Bug #9485
Monitor crash due to wrong crush rule set
0%
Description
I create a customized crush rule for ec pool
1 set take default
2 choose firstn 6 type rack
3 chooseleaf firstn 2 type host
4 step emit
and it works fine, but if i change "firstn" to "indep" and set crushmap, the monitor crash and can not be restarted.
My cluster has only 6 rack. I guess that the "choose indep 6 type rack" may returns hole in the result, and at the next step CRUSH can not choose leaf from hole, so it fails.
Unfortunately, the wrong crush rule has been set, and every time I try to restart monitor, the monitor crash for the same reason "can not apply rule ... segment fault...". The whole cluster then become unusable.
Although I set a crush rule that can not be satisfied, the monitor should not crash forever.
Related issues
History
#1 Updated by Loïc Dachary about 9 years ago
- Status changed from New to Need More Info
- Assignee set to Loïc Dachary
- Priority changed from High to Normal
Could you add the stack trace of the mon crash to the ticket ? I remember the discussion we had on the mailing list and you apparently checked (using crushtool I presume) that the mapping succeeds. In any case, it should not crash the mon and I wonder how it is related.
#2 Updated by Dong Lei about 9 years ago
Hi, loic.
Currently I'm running some tests on my dev envrionment, after the tests are finished, I will reproduce it and give you some stack trace.
On the mail list I use
2 choose firstn 6 type rack
so, some times this step returns 5 racks, and the final acting list is 5 * 2 = 10 < 11(what I want), so some pgs stuck at remapped.
Then I do a little change to
2 choose indep 6 type rack // indep vs firstn
and it causes the monitor crash.
I guess in this step, some pgs can not get 6 racks but 5 racks, but the result is 5 racks plus a hole because of the indep mode. And in next step
3 chooseleaf firstn 2 type host
CRUSH can not chooseleaf for hole, so it crash.
So can you check the behavior of CRUSH when it handles holes filled by the indep mode?
I will add some log soon.
#3 Updated by Loïc Dachary about 9 years ago
Hi,
It should not crash, it should give you an error of some kind maybe. Could you please attach to this ticket a dump of the crushmap obtained with
ceph osd getcrushmap > c crushtool -d c -o c.txt
and the output of
ceph osd tree
?
#4 Updated by Dong Lei about 9 years ago
- File ceph_osd_tree.txt View added
- File c.txt View added
Hi loic:
log, "ceph osd tree" output and crush map added.
log:
0> 2014-09-19 09:43:08.462737 7f92d9674700 -1 ** Caught signal (Segmentation fault) *
in thread 7f92d9674700
ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
1: /usr/bin/ceph-mon() [0x86c391]
2: /lib64/libpthread.so.0() [0x3012c0f500]
3: (crush_do_rule()+0x38e) [0x834f4e]
4: (CrushWrapper::do_rule(int, int, std::vector<int, std::allocator<int> >&, int, std::vector<unsigned int, std::allocator<unsigned int> > const&) const+0xa6) [0x78ec06]
5: (OSDMap::_pg_to_osds(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >, int, unsigned int*) const+0x93) [0x77e903]
6: (OSDMap::_pg_to_up_acting_osds(pg_t, std::vector<int, std::allocator<int> >, int, std::vector<int, std::allocator<int> >, int) const+0x123) [0x77ed43]
7: (PGMonitor::map_pg_creates()+0x25b) [0x624f2b]
8: (PGMonitor::update_from_paxos(bool*)+0xa48) [0x63a598]
9: (PaxosService::refresh(bool*)+0x193) [0x5b1173]
10: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54f557]
11: (Paxos::do_refresh()+0x36) [0x59ed16]
12: (Paxos::begin(ceph::buffer::list&)+0xc46) [0x5a6206]
13: (Paxos::propose_queued()+0x273) [0x5a6773]
14: (Paxos::finish_round()+0x106) [0x5a6aa6]
15: (Paxos::begin(ceph::buffer::list&)+0xc64) [0x5a6224]
16: (Paxos::propose_queued()+0x273) [0x5a6773]
17: (Paxos::propose_new_value(ceph::buffer::list&, Context*)+0x160) [0x5a6960]
18: (PaxosService::propose_pending()+0x386) [0x5b07b6]
19: (Context::complete(int)+0x9) [0x582f69]
20: (SafeTimer::timer_thread()+0x453) [0x763f53]
21: (SafeTimerThread::entry()+0xd) [0x76610d]
22: /lib64/libpthread.so.0() [0x3012c07851]
23: (clone()+0x6d) [0x30128e890d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
crush map and osd tree are attached.
I use the crush ruleset 1 "ecpool"
#5 Updated by Loïc Dachary about 9 years ago
Could you also please add the output of ceph osd dump ? It looks like you have run into http://tracker.ceph.com/issues/9492
#6 Updated by Dong Lei about 9 years ago
Because the monitor crash and it can not be restarted, so currently I can not get "ceph osd dump".
I checked the issue 9492, and I believe we are not the same issue, because I encounter the issue 9492 and when 9492 happens, even the monitor crash, it can be restarted because the rule are not successfully set. But in my situation, the wrong rule is set and the monitor can not be restarted.
As I said, my original rule is
1 set take default
2 choose firstn 6 type rack
3 chooseleaf firstn 2 type host
4 step emit
actually it's also
min_size 11
max_size 11
if i set min_size = 1~5, the monitor will also crash(can be restared). This may be related to 9492.
but if i change "firstn" to "indep" in "choose firstn 6 type rack", the monitor crash forever. and this is my issue.
#7 Updated by Loïc Dachary about 9 years ago
What probably happens is that you created an erasure code profile with k+m that is lower than the number of OSDs provided by the crush rule you are trying to use. And that would trigger #9492. What is the profile used for the ecpool ?
ceph osd erasure-code-profile get theprofile
the size of the erasure coded pool also shows with ceph osd dump, if you can manage to get it to work.
#8 Updated by Loïc Dachary about 9 years ago
Could you also attach the log of monitor crash you are seeing ? Note that if you change a crush rule that is currently in use, the result is undefined. You need to first make sure the pools using the rule are removed.
#9 Updated by Dong Lei about 9 years ago
The profile used for the ecpool is K=8 M=3.
If I set the min_size = 3, max_size = 12(as default), the monitor crash will crash when I set the rule before I create pools. I think it may fails at rule validation because CRUSH will test the rule with replica_size=3, but my rule will choose 6 racks at first step. And CRUSH crash somehow.
then I set min_size = 11, the rule can be set, pool can be created.
then I delete all pools, change "firstn" to "indep", the rule is set successully.
then I create a pool with the rule, monitor crash.
(sometimes it will not crash, and I check all the pgs can find 6 racks, then I create another pool with the same rule, monitor crash, so I guess these pgs cannot find 6 racks and encounter a hole)
I always delete all pools, then create a rule and then create a new pool with the rule.
The log is at #4, you mean it's not enough?
#10 Updated by Loïc Dachary about 9 years ago
You have K=8 M=3 which means your pool needs 11 OSDs. However the rule you defined will always provide 12 OSDs and you will run into #9492. Does it work if you set K=8 M=4 ?
#11 Updated by Dong Lei about 9 years ago
I know that I need 11 and the rule provide 12 and It looks CRUSH will do thetruncate.
It doesn't seem to be an issue because I don't run into any problems if I set min_size = 11 and use firstn when choosing racks.
The problem is :
1. if I set min_size = 3, I can not set the rule, monitor crash (this doesn't matter, because I can restart monitor)
2. if I set min_size = 11 & use indep instead of firstn, it will crash forever.
#12 Updated by Loïc Dachary about 9 years ago
Could you please let me know if it always work with K=8 M=4 ?
#13 Updated by Dong Lei about 9 years ago
K=8 M=4 doesn't work.
I rebuild the cluster and do the following steps.
(delete all pools)
1. create a profile with K=8 M=4.
2. I create a pool with the profile and delete the pool.(It will create a rule in crushmap)
3. Dump the crush map and modify the crush rule to
rule ecpool {
ruleset 1
type erasure
min_size 3
max_size 20
step set_chooseleaf_tries 5
step take default
step choose firstn 6 type rack
step chooseleaf firstn 2 type osd
step emit
}
monitor crash. // This is the first problem
4. restart monitor, and change
min_size 12
max_size 12
compile and set crushmap, OK
5. dump the crushmap, modify it to
choose indep 6 type rack.
set the crushmap.
monitor crash. log:
ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
1: /usr/bin/ceph-mon() [0x86c391]
2: /lib64/libpthread.so.0() [0x3012c0f500]
3: (crush_do_rule()+0x38e) [0x834f4e]
4: (CrushWrapper::do_rule(int, int, std::vector<int, std::allocator<int> >&, int, std::vector<unsigned int, std::allocator<unsigned int> > const&) const+0xa6) [0x78ec06]
5: (CrushTester::test()+0xecc) [0x7990ec]
6: (OSDMonitor::prepare_command_impl(MMonCommand*, std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >&)+0xff6) [0x5deb36]
7: (OSDMonitor::prepare_command(MMonCommand*)+0x2cf) [0x5ebc4f]
8: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x26b) [0x5ec08b]
9: (PaxosService::dispatch(PaxosServiceMessage*)+0xa4a) [0x5b1e7a]
10: (Monitor::handle_command(MMonCommand*)+0xde0) [0x579a70]
11: (Monitor::dispatch(MonSession*, Message*, bool)+0x3ca) [0x580e5a]
12: (Monitor::_ms_dispatch(Message*)+0x20e) [0x58139e]
13: (Monitor::ms_dispatch(Message*)+0x32) [0x59d572]
14: (DispatchQueue::entry()+0x5a2) [0x843542]
15: (DispatchQueue::DispatchThread::entry()+0xd) [0x83e87d]
16: /lib64/libpthread.so.0() [0x3012c07851]
17: (clone()+0x6d) [0x30128e890d]
6. restart the monitor, this time it's restarted successfully.
// I havn't create any pool here, only set the crushmap make monitor crash.
#14 Updated by Loïc Dachary about 9 years ago
Thanks for the detailed instructions. I'll try them to repeat the problem.
#15 Updated by Dong Lei about 9 years ago
Thanks so much.
BTW:
I repeat this in my dev environment with 60 osds on one host. I create 6 virtual racks. (you can see from "ceph osd tree" I attached before).
Hopes it will help you to easily setup a test environment.
#16 Updated by Loïc Dachary almost 9 years ago
- Status changed from Need More Info to In Progress
Did not forget about it, just busy with other things.
#17 Updated by Loïc Dachary almost 9 years ago
Did not forget about it, just busy with other things (the OpenStack summit after the Giant release).
#18 Updated by Loïc Dachary almost 9 years ago
- Status changed from In Progress to 12
#19 Updated by Panayiotis Gotsis almost 9 years ago
Hello, I can verify that I am facing the same problem.
After trying to edit the crushmap in order to separate groups of OSDs according to their disk technology, my mons failed to restart after a needed restart.
Using the ceph-monstore-tool I managed to extract the crushmap from the store.db and I can see that a ruleset refers to a non-existing root object. I cannot say for sure that I set up my crushmap, through the edit, properly, and it seems that the problem of the "lost" reference is related to that. But I cannot find a way to reset the offline store.db with a fixed crushmap.
I am even checking the source code and I see that the MonitorDBStore has only get methods. The put methors are part of the Transaction struct of the class and I too little experience with the source code to understand whether this structure can be used to initiate a transaction to update an offline store.db.
Having a way of fixing an offline store.db that is plagued due to a bad crushmap would be nice :D
#20 Updated by Panayiotis Gotsis almost 9 years ago
This is the crashing crushmap
https://www.dropbox.com/s/gbusu8jf2ku6k62/crushmap.orig?dl=0
#21 Updated by Panayiotis Gotsis almost 9 years ago
for the attached linked, this is the result of the command (crushtool, as supplied by debian packages -- 0.80)
#22 Updated by Panayiotis Gotsis almost 9 years ago
for the attached linked, this is the result of the command (crushtool, as compiled from git tree with --with-debug -- 0.89)
#24 Updated by Sage Weil almost 9 years ago
I've fixed Panayiotis's issue, but it is different than the original bug.
Dong Lei, I've tried to reproduce this but can't make it happen. Is the attached c.txt the exact map that triggers the mon crash?
Are you able to make crushtool crash with
crushtool -c c.txt -o map
crushtool -i map --test
? This is what I'm trying to do to reproduce and I'm failing. :(
#25 Updated by Panayiotis Gotsis almost 9 years ago
Just to offer some debriefing on the issue.
After installing the patch, I managed to get the monitor up and running. I noticed however that the monitor started failing after a couple of minutes so I restarted and quickly injected the correct crushmap.
With this action my problem was solved.
Thanks a lot
#26 Updated by Dong Lei almost 9 years ago
Hi sage:
According to my test earlier, crushtool may not be able to make it crash. I remember that crushtool will return 10 osd if it can only find 5 racks when trying to find 6.
Loic has marked this issue as verified. Would you talk to him to know how can he reproduce?
Sage Weil wrote:
I've fixed Panayiotis's issue, but it is different than the original bug.
Dong Lei, I've tried to reproduce this but can't make it happen. Is the attached c.txt the exact map that triggers the mon crash?
Are you able to make crushtool crash with
crushtool -c c.txt -o map
crushtool -i map --test? This is what I'm trying to do to reproduce and I'm failing. :(
#27 Updated by Loïc Dachary almost 9 years ago
Although I've marked the issue as verified, I did not actually get to reproduce it. I meant to a number of times using the procedure you provided but did not. My mistake.
#28 Updated by Dong Lei almost 9 years ago
But you understand that when CRUSH can not find enough racks using the indep mode, things go wrong and the wrong rule is set to db, do you?
Loic Dachary wrote:
Although I've marked the issue as verified, I did not actually get to reproduce it. I meant to a number of times using the procedure you provided but did not. My mistake.
#29 Updated by Sage Weil almost 9 years ago
- Priority changed from Normal to High
#30 Updated by Samuel Just over 8 years ago
- Status changed from 12 to Resolved