Project

General

Profile

Bug #9485

Monitor crash due to wrong crush rule set

Added by Dong Lei over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

I create a customized crush rule for ec pool

1 set take default
2 choose firstn 6 type rack
3 chooseleaf firstn 2 type host
4 step emit

and it works fine, but if i change "firstn" to "indep" and set crushmap, the monitor crash and can not be restarted.

My cluster has only 6 rack. I guess that the "choose indep 6 type rack" may returns hole in the result, and at the next step CRUSH can not choose leaf from hole, so it fails.

Unfortunately, the wrong crush rule has been set, and every time I try to restart monitor, the monitor crash for the same reason "can not apply rule ... segment fault...". The whole cluster then become unusable.

Although I set a crush rule that can not be satisfied, the monitor should not crash forever.

ceph_osd_tree.txt View - ceph osd tree output (1.31 KB) Dong Lei, 09/19/2014 02:59 AM

c.txt View - crush map output (4.63 KB) Dong Lei, 09/19/2014 02:59 AM


Related issues

Related to RADOS - Bug #9492: Crush Mapper crashes when number of replicas is less than total number of osds to be selected. Resolved 09/16/2014
Duplicates RADOS - Support #8600: MON crashes on new crushmap injection New 06/14/2014

History

#1 Updated by Loic Dachary over 5 years ago

  • Status changed from New to Need More Info
  • Assignee set to Loic Dachary
  • Priority changed from High to Normal

Could you add the stack trace of the mon crash to the ticket ? I remember the discussion we had on the mailing list and you apparently checked (using crushtool I presume) that the mapping succeeds. In any case, it should not crash the mon and I wonder how it is related.

#2 Updated by Dong Lei over 5 years ago

Hi, loic.

Currently I'm running some tests on my dev envrionment, after the tests are finished, I will reproduce it and give you some stack trace.

On the mail list I use

2 choose firstn 6 type rack

so, some times this step returns 5 racks, and the final acting list is 5 * 2 = 10 < 11(what I want), so some pgs stuck at remapped.

Then I do a little change to

2 choose indep 6 type rack // indep vs firstn

and it causes the monitor crash.

I guess in this step, some pgs can not get 6 racks but 5 racks, but the result is 5 racks plus a hole because of the indep mode. And in next step

3 chooseleaf firstn 2 type host

CRUSH can not chooseleaf for hole, so it crash.

So can you check the behavior of CRUSH when it handles holes filled by the indep mode?
I will add some log soon.

#3 Updated by Loic Dachary over 5 years ago

Hi,

It should not crash, it should give you an error of some kind maybe. Could you please attach to this ticket a dump of the crushmap obtained with

ceph osd getcrushmap > c
crushtool -d c -o c.txt

and the output of
ceph osd tree

?

#4 Updated by Dong Lei over 5 years ago

Hi loic:

log, "ceph osd tree" output and crush map added.

log:

0> 2014-09-19 09:43:08.462737 7f92d9674700 -1 ** Caught signal (Segmentation fault) *
in thread 7f92d9674700
ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
1: /usr/bin/ceph-mon() [0x86c391]
2: /lib64/libpthread.so.0() [0x3012c0f500]
3: (crush_do_rule()+0x38e) [0x834f4e]
4: (CrushWrapper::do_rule(int, int, std::vector&lt;int, std::allocator&lt;int&gt; >&, int, std::vector&lt;unsigned int, std::allocator&lt;unsigned int&gt; > const&) const+0xa6) [0x78ec06]
5: (OSDMap::_pg_to_osds(pg_pool_t const&, pg_t, std::vector&lt;int, std::allocator&lt;int&gt; >, int, unsigned int*) const+0x93) [0x77e903]
6: (OSDMap::_pg_to_up_acting_osds(pg_t, std::vector&lt;int, std::allocator&lt;int&gt; >, int, std::vector&lt;int, std::allocator&lt;int&gt; >, int) const+0x123) [0x77ed43]
7: (PGMonitor::map_pg_creates()+0x25b) [0x624f2b]
8: (PGMonitor::update_from_paxos(bool*)+0xa48) [0x63a598]
9: (PaxosService::refresh(bool*)+0x193) [0x5b1173]
10: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54f557]
11: (Paxos::do_refresh()+0x36) [0x59ed16]
12: (Paxos::begin(ceph::buffer::list&)+0xc46) [0x5a6206]
13: (Paxos::propose_queued()+0x273) [0x5a6773]
14: (Paxos::finish_round()+0x106) [0x5a6aa6]
15: (Paxos::begin(ceph::buffer::list&)+0xc64) [0x5a6224]
16: (Paxos::propose_queued()+0x273) [0x5a6773]
17: (Paxos::propose_new_value(ceph::buffer::list&, Context*)+0x160) [0x5a6960]
18: (PaxosService::propose_pending()+0x386) [0x5b07b6]
19: (Context::complete(int)+0x9) [0x582f69]
20: (SafeTimer::timer_thread()+0x453) [0x763f53]
21: (SafeTimerThread::entry()+0xd) [0x76610d]
22: /lib64/libpthread.so.0() [0x3012c07851]
23: (clone()+0x6d) [0x30128e890d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

crush map and osd tree are attached.

I use the crush ruleset 1 "ecpool"

#5 Updated by Loic Dachary over 5 years ago

Could you also please add the output of ceph osd dump ? It looks like you have run into http://tracker.ceph.com/issues/9492

#6 Updated by Dong Lei over 5 years ago

Because the monitor crash and it can not be restarted, so currently I can not get "ceph osd dump".

I checked the issue 9492, and I believe we are not the same issue, because I encounter the issue 9492 and when 9492 happens, even the monitor crash, it can be restarted because the rule are not successfully set. But in my situation, the wrong rule is set and the monitor can not be restarted.

As I said, my original rule is

1 set take default
2 choose firstn 6 type rack
3 chooseleaf firstn 2 type host
4 step emit

actually it's also
min_size 11
max_size 11

if i set min_size = 1~5, the monitor will also crash(can be restared). This may be related to 9492.

but if i change "firstn" to "indep" in "choose firstn 6 type rack", the monitor crash forever. and this is my issue.

#7 Updated by Loic Dachary over 5 years ago

What probably happens is that you created an erasure code profile with k+m that is lower than the number of OSDs provided by the crush rule you are trying to use. And that would trigger #9492. What is the profile used for the ecpool ?

ceph osd erasure-code-profile get theprofile

the size of the erasure coded pool also shows with ceph osd dump, if you can manage to get it to work.

#8 Updated by Loic Dachary over 5 years ago

Could you also attach the log of monitor crash you are seeing ? Note that if you change a crush rule that is currently in use, the result is undefined. You need to first make sure the pools using the rule are removed.

#9 Updated by Dong Lei over 5 years ago

The profile used for the ecpool is K=8 M=3.

If I set the min_size = 3, max_size = 12(as default), the monitor crash will crash when I set the rule before I create pools. I think it may fails at rule validation because CRUSH will test the rule with replica_size=3, but my rule will choose 6 racks at first step. And CRUSH crash somehow.

then I set min_size = 11, the rule can be set, pool can be created.

then I delete all pools, change "firstn" to "indep", the rule is set successully.
then I create a pool with the rule, monitor crash.
(sometimes it will not crash, and I check all the pgs can find 6 racks, then I create another pool with the same rule, monitor crash, so I guess these pgs cannot find 6 racks and encounter a hole)

I always delete all pools, then create a rule and then create a new pool with the rule.

The log is at #4, you mean it's not enough?

#10 Updated by Loic Dachary over 5 years ago

You have K=8 M=3 which means your pool needs 11 OSDs. However the rule you defined will always provide 12 OSDs and you will run into #9492. Does it work if you set K=8 M=4 ?

#11 Updated by Dong Lei over 5 years ago

I know that I need 11 and the rule provide 12 and It looks CRUSH will do thetruncate.

It doesn't seem to be an issue because I don't run into any problems if I set min_size = 11 and use firstn when choosing racks.

The problem is :
1. if I set min_size = 3, I can not set the rule, monitor crash (this doesn't matter, because I can restart monitor)
2. if I set min_size = 11 & use indep instead of firstn, it will crash forever.

#12 Updated by Loic Dachary over 5 years ago

Could you please let me know if it always work with K=8 M=4 ?

#13 Updated by Dong Lei over 5 years ago

K=8 M=4 doesn't work.

I rebuild the cluster and do the following steps.

(delete all pools)
1. create a profile with K=8 M=4.

2. I create a pool with the profile and delete the pool.(It will create a rule in crushmap)

3. Dump the crush map and modify the crush rule to
rule ecpool {
ruleset 1
type erasure
min_size 3
max_size 20
step set_chooseleaf_tries 5
step take default
step choose firstn 6 type rack
step chooseleaf firstn 2 type osd
step emit
}

monitor crash. // This is the first problem

4. restart monitor, and change
min_size 12
max_size 12

compile and set crushmap, OK

5. dump the crushmap, modify it to
choose indep 6 type rack.

set the crushmap.

monitor crash. log:
ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
1: /usr/bin/ceph-mon() [0x86c391]
2: /lib64/libpthread.so.0() [0x3012c0f500]
3: (crush_do_rule()+0x38e) [0x834f4e]
4: (CrushWrapper::do_rule(int, int, std::vector<int, std::allocator<int> >&, int, std::vector<unsigned int, std::allocator<unsigned int> > const&) const+0xa6) [0x78ec06]
5: (CrushTester::test()+0xecc) [0x7990ec]
6: (OSDMonitor::prepare_command_impl(MMonCommand*, std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >&)+0xff6) [0x5deb36]
7: (OSDMonitor::prepare_command(MMonCommand*)+0x2cf) [0x5ebc4f]
8: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x26b) [0x5ec08b]
9: (PaxosService::dispatch(PaxosServiceMessage*)+0xa4a) [0x5b1e7a]
10: (Monitor::handle_command(MMonCommand*)+0xde0) [0x579a70]
11: (Monitor::dispatch(MonSession*, Message*, bool)+0x3ca) [0x580e5a]
12: (Monitor::_ms_dispatch(Message*)+0x20e) [0x58139e]
13: (Monitor::ms_dispatch(Message*)+0x32) [0x59d572]
14: (DispatchQueue::entry()+0x5a2) [0x843542]
15: (DispatchQueue::DispatchThread::entry()+0xd) [0x83e87d]
16: /lib64/libpthread.so.0() [0x3012c07851]
17: (clone()+0x6d) [0x30128e890d]

6. restart the monitor, this time it's restarted successfully.

// I havn't create any pool here, only set the crushmap make monitor crash.

#14 Updated by Loic Dachary over 5 years ago

Thanks for the detailed instructions. I'll try them to repeat the problem.

#15 Updated by Dong Lei over 5 years ago

Thanks so much.

BTW:
I repeat this in my dev environment with 60 osds on one host. I create 6 virtual racks. (you can see from "ceph osd tree" I attached before).

Hopes it will help you to easily setup a test environment.

#16 Updated by Loic Dachary over 5 years ago

  • Status changed from Need More Info to In Progress

Did not forget about it, just busy with other things.

#17 Updated by Loic Dachary over 5 years ago

Did not forget about it, just busy with other things (the OpenStack summit after the Giant release).

#18 Updated by Loic Dachary over 5 years ago

  • Status changed from In Progress to 12

#19 Updated by Panayiotis Gotsis over 5 years ago

Hello, I can verify that I am facing the same problem.

After trying to edit the crushmap in order to separate groups of OSDs according to their disk technology, my mons failed to restart after a needed restart.

Using the ceph-monstore-tool I managed to extract the crushmap from the store.db and I can see that a ruleset refers to a non-existing root object. I cannot say for sure that I set up my crushmap, through the edit, properly, and it seems that the problem of the "lost" reference is related to that. But I cannot find a way to reset the offline store.db with a fixed crushmap.

I am even checking the source code and I see that the MonitorDBStore has only get methods. The put methors are part of the Transaction struct of the class and I too little experience with the source code to understand whether this structure can be used to initiate a transaction to update an offline store.db.

Having a way of fixing an offline store.db that is plagued due to a bad crushmap would be nice :D

#21 Updated by Panayiotis Gotsis over 5 years ago

for the attached linked, this is the result of the command (crushtool, as supplied by debian packages -- 0.80)

http://pastie.org/9764824

#22 Updated by Panayiotis Gotsis over 5 years ago

for the attached linked, this is the result of the command (crushtool, as compiled from git tree with --with-debug -- 0.89)

http://pastie.org/9764828

#24 Updated by Sage Weil over 5 years ago

I've fixed Panayiotis's issue, but it is different than the original bug.

Dong Lei, I've tried to reproduce this but can't make it happen. Is the attached c.txt the exact map that triggers the mon crash?

Are you able to make crushtool crash with

crushtool -c c.txt -o map
crushtool -i map --test

? This is what I'm trying to do to reproduce and I'm failing. :(

#25 Updated by Panayiotis Gotsis over 5 years ago

Just to offer some debriefing on the issue.

After installing the patch, I managed to get the monitor up and running. I noticed however that the monitor started failing after a couple of minutes so I restarted and quickly injected the correct crushmap.

With this action my problem was solved.

Thanks a lot

#26 Updated by Dong Lei over 5 years ago

Hi sage:

According to my test earlier, crushtool may not be able to make it crash. I remember that crushtool will return 10 osd if it can only find 5 racks when trying to find 6.

Loic has marked this issue as verified. Would you talk to him to know how can he reproduce?

Sage Weil wrote:

I've fixed Panayiotis's issue, but it is different than the original bug.

Dong Lei, I've tried to reproduce this but can't make it happen. Is the attached c.txt the exact map that triggers the mon crash?

Are you able to make crushtool crash with

crushtool -c c.txt -o map
crushtool -i map --test

? This is what I'm trying to do to reproduce and I'm failing. :(

#27 Updated by Loic Dachary over 5 years ago

Although I've marked the issue as verified, I did not actually get to reproduce it. I meant to a number of times using the procedure you provided but did not. My mistake.

#28 Updated by Dong Lei over 5 years ago

But you understand that when CRUSH can not find enough racks using the indep mode, things go wrong and the wrong rule is set to db, do you?

Loic Dachary wrote:

Although I've marked the issue as verified, I did not actually get to reproduce it. I meant to a number of times using the procedure you provided but did not. My mistake.

#29 Updated by Sage Weil over 5 years ago

  • Priority changed from Normal to High

#30 Updated by Samuel Just about 5 years ago

  • Status changed from 12 to Resolved

Also available in: Atom PDF