Bug #12047
closedmonitor segmentation fault on faulty crushmap
0%
Description
Hi,
I accidently removed a root bucket "osd crush remove platter" (root=platter). The platter object was still used by a ruleset and the ruleset was still in use by a pool. My issue looks similar to #9485.
outline of the crushmap
root platter { id -1 # do not change unnecessarily # weight 12.000 alg straw hash 0 # rjenkins1 item platter-rack1 weight 4.000 item platter-rack2 weight 4.000 item platter-rack3 weight 4.000 } rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take platter step chooseleaf firstn 0 type rack step emit }
Until now I am not able to get the monitors up and running again.
Files
Updated by Kefu Chai almost 9 years ago
- Status changed from New to Duplicate
Jonas, it should be addressed by the latest master branch. and the fix will be back ported to the next hammer release.
Updated by Jonas Weismüller almost 9 years ago
Hi Kefu,
thanks a lot for the information. Is there any chance to manually recover the cluster?
Or will the new release fix it automatically?
I am also willing to test the new fix. Is there already a pre-built deb package to test with? Or do I have to manually build it?
Cheers Jonas
Updated by Kefu Chai almost 9 years ago
Jonas Weismüller wrote:
Hi Kefu,
thanks a lot for the information. Is there any chance to manually recover the cluster?
i'd say "yes, but it's a non-trivial". Is your cluster in production?
Or will the new release fix it automatically?
see #11815, we plan to have a CLI tool to help user to fix it manually. And I am working on it. probably you can watch that ticket if you want to keep you updated.
I am also willing to test the new fix. Is there already a pre-built deb package to test with? Or do I have to manually build it?
The fix i mentioned in the #12047-1 will prevent you from injecting a faulty crush map, but your monitors won't come alive with it.
Cheers Jonas
Updated by Jonas Weismüller almost 9 years ago
Luckily it is not a production cluster, it is a development cluster. Though, I would prefer to get the cluster fixed and get it up and running again.
So I could be an alpha tester of your cli tool? I would really love to offer my help. The only think is, that I currently need my development cluster pretty soon for further testing. So it depends how long it last long to alpha test your client?
Updated by Kefu Chai almost 9 years ago
Jonas Weismüller wrote:
So I could be an alpha tester of your cli tool? I would really love to offer my help.
jonas, sorry to get back late. thanks! i pulled together a script at https://github.com/ceph/ceph/pull/5052 .
but this patch also includes a change in the a CLI tool. so you will need to recompile it or download the latest package from one of our gitbuilders. i will update you once the package is built.
The only think is, that I currently need my development cluster pretty soon for further testing. So it depends how long it last long to alpha test your client?
in minutes =)
Updated by Jonas Weismüller almost 9 years ago
I can wait for the package. Thanks a lot so far.
Updated by Kefu Chai almost 9 years ago
jonas, the packages are ready. may i learn what your distro/release/arch is?
since the repo's addresses for the packages under testing are different from distro to distro. for example, the
ubuntu precise amd64: http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/sha1/b8795313b2d0028f1e71b584bb28e014b5e06984/pool/main/c/ceph/
please note, the patch is still under testing. seems there are chances that the mon store does not get fixed. i am looking into it.
Updated by Jonas Weismüller almost 9 years ago
# lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 7.8 (wheezy) Release: 7.8 Codename: wheezy
# uname -r 3.2.0-4-amd64
Updated by Kefu Chai almost 9 years ago
jonas, just fixed the bug!
the package will be soon ready at http://gitbuilder.ceph.com/ceph-deb-wheezy-x86_64-basic/sha1/0e78e54e07e637fa458adaae6165b7259c79e089/pool/main/c/ceph/ .
will update you once it's done.
Updated by Kefu Chai almost 9 years ago
the package is ready. you need to install ceph-test_9.0.1-1080-g0e78e54-1wheezy_amd64.deb in the repo.
and /usr/lib/ceph/ceph-monstore-update-crush.sh
will help you with the broken mon store.
Updated by Jonas Weismüller almost 9 years ago
Which additional ceph packages do I have to update as well?
Can you give me a short hint, which commands to execute:
I tried the following, which ends up in a loop:
# /usr/lib/ceph/ceph-monstore-update-crush.sh --rewrite /var/lib/ceph/mon/ceph-vs64/ no action specified; -h for help no action specified; -h for help no action specified; -h for help no action specified; -h for help no action specified; -h for help
Updated by Kefu Chai almost 9 years ago
no action specified
this is printed by crushtool, which accepted some updates recently to enable it to check the completeness of a crush map.
it is packaged in "ceph". the most recent release of my fix can be found at http://gitbuilder.ceph.com/ceph-deb-wheezy-x86_64-basic/sha1/4f4c8fe40e3912cb895d5f9cd4d40b085b3e12ab/pool/main/c/ceph/ . you can extract the crushtool from the debian package using dpkg-deb without taking the risk of installing it.
Updated by Jonas Weismüller almost 9 years ago
- File store.db.tar.bz2 store.db.tar.bz2 added
does not revert to a previous monmap:
# /usr/lib/ceph/ceph-monstore-update-crush.sh --mon-store /var/lib/ceph/mon/ceph-vs64/ --rewrite good crush map found at epoch 1547/1547 and mon store has no faulty crush maps.
attached you will find the store.db folder as tar.gz file.
Updated by Jonas Weismüller almost 9 years ago
I downloaded:
http://gitbuilder.ceph.com/ceph-deb-wheezy-x86_64-basic/sha1/0e78e54e07e637fa458adaae6165b7259c79e089/pool/main/c/ceph/ceph_9.0.1-1080-g0e78e54-1wheezy_amd64.deb
http://gitbuilder.ceph.com/ceph-deb-wheezy-x86_64-basic/sha1/0e78e54e07e637fa458adaae6165b7259c79e089/pool/main/c/ceph/ceph-test_9.0.1-1080-g0e78e54-1wheezy_amd64.deb
Then I replaced in /usr/bin on the leader monitor "crushtool, osdmaptool and ceph-monstore-tool". I extracted the crushmap and fixed the faulty references to non existing buckets (see diff).
# ceph-monstore-tool /var/lib/ceph/mon/ceph-vs64/store.db/ get osdmap -- -v 1547 -o /tmp/osdmap.1547 # ceph-monstore-tool /var/lib/ceph/mon/ceph-vs64/ get osdmap -- -v 1547 -o /tmp/osdmap.1547 # osdmaptool --export-crush /tmp/crush.1547 /tmp/osdmap.1547 # crushtool --decompile /tmp/crush.1547 --outfn /tmp/crush.1547.src_broken # cp /tmp/crush.1547.src_broken /tmp/crush.1547.src # diff /tmp/crush.1547.src_broken /tmp/crush.1547.src 148c148 < step take bucket0 --- > step take default 159c159 < step take bucket0 --- > step take default 168c168 < step take bucket15 --- > step take default 177c177 < step take bucket15 --- > step take default 180c180 < step take bucket0 --- > step take default
I recompiled the crushmap and injected it back to the leader monitor.
# crushtool -c /tmp/crush.1547.src -o /tmp/crush.1547.src_compiled # /usr/bin/ceph-monstore-tool /var/lib/ceph/mon/ceph-vs64/ rewrite-crush -- --crush /tmp/crush.1547.src_compiled --good-epoch 1546
Then I restarted all other monitors and ended up in a "healthy" monitor state:
cluster 76632a36-a0fe-4a3a-b8c0-b08cec03efd0 health HEALTH_WARN 103 pgs degraded 16 pgs down 42 pgs peering 223 pgs stale 103 pgs stuck degraded 60 pgs stuck inactive 223 pgs stuck stale 512 pgs stuck unclean 103 pgs stuck undersized 103 pgs undersized recovery 1429/12927 objects degraded (11.054%) recovery 10423/12927 objects misplaced (80.630%) too few PGs per OSD (21 < min 30) monmap e1: 5 mons at {vs64=10.1.128.6:6789/0,vs65=10.1.128.7:6789/0,vs66=10.1.128.8:6789/0,vs67=10.1.128.9:6789 election epoch 224, quorum 0,1,2,3,4 vs64,vs65,vs66,vs67,vs68 osdmap e1548: 24 osds: 24 up, 24 in; 289 remapped pgs pgmap v41595: 512 pgs, 1 pools, 4309 MB data, 4309 objects 14193 MB used, 2382 GB / 2396 GB avail 1429/12927 objects degraded (11.054%) 10423/12927 objects misplaced (80.630%) 258 active+remapped 91 stale+active+remapped 58 stale+active+undersized+degraded+remapped 31 active+undersized+degraded+remapped 26 stale+remapped+peering 18 stale+remapped 14 stale+down+remapped+peering 14 stale+active+undersized+degraded 2 stale+down+peering
Updated by Jonas Weismüller almost 9 years ago
Now I have the problem that the pg's are not recovering. This is probably due to approx. half of osd's are crashing right now:
2015-06-30 13:43:00.878364 7f9feb23d840 0 osd.5 1547 load_pgs 2015-06-30 13:43:01.076863 7f9feb23d840 0 osd.5 1547 load_pgs opened 43 pgs 2015-06-30 13:43:01.077486 7f9feb23d840 -1 osd.5 1547 log_to_monitors {default=true} 2015-06-30 13:43:01.086179 7f9fd76bd700 0 osd.5 1547 ignoring osdmap until we have initialized 2015-06-30 13:43:01.089618 7f9fd76bd700 0 osd.5 1547 ignoring osdmap until we have initialized 2015-06-30 13:43:02.086065 7f9fccea8700 -1 *** Caught signal (Segmentation fault) ** in thread 7f9fccea8700 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: /usr/bin/ceph-osd() [0xbef08c] 2: (()+0xf0a0) [0x7f9fea1500a0] 3: /usr/bin/ceph-osd() [0xd5c934] 4: (crush_do_rule()+0x390) [0xd5d570] 5: (CrushWrapper::do_rule(int, int, std::vector<int, std::allocator<int> >&, int, std::vector<unsigned int, std::allocator<unsigned int> > const&) const+0x8b) [0xcb146b] 6: (OSDMap::_pg_to_osds(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*, int*, unsigned int*) const+0x7c) [0xc9a68c] 7: (OSDMap::_pg_to_up_acting_osds(pg_t const&, std::vector<int, std::allocator<int> >*, int*, std::vector<int, std::allocator<int> >*, int*) const+0x10f) [0xc9a82f] 8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x1e2) [0x7d1dc2] 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x262) [0x7d2cd2] 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x14) [0x8385c4] 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xccce79] 12: (ThreadPool::WorkThread::entry()+0x10) [0xccee70] 13: (()+0x6b50) [0x7f9fea147b50] 14: (clone()+0x6d) [0x7f9fe8b6395d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- -31> 2015-06-30 13:43:00.127913 7f9feb23d840 5 asok(0x50be000) register_command perfcounters_dump hook 0x5082050 -30> 2015-06-30 13:43:00.127998 7f9feb23d840 5 asok(0x50be000) register_command 1 hook 0x5082050 -29> 2015-06-30 13:43:00.127998 7f9feb23d840 5 asok(0x50be000) register_command perf dump hook 0x5082050 -28> 2015-06-30 13:43:00.127998 7f9feb23d840 5 asok(0x50be000) register_command perfcounters_schema hook 0x5082050 -27> 2015-06-30 13:43:00.127998 7f9feb23d840 5 asok(0x50be000) register_command 2 hook 0x5082050 -26> 2015-06-30 13:43:00.127998 7f9feb23d840 5 asok(0x50be000) register_command perf schema hook 0x5082050 -25> 2015-06-30 13:43:00.127998 7f9feb23d840 5 asok(0x50be000) register_command perf reset hook 0x5082050 -24> 2015-06-30 13:43:00.127998 7f9feb23d840 5 asok(0x50be000) register_command config show hook 0x5082050 -23> 2015-06-30 13:43:00.127998 7f9feb23d840 5 asok(0x50be000) register_command config set hook 0x5082050 -22> 2015-06-30 13:43:00.128058 7f9feb23d840 5 asok(0x50be000) register_command config get hook 0x5082050 -21> 2015-06-30 13:43:00.128060 7f9feb23d840 5 asok(0x50be000) register_command config diff hook 0x5082050 -20> 2015-06-30 13:43:00.128067 7f9feb23d840 5 asok(0x50be000) register_command log flush hook 0x5082050 -19> 2015-06-30 13:43:00.128067 7f9feb23d840 5 asok(0x50be000) register_command log dump hook 0x5082050 -18> 2015-06-30 13:43:00.128085 7f9feb23d840 5 asok(0x50be000) register_command log reopen hook 0x5082050 -17> 2015-06-30 13:43:00.130445 7f9feb23d840 0 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3), process ceph-osd, pid 2947 -16> 2015-06-30 13:43:00.132297 7f9feb23d840 1 finished global_init_daemonize -15> 2015-06-30 13:43:00.152866 7f9feb23d840 0 filestore(/var/lib/ceph/osd/ceph-5) backend xfs (magic 0x58465342) -14> 2015-06-30 13:43:00.244827 7f9feb23d840 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-5) detect_features: FIEMAP ioctl is supported and appears to work -13> 2015-06-30 13:43:00.244853 7f9feb23d840 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-5) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option -12> 2015-06-30 13:43:00.604742 7f9feb23d840 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-5) detect_features: syscall(SYS_syncfs, fd) fully supported -11> 2015-06-30 13:43:00.604855 7f9feb23d840 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-5) detect_features: disabling extsize, kernel 3.2.0-4-amd64 is older than 3.5 and has buggy extsize ioctl -10> 2015-06-30 13:43:00.714798 7f9feb23d840 0 filestore(/var/lib/ceph/osd/ceph-5) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled -9> 2015-06-30 13:43:00.869904 7f9feb23d840 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello -8> 2015-06-30 13:43:00.877680 7f9feb23d840 0 osd.5 1547 crush map has features 1107558400, adjusting msgr requires for clients -7> 2015-06-30 13:43:00.878332 7f9feb23d840 0 osd.5 1547 crush map has features 1107558400 was 8705, adjusting msgr requires for mons -6> 2015-06-30 13:43:00.878332 7f9feb23d840 0 osd.5 1547 crush map has features 1107558400, adjusting msgr requires for osds -5> 2015-06-30 13:43:00.878364 7f9feb23d840 0 osd.5 1547 load_pgs -4> 2015-06-30 13:43:01.076863 7f9feb23d840 0 osd.5 1547 load_pgs opened 43 pgs -3> 2015-06-30 13:43:01.077486 7f9feb23d840 -1 osd.5 1547 log_to_monitors {default=true} -2> 2015-06-30 13:43:01.086179 7f9fd76bd700 0 osd.5 1547 ignoring osdmap until we have initialized -1> 2015-06-30 13:43:01.089618 7f9fd76bd700 0 osd.5 1547 ignoring osdmap until we have initialized 0> 2015-06-30 13:43:02.086065 7f9fccea8700 -1 *** Caught signal (Segmentation fault) ** in thread 7f9fccea8700 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: /usr/bin/ceph-osd() [0xbef08c] 2: (()+0xf0a0) [0x7f9fea1500a0] 3: /usr/bin/ceph-osd() [0xd5c934] 4: (crush_do_rule()+0x390) [0xd5d570] 5: (CrushWrapper::do_rule(int, int, std::vector<int, std::allocator<int> >&, int, std::vector<unsigned int, std::allocator<unsigned int> > const&) const+0x8b) [0xcb146b] 6: (OSDMap::_pg_to_osds(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*, int*, unsigned int*) const+0x7c) [0xc9a68c] 7: (OSDMap::_pg_to_up_acting_osds(pg_t const&, std::vector<int, std::allocator<int> >*, int*, std::vector<int, std::allocator<int> >*, int*) const+0x10f) [0xc9a82f] 8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x1e2) [0x7d1dc2] 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x262) [0x7d2cd2] 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x14) [0x8385c4] 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xccce79] 12: (ThreadPool::WorkThread::entry()+0x10) [0xccee70] 13: (()+0x6b50) [0x7f9fea147b50] 14: (clone()+0x6d) [0x7f9fe8b6395d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Jonas Weismüller almost 9 years ago
I also rebooted all of the OSD nodes, but still have of them are crashing (DOWN).
# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -22 24.00000 root default -28 8.00000 rack rack1 -23 4.00000 host vs72 23 1.00000 osd.23 up 1.00000 1.00000 22 1.00000 osd.22 down 0 1.00000 20 1.00000 osd.20 down 0 1.00000 21 1.00000 osd.21 down 0 1.00000 -21 4.00000 host vs73 12 1.00000 osd.12 up 1.00000 1.00000 0 1.00000 osd.0 down 0 1.00000 8 1.00000 osd.8 up 1.00000 1.00000 1 1.00000 osd.1 down 0 1.00000 -29 8.00000 rack rack2 -24 4.00000 host vs74 10 1.00000 osd.10 up 1.00000 1.00000 13 1.00000 osd.13 up 1.00000 1.00000 7 1.00000 osd.7 down 0 1.00000 3 1.00000 osd.3 down 0 1.00000 -25 4.00000 host vs75 14 1.00000 osd.14 up 1.00000 1.00000 9 1.00000 osd.9 up 1.00000 1.00000 6 1.00000 osd.6 down 0 1.00000 4 1.00000 osd.4 down 0 1.00000 -30 8.00000 rack rack3 -26 4.00000 host vs76 15 1.00000 osd.15 up 1.00000 1.00000 11 1.00000 osd.11 up 1.00000 1.00000 5 1.00000 osd.5 down 0 1.00000 2 1.00000 osd.2 down 0 1.00000 -27 4.00000 host vs77 19 1.00000 osd.19 up 1.00000 1.00000 18 1.00000 osd.18 up 1.00000 1.00000 16 1.00000 osd.16 down 0 1.00000 17 1.00000 osd.17 down 0 1.00000
Updated by Jonas Weismüller almost 9 years ago
opened #12193 to track further the osd map issue.
Updated by Kefu Chai almost 9 years ago
Jonas Weismüller wrote:
Hi,
I accidently removed a root bucket "osd crush remove platter" (root=platter). The platter object was still used by a ruleset and the ruleset was still in use by a pool. My issue looks similar to #9485.
weird, monitor will fail this command:
ceph osd crush rm default Error EBUSY: (16) Device or resource busy
above is the output from my vstart cluster. i removed all descendents of the "default" bucket before issuing this command.