Bug #11373
closedOSD crash in OSDService::get_map
Added by Ilja Slepnev about 9 years ago. Updated about 9 years ago.
0%
Description
CentOS 7.1, packages from http://ceph.com/rpm-hammer/el7/
Upgraded MONs 0.87.1 to 0.94, restarted MONs, HEALTH_OK.
Upgraded OSDs. Restarted OSDs - none came up, all crashed on start.
OSD store_version is 4.
See attached log. It looks like bug #6430
Files
osd-log.txt (26.8 KB) osd-log.txt | Ilja Slepnev, 04/11/2015 06:09 PM |
Updated by Kefu Chai about 9 years ago
the backtrace
terminate called after throwing an instance of 'ceph::FailedAssertion' *** Caught signal (Aborted) ** in thread 7f4bf83a7880 ceph version 0.94 (e61c4f093f88e44961d157f65091733580cea79a) 1: ceph-osd() [0xac51c2] 2: (()+0xf130) [0x7f4bf6d3e130] 3: (gsignal()+0x37) [0x7f4bf57585d7] 4: (abort()+0x148) [0x7f4bf5759cc8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f4bf605c9b5] 6: (()+0x5e926) [0x7f4bf605a926] 7: (()+0x5e953) [0x7f4bf605a953] 8: (()+0x5eb73) [0x7f4bf605ab73] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbc538a] 10: (OSDService::get_map(unsigned int)+0x3f) [0x6ff77f] 11: (OSD::load_pgs()+0x17c9) [0x6b7479] 12: (OSD::init()+0x729) [0x6b8b99] 13: (main()+0x27f3) [0x643b63] 14: (__libc_start_main()+0xf5) [0x7f4bf5744af5] 15: ceph-osd() [0x65cdc9] 2015-04-11 20:59:43.350376 7f4bf83a7880 -1 *** Caught signal (Aborted) ** in thread 7f4bf83a7880
most recent log
-13> 2015-04-11 20:59:43.320175 7f4bf83a7880 2 osd.30 0 boot -12> 2015-04-11 20:59:43.322833 7f4bf83a7880 1 <cls> cls/refcount/cls_refcount.cc:231: Loaded refcount class! -11> 2015-04-11 20:59:43.322946 7f4bf83a7880 1 <cls> cls/replica_log/cls_replica_log.cc:141: Loaded replica log class! -10> 2015-04-11 20:59:43.323048 7f4bf83a7880 1 <cls> cls/statelog/cls_statelog.cc:306: Loaded log class! -9> 2015-04-11 20:59:43.323386 7f4bf83a7880 1 <cls> cls/log/cls_log.cc:312: Loaded log class! -8> 2015-04-11 20:59:43.325417 7f4bf83a7880 1 <cls> cls/rgw/cls_rgw.cc:3046: Loaded rgw class! -7> 2015-04-11 20:59:43.325541 7f4bf83a7880 1 <cls> cls/version/cls_version.cc:227: Loaded version class! -6> 2015-04-11 20:59:43.325667 7f4bf83a7880 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello -5> 2015-04-11 20:59:43.325767 7f4bf83a7880 1 <cls> cls/user/cls_user.cc:367: Loaded user class! -4> 2015-04-11 20:59:43.326800 7f4bf83a7880 0 osd.30 28642 crush map has features 1107558400, adjusting msgr requires for clients -3> 2015-04-11 20:59:43.326810 7f4bf83a7880 0 osd.30 28642 crush map has features 1107558400 was 8705, adjusting msgr requires for mons -2> 2015-04-11 20:59:43.326817 7f4bf83a7880 0 osd.30 28642 crush map has features 1107558400, adjusting msgr requires for osds -1> 2015-04-11 20:59:43.326833 7f4bf83a7880 0 osd.30 28642 load_pgs 0> 2015-04-11 20:59:43.346451 7f4bf83a7880 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f4bf83a7880 time 2015-04-11 20:59:43.344831 osd/OSD.h: 716: FAILED assert(ret)
Updated by Ilja Slepnev about 9 years ago
Enabled more debug. Missing osdmap 25584.
What could be the reason for osdmap loss and how to work around it with minimal data loss?
Startup log from OSD.58
-9> 2015-04-13 23:40:29.333979 7f4fc5205880 10 osd.58 28557 pgid 5.1 coll 5.1_head -8> 2015-04-13 23:40:29.333993 7f4fc5205880 15 filestore(/var/lib/ceph/osd/ceph-58) omap_get_values 5.1_head/1//head//5 -7> 2015-04-13 23:40:29.334021 7f4fc5205880 15 filestore(/var/lib/ceph/osd/ceph-58) collection_getattr /var/lib/ceph/osd/ceph-58/current/5.1_head 'info' -6> 2015-04-13 23:40:29.334036 7f4fc5205880 10 filestore(/var/lib/ceph/osd/ceph-58) collection_getattr /var/lib/ceph/osd/ceph-58/current/5.1_head 'info' = 1 -5> 2015-04-13 23:40:29.334048 7f4fc5205880 15 filestore(/var/lib/ceph/osd/ceph-58) omap_get_values meta/16ef7597/infos/head//-1 -4> 2015-04-13 23:40:29.334297 7f4fc5205880 20 osd.58 0 get_map 25584 - loading and decoding 0x5618000 -3> 2015-04-13 23:40:29.334308 7f4fc5205880 15 filestore(/var/lib/ceph/osd/ceph-58) read meta/4eb33da9/osdmap.25584/0//-1 0~0 -2> 2015-04-13 23:40:29.334339 7f4fc5205880 10 filestore(/var/lib/ceph/osd/ceph-58) error opening file /var/lib/ceph/osd/ceph-58/current/meta/DIR_9/DIR_A/osdmap.25584__0_4EB33DA9__none with flags=2: (2) No such file or directory -1> 2015-04-13 23:40:29.334353 7f4fc5205880 10 filestore(/var/lib/ceph/osd/ceph-58) FileStore::read(meta/4eb33da9/osdmap.25584/0//-1) open error: (2) No such file or directory 0> 2015-04-13 23:40:29.336211 7f4fc5205880 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f4fc5205880 time 2015-04-13 23:40:29.334365 osd/OSD.h: 716: FAILED assert(ret) ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc51f5] 2: (OSDService::get_map(unsigned int)+0x3f) [0x6ff77f] 3: (OSD::load_pgs()+0x17c9) [0x6b7479] 4: (OSD::init()+0x729) [0x6b8b99] 5: (main()+0x27f3) [0x643b63] 6: (__libc_start_main()+0xf5) [0x7f4fc25a5af5] 7: ceph-osd() [0x65cdc9]
Another log from OSD.59
-9> 2015-04-13 23:35:16.834060 7f3db534c880 10 osd.59 28521 pgid 5.13 coll 5.13_head -8> 2015-04-13 23:35:16.834078 7f3db534c880 15 filestore(/var/lib/ceph/osd/ceph-59) omap_get_values 5.13_head/13//head//5 -7> 2015-04-13 23:35:16.834120 7f3db534c880 15 filestore(/var/lib/ceph/osd/ceph-59) collection_getattr /var/lib/ceph/osd/ceph-59/current/5.13_head 'info' -6> 2015-04-13 23:35:16.834142 7f3db534c880 10 filestore(/var/lib/ceph/osd/ceph-59) collection_getattr /var/lib/ceph/osd/ceph-59/current/5.13_head 'info' = 1 -5> 2015-04-13 23:35:16.834154 7f3db534c880 15 filestore(/var/lib/ceph/osd/ceph-59) omap_get_values meta/16ef7597/infos/head//-1 -4> 2015-04-13 23:35:16.834512 7f3db534c880 20 osd.59 0 get_map 25584 - loading and decoding 0x4f08000 -3> 2015-04-13 23:35:16.834527 7f3db534c880 15 filestore(/var/lib/ceph/osd/ceph-59) read meta/4eb33da9/osdmap.25584/0//-1 0~0 -2> 2015-04-13 23:35:16.834569 7f3db534c880 10 filestore(/var/lib/ceph/osd/ceph-59) error opening file /var/lib/ceph/osd/ceph-59/current/meta/DIR_9/DIR_A/osdmap.25584__0_4EB33DA9__none with flags=2: (2) No such file or directo ry -1> 2015-04-13 23:35:16.834590 7f3db534c880 10 filestore(/var/lib/ceph/osd/ceph-59) FileStore::read(meta/4eb33da9/osdmap.25584/0//-1) open error: (2) No such file or directory 0> 2015-04-13 23:35:16.837324 7f3db534c880 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f3db534c880 time 2015-04-13 23:35:16.834606 osd/OSD.h: 716: FAILED assert(ret) ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc51f5] 2: (OSDService::get_map(unsigned int)+0x3f) [0x6ff77f] 3: (OSD::load_pgs()+0x17c9) [0x6b7479] 4: (OSD::init()+0x729) [0x6b8b99] 5: (main()+0x27f3) [0x643b63] 6: (__libc_start_main()+0xf5) [0x7f3db26ecaf5] 7: ceph-osd() [0x65cdc9]
Updated by Ilja Slepnev about 9 years ago
Found the workaround.
In the past I have deleted some pools. It was before giant. Now I see that data was not erased from OSDs for some reason.
Until hammer it was not a problem, however after upgrade to hammer osd daemons failed to start.
Moved old and unused data to safe place.
OSDs started successfully. All PGs are active+clean.
Updated by Samuel Just about 9 years ago
- Status changed from New to Duplicate
I think you are quite right. The original bug is 10617, and I opened 11429 to handle upgrades from osds which hit this bug.