Project

General

Profile

Actions

Bug #11373

closed

OSD crash in OSDService::get_map

Added by Ilja Slepnev about 9 years ago. Updated about 9 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

CentOS 7.1, packages from http://ceph.com/rpm-hammer/el7/
Upgraded MONs 0.87.1 to 0.94, restarted MONs, HEALTH_OK.
Upgraded OSDs. Restarted OSDs - none came up, all crashed on start.
OSD store_version is 4.

See attached log. It looks like bug #6430


Files

osd-log.txt (26.8 KB) osd-log.txt Ilja Slepnev, 04/11/2015 06:09 PM

Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Bug #11429: OSD::load_pgs: we need to handle the case where an upgrade from earlier versions which ignored non-existent pgs resurrects a pg with a prehistoric osdmapResolvedSamuel Just04/20/2015

Actions
Actions #1

Updated by Kefu Chai about 9 years ago

the backtrace

terminate called after throwing an instance of 'ceph::FailedAssertion'
*** Caught signal (Aborted) **
 in thread 7f4bf83a7880
 ceph version 0.94 (e61c4f093f88e44961d157f65091733580cea79a)
 1: ceph-osd() [0xac51c2]
 2: (()+0xf130) [0x7f4bf6d3e130]
 3: (gsignal()+0x37) [0x7f4bf57585d7]
 4: (abort()+0x148) [0x7f4bf5759cc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f4bf605c9b5]
 6: (()+0x5e926) [0x7f4bf605a926]
 7: (()+0x5e953) [0x7f4bf605a953]
 8: (()+0x5eb73) [0x7f4bf605ab73]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbc538a]
 10: (OSDService::get_map(unsigned int)+0x3f) [0x6ff77f]
 11: (OSD::load_pgs()+0x17c9) [0x6b7479]
 12: (OSD::init()+0x729) [0x6b8b99]
 13: (main()+0x27f3) [0x643b63]
 14: (__libc_start_main()+0xf5) [0x7f4bf5744af5]
 15: ceph-osd() [0x65cdc9]
2015-04-11 20:59:43.350376 7f4bf83a7880 -1 *** Caught signal (Aborted) **
 in thread 7f4bf83a7880

most recent log

   -13> 2015-04-11 20:59:43.320175 7f4bf83a7880  2 osd.30 0 boot
   -12> 2015-04-11 20:59:43.322833 7f4bf83a7880  1 <cls> cls/refcount/cls_refcount.cc:231: Loaded refcount class!
   -11> 2015-04-11 20:59:43.322946 7f4bf83a7880  1 <cls> cls/replica_log/cls_replica_log.cc:141: Loaded replica log class!
   -10> 2015-04-11 20:59:43.323048 7f4bf83a7880  1 <cls> cls/statelog/cls_statelog.cc:306: Loaded log class!
    -9> 2015-04-11 20:59:43.323386 7f4bf83a7880  1 <cls> cls/log/cls_log.cc:312: Loaded log class!
    -8> 2015-04-11 20:59:43.325417 7f4bf83a7880  1 <cls> cls/rgw/cls_rgw.cc:3046: Loaded rgw class!
    -7> 2015-04-11 20:59:43.325541 7f4bf83a7880  1 <cls> cls/version/cls_version.cc:227: Loaded version class!
    -6> 2015-04-11 20:59:43.325667 7f4bf83a7880  0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello
    -5> 2015-04-11 20:59:43.325767 7f4bf83a7880  1 <cls> cls/user/cls_user.cc:367: Loaded user class!
    -4> 2015-04-11 20:59:43.326800 7f4bf83a7880  0 osd.30 28642 crush map has features 1107558400, adjusting msgr requires for clients
    -3> 2015-04-11 20:59:43.326810 7f4bf83a7880  0 osd.30 28642 crush map has features 1107558400 was 8705, adjusting msgr requires for mons
    -2> 2015-04-11 20:59:43.326817 7f4bf83a7880  0 osd.30 28642 crush map has features 1107558400, adjusting msgr requires for osds
    -1> 2015-04-11 20:59:43.326833 7f4bf83a7880  0 osd.30 28642 load_pgs
     0> 2015-04-11 20:59:43.346451 7f4bf83a7880 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f4bf83a7880 time 2015-04-11 20:59:43.344831
osd/OSD.h: 716: FAILED assert(ret)

Actions #2

Updated by Ilja Slepnev about 9 years ago

Enabled more debug. Missing osdmap 25584.
What could be the reason for osdmap loss and how to work around it with minimal data loss?

Startup log from OSD.58

    -9> 2015-04-13 23:40:29.333979 7f4fc5205880 10 osd.58 28557 pgid 5.1 coll 5.1_head
    -8> 2015-04-13 23:40:29.333993 7f4fc5205880 15 filestore(/var/lib/ceph/osd/ceph-58) omap_get_values 5.1_head/1//head//5
    -7> 2015-04-13 23:40:29.334021 7f4fc5205880 15 filestore(/var/lib/ceph/osd/ceph-58) collection_getattr /var/lib/ceph/osd/ceph-58/current/5.1_head 'info'
    -6> 2015-04-13 23:40:29.334036 7f4fc5205880 10 filestore(/var/lib/ceph/osd/ceph-58) collection_getattr /var/lib/ceph/osd/ceph-58/current/5.1_head 'info' = 1
    -5> 2015-04-13 23:40:29.334048 7f4fc5205880 15 filestore(/var/lib/ceph/osd/ceph-58) omap_get_values meta/16ef7597/infos/head//-1
    -4> 2015-04-13 23:40:29.334297 7f4fc5205880 20 osd.58 0 get_map 25584 - loading and decoding 0x5618000
    -3> 2015-04-13 23:40:29.334308 7f4fc5205880 15 filestore(/var/lib/ceph/osd/ceph-58) read meta/4eb33da9/osdmap.25584/0//-1 0~0
    -2> 2015-04-13 23:40:29.334339 7f4fc5205880 10 filestore(/var/lib/ceph/osd/ceph-58) error opening file /var/lib/ceph/osd/ceph-58/current/meta/DIR_9/DIR_A/osdmap.25584__0_4EB33DA9__none with flags=2: (2) No such file or directory
    -1> 2015-04-13 23:40:29.334353 7f4fc5205880 10 filestore(/var/lib/ceph/osd/ceph-58) FileStore::read(meta/4eb33da9/osdmap.25584/0//-1) open error: (2) No such file or directory
     0> 2015-04-13 23:40:29.336211 7f4fc5205880 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f4fc5205880 time 2015-04-13 23:40:29.334365
osd/OSD.h: 716: FAILED assert(ret)

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc51f5]
 2: (OSDService::get_map(unsigned int)+0x3f) [0x6ff77f]
 3: (OSD::load_pgs()+0x17c9) [0x6b7479]
 4: (OSD::init()+0x729) [0x6b8b99]
 5: (main()+0x27f3) [0x643b63]
 6: (__libc_start_main()+0xf5) [0x7f4fc25a5af5]
 7: ceph-osd() [0x65cdc9]

Another log from OSD.59

    -9> 2015-04-13 23:35:16.834060 7f3db534c880 10 osd.59 28521 pgid 5.13 coll 5.13_head
    -8> 2015-04-13 23:35:16.834078 7f3db534c880 15 filestore(/var/lib/ceph/osd/ceph-59) omap_get_values 5.13_head/13//head//5
    -7> 2015-04-13 23:35:16.834120 7f3db534c880 15 filestore(/var/lib/ceph/osd/ceph-59) collection_getattr /var/lib/ceph/osd/ceph-59/current/5.13_head 'info'
    -6> 2015-04-13 23:35:16.834142 7f3db534c880 10 filestore(/var/lib/ceph/osd/ceph-59) collection_getattr /var/lib/ceph/osd/ceph-59/current/5.13_head 'info' = 1
    -5> 2015-04-13 23:35:16.834154 7f3db534c880 15 filestore(/var/lib/ceph/osd/ceph-59) omap_get_values meta/16ef7597/infos/head//-1
    -4> 2015-04-13 23:35:16.834512 7f3db534c880 20 osd.59 0 get_map 25584 - loading and decoding 0x4f08000
    -3> 2015-04-13 23:35:16.834527 7f3db534c880 15 filestore(/var/lib/ceph/osd/ceph-59) read meta/4eb33da9/osdmap.25584/0//-1 0~0
    -2> 2015-04-13 23:35:16.834569 7f3db534c880 10 filestore(/var/lib/ceph/osd/ceph-59) error opening file /var/lib/ceph/osd/ceph-59/current/meta/DIR_9/DIR_A/osdmap.25584__0_4EB33DA9__none with flags=2: (2) No such file or directo
ry
    -1> 2015-04-13 23:35:16.834590 7f3db534c880 10 filestore(/var/lib/ceph/osd/ceph-59) FileStore::read(meta/4eb33da9/osdmap.25584/0//-1) open error: (2) No such file or directory
     0> 2015-04-13 23:35:16.837324 7f3db534c880 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f3db534c880 time 2015-04-13 23:35:16.834606
osd/OSD.h: 716: FAILED assert(ret)

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc51f5]
 2: (OSDService::get_map(unsigned int)+0x3f) [0x6ff77f]
 3: (OSD::load_pgs()+0x17c9) [0x6b7479]
 4: (OSD::init()+0x729) [0x6b8b99]
 5: (main()+0x27f3) [0x643b63]
 6: (__libc_start_main()+0xf5) [0x7f3db26ecaf5]
 7: ceph-osd() [0x65cdc9]

Actions #3

Updated by Ilja Slepnev about 9 years ago

Found the workaround.

In the past I have deleted some pools. It was before giant. Now I see that data was not erased from OSDs for some reason.
Until hammer it was not a problem, however after upgrade to hammer osd daemons failed to start.
Moved old and unused data to safe place.
OSDs started successfully. All PGs are active+clean.

Actions #4

Updated by Samuel Just about 9 years ago

  • Status changed from New to Duplicate

I think you are quite right. The original bug is 10617, and I opened 11429 to handle upgrades from osds which hit this bug.

Actions

Also available in: Atom PDF