Bug #16278
closedCeph OSD one bluestore crashes on start
0%
Description
Hi,
the canary bluestore OSD of my cluster can't start anymore after 2 days in the cluster. 4 pgs where marked inconsistent.
It was run from the docker ceph/daemon:jewel:
# docker pull ceph/daemon:jewel jewel: Pulling from ceph/daemon Digest: sha256:6a96e8a09670a30ca005b2fb92a35229564d7a9dd91a64a4df3515ef43ad987f Status: Image is up to date for ceph/daemon:jewel
log is attached.
I seems the docker is not up-to-date (10.2.1 instead of 10.2.2), so I will have to check with this version to confirm.
Files
Updated by Mikaël Cluseau almost 8 years ago
BTW no, docker is ok, 10.2.2 is not out yet :)
Updated by Mikaël Cluseau almost 8 years ago
- File ceph-osd-3.log ceph-osd-3.log added
Tryed on ceph/daemon:tag-build-master-jewel-ubuntu-16.04, same result.
I'm keeping the OSD drive as is for now.
Updated by Mikaël Cluseau almost 8 years ago
- File ceph-osd-3.log ceph-osd-3.log added
This is still happening with the image of 3 days ago.
Updated by Mikaël Cluseau almost 8 years ago
Looking at other issues, it seems like the relevant error is:
0> 2016-07-13 08:41:42.469640 7f65cbfb18c0 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f65cbfb18c0 time 2016-07-13 08:41:42.468056 osd/OSD.h: 885: FAILED assert(ret) ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x563f92a23040] 2: (OSDService::get_map(unsigned int)+0x5d) [0x563f9239d93d] 3: (OSD::init()+0x1f91) [0x563f9234d8b1] 4: (main()+0x2ea5) [0x563f922bee55] 5: (__libc_start_main()+0xf0) [0x7f65c8df6830] 6: (_start()+0x29) [0x563f923004e9]
Meaning try_get_map returns NULL. With debug_osd at 30, I can see the path taken:
-1> 2016-08-02 00:53:01.381602 7fc155ef18c0 20 osd.3 0 get_map 11409 - loading and decoding 0x5652a952b200 0> 2016-08-02 00:53:01.383467 7fc155ef18c0 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fc155ef18c0 time 2016-08-02 00:53:01.381845 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
Going deeper is harder because I don't have logs to trace the path. I'm not used to the codebase though, so it's hard for me know what's the most likely fault path.I'd say it's map->decode(bl) but... how to test the local osdmap?
Updated by Sage Weil about 7 years ago
- Status changed from New to Can't reproduce
if you see this on kraken or later, pelase reopen! we haven't encounterd this in qa or in our test clusters.