Bug #16278
closedCeph OSD one bluestore crashes on start
0%
Description
Hi,
the canary bluestore OSD of my cluster can't start anymore after 2 days in the cluster. 4 pgs where marked inconsistent.
It was run from the docker ceph/daemon:jewel:
# docker pull ceph/daemon:jewel jewel: Pulling from ceph/daemon Digest: sha256:6a96e8a09670a30ca005b2fb92a35229564d7a9dd91a64a4df3515ef43ad987f Status: Image is up to date for ceph/daemon:jewel
log is attached.
I seems the docker is not up-to-date (10.2.1 instead of 10.2.2), so I will have to check with this version to confirm.
Files
Updated by Mikaël Cluseau almost 8 years ago
BTW no, docker is ok, 10.2.2 is not out yet :)
Updated by Mikaël Cluseau almost 8 years ago
- File ceph-osd-3.log ceph-osd-3.log added
Tryed on ceph/daemon:tag-build-master-jewel-ubuntu-16.04, same result.
I'm keeping the OSD drive as is for now.
Updated by Mikaël Cluseau almost 8 years ago
- File ceph-osd-3.log ceph-osd-3.log added
This is still happening with the image of 3 days ago.
Updated by Mikaël Cluseau over 7 years ago
Looking at other issues, it seems like the relevant error is:
0> 2016-07-13 08:41:42.469640 7f65cbfb18c0 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f65cbfb18c0 time 2016-07-13 08:41:42.468056 osd/OSD.h: 885: FAILED assert(ret) ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x563f92a23040] 2: (OSDService::get_map(unsigned int)+0x5d) [0x563f9239d93d] 3: (OSD::init()+0x1f91) [0x563f9234d8b1] 4: (main()+0x2ea5) [0x563f922bee55] 5: (__libc_start_main()+0xf0) [0x7f65c8df6830] 6: (_start()+0x29) [0x563f923004e9]
Meaning try_get_map returns NULL. With debug_osd at 30, I can see the path taken:
-1> 2016-08-02 00:53:01.381602 7fc155ef18c0 20 osd.3 0 get_map 11409 - loading and decoding 0x5652a952b200 0> 2016-08-02 00:53:01.383467 7fc155ef18c0 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fc155ef18c0 time 2016-08-02 00:53:01.381845 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
Going deeper is harder because I don't have logs to trace the path. I'm not used to the codebase though, so it's hard for me know what's the most likely fault path.I'd say it's map->decode(bl) but... how to test the local osdmap?
Updated by Sage Weil about 7 years ago
- Status changed from New to Can't reproduce
if you see this on kraken or later, pelase reopen! we haven't encounterd this in qa or in our test clusters.
Updated by Mikaël Cluseau almost 7 years ago
for now my canary is still alive ;)