Project

General

Profile

Actions

Bug #16278

closed

Ceph OSD one bluestore crashes on start

Added by Mikaël Cluseau almost 8 years ago. Updated almost 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

the canary bluestore OSD of my cluster can't start anymore after 2 days in the cluster. 4 pgs where marked inconsistent.

It was run from the docker ceph/daemon:jewel:

# docker pull ceph/daemon:jewel
jewel: Pulling from ceph/daemon
Digest: sha256:6a96e8a09670a30ca005b2fb92a35229564d7a9dd91a64a4df3515ef43ad987f
Status: Image is up to date for ceph/daemon:jewel

log is attached.

I seems the docker is not up-to-date (10.2.1 instead of 10.2.2), so I will have to check with this version to confirm.


Files

ceph-osd-3.log (76.8 KB) ceph-osd-3.log Mikaël Cluseau, 06/14/2016 02:55 AM
ceph-osd-3.log (48.9 KB) ceph-osd-3.log Mikaël Cluseau, 06/14/2016 04:56 AM
ceph-osd-3.log (48 KB) ceph-osd-3.log Mikaël Cluseau, 07/13/2016 08:47 AM
Actions #1

Updated by Mikaël Cluseau almost 8 years ago

BTW no, docker is ok, 10.2.2 is not out yet :)

Actions #2

Updated by Mikaël Cluseau almost 8 years ago

Tryed on ceph/daemon:tag-build-master-jewel-ubuntu-16.04, same result.

I'm keeping the OSD drive as is for now.

Actions #3

Updated by Mikaël Cluseau almost 8 years ago

This is still happening with the image of 3 days ago.

Actions #4

Updated by Mikaël Cluseau over 7 years ago

Looking at other issues, it seems like the relevant error is:

     0> 2016-07-13 08:41:42.469640 7f65cbfb18c0 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f65cbfb18c0 time 2016-07-13 08:41:42.468056
osd/OSD.h: 885: FAILED assert(ret)

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x563f92a23040]
 2: (OSDService::get_map(unsigned int)+0x5d) [0x563f9239d93d]
 3: (OSD::init()+0x1f91) [0x563f9234d8b1]
 4: (main()+0x2ea5) [0x563f922bee55]
 5: (__libc_start_main()+0xf0) [0x7f65c8df6830]
 6: (_start()+0x29) [0x563f923004e9]

Meaning try_get_map returns NULL. With debug_osd at 30, I can see the path taken:

    -1> 2016-08-02 00:53:01.381602 7fc155ef18c0 20 osd.3 0 get_map 11409 - loading and decoding 0x5652a952b200
     0> 2016-08-02 00:53:01.383467 7fc155ef18c0 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fc155ef18c0 time 2016-08-02 00:53:01.381845

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)

Going deeper is harder because I don't have logs to trace the path. I'm not used to the codebase though, so it's hard for me know what's the most likely fault path.I'd say it's map->decode(bl) but... how to test the local osdmap?

Actions #5

Updated by Sage Weil about 7 years ago

  • Status changed from New to Can't reproduce

if you see this on kraken or later, pelase reopen! we haven't encounterd this in qa or in our test clusters.

Actions #6

Updated by Mikaël Cluseau almost 7 years ago

for now my canary is still alive ;)

Actions

Also available in: Atom PDF