Project

General

Profile

Actions

Bug #38250

closed

assert failure crash prevents ceph-osd from running

Added by Adam DC949 about 5 years ago. Updated about 5 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

One of my OSDs keeps crashing shortly after startup, which is preventing it from joining the cluster. The core issue seems to be a failed assert on line 1337 of BlueFS.cc. Because this is a crash which prevents the OSD from operating, I'm marking this as a major severity. If this is not correct, feel free to change it.

https://github.com/ceph/ceph/blob/2d4468c8958255fd6df4c813e7d112be07a111e6/src/os/bluestore/BlueFS.cc#L1337

I tried to dig into the code to figure out why this is happening. The backing storage device is a LUKS mounted device, which I'm guessing would map to KernelDevice::read_random, although I have not verified this. From there, it looks like it's pread from ceph_io.h, then I get lost (and perhaps I got lost a few calls up the stack, since I'm not really familiar with the code).

I'm running on Ubuntu, using the packages installed by ceph-deploy. I upgraded from 13.2.2 to 13.2.4 to see if that would fix it, but it did not. I ran a long test on the drive using smartctl and it passed, however it also indicated there were some read failures (10%). This is an Ubuntu 18.04.1 LTS machine using the 4.15.0-45-generic kernel.

If this is caused by a partial hardware failure (read error), I would expect that it could be handled by marking the sectors bad and moving along. In this case, it seems like a log message would be appropriate to give some indication that the backing storage device(s) may be failing, but no need to crash the ceph-osd process.

If this is an error which the code can not recover from, the "assert and crash" response makes sense, as this will prevent corruption of the cluster. In this case, I would request that we get a better error message to indicate what's going on, and perhaps even a suggestion of what to do next (run a utility to mark sectors as bad so this doesn't continue to happen, replace the drive, use smartctl to do a "long" test on the drive, etc.). As a user, I would really appreciate a more meaningful error message so I know what I should do next.

I have a small cluster, so it is going to take a few days to recover. Unless I hear from someone here wanting more debugging info, once recovered, my plan is to remove, wipe and re-add this drive to see if that fixes things. If not, then I'll replace the drive. In any event, I'll post back here so people of the future who run into this error will have some idea of what to expect will work.


Files

ceph-osd_crash_today.txt (22 KB) ceph-osd_crash_today.txt Logs showing exact stacktraces Adam DC949, 02/09/2019 05:11 PM
Actions

Also available in: Atom PDF