Project

General

Profile

Actions

Bug #38250

closed

assert failure crash prevents ceph-osd from running

Added by Adam DC949 about 5 years ago. Updated about 5 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

One of my OSDs keeps crashing shortly after startup, which is preventing it from joining the cluster. The core issue seems to be a failed assert on line 1337 of BlueFS.cc. Because this is a crash which prevents the OSD from operating, I'm marking this as a major severity. If this is not correct, feel free to change it.

https://github.com/ceph/ceph/blob/2d4468c8958255fd6df4c813e7d112be07a111e6/src/os/bluestore/BlueFS.cc#L1337

I tried to dig into the code to figure out why this is happening. The backing storage device is a LUKS mounted device, which I'm guessing would map to KernelDevice::read_random, although I have not verified this. From there, it looks like it's pread from ceph_io.h, then I get lost (and perhaps I got lost a few calls up the stack, since I'm not really familiar with the code).

I'm running on Ubuntu, using the packages installed by ceph-deploy. I upgraded from 13.2.2 to 13.2.4 to see if that would fix it, but it did not. I ran a long test on the drive using smartctl and it passed, however it also indicated there were some read failures (10%). This is an Ubuntu 18.04.1 LTS machine using the 4.15.0-45-generic kernel.

If this is caused by a partial hardware failure (read error), I would expect that it could be handled by marking the sectors bad and moving along. In this case, it seems like a log message would be appropriate to give some indication that the backing storage device(s) may be failing, but no need to crash the ceph-osd process.

If this is an error which the code can not recover from, the "assert and crash" response makes sense, as this will prevent corruption of the cluster. In this case, I would request that we get a better error message to indicate what's going on, and perhaps even a suggestion of what to do next (run a utility to mark sectors as bad so this doesn't continue to happen, replace the drive, use smartctl to do a "long" test on the drive, etc.). As a user, I would really appreciate a more meaningful error message so I know what I should do next.

I have a small cluster, so it is going to take a few days to recover. Unless I hear from someone here wanting more debugging info, once recovered, my plan is to remove, wipe and re-add this drive to see if that fixes things. If not, then I'll replace the drive. In any event, I'll post back here so people of the future who run into this error will have some idea of what to expect will work.


Files

ceph-osd_crash_today.txt (22 KB) ceph-osd_crash_today.txt Logs showing exact stacktraces Adam DC949, 02/09/2019 05:11 PM
Actions #1

Updated by Sage Weil about 5 years ago

  • Status changed from New to Need More Info

Is the errno EIO in this case?

On read error we do crash and fail the OSD. There is generally no recovery path for errors on read.

Actions #2

Updated by Adam DC949 about 5 years ago

I'm not sure how to get the errno value. I don't see it anywhere in the logs. However SMART started complaining about that drive, which means it's in serious trouble, and so I pulled it. It's highly likely that the error was due to a read error at the hardware level.

Could the sector which is failing to be read be marked as bad to recover from this error? The data should be able to be replicated from other nodes, so I would not expect any data loss. Forcing users to throw away entire disks as soon as one sector is unreadable seems a bit harsh, especially with the large number of sectors on disks nowadays.

If there is no recovery path then this ticket can be closed out as WONTFIX.

Actions #3

Updated by Igor Fedotov about 5 years ago

@Adam DC949 - IMO you can proceed with this disk usage but you should redeploy OSD over it. The rationale - some data is corrupt and nobody can tell whether it is critical or not. Bringing existing OSD back to life might be trick as well.
But generally I'd strongly discourage you from using such disks for any more or less valuable data store - it will rather cause more issues than the saving you get from keeping it.

Actions #4

Updated by Igor Fedotov about 5 years ago

  • Status changed from Need More Info to Rejected
Actions

Also available in: Atom PDF