Project

General

Profile

Actions

Bug #3615

closed

Reproducible OSD crash when recovering the journal

Added by Faidon Liambotis over 11 years ago. Updated over 11 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

After an abrupt powercycle of one of the servers, one of the OSDs has trouble booting up. It seems to be getting a SIGABRT produced by an assert.

This is on Ceph 0.55. Attached is the output of /usr/bin/ceph-osd -d --debug_ms 20 -i 17 -c /etc/ceph/ceph.conf, run under gdb and a bt/bt full.

I can also provide the journal, as there's nothing that I consider private there (it's a test cluster). It's the default 1GB probably less compressed. We have enough bandwidth to push that :)


Files

ceph-journal-crash-backtrace.txt (23.2 KB) ceph-journal-crash-backtrace.txt gdb backtrace full & ceph-osd -d --debug_ms 20 Faidon Liambotis, 12/13/2012 09:15 AM
Actions #1

Updated by Sage Weil over 11 years ago

  • Status changed from New to Need More Info

on pg 6.7111 the info appears corrupt somehow. can you attach the contents of the meta/.../pginfo file for 6.7111 on the failing osd?

Actions #2

Updated by Faidon Liambotis over 11 years ago

Thanks for the quick reply. I couldn't find a 6.7111 pg. Am I doing something wrong?

  1. find /var/lib/ceph/osd/ceph-17/*/meta -name 'pginfo*6.71*'
    /var/lib/ceph/osd/ceph-17/current/meta/DIR_E/pginfo\u6.714__0_A158572E__none
    /var/lib/ceph/osd/ceph-17/current/meta/DIR_E/pginfo\u6.719__0_A1586BBE__none
    /var/lib/ceph/osd/ceph-17/current/meta/DIR_F/DIR_E/pginfo\u6.71b__0_A15877EF__none

Note that this was on a powercycled box with /var/lib/ceph/osd/ceph-17/ as an XFS mounted with nobarrier,logbufs=8.

Thanks,

Actions #3

Updated by Samuel Just over 11 years ago

  • Status changed from Need More Info to Rejected

Running with nobarrier could explain this error if the pginfo file link wasn't flushed when the sync finished. Our usage of xfs relies on barriers, so you probably want to switch your settings. You might be able to recover the OSD by renaming the pg directory out of the way and restarting the OSD (the pg will recover from replicas).

Actions #4

Updated by Faidon Liambotis over 11 years ago

Thanks, I figured as much when you mentioned the existence of that file.

I still think it's a bug though: I believe Ceph whouldn't crash with a sigabrt on a corrupted filesystem and it should at least warn instead, preferrably pointing to the pg in question. I could go as far as ask for Ceph to be able to recover from that gracefully instead of me renaming the pg directory, but this may be asking too much right now :)

Actions #5

Updated by Faidon Liambotis over 11 years ago

Samuel was kind enough to clarify a bit on IRC: I should be looking for a pginfo/directory named 6.1bc7, since they're stored in hex.

So, there was a pginfo, but had a filesize of 0, which is a typical sign of XFS-related crash corruption when running with nobarrier.

I removed pginfo, pglog and the directory for 6.1bc7 and the OSD started successfully and is now recovering.

Thanks :-)

(I still maintain that ceph-osd should report the error explaining what happened rather than plain assert() though :)

Actions

Also available in: Atom PDF