Bug #3615: Reproducible OSD crash when recovering the journal - Ceph - Ceph

Actions

Copy link

Bug #3615

closed

Reproducible OSD crash when recovering the journal

Added by Faidon Liambotis over 11 years ago. Updated over 11 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi,

After an abrupt powercycle of one of the servers, one of the OSDs has trouble booting up. It seems to be getting a SIGABRT produced by an assert.

This is on Ceph 0.55. Attached is the output of /usr/bin/ceph-osd -d --debug_ms 20 -i 17 -c /etc/ceph/ceph.conf, run under gdb and a bt/bt full.

I can also provide the journal, as there's nothing that I consider private there (it's a test cluster). It's the default 1GB ~~probably less compressed~~. We have enough bandwidth to push that :)

Files

ceph-journal-crash-backtrace.txt (23.2 KB) ceph-journal-crash-backtrace.txt

gdb backtrace full & ceph-osd -d --debug_ms 20

Faidon Liambotis, 12/13/2012 09:15 AM

Actions

Copy link

Updated by Sage Weil over 11 years ago

Status changed from New to Need More Info

on pg 6.7111 the info appears corrupt somehow. can you attach the contents of the meta/.../pginfo file for 6.7111 on the failing osd?

Actions

Copy link

Updated by Faidon Liambotis over 11 years ago

Thanks for the quick reply. I couldn't find a 6.7111 pg. Am I doing something wrong?

find /var/lib/ceph/osd/ceph-17/*/meta -name 'pginfo*6.71*'
/var/lib/ceph/osd/ceph-17/current/meta/DIR_E/pginfo\u6.714__0_A158572E__none
/var/lib/ceph/osd/ceph-17/current/meta/DIR_E/pginfo\u6.719__0_A1586BBE__none
/var/lib/ceph/osd/ceph-17/current/meta/DIR_F/DIR_E/pginfo\u6.71b__0_A15877EF__none

Note that this was on a powercycled box with /var/lib/ceph/osd/ceph-17/ as an XFS mounted with nobarrier,logbufs=8.

Thanks,

Actions

Copy link

Updated by Samuel Just over 11 years ago

Status changed from Need More Info to Rejected

Running with nobarrier could explain this error if the pginfo file link wasn't flushed when the sync finished. Our usage of xfs relies on barriers, so you probably want to switch your settings. You might be able to recover the OSD by renaming the pg directory out of the way and restarting the OSD (the pg will recover from replicas).

Actions

Copy link

Updated by Faidon Liambotis over 11 years ago

Thanks, I figured as much when you mentioned the existence of that file.

I still think it's a bug though: I believe Ceph whouldn't crash with a sigabrt on a corrupted filesystem and it should at least warn instead, preferrably pointing to the pg in question. I could go as far as ask for Ceph to be able to recover from that gracefully instead of me renaming the pg directory, but this may be asking too much right now :)

Actions

Copy link

Updated by Faidon Liambotis over 11 years ago

Samuel was kind enough to clarify a bit on IRC: I should be looking for a pginfo/directory named 6.1bc7, since they're stored in hex.

So, there was a pginfo, but had a filesize of 0, which is a typical sign of XFS-related crash corruption when running with nobarrier.

I removed pginfo, pglog and the directory for 6.1bc7 and the OSD started successfully and is now recovering.

Thanks :-)

(I still maintain that ceph-osd should report the error explaining what happened rather than plain assert() though :)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #3615

Reproducible OSD crash when recovering the journal

Updated by Sage Weil over 11 years ago

Updated by Faidon Liambotis over 11 years ago

Updated by Samuel Just over 11 years ago

Updated by Faidon Liambotis over 11 years ago

Updated by Faidon Liambotis over 11 years ago