Bug #14983: osd: handle EIO in handle_sub_read - Ceph - Ceph

Actions

Copy link

Bug #14983

closed

osd: handle EIO in handle_sub_read

Added by Aaron T about 8 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v9.2.1

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

My 41-OSD cluster suffered a four drive failures in rapid succession several weeks ago. I added new drives and have been waiting for recovery to finish and health to return to OK. Unfortunately, one particular pg seems to have a problem. Four OSDs (2, 5, 22, 33) started crashing repeatedly on handle_sub_read() and/or scan_range(). After many restarts, I have narrowed down the issue to OSD.5 - if 5 is down, 2, 22, and 33 will run without problems. As soon as 5 is started, all four will crash within minutes. The error messages always seem to relate to pg 2.15.

The log files showing one cycle of boot->crash for all four OSDs are attached. Also attached is the main ceph.conf file -- the only difference on some machines was increasing the OSD logging to 20/20.

The problem was present on 9.2.0, so I tried upgrading to 9.2.1 and the issue recurred.

The cluster is comprised of Gentoo Linux machines running 4.1.12-gentoo. All OSDs run xfs, and I've done a full offline xfs_repair on all four crashing OSDs. No problems were reported.

I'm anxious to get the cluster online again. I'm happy to adjust settings and/or compile proposed patches to try and resolve the issue.

Files

Download all files

ceph.conf (5.24 KB) ceph.conf		Aaron T, 03/04/2016 08:13 PM
osd-dump.json (28.4 KB) osd-dump.json		Aaron T, 03/04/2016 08:30 PM
pg-dump.json (403 KB) pg-dump.json		Aaron T, 03/04/2016 08:30 PM

Actions

Copy link

Updated by Aaron T about 8 years ago

Attaching log files either as plaintext or bzipped plaintext failed, so they can be downloaded individually or as a tarball from my website:

https://aarontc.com/ceph/14983/ceph-osd.2.log.bz2
https://aarontc.com/ceph/14983/ceph-osd.5.log.bz2
https://aarontc.com/ceph/14983/ceph-osd.22.log.bz2
https://aarontc.com/ceph/14983/ceph-osd.33.log.bz2
https://aarontc.com/ceph/14983/ceph-osd-logs.tar.bz2

Actions

Copy link Download all files

Updated by Aaron T about 8 years ago

File osd-dump.json osd-dump.json added
File pg-dump.json pg-dump.json added

Attaching output of 'ceph pg dump --format=json' and 'ceph osd dump --format=json'

Actions

Copy link

Updated by Samuel Just about 8 years ago

Subject changed from osd: v9.2.0 and v9.2.1 Crash in ECBackend::handle_sub_read() and ReplicatedPG::scan_range() to osd: handle EIO in handle_sub_read

osd 5 is getting an EIO on object 2/ab64d095/100000578e9.00001458/head. We don't handle this well, updating the subject. In your case, if you can recover without osd.5, do so. If you need osd.5 to survive, you'll probably at least have to remove pg 2.15 from it using the ceph-objectstore-tool (I assume that 2.15 at least will recover without osd.5?).

Actions

Copy link

Updated by Aaron T about 8 years ago

Samuel Just wrote:

osd 5 is getting an EIO on object 2/ab64d095/100000578e9.00001458/head. We don't handle this well, updating the subject. In your case, if you can recover without osd.5, do so. If you need osd.5 to survive, you'll probably at least have to remove pg 2.15 from it using the ceph-objectstore-tool (I assume that 2.15 at least will recover without osd.5?).

Samuel,

Thanks for the advice. On recommendation from davidz in #ceph-devel, I removed the object you mentioned from OSD 5 and the problem recurred with another object from the same pg. After a few (~15) cycles of this, OSD 5 managed to be happy and the entire cluster has recovered! Since the cluster has recovered, I'm not sure how much I can help with testing patches, but I'm certainly willing to try.

-Aaron

Actions

Copy link

Updated by Sage Weil about 7 years ago

Status changed from New to Resolved

this was fixed in jewel

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #14983

osd: handle EIO in handle_sub_read

Updated by Aaron T about 8 years ago

Updated by Aaron T about 8 years ago

Updated by Samuel Just about 8 years ago

Updated by Aaron T about 8 years ago

Updated by Sage Weil about 7 years ago