Bug #14983: osd: handle EIO in handle_sub_read - Ceph - Ceph

Actions

Copy link

Bug #14983

closed

osd: handle EIO in handle_sub_read

Added by Aaron T about 8 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v9.2.1

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

My 41-OSD cluster suffered a four drive failures in rapid succession several weeks ago. I added new drives and have been waiting for recovery to finish and health to return to OK. Unfortunately, one particular pg seems to have a problem. Four OSDs (2, 5, 22, 33) started crashing repeatedly on handle_sub_read() and/or scan_range(). After many restarts, I have narrowed down the issue to OSD.5 - if 5 is down, 2, 22, and 33 will run without problems. As soon as 5 is started, all four will crash within minutes. The error messages always seem to relate to pg 2.15.

The log files showing one cycle of boot->crash for all four OSDs are attached. Also attached is the main ceph.conf file -- the only difference on some machines was increasing the OSD logging to 20/20.

The problem was present on 9.2.0, so I tried upgrading to 9.2.1 and the issue recurred.

The cluster is comprised of Gentoo Linux machines running 4.1.12-gentoo. All OSDs run xfs, and I've done a full offline xfs_repair on all four crashing OSDs. No problems were reported.

I'm anxious to get the cluster online again. I'm happy to adjust settings and/or compile proposed patches to try and resolve the issue.

Files

Download all files

ceph.conf (5.24 KB) ceph.conf		Aaron T, 03/04/2016 08:13 PM
osd-dump.json (28.4 KB) osd-dump.json		Aaron T, 03/04/2016 08:30 PM
pg-dump.json (403 KB) pg-dump.json		Aaron T, 03/04/2016 08:30 PM

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #14983

osd: handle EIO in handle_sub_read

Updated by Aaron T about 8 years ago

Updated by Aaron T about 8 years ago

Updated by Samuel Just about 8 years ago

Updated by Aaron T about 8 years ago

Updated by Sage Weil about 7 years ago