Project

General

Profile

Actions

Bug #14983

closed

osd: handle EIO in handle_sub_read

Added by Aaron T about 8 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

My 41-OSD cluster suffered a four drive failures in rapid succession several weeks ago. I added new drives and have been waiting for recovery to finish and health to return to OK. Unfortunately, one particular pg seems to have a problem. Four OSDs (2, 5, 22, 33) started crashing repeatedly on handle_sub_read() and/or scan_range(). After many restarts, I have narrowed down the issue to OSD.5 - if 5 is down, 2, 22, and 33 will run without problems. As soon as 5 is started, all four will crash within minutes. The error messages always seem to relate to pg 2.15.

The log files showing one cycle of boot->crash for all four OSDs are attached. Also attached is the main ceph.conf file -- the only difference on some machines was increasing the OSD logging to 20/20.

The problem was present on 9.2.0, so I tried upgrading to 9.2.1 and the issue recurred.

The cluster is comprised of Gentoo Linux machines running 4.1.12-gentoo. All OSDs run xfs, and I've done a full offline xfs_repair on all four crashing OSDs. No problems were reported.

I'm anxious to get the cluster online again. I'm happy to adjust settings and/or compile proposed patches to try and resolve the issue.


Files

ceph.conf (5.24 KB) ceph.conf Aaron T, 03/04/2016 08:13 PM
osd-dump.json (28.4 KB) osd-dump.json Aaron T, 03/04/2016 08:30 PM
pg-dump.json (403 KB) pg-dump.json Aaron T, 03/04/2016 08:30 PM
Actions

Also available in: Atom PDF