Bug #14983
closedosd: handle EIO in handle_sub_read
0%
Description
My 41-OSD cluster suffered a four drive failures in rapid succession several weeks ago. I added new drives and have been waiting for recovery to finish and health to return to OK. Unfortunately, one particular pg seems to have a problem. Four OSDs (2, 5, 22, 33) started crashing repeatedly on handle_sub_read() and/or scan_range(). After many restarts, I have narrowed down the issue to OSD.5 - if 5 is down, 2, 22, and 33 will run without problems. As soon as 5 is started, all four will crash within minutes. The error messages always seem to relate to pg 2.15.
The log files showing one cycle of boot->crash for all four OSDs are attached. Also attached is the main ceph.conf file -- the only difference on some machines was increasing the OSD logging to 20/20.
The problem was present on 9.2.0, so I tried upgrading to 9.2.1 and the issue recurred.
The cluster is comprised of Gentoo Linux machines running 4.1.12-gentoo. All OSDs run xfs, and I've done a full offline xfs_repair on all four crashing OSDs. No problems were reported.
I'm anxious to get the cluster online again. I'm happy to adjust settings and/or compile proposed patches to try and resolve the issue.
Files