Bug #14983
closedosd: handle EIO in handle_sub_read
0%
Description
My 41-OSD cluster suffered a four drive failures in rapid succession several weeks ago. I added new drives and have been waiting for recovery to finish and health to return to OK. Unfortunately, one particular pg seems to have a problem. Four OSDs (2, 5, 22, 33) started crashing repeatedly on handle_sub_read() and/or scan_range(). After many restarts, I have narrowed down the issue to OSD.5 - if 5 is down, 2, 22, and 33 will run without problems. As soon as 5 is started, all four will crash within minutes. The error messages always seem to relate to pg 2.15.
The log files showing one cycle of boot->crash for all four OSDs are attached. Also attached is the main ceph.conf file -- the only difference on some machines was increasing the OSD logging to 20/20.
The problem was present on 9.2.0, so I tried upgrading to 9.2.1 and the issue recurred.
The cluster is comprised of Gentoo Linux machines running 4.1.12-gentoo. All OSDs run xfs, and I've done a full offline xfs_repair on all four crashing OSDs. No problems were reported.
I'm anxious to get the cluster online again. I'm happy to adjust settings and/or compile proposed patches to try and resolve the issue.
Files
Updated by Aaron T about 8 years ago
Attaching log files either as plaintext or bzipped plaintext failed, so they can be downloaded individually or as a tarball from my website:
https://aarontc.com/ceph/14983/ceph-osd.2.log.bz2
https://aarontc.com/ceph/14983/ceph-osd.5.log.bz2
https://aarontc.com/ceph/14983/ceph-osd.22.log.bz2
https://aarontc.com/ceph/14983/ceph-osd.33.log.bz2
https://aarontc.com/ceph/14983/ceph-osd-logs.tar.bz2
Updated by Aaron T about 8 years ago
- File osd-dump.json osd-dump.json added
- File pg-dump.json pg-dump.json added
Attaching output of 'ceph pg dump --format=json' and 'ceph osd dump --format=json'
Updated by Samuel Just about 8 years ago
- Subject changed from osd: v9.2.0 and v9.2.1 Crash in ECBackend::handle_sub_read() and ReplicatedPG::scan_range() to osd: handle EIO in handle_sub_read
osd 5 is getting an EIO on object 2/ab64d095/100000578e9.00001458/head. We don't handle this well, updating the subject. In your case, if you can recover without osd.5, do so. If you need osd.5 to survive, you'll probably at least have to remove pg 2.15 from it using the ceph-objectstore-tool (I assume that 2.15 at least will recover without osd.5?).
Updated by Aaron T about 8 years ago
Samuel Just wrote:
osd 5 is getting an EIO on object 2/ab64d095/100000578e9.00001458/head. We don't handle this well, updating the subject. In your case, if you can recover without osd.5, do so. If you need osd.5 to survive, you'll probably at least have to remove pg 2.15 from it using the ceph-objectstore-tool (I assume that 2.15 at least will recover without osd.5?).
Samuel,
Thanks for the advice. On recommendation from davidz in #ceph-devel, I removed the object you mentioned from OSD 5 and the problem recurred with another object from the same pg. After a few (~15) cycles of this, OSD 5 managed to be happy and the entire cluster has recovered! Since the cluster has recovered, I'm not sure how much I can help with testing patches, but I'm certainly willing to try.
-Aaron