Bug #6333
closedRecovery and/or Backfill Cause QEMU/RBD Reads to Hang
0%
Description
Some (but not all) of our qemu instances booted from rbd copy on write clones experience i/o outages during recovery and/or backfill. Re-adding two osds (3TB SATA) on separate nodes recently took ~15hours. For most of that time, the two new disks hover close to 100% spindle contention. Other OSDs have spikes of elevated utilization, but are significantly calmer. The network does not appear to be a limiting factor.
We attempt to get client i/o flowing freely again on stuck guests by lowering osd_recovery_op_priority to 1, but it does not work. Turning osd_recovery_max_active down to 1 doesn't fix client i/o either, but that may have been due to Issue #6291.
The instances that encounter outages run video surveillance software on Windows. During the recovery, we get seemingly uninterrupted i/o from a different video surveillance package running on Linux. I believe reads may be the primary issue. The Windows application seems to block on reads leaving it unable to write the video stream to rbd while the Linux application seems to have read and write separation.
Files