Project

General

Profile

Actions

Bug #6333

closed

Recovery and/or Backfill Cause QEMU/RBD Reads to Hang

Added by Mike Dawson over 10 years ago. Updated about 10 years ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Some (but not all) of our qemu instances booted from rbd copy on write clones experience i/o outages during recovery and/or backfill. Re-adding two osds (3TB SATA) on separate nodes recently took ~15hours. For most of that time, the two new disks hover close to 100% spindle contention. Other OSDs have spikes of elevated utilization, but are significantly calmer. The network does not appear to be a limiting factor.

We attempt to get client i/o flowing freely again on stuck guests by lowering osd_recovery_op_priority to 1, but it does not work. Turning osd_recovery_max_active down to 1 doesn't fix client i/o either, but that may have been due to Issue #6291.

The instances that encounter outages run video surveillance software on Windows. During the recovery, we get seemingly uninterrupted i/o from a different video surveillance package running on Linux. I believe reads may be the primary issue. The Windows application seems to block on reads leaving it unable to write the video stream to rbd while the Linux application seems to have read and write separation.


Files

ceph-issue-6333.jpg (251 KB) ceph-issue-6333.jpg Mike Dawson, 09/17/2013 11:57 AM
ceph-issue-6333-control.jpg (240 KB) ceph-issue-6333-control.jpg Mike Dawson, 09/17/2013 12:04 PM
ceph-issue-6333-recovery.jpg (88.8 KB) ceph-issue-6333-recovery.jpg Mike Dawson, 09/17/2013 01:47 PM
ceph-issue-6333-recovery-rbd-perf-dump.jpg (233 KB) ceph-issue-6333-recovery-rbd-perf-dump.jpg Mike Dawson, 09/17/2013 01:47 PM
Actions

Also available in: Atom PDF