Project

General

Profile

Bug #6333

Recovery and/or Backfill Cause QEMU/RBD Reads to Hang

Added by Mike Dawson over 10 years ago. Updated about 10 years ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Some (but not all) of our qemu instances booted from rbd copy on write clones experience i/o outages during recovery and/or backfill. Re-adding two osds (3TB SATA) on separate nodes recently took ~15hours. For most of that time, the two new disks hover close to 100% spindle contention. Other OSDs have spikes of elevated utilization, but are significantly calmer. The network does not appear to be a limiting factor.

We attempt to get client i/o flowing freely again on stuck guests by lowering osd_recovery_op_priority to 1, but it does not work. Turning osd_recovery_max_active down to 1 doesn't fix client i/o either, but that may have been due to Issue #6291.

The instances that encounter outages run video surveillance software on Windows. During the recovery, we get seemingly uninterrupted i/o from a different video surveillance package running on Linux. I believe reads may be the primary issue. The Windows application seems to block on reads leaving it unable to write the video stream to rbd while the Linux application seems to have read and write separation.

ceph-issue-6333.jpg View (251 KB) Mike Dawson, 09/17/2013 11:57 AM

ceph-issue-6333-control.jpg View (240 KB) Mike Dawson, 09/17/2013 12:04 PM

ceph-issue-6333-recovery.jpg View (88.8 KB) Mike Dawson, 09/17/2013 01:47 PM

ceph-issue-6333-recovery-rbd-perf-dump.jpg View (233 KB) Mike Dawson, 09/17/2013 01:47 PM

History

#1 Updated by Mike Dawson over 10 years ago

In the attached image you'll see graphs of a set of nine Windows guests running video surveillance over the duration of the Ceph recovery process. Under normal conditions (when Ceph isn't recovering/backfilling), the blue line on each graph should track closely with the red dashed baseline. The drops indicate i/o issues due to the Ceph recovery and backfilling process.

As shown in the image, the guests don't halt or resume i/o at the same time. At some points during the recovery, i/o proceeds normally, then falls off. Sometimes the fall off will go to roughly 50% the expected bandwidth. Other times, it will fall off to 0mbps.

#2 Updated by Mike Dawson over 10 years ago

Added a graph of the same time period from last night / this morning for a comparison of what the graphs should look like when RBD is working properly. As you can see, i/o should be roughly consistent and track close to the red dashed baseline. The tall spikes can be ignored, because they only occur during school passing periods when students flood the hallways and create heavier writes.

#3 Updated by Mike Dawson over 10 years ago

We had an osd process die unexpectedly a few days ago. I run a cronjob that 'service ceph start's on all nodes for this very reason. It restarted the osd daemon and recovery started. During the time, at least on other OSD flapped and was incorrectly marked down with status like:

HEALTH_WARN 128 pgs backfill; 267 pgs backfilling; 3355 pgs degraded; 3 pgs down; 32 pgs peering; 92 pgs recovering; 94 pgs recovery_wait; 12 pgs stuck inactive; 2172 pgs stuck unclean; 27 requests are blocked > 32 sec; 5 osds have slow requests; recovery 2116832/32281104 degraded (6.557%); 2/64 in osds are down; noscrub,nodeep-scrub flag(s) set

The whole event took ~1.5 hours. Attached are graphs showing three of the guests' traffic versus their baseline during this event (ceph-issue-6333-recovery.jpg) plus four graphs of various rbd perf dumps for the same three guests. You'll notice reads seem to pile up, then get processed in an abnormally large spike. And you'll notice extremely high read latency.

Could this be a situation where we're attempting to read from the primary object, but it is quite busy? If that is the case, is it possible to detect this issue, and remap the request to another copy that could respond faster?

#4 Updated by Sage Weil over 10 years ago

  • Status changed from New to Need More Info

Hey-

Here is what I think we need to fully understand what is happening:

- debug ms = 1, debug osd = 20 on the osds during recovery/backfill. (hopefully this is doable without torpedoing performance so much)
- wait for a guest to stall
- while guest is stalled, query the admin socket and dump the objecter requests

That will let us see the specific request(s) that are blocked (along with how long they ahve been blocked), and map that back to the osd logs where we can see why they are stalled out.

You may be right that it is reads, but I'm not certain.. in general, an individual object that is stuck should not stay that way for long. I have some ideas about what we can do to help, but I'm not sure the status quo should be as bad as it sounds like it is, so we should address that first.

#5 Updated by Sage Weil over 10 years ago

  • Project changed from rbd to Ceph

Hey, any update on this one?

#6 Updated by Mike Dawson over 10 years ago

I believe saaby just saw the same thing on his cluster. Details here:

http://irclogs.ceph.widodh.nl/index.php?date=2013-10-17

#7 Updated by Mike Dawson over 10 years ago

Azrael (who works with saaby) looked through logs... 1.3 million slow request warnings during their recovery. On my cluster, once the slow requests start to be reported, they tend to increase in number until rbd client i/o grinds to a halt.

Azrael also confirmed that like me, he sees some VMs more affected than others, possibly dependent on the application/workload. The full comments are available at:

http://irclogs.ceph.widodh.nl/index.php?date=2013-10-17

#8 Updated by Sage Weil over 10 years ago

saaby's cluster is still running cuttlefish; i think the problem there is related to the pg logs (fixed in dumpling). your problem is on dumpling, though.

i think the info in comment #4 is still the best way to fully understand what is happening. is this something you can reproduce on your system?

#9 Updated by Samuel Just over 10 years ago

  • Priority changed from Urgent to High

#10 Updated by Sage Weil about 10 years ago

  • Status changed from Need More Info to Closed

See #6278 for teh snaptrimmer portion of this. the rest is somewhere between resolved, need more info, and can't reproduce.

Also available in: Atom PDF