Bug #15763
closedSlow recovery times, slowing further toward the end of the recovery
0%
Description
I have been evaluating Ceph (Hammer - v0.94.6) in an all-SSD, 2X replication configuration. I have 4 OSD servers: 2x Intel Xeon E5-2660v2 - 20 physical cores, SAS connected to 64TB (8x8TB SSD drives each), 2x10Gbit LACP-bonded ethernet links.
The OS is Ubuntu 14.04 with a 3.16 Linux kernel.
I am running the JESD219 (60% write + 40% read) benchmark against the cluster (10 VMs: fio + librbd running this FIO workload: https://github.com/axboe/fio/blob/master/examples/jesd219.fio, against 1 pool with 10 RBD devices)
When I "fail" one of the OSD servers, I find that the recovery rate - the rate at which degraded PGs reduce - is very low, even though CPU, SAS bandwidth, and network bandwidth are not pegged. I have attached some graphs to illustrate. I had expected Ceph recovery to saturate one of these hardware limits and work aggresively toward reducing the risk of a single-copy situation.
I could possibly tweak osd_recovery_* parameters in the ceph.conf or consider 3 copies. (having degraded 1-copy PGs in the system for 24+ hours is not production worthy).
But I am reporting this as a bug because I found it strange that recovery "slows down" as the number of degraded PGs reduces (you can see this in the different graphs of the attachment - they taper off toward the end). During all this time data is at risk (single copy) and for a mixed heavy workload another hardware failure (e.g. a disk dying) will result in significant data loss/corruption. I think the priority of recovery should not be decreased as the number of degraded PGs reduces.
Here are some graphs to illustrate
(please see attached PDF)
Files
Updated by Greg Farnum almost 7 years ago
- Status changed from New to Closed
There's been a bunch of work already to improve recovery handling and prioritization; there's planned work coming up. This ticket doesn't add much to the conversation, so closing! :)