Bug #15763
closedSlow recovery times, slowing further toward the end of the recovery
0%
Description
I have been evaluating Ceph (Hammer - v0.94.6) in an all-SSD, 2X replication configuration. I have 4 OSD servers: 2x Intel Xeon E5-2660v2 - 20 physical cores, SAS connected to 64TB (8x8TB SSD drives each), 2x10Gbit LACP-bonded ethernet links.
The OS is Ubuntu 14.04 with a 3.16 Linux kernel.
I am running the JESD219 (60% write + 40% read) benchmark against the cluster (10 VMs: fio + librbd running this FIO workload: https://github.com/axboe/fio/blob/master/examples/jesd219.fio, against 1 pool with 10 RBD devices)
When I "fail" one of the OSD servers, I find that the recovery rate - the rate at which degraded PGs reduce - is very low, even though CPU, SAS bandwidth, and network bandwidth are not pegged. I have attached some graphs to illustrate. I had expected Ceph recovery to saturate one of these hardware limits and work aggresively toward reducing the risk of a single-copy situation.
I could possibly tweak osd_recovery_* parameters in the ceph.conf or consider 3 copies. (having degraded 1-copy PGs in the system for 24+ hours is not production worthy).
But I am reporting this as a bug because I found it strange that recovery "slows down" as the number of degraded PGs reduces (you can see this in the different graphs of the attachment - they taper off toward the end). During all this time data is at risk (single copy) and for a mixed heavy workload another hardware failure (e.g. a disk dying) will result in significant data loss/corruption. I think the priority of recovery should not be decreased as the number of degraded PGs reduces.
Here are some graphs to illustrate
(please see attached PDF)
Files