Project

General

Profile

Actions

Bug #15763

closed

Slow recovery times, slowing further toward the end of the recovery

Added by Sachin Agarwal almost 8 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Low
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
5 - suggestion
Reviewed:
Affected Versions:
ceph-qa-suite:
rbd
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have been evaluating Ceph (Hammer - v0.94.6) in an all-SSD, 2X replication configuration. I have 4 OSD servers: 2x Intel Xeon E5-2660v2 - 20 physical cores, SAS connected to 64TB (8x8TB SSD drives each), 2x10Gbit LACP-bonded ethernet links.

The OS is Ubuntu 14.04 with a 3.16 Linux kernel.

I am running the JESD219 (60% write + 40% read) benchmark against the cluster (10 VMs: fio + librbd running this FIO workload: https://github.com/axboe/fio/blob/master/examples/jesd219.fio, against 1 pool with 10 RBD devices)

When I "fail" one of the OSD servers, I find that the recovery rate - the rate at which degraded PGs reduce - is very low, even though CPU, SAS bandwidth, and network bandwidth are not pegged. I have attached some graphs to illustrate. I had expected Ceph recovery to saturate one of these hardware limits and work aggresively toward reducing the risk of a single-copy situation.

I could possibly tweak osd_recovery_* parameters in the ceph.conf or consider 3 copies. (having degraded 1-copy PGs in the system for 24+ hours is not production worthy).

But I am reporting this as a bug because I found it strange that recovery "slows down" as the number of degraded PGs reduces (you can see this in the different graphs of the attachment - they taper off toward the end). During all this time data is at risk (single copy) and for a mixed heavy workload another hardware failure (e.g. a disk dying) will result in significant data loss/corruption. I think the priority of recovery should not be decreased as the number of degraded PGs reduces.

Here are some graphs to illustrate

(please see attached PDF)


Files

ceph-recovery-slowdown - Copy.pdf (635 KB) ceph-recovery-slowdown - Copy.pdf Sachin Agarwal, 05/06/2016 03:07 PM
Actions #1

Updated by Greg Farnum almost 7 years ago

  • Status changed from New to Closed

There's been a bunch of work already to improve recovery handling and prioritization; there's planned work coming up. This ticket doesn't add much to the conversation, so closing! :)

Actions

Also available in: Atom PDF