Bug #15763: Slow recovery times, slowing further toward the end of the recovery - Ceph - Ceph

Actions

Copy link

Bug #15763

closed

Slow recovery times, slowing further toward the end of the recovery

Added by Sachin Agarwal almost 8 years ago. Updated almost 7 years ago.

Status:

Closed

Priority:

Low

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

5 - suggestion

Reviewed:

Affected Versions:

v0.94

ceph-qa-suite:

rbd

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I have been evaluating Ceph (Hammer - v0.94.6) in an all-SSD, 2X replication configuration. I have 4 OSD servers: 2x Intel Xeon E5-2660v2 - 20 physical cores, SAS connected to 64TB (8x8TB SSD drives each), 2x10Gbit LACP-bonded ethernet links.

The OS is Ubuntu 14.04 with a 3.16 Linux kernel.

I am running the JESD219 (60% write + 40% read) benchmark against the cluster (10 VMs: fio + librbd running this FIO workload: https://github.com/axboe/fio/blob/master/examples/jesd219.fio, against 1 pool with 10 RBD devices)

When I "fail" one of the OSD servers, I find that the recovery rate - the rate at which degraded PGs reduce - is very low, even though CPU, SAS bandwidth, and network bandwidth are not pegged. I have attached some graphs to illustrate. I had expected Ceph recovery to saturate one of these hardware limits and work aggresively toward reducing the risk of a single-copy situation.

I could possibly tweak osd_recovery_* parameters in the ceph.conf or consider 3 copies. (having degraded 1-copy PGs in the system for 24+ hours is not production worthy).

But I am reporting this as a bug because I found it strange that recovery "slows down" as the number of degraded PGs reduces (you can see this in the different graphs of the attachment - they taper off toward the end). During all this time data is at risk (single copy) and for a mixed heavy workload another hardware failure (e.g. a disk dying) will result in significant data loss/corruption. I think the priority of recovery should not be decreased as the number of degraded PGs reduces.

Here are some graphs to illustrate

(please see attached PDF)

Files

ceph-recovery-slowdown - Copy.pdf (635 KB) ceph-recovery-slowdown - Copy.pdf

Sachin Agarwal, 05/06/2016 03:07 PM

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

Status changed from New to Closed

There's been a bunch of work already to improve recovery handling and prioritization; there's planned work coming up. This ticket doesn't add much to the conversation, so closing! :)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #15763

Slow recovery times, slowing further toward the end of the recovery

Updated by Greg Farnum almost 7 years ago