Bug #18967: Cluster can't process any new requests after 3 hosts crashed in 4+2 EC - Ceph - Ceph

Actions

Copy link

Bug #18967

closed

Cluster can't process any new requests after 3 hosts crashed in 4+2 EC

Added by Xinying Song about 7 years ago. Updated almost 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v0.94.3

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi, guys. I'm using rgw on a 4+2 EC pool whose failure domain is host level. There are 20 hosts in total and each has 10 SATA disks. What surprised me is after 3 OSD hosts crashed, the whole cluster was totally unable to process any new requests, although there are still 17 hosts alive.

I found that was caused by throttle limits, because OSD will cache all requests until it was completed. However, if the crashed machine was not restarted in time(in my case, 10 minutes), this caching strategy will lead to a chain-reaction, finally, all OSDs reach its throttle limits.

Using a rack domain in crushmap can avoid IO completely drop to zero, but this will impact on more pgs than before. I'm wondering that is this a bug or a just what Ceph intended to do? Why not just return an error before put OP into OSD's work queue when the target pg is unactive ?

By the way, replicate pool also has this problem when min_size>1. Ceph version in my case is 0.94.