Project

General

Profile

Actions

Bug #18967

closed

Cluster can't process any new requests after 3 hosts crashed in 4+2 EC

Added by Xinying Song about 7 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi, guys. I'm using rgw on a 4+2 EC pool whose failure domain is host level. There are 20 hosts in total and each has 10 SATA disks. What surprised me is after 3 OSD hosts crashed, the whole cluster was totally unable to process any new requests, although there are still 17 hosts alive.

I found that was caused by throttle limits, because OSD will cache all requests until it was completed. However, if the crashed machine was not restarted in time(in my case, 10 minutes), this caching strategy will lead to a chain-reaction, finally, all OSDs reach its throttle limits.

Using a rack domain in crushmap can avoid IO completely drop to zero, but this will impact on more pgs than before. I'm wondering that is this a bug or a just what Ceph intended to do? Why not just return an error before put OP into OSD's work queue when the target pg is unactive ?

By the way, replicate pool also has this problem when min_size>1. Ceph version in my case is 0.94.

Actions #1

Updated by Xinying Song about 7 years ago

And restart rgw process won't release OSD's throttle.

Actions #2

Updated by Xinying Song about 7 years ago

Wow, just find this problem has been solved
https://github.com/ceph/ceph/pull/12342

Actions #3

Updated by Greg Farnum almost 7 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF