Project

General

Profile

Actions

Bug #8907

closed

All user traffics will start to get 500 after some time if (m+1) OSDs of one EC PG are down

Added by Zhi Zhang almost 10 years ago. Updated over 8 years ago.

Status:
Duplicate
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

EC pool configuration: k=8, m=3

Steps to reproduce:

1. stop 4 OSDs of one EC PG (down), so this PG can't be writable
2. keep running load to radosgw for some time (use S3 API)
3. from the beginning, only request to this PG will get 500 after 30 sec. This
is expected as slow request
4. but finally all the requests will get 500 if keep running load

From apache access log, all the requests' http status is 500. And from apache
error log, fastcgi will report following error.

(11)Resource temporarily unavailable: FastCGI: failed to connect to server "s3gw.fcgi": connect() failed

At this time, radosgw log has no any user traffic log. It has lots of following
timed out logs.

2014-07-23 10:37:36.509795 7fbd44f33700 1 heartbeat_map is_healthy
'RGWProcess::m_tp thread 0x7fbc165f6700' had timed out after 600
2014-07-23 10:37:36.509825 7fbd44f33700 1 heartbeat_map is_healthy
'RGWProcess::m_tp thread 0x7fbc28fa7700' had timed out after 600

It looks like radosgw queue is full and uses up its thread resources, so can't accept any user traffic.

Actions

Also available in: Atom PDF