Bug #8907: All user traffics will start to get 500 after some time if (m+1) OSDs of one EC PG are down - Ceph - Ceph

Actions

Copy link

Bug #8907

closed

All user traffics will start to get 500 after some time if (m+1) OSDs of one EC PG are down

Added by Zhi Zhang almost 10 years ago. Updated over 8 years ago.

Status:

Duplicate

Priority:

High

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

EC pool configuration: k=8, m=3

Steps to reproduce:

1. stop 4 OSDs of one EC PG (down), so this PG can't be writable
2. keep running load to radosgw for some time (use S3 API)
3. from the beginning, only request to this PG will get 500 after 30 sec. This
is expected as slow request
4. but finally all the requests will get 500 if keep running load

From apache access log, all the requests' http status is 500. And from apache
error log, fastcgi will report following error.

(11)Resource temporarily unavailable: FastCGI: failed to connect to server "s3gw.fcgi": connect() failed

At this time, radosgw log has no any user traffic log. It has lots of following
timed out logs.

2014-07-23 10:37:36.509795 7fbd44f33700 1 heartbeat_map is_healthy
'RGWProcess::m_tp thread 0x7fbc165f6700' had timed out after 600
2014-07-23 10:37:36.509825 7fbd44f33700 1 heartbeat_map is_healthy
'RGWProcess::m_tp thread 0x7fbc28fa7700' had timed out after 600

It looks like radosgw queue is full and uses up its thread resources, so can't accept any user traffic.

Actions

Copy link

Updated by Guang Yang almost 10 years ago

The problem, as David explained, is due to that many OPs are stuck at OSD side and in turn hand the thread at radosgw, which bring the entire cluster unavailable.

What is the harm if OSD just reply client if the PG is inactive, and then let users do a re-try if he/she would like to?

Actions

Copy link