Bug #8907
closedAll user traffics will start to get 500 after some time if (m+1) OSDs of one EC PG are down
0%
Description
EC pool configuration: k=8, m=3
Steps to reproduce:
1. stop 4 OSDs of one EC PG (down), so this PG can't be writable
2. keep running load to radosgw for some time (use S3 API)
3. from the beginning, only request to this PG will get 500 after 30 sec. This
is expected as slow request
4. but finally all the requests will get 500 if keep running load
From apache access log, all the requests' http status is 500. And from apache
error log, fastcgi will report following error.
(11)Resource temporarily unavailable: FastCGI: failed to connect to server "s3gw.fcgi": connect() failed
At this time, radosgw log has no any user traffic log. It has lots of following
timed out logs.
2014-07-23 10:37:36.509795 7fbd44f33700 1 heartbeat_map is_healthy
'RGWProcess::m_tp thread 0x7fbc165f6700' had timed out after 600
2014-07-23 10:37:36.509825 7fbd44f33700 1 heartbeat_map is_healthy
'RGWProcess::m_tp thread 0x7fbc28fa7700' had timed out after 600
It looks like radosgw queue is full and uses up its thread resources, so can't accept any user traffic.
Updated by Guang Yang almost 10 years ago
The problem, as David explained, is due to that many OPs are stuck at OSD side and in turn hand the thread at radosgw, which bring the entire cluster unavailable.
What is the harm if OSD just reply client if the PG is inactive, and then let users do a re-try if he/she would like to?
Updated by Guang Yang over 9 years ago
- Category set to OSD
- Priority changed from Normal to High
- Source changed from Community (user) to Community (dev)
Updated by Guang Yang over 8 years ago
- Status changed from New to Duplicate
- Regression set to No
Dup to 12623