Actions
Bug #52814
openRGW services stop responding when datacenter down
Status:
New
Priority:
Normal
Assignee:
Or Friedmann
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rgw
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
We planned a stop of half of our cluster hosts to move to a new datacenter and right after all the RGW services stop responding and therefore we suffer a service drop since production depends on CEPH to work. We actually first did it in a test environment so it wasn't catastrophic but we will have to move the rest of our datacenter and almost every host there is real production.
Here I share some status so you can see it through.
$ ceph -s
cluster:
id: b410bef1-f09c-521c-b3e9-0198cb041e6c
health: HEALTH_WARN
2 hosts fail cephadm check
8 large omap objects
1/3 mons down, quorum cephm00a,cephm00c
noout,norecover flag(s) set
2 datacenters (38 osds) down
38 osds down
4 hosts (38 osds) down
Reduced data availability: 88 pgs inactive
Degraded data redundancy: 315300237/646890369 objects degraded (48.741%), 2132 pgs degraded, 2217 pgs undersized
About RGW I don't know what to actually share since suddenly it stop responding. Although requests keep entering none of them respond:
2021-10-05T04:10:16.863+0000 7f6ab45f4700 1 ====== starting new request req=0x7f6a2ab75620 =====
2021-10-05T04:10:29.833+0000 7f6a5d546700 1 ====== starting new request req=0x7f6a2aaf4620 =====
2021-10-05T04:10:42.775+0000 7f6af1e6f700 1 ====== starting new request req=0x7f6a2aa73620 =====
2021-10-05T04:10:55.820+0000 7f6a53d33700 1 ====== starting new request req=0x7f6a2a9f2620 =====
2021-10-05T04:11:08.782+0000 7f6adfe4b700 1 ====== starting new request req=0x7f6a2a971620 =====
2021-10-05T04:11:21.790+0000 7f6acbe23700 1 ====== starting new request req=0x7f6a2a8f0620 =====
2021-10-05T04:11:34.822+0000 7f6a7b582700 1 ====== starting new request req=0x7f6a2a86f620 =====
Also "radosgw-admin" command doesn't respond, I guess it's normal since neither of the daemons are responding.
Actions