Feature #51655
openNo automatic alarm / recovery for unresponsive RGW
0%
Description
Running a multisite environment on ceph octopus 15.2.9 (container version: v5.0.10-stable-5.0-octopus-centos-8 ).
Site A and B each have their own rgw zone, belonging to the same zonegroup.
Shortly after scaling out site A from 3 to 4 MONs and OSDs, one of the RGW instances on site B became unresponsive. Logs from the RGW instance reported:
[qs-admin@prd16134_st_uplevel_sc2b ~]$ sudo journalctl -u ceph-radosgw@rgw.prd16134_st_uplevel_sc2b.rgw0.service
Jul 06 13:05:26 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:26.960+0000 7f1fb4f61700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:27.147+0000 7f1fb4f61700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:27.203+0000 7f1fb4f61700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:27.343+0000 7f1fb4f61700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:27.348+0000 7f1fb4f61700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:27.349+0000 7f1fb4f61700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:27.465+0000 7f1fb3f5f700 0 RGW-SYNC:data:sync:shard56: ERROR: failed to read remote data log info: ret=-5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:27.471+0000 7f1fb3f5f700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:27.484+0000 7f1fb4f61700 0 RGW-SYNC:meta: ERROR: failed to fetch all metadata keys
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:27.489+0000 7f1fb3f5f700 0 data sync zone:5327dead ERROR: failed to run sync
and then nothing.
Symptoms:
1) S3 requests sent directly to this RGW instance get no response.
2) `radosgw-admin sync status` on site A hang indefinitely.
But at the same time I've got a lack of alarms / recognition that there's anything broken:
3) No alarms on either site to indicate anything is broken.
4) The systemd service running the failed RGW instance has failed to recognise the process is dead, so is still marked as active. No automatic restarts.
I've recovered from this state by restarting the failed RGW instance manually. Symptoms (1) and (2) go away immediately and both sites appear to be healthy again.
Expected outcome:
At the very least, I'd have expected ceph to raise alarms on site B to warn that the RGW instance was unresponsive.
Updated by Casey Bodley over 2 years ago
- Related to Bug #52568: RadosGW's hang when OSD's are in slow OPS state added