Feature #51655: No automatic alarm / recovery for unresponsive RGW - rgw - Ceph

Actions

Copy link

Feature #51655

open

No automatic alarm / recovery for unresponsive RGW

Added by David Piper almost 3 years ago. Updated about 2 years ago.

Status:

New

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Reviewed:

Affected Versions:

Pull request ID:

Description

Running a multisite environment on ceph octopus 15.2.9 (container version: v5.0.10-stable-5.0-octopus-centos-8 ).
Site A and B each have their own rgw zone, belonging to the same zonegroup.

Shortly after scaling out site A from 3 to 4 MONs and OSDs, one of the RGW instances on site B became unresponsive. Logs from the RGW instance reported:

[qs-admin@prd16134_st_uplevel_sc2b ~]$ sudo journalctl -u ceph-radosgw@rgw.prd16134_st_uplevel_sc2b.rgw0.service
Jul 06 13:05:26 prd16134_st_uplevel_sc2b docker¹³⁰⁴: 2021-07-06T13:05:26.960+0000 7f1fb4f61700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker¹³⁰⁴: 2021-07-06T13:05:27.147+0000 7f1fb4f61700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker¹³⁰⁴: 2021-07-06T13:05:27.203+0000 7f1fb4f61700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker¹³⁰⁴: 2021-07-06T13:05:27.343+0000 7f1fb4f61700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker¹³⁰⁴: 2021-07-06T13:05:27.348+0000 7f1fb4f61700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker¹³⁰⁴: 2021-07-06T13:05:27.349+0000 7f1fb4f61700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker¹³⁰⁴: 2021-07-06T13:05:27.465+0000 7f1fb3f5f700 0 RGW-SYNC:data:sync:shard⁵⁶: ERROR: failed to read remote data log info: ret=-5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker¹³⁰⁴: 2021-07-06T13:05:27.471+0000 7f1fb3f5f700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker¹³⁰⁴: 2021-07-06T13:05:27.484+0000 7f1fb4f61700 0 RGW-SYNC:meta: ERROR: failed to fetch all metadata keys
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker¹³⁰⁴: 2021-07-06T13:05:27.489+0000 7f1fb3f5f700 0 data sync zone:5327dead ERROR: failed to run sync

and then nothing.

Symptoms:

1) S3 requests sent directly to this RGW instance get no response.
2) `radosgw-admin sync status` on site A hang indefinitely.

But at the same time I've got a lack of alarms / recognition that there's anything broken:
3) No alarms on either site to indicate anything is broken.
4) The systemd service running the failed RGW instance has failed to recognise the process is dead, so is still marked as active. No automatic restarts.

I've recovered from this state by restarting the failed RGW instance manually. Symptoms (1) and (2) go away immediately and both sites appear to be healthy again.

Expected outcome:

At the very least, I'd have expected ceph to raise alarms on site B to warn that the RGW instance was unresponsive.

Related issues 1 (1 open — 0 closed)