Project

General

Profile

Actions

Feature #51655

open

No automatic alarm / recovery for unresponsive RGW

Added by David Piper almost 3 years ago. Updated about 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Running a multisite environment on ceph octopus 15.2.9 (container version: v5.0.10-stable-5.0-octopus-centos-8 ).
Site A and B each have their own rgw zone, belonging to the same zonegroup.

Shortly after scaling out site A from 3 to 4 MONs and OSDs, one of the RGW instances on site B became unresponsive. Logs from the RGW instance reported:

[qs-admin@prd16134_st_uplevel_sc2b ~]$ sudo journalctl -u _st_uplevel_sc2b.rgw0.service
Jul 06 13:05:26 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:26.960+0000 7f1fb4f61700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:27.147+0000 7f1fb4f61700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:27.203+0000 7f1fb4f61700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:27.343+0000 7f1fb4f61700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:27.348+0000 7f1fb4f61700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:27.349+0000 7f1fb4f61700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:27.465+0000 7f1fb3f5f700 0 RGW-SYNC:data:sync:shard56: ERROR: failed to read remote data log info: ret=-5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:27.471+0000 7f1fb3f5f700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:27.484+0000 7f1fb4f61700 0 RGW-SYNC:meta: ERROR: failed to fetch all metadata keys
Jul 06 13:05:27 prd16134_st_uplevel_sc2b docker1304: 2021-07-06T13:05:27.489+0000 7f1fb3f5f700 0 data sync zone:5327dead ERROR: failed to run sync

and then nothing.

Symptoms:

1) S3 requests sent directly to this RGW instance get no response.
2) `radosgw-admin sync status` on site A hang indefinitely.

But at the same time I've got a lack of alarms / recognition that there's anything broken:
3) No alarms on either site to indicate anything is broken.
4) The systemd service running the failed RGW instance has failed to recognise the process is dead, so is still marked as active. No automatic restarts.

I've recovered from this state by restarting the failed RGW instance manually. Symptoms (1) and (2) go away immediately and both sites appear to be healthy again.

Expected outcome:

At the very least, I'd have expected ceph to raise alarms on site B to warn that the RGW instance was unresponsive.


Related issues 1 (1 open0 closed)

Related to rgw - Bug #52568: RadosGW's hang when OSD's are in slow OPS stateNew

Actions
Actions #1

Updated by Sebastian Wagner over 2 years ago

  • Project changed from Ceph to rgw
Actions #2

Updated by Casey Bodley over 2 years ago

  • Related to Bug #52568: RadosGW's hang when OSD's are in slow OPS state added
Actions #3

Updated by Casey Bodley about 2 years ago

  • Tracker changed from Bug to Feature
Actions

Also available in: Atom PDF