Bug #53029: radosgw-admin fails on "sync status" if a single RGW process is down - rgw - Ceph

Actions

Copy link

Bug #53029

closed

radosgw-admin fails on "sync status" if a single RGW process is down

Added by David Piper over 2 years ago. Updated 30 days ago.

Status:

Resolved

Priority:

Normal

Assignee:

Jane Zhu

Target version:

% Done:

Source:

Tags:

multisite multisite-backlog

Backport:

reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

53320

Crash signature (v1):

Crash signature (v2):

Description

We're using ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable) in a containerized deployment.
We have two RGW zones in the same zonegroup.
Each zone is hosted in a separate ceph cluster, and has four RGW endpoints.
The master zone's endpoints are configured as endpoints for the zonegroup.
(We're also using pubsub zones but I don't think this is related.)

When a single RGW endpoint from the master zone is stopped / crashes, the 'radosgw-admin sync status' command returns an error on the cluster hosting the non-master zone:

[qs-admin@newbrunswick0 ~]$ radosgw-admin sync status
+ sudo docker ps --filter name=ceph-rgw-.*rgw -q
+ sudo docker exec aa87acb445c5 radosgw-admin
realm 9d76aa86-99d1-41c3-966f-cc97eab2bfb3 (geored_realm)
zonegroup 384c36ac-374b-4ae2-bf9f-ae951f25920a (geored_zg)
zone b113b104-9c84-44ff-9058-4658c6e1df52 (siteB)
metadata sync syncing
full sync: 0/64 shards
failed to fetch master sync status: (5) Input/output error
data sync source: 0bbdd7ae-6e2a-4ad0-996b-5f0ed38443c1 (siteA)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
source: 9be18697-7423-41a7-a338-926aa938f9de (siteBpubsub)
not syncing from zone
source: a2a5b39a-3df5-4be3-9270-68bf90bc2a51 (siteApubsub)
not syncing from zone

This is easy to repro by stopping any of the RGW containers in the master zone. As far as we can tell, sync is still taking place. Once the container is restarted, the sync status command returns normally again.

[qs-admin@newbrunswick0 ~]$ radosgw-admin sync status
+ sudo docker ps --filter name=ceph-rgw-.*rgw -q
+ sudo docker exec aa87acb445c5 radosgw-admin
realm 9d76aa86-99d1-41c3-966f-cc97eab2bfb3 (geored_realm)
zonegroup 384c36ac-374b-4ae2-bf9f-ae951f25920a (geored_zg)
zone b113b104-9c84-44ff-9058-4658c6e1df52 (siteB)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
data sync source: 0bbdd7ae-6e2a-4ad0-996b-5f0ed38443c1 (siteA)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 3 shards
behind shards: [90,101,107]
oldest incremental change not applied: 2021-10-25T13:59:52.007974+0000 [90]
6 shards are recovering
recovering shards: [2,3,54,57,107,116]
source: 9be18697-7423-41a7-a338-926aa938f9de (siteBpubsub)
not syncing from zone
source: a2a5b39a-3df5-4be3-9270-68bf90bc2a51 (siteApubsub)
not syncing from zone

Unless we have misconfigured something, this feels like a bug: the other RGW endpoints should be suitable for reporting sync status?

RGW config:

(newbrunswick0 = 10.245.0.40)

[qs-admin@newbrunswick0 ~]$ radosgw-admin zonegroup get
+ sudo docker ps --filter name=ceph-rgw-.*rgw -q
+ sudo docker exec aa87acb445c5 radosgw-admin {
"id": "384c36ac-374b-4ae2-bf9f-ae951f25920a",
"name": "geored_zg",
"api_name": "geored_zg",
"is_master": "true",
"endpoints": [
"https://10.245.0.20:7480",
"https://10.245.0.21:7480",
"https://10.245.0.22:7480",
"https://10.245.0.23:7480"
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "0bbdd7ae-6e2a-4ad0-996b-5f0ed38443c1",
"zones": [ {
"id": "0bbdd7ae-6e2a-4ad0-996b-5f0ed38443c1",
"name": "siteA",
"endpoints": [
"https://10.245.0.20:7480",
"https://10.245.0.21:7480",
"https://10.245.0.22:7480",
"https://10.245.0.23:7480"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": [],
"redirect_zone": ""
}, {
"id": "9be18697-7423-41a7-a338-926aa938f9de",
"name": "siteBpubsub",
"endpoints": [
"https://10.245.0.40:7481",
"https://10.245.0.41:7481",
"https://10.245.0.42:7481",
"https://10.245.0.43:7481"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "pubsub",
"sync_from_all": "false",
"sync_from": [
"siteB"
],
"redirect_zone": ""
}, {
"id": "a2a5b39a-3df5-4be3-9270-68bf90bc2a51",
"name": "siteApubsub",
"endpoints": [
"https://10.245.0.20:7481",
"https://10.245.0.21:7481",
"https://10.245.0.22:7481",
"https://10.245.0.23:7481"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "pubsub",
"sync_from_all": "false",
"sync_from": [
"siteA"
],
"redirect_zone": ""
}, {
"id": "b113b104-9c84-44ff-9058-4658c6e1df52",
"name": "siteB",
"endpoints": [
"https://10.245.0.40:7480",
"https://10.245.0.41:7480",
"https://10.245.0.42:7480",
"https://10.245.0.43:7480"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": [],
"redirect_zone": ""
}
],
"placement_targets": [ {
"name": "default-placement",
"tags": [],
"storage_classes": [
"STANDARD"
]
}
],
"default_placement": "default-placement",
"realm_id": "9d76aa86-99d1-41c3-966f-cc97eab2bfb3",
"sync_policy": {
"groups": []
}
}

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #53029

radosgw-admin fails on "sync status" if a single RGW process is down

Updated by Casey Bodley over 2 years ago

Updated by Casey Bodley over 1 year ago

Updated by Casey Bodley 12 months ago

Updated by Jane Zhu 11 months ago

Updated by Jane Zhu 9 months ago

Updated by Casey Bodley 9 months ago

Updated by Casey Bodley 7 months ago

Updated by Jane Zhu 7 months ago

Updated by Jane Zhu 30 days ago