Project

General

Profile

Actions

Bug #53029

closed

radosgw-admin fails on "sync status" if a single RGW process is down

Added by David Piper over 2 years ago. Updated 30 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
multisite multisite-backlog
Backport:
reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We're using ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable) in a containerized deployment.
We have two RGW zones in the same zonegroup.
Each zone is hosted in a separate ceph cluster, and has four RGW endpoints.
The master zone's endpoints are configured as endpoints for the zonegroup.
(We're also using pubsub zones but I don't think this is related.)

When a single RGW endpoint from the master zone is stopped / crashes, the 'radosgw-admin sync status' command returns an error on the cluster hosting the non-master zone:

[qs-admin@newbrunswick0 ~]$ radosgw-admin sync status
+ sudo docker ps --filter name=ceph-rgw-.*rgw -q
+
sudo docker exec aa87acb445c5 radosgw-admin
realm 9d76aa86-99d1-41c3-966f-cc97eab2bfb3 (geored_realm)
zonegroup 384c36ac-374b-4ae2-bf9f-ae951f25920a (geored_zg)
zone b113b104-9c84-44ff-9058-4658c6e1df52 (siteB)
metadata sync syncing
full sync: 0/64 shards
failed to fetch master sync status: (5) Input/output error
data sync source: 0bbdd7ae-6e2a-4ad0-996b-5f0ed38443c1 (siteA)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
source: 9be18697-7423-41a7-a338-926aa938f9de (siteBpubsub)
not syncing from zone
source: a2a5b39a-3df5-4be3-9270-68bf90bc2a51 (siteApubsub)
not syncing from zone

This is easy to repro by stopping any of the RGW containers in the master zone. As far as we can tell, sync is still taking place. Once the container is restarted, the sync status command returns normally again.

[qs-admin@newbrunswick0 ~]$ radosgw-admin sync status
+ sudo docker ps --filter name=ceph-rgw-.*rgw -q
+
sudo docker exec aa87acb445c5 radosgw-admin
realm 9d76aa86-99d1-41c3-966f-cc97eab2bfb3 (geored_realm)
zonegroup 384c36ac-374b-4ae2-bf9f-ae951f25920a (geored_zg)
zone b113b104-9c84-44ff-9058-4658c6e1df52 (siteB)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
data sync source: 0bbdd7ae-6e2a-4ad0-996b-5f0ed38443c1 (siteA)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 3 shards
behind shards: [90,101,107]
oldest incremental change not applied: 2021-10-25T13:59:52.007974+0000 [90]
6 shards are recovering
recovering shards: [2,3,54,57,107,116]
source: 9be18697-7423-41a7-a338-926aa938f9de (siteBpubsub)
not syncing from zone
source: a2a5b39a-3df5-4be3-9270-68bf90bc2a51 (siteApubsub)
not syncing from zone

Unless we have misconfigured something, this feels like a bug: the other RGW endpoints should be suitable for reporting sync status?

RGW config:

(newbrunswick0 = 10.245.0.40)

[qs-admin@newbrunswick0 ~]$ radosgw-admin zonegroup get
+ sudo docker ps --filter name=ceph-rgw-.*rgw -q
+
sudo docker exec aa87acb445c5 radosgw-admin {
"id": "384c36ac-374b-4ae2-bf9f-ae951f25920a",
"name": "geored_zg",
"api_name": "geored_zg",
"is_master": "true",
"endpoints": [
"https://10.245.0.20:7480",
"https://10.245.0.21:7480",
"https://10.245.0.22:7480",
"https://10.245.0.23:7480"
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "0bbdd7ae-6e2a-4ad0-996b-5f0ed38443c1",
"zones": [ {
"id": "0bbdd7ae-6e2a-4ad0-996b-5f0ed38443c1",
"name": "siteA",
"endpoints": [
"https://10.245.0.20:7480",
"https://10.245.0.21:7480",
"https://10.245.0.22:7480",
"https://10.245.0.23:7480"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": [],
"redirect_zone": ""
}, {
"id": "9be18697-7423-41a7-a338-926aa938f9de",
"name": "siteBpubsub",
"endpoints": [
"https://10.245.0.40:7481",
"https://10.245.0.41:7481",
"https://10.245.0.42:7481",
"https://10.245.0.43:7481"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "pubsub",
"sync_from_all": "false",
"sync_from": [
"siteB"
],
"redirect_zone": ""
}, {
"id": "a2a5b39a-3df5-4be3-9270-68bf90bc2a51",
"name": "siteApubsub",
"endpoints": [
"https://10.245.0.20:7481",
"https://10.245.0.21:7481",
"https://10.245.0.22:7481",
"https://10.245.0.23:7481"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "pubsub",
"sync_from_all": "false",
"sync_from": [
"siteA"
],
"redirect_zone": ""
}, {
"id": "b113b104-9c84-44ff-9058-4658c6e1df52",
"name": "siteB",
"endpoints": [
"https://10.245.0.40:7480",
"https://10.245.0.41:7480",
"https://10.245.0.42:7480",
"https://10.245.0.43:7480"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": [],
"redirect_zone": ""
}
],
"placement_targets": [ {
"name": "default-placement",
"tags": [],
"storage_classes": [
"STANDARD"
]
}
],
"default_placement": "default-placement",
"realm_id": "9d76aa86-99d1-41c3-966f-cc97eab2bfb3",
"sync_policy": {
"groups": []
}
}


Related issues 1 (0 open1 closed)

Has duplicate rgw - Bug #62196: multisite sync fairness : "sync status " in I/O error DuplicateShilpa MJ

Actions
Actions

Also available in: Atom PDF