Bug #53029
closedradosgw-admin fails on "sync status" if a single RGW process is down
0%
Description
We're using ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable) in a containerized deployment.
We have two RGW zones in the same zonegroup.
Each zone is hosted in a separate ceph cluster, and has four RGW endpoints.
The master zone's endpoints are configured as endpoints for the zonegroup.
(We're also using pubsub zones but I don't think this is related.)
When a single RGW endpoint from the master zone is stopped / crashes, the 'radosgw-admin sync status' command returns an error on the cluster hosting the non-master zone:
[qs-admin@newbrunswick0 ~]$ radosgw-admin sync status
+ sudo docker ps --filter name=ceph-rgw-.*rgw -q
+ sudo docker exec aa87acb445c5 radosgw-admin
realm 9d76aa86-99d1-41c3-966f-cc97eab2bfb3 (geored_realm)
zonegroup 384c36ac-374b-4ae2-bf9f-ae951f25920a (geored_zg)
zone b113b104-9c84-44ff-9058-4658c6e1df52 (siteB)
metadata sync syncing
full sync: 0/64 shards
failed to fetch master sync status: (5) Input/output error
data sync source: 0bbdd7ae-6e2a-4ad0-996b-5f0ed38443c1 (siteA)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
source: 9be18697-7423-41a7-a338-926aa938f9de (siteBpubsub)
not syncing from zone
source: a2a5b39a-3df5-4be3-9270-68bf90bc2a51 (siteApubsub)
not syncing from zone
This is easy to repro by stopping any of the RGW containers in the master zone. As far as we can tell, sync is still taking place. Once the container is restarted, the sync status command returns normally again.
[qs-admin@newbrunswick0 ~]$ radosgw-admin sync status
+ sudo docker ps --filter name=ceph-rgw-.*rgw -q
+ sudo docker exec aa87acb445c5 radosgw-admin
realm 9d76aa86-99d1-41c3-966f-cc97eab2bfb3 (geored_realm)
zonegroup 384c36ac-374b-4ae2-bf9f-ae951f25920a (geored_zg)
zone b113b104-9c84-44ff-9058-4658c6e1df52 (siteB)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
data sync source: 0bbdd7ae-6e2a-4ad0-996b-5f0ed38443c1 (siteA)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 3 shards
behind shards: [90,101,107]
oldest incremental change not applied: 2021-10-25T13:59:52.007974+0000 [90]
6 shards are recovering
recovering shards: [2,3,54,57,107,116]
source: 9be18697-7423-41a7-a338-926aa938f9de (siteBpubsub)
not syncing from zone
source: a2a5b39a-3df5-4be3-9270-68bf90bc2a51 (siteApubsub)
not syncing from zone
Unless we have misconfigured something, this feels like a bug: the other RGW endpoints should be suitable for reporting sync status?
RGW config:
(newbrunswick0 = 10.245.0.40)
[qs-admin@newbrunswick0 ~]$ radosgw-admin zonegroup get
+ sudo docker ps --filter name=ceph-rgw-.*rgw -q
+ sudo docker exec aa87acb445c5 radosgw-admin
{
"id": "384c36ac-374b-4ae2-bf9f-ae951f25920a",
"name": "geored_zg",
"api_name": "geored_zg",
"is_master": "true",
"endpoints": [
"https://10.245.0.20:7480",
"https://10.245.0.21:7480",
"https://10.245.0.22:7480",
"https://10.245.0.23:7480"
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "0bbdd7ae-6e2a-4ad0-996b-5f0ed38443c1",
"zones": [
{
"id": "0bbdd7ae-6e2a-4ad0-996b-5f0ed38443c1",
"name": "siteA",
"endpoints": [
"https://10.245.0.20:7480",
"https://10.245.0.21:7480",
"https://10.245.0.22:7480",
"https://10.245.0.23:7480"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": [],
"redirect_zone": ""
},
{
"id": "9be18697-7423-41a7-a338-926aa938f9de",
"name": "siteBpubsub",
"endpoints": [
"https://10.245.0.40:7481",
"https://10.245.0.41:7481",
"https://10.245.0.42:7481",
"https://10.245.0.43:7481"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "pubsub",
"sync_from_all": "false",
"sync_from": [
"siteB"
],
"redirect_zone": ""
},
{
"id": "a2a5b39a-3df5-4be3-9270-68bf90bc2a51",
"name": "siteApubsub",
"endpoints": [
"https://10.245.0.20:7481",
"https://10.245.0.21:7481",
"https://10.245.0.22:7481",
"https://10.245.0.23:7481"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "pubsub",
"sync_from_all": "false",
"sync_from": [
"siteA"
],
"redirect_zone": ""
},
{
"id": "b113b104-9c84-44ff-9058-4658c6e1df52",
"name": "siteB",
"endpoints": [
"https://10.245.0.40:7480",
"https://10.245.0.41:7480",
"https://10.245.0.42:7480",
"https://10.245.0.43:7480"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": [],
"redirect_zone": ""
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": [],
"storage_classes": [
"STANDARD"
]
}
],
"default_placement": "default-placement",
"realm_id": "9d76aa86-99d1-41c3-966f-cc97eab2bfb3",
"sync_policy": {
"groups": []
}
}