Support #43327: rgw multisite: sync errors after enabling - rgw - Ceph

Actions

Copy link

Support #43327

open

rgw multisite: sync errors after enabling

Added by David Piper over 4 years ago.

Status:

New

Priority:

Normal

Assignee:

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

Pull request ID:

Description

Hi,

We're doing some testing with a trial deployment running version 14.2.2 in containers. We had a single site system (A) running happily, which we've then introduced a secondary site (B) too using the ceph-ansible playbooks.

The data sync seems to have largely succeeded and site B reports that it has caught up with source/master for both data and metadata, but both sites have entries in their RGW errors list, and site A is reporting behind shards. The RGW containers on both sites are logging a constant stream of errors.

Site A
------

[ceph-deploy@scotlanda_2 ~]$ sudo docker exec 79ad0a2de3f6 radosgw-admin sync status
realm b7f31089-0879-4fa2-9cbc-cfdf5f866a35 (geored_realm)
zonegroup 5d74eb0e-5d99-481f-ae33-43483f6cebc0 (geored_zg)
zone 033709fc-924a-4582-b00d-97c90e9e61b6 (siteA)
metadata sync no sync (zone is master)
data sync source: fecc1fc1-28f9-459e-8227-9a0f677b951f (siteB)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 4 shards
behind shards: [3,65,96,110]

[ceph-deploy@scotlanda_2 ~]$ sudo docker exec 79ad0a2de3f6 radosgw-admin sync error list
[ {
"shard_id": 0,
"entries": []
}, {
"shard_id": 1,
"entries": [ {
"id": "1_1576174107.369809_339513.1",
"section": "data",
"name": "apollo-scsdata:033709fc-924a-4582-b00d-97c90e9e61b6.533073.1:1",
"timestamp": "2019-12-12 18:08:27.369809Z",
"info": {
"source_zone": "fecc1fc1-28f9-459e-8227-9a0f677b951f",
"error_code": 5,
"message": "failed to sync bucket instance: (5) Input/output error"
}
}, {
"id": "1_1576174112.463455_339520.1",
"section": "data",
"name": "mrbounce-scsdata:033709fc-924a-4582-b00d-97c90e9e61b6.476166.1:3",
"timestamp": "2019-12-12 18:08:32.463455Z",
"info": {
"source_zone": "fecc1fc1-28f9-459e-8227-9a0f677b951f",
"error_code": 5,
"message": "failed to sync bucket instance: (5) Input/output error"
}
}
]
},

< lots more entries, typically 1-2 per shard, all with the same error code and message, and all from a similar timestamp, coinciding with when site B was introduced>

[ceph-deploy@scotlanda_1 ~]$ sudo docker logs ceph-rgw-scotlanda_1-rgw0 | tail -n 20
2019-12-16 09:37:02.672 7fe94fd7d700 0 RGW-SYNC:data:sync:shard¹⁰⁷: ERROR: failed to read remote data log info: ret=-2
2019-12-16 09:37:02.672 7fe94fd7d700 0 RGW-SYNC:data:sync:shard¹¹⁹: ERROR: failed to read remote data log info: ret=-2
2019-12-16 09:37:02.675 7fe94fd7d700 0 RGW-SYNC:data:sync:shard¹⁰²: ERROR: failed to read remote data log info: ret=-2
2019-12-16 09:37:02.687 7fe94fd7d700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
2019-12-16 09:37:02.697 7fe94fd7d700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
2019-12-16 09:37:02.700 7fe94fd7d700 0 RGW-SYNC:data:sync:shard³⁰: ERROR: failed to read remote data log info: ret=-2
2019-12-16 09:37:02.705 7fe94fd7d700 0 RGW-SYNC:data:sync:shard⁵³: ERROR: failed to read remote data log info: ret=-2
2019-12-16 09:37:02.706 7fe94fd7d700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
2019-12-16 09:37:02.711 7fe94fd7d700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
2019-12-16 09:37:02.715 7fe94fd7d700 0 RGW-SYNC:data:sync:shard⁷⁴: ERROR: failed to read remote data log info: ret=-2
2019-12-16 09:37:02.716 7fe94fd7d700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
2019-12-16 09:37:02.721 7fe94fd7d700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
2019-12-16 09:37:02.724 7fe94fd7d700 0 RGW-SYNC:data:sync:shard³⁷: ERROR: failed to read remote data log info: ret=-2
2019-12-16 09:37:02.731 7fe94fd7d700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
2019-12-16 09:37:02.739 7fe94fd7d700 0 RGW-SYNC:data:sync:shard¹²¹: ERROR: failed to read remote data log info: ret=-2
2019-12-16 09:37:02.752 7fe94fd7d700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
2019-12-16 09:37:03.173 7fe94fd7d700 0 RGW-SYNC:data:sync:shard¹²⁰: ERROR: failed to read remote data log info: ret=-2
2019-12-16 09:37:03.198 7fe94fd7d700 0 RGW-SYNC:data:sync:shard²⁹: ERROR: failed to read remote data log info: ret=-2
2019-12-16 09:37:03.375 7fe94fd7d700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
2019-12-16 09:37:03.378 7fe94fd7d700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2

Site B
------

[ceph-deploy@scotlandb_4 ~]$ sudo docker exec ceph-mon-scotlandb_4 radosgw-admin sync status
realm b7f31089-0879-4fa2-9cbc-cfdf5f866a35 (geored_realm)
zonegroup 5d74eb0e-5d99-481f-ae33-43483f6cebc0 (geored_zg)
zone fecc1fc1-28f9-459e-8227-9a0f677b951f (siteB)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
data sync source: 033709fc-924a-4582-b00d-97c90e9e61b6 (siteA)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source

[ceph-deploy@scotlandb_4 ~]$ sudo docker exec ceph-mon-scotlandb_4 radosgw-admin sync error list
[ {
"shard_id": 0,
"entries": []
}, {
"shard_id": 1,
"entries": [ {
"id": "1_1576254471.034909_459127.1",
"section": "data",
"name": "apollo-sipps:033709fc-924a-4582-b00d-97c90e9e61b6.14634.5",
"timestamp": "2019-12-13 16:27:51.034909Z",
"info": {
"source_zone": "033709fc-924a-4582-b00d-97c90e9e61b6",
"error_code": 5,
"message": "failed to sync bucket instance: (5) Input/output error"
}
},
< only five entries here, all timestamped a day later, when I restarted my RGW instances on site A>

[ceph-deploy@scotlandb_4 ~]$ sudo docker logs ceph-rgw-scotlandb_4-rgw0 | grep "sync" | tail -n 10
2019-12-16 09:42:02.459 7f2c1090f700 0 RGW-SYNC:data:sync:shard³¹: ERROR: failed to read remote data log info: ret=-2
2019-12-16 09:42:02.475 7f2c1090f700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
2019-12-16 09:42:02.479 7f2c1090f700 0 RGW-SYNC:data:sync:shard¹⁷: ERROR: failed to read remote data log info: ret=-2
2019-12-16 09:42:02.480 7f2c1090f700 0 RGW-SYNC:data:sync:shard⁸⁷: ERROR: failed to read remote data log info: ret=-2
2019-12-16 09:42:02.501 7f2c1090f700 0 RGW-SYNC:data:sync:shard⁴⁶: ERROR: failed to read remote data log info: ret=-2
2019-12-16 09:42:02.501 7f2c1090f700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
2019-12-16 09:42:02.501 7f2c1090f700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
2019-12-16 09:42:02.512 7f2c1090f700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
2019-12-16 09:42:02.597 7f2c1090f700 0 RGW-SYNC:data:sync:shard⁵: ERROR: failed to read remote data log info: ret=-2
2019-12-16 09:42:02.604 7f2c1090f700 0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2

When we've seen "(5) Input/output error" in the past, our issue has been that the SSH certificates presented by each sites' RGW instances were not trusted by the other site. We've corrected that - and indeed since the data sync to B has succeeded, that no longer seems to be the problem.

Have I possibly configured something wrong?

Please let me know what diags would be useful.

Cheers,

Dave

No data to display

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Support #43327

rgw multisite: sync errors after enabling