Project

General

Profile

Actions

Bug #52877

closed

rgw-multisite: Dynamic resharding fails to take effect, even when rgw_max_objs_per_shard value is reached.

Added by Vidushi Mishra over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
multisite-reshard
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On a multi-site cluster, dynamic resharding fails to take effect, even when the rgw_max_objs_per_shard value is reached.

ceph version:
ceph version 17.0.0-8051-g15b54dc9 (15b54dc9eaa835e95809e32e8ddf109d416320c9) quincy (dev)

steps:
1. create a realm "data" and set it as the default
2. create zonegroup "us" for realm "data" and set it as master and default.
3. create master zone "east" and set the zone as the master.
4. Now create few buckets and put objects.
5. Set the rgw_max_objs_per_shard = 100 and restart all the rgws for the tuning to take effect.
6. Create 2 buckets [tx/ss-bkt-v1 and tx/ss-bkt-v3] and upload more than 2K objects on each bucket.
7. We observed that since rgw_max_objs_per_shard = 100, the buckets got resharded to 41 and 29 shards respectively.
8. Now establish a multisite by performing realm pull and period pull
9. Create the slave zone "west" and wait for the sync to complete.
10.Ensure the rgw_max_objs_per_shard = 100 on both the sites, master and the slave.
11.Create 2 more buckets "tx/ms-bkt-v1" and "elizabethd.615-bucky-488-0" and upload 2171 and 12002 objects respectively.
12. Dynamic resharding does not take effect on both sites.

Actions #2

Updated by Vidushi Mishra over 2 years ago

Update on the issue after 2 days:

1- On the primary site, the buckets have been resharded dynamically.
2- On the secondary site, however, looks like the dynamic reshard process is stuck and the 'radosgw-admin reshard list' lists out the buckets that are in for resharding.

a snippet of radosgw-admin reshard list on the secondary: ==========================================================

[ceph: root@magna017 /]# radosgw-admin reshard list
2021-10-11T11:10:28.281+0000 7f20f637a340 1 Realm: data (d62bd711-d486-47be-9c3e-193e49334862)
2021-10-11T11:10:28.281+0000 7f20f637a340 1 ZoneGroup: us (8f3b29b1-ffc6-4c90-9d0c-9bd258028cd8)
2021-10-11T11:10:28.281+0000 7f20f637a340 1 Zone: west (3a571642-9f5e-46d8-8186-9eca8cc79ac6)
2021-10-11T11:10:28.281+0000 7f20f637a340 1 using period configuration: dd132fae-4457-4f49-88b9-55ca2f8adff9:2
[2021-10-11T11:10:28.532+0000 7f20f637a340 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000000 marker= (2) No such file or directory
2021-10-11T11:10:28.533+0000 7f20f637a340 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000001 marker= (2) No such file or directory
2021-10-11T11:10:28.533+0000 7f20f637a340 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000002 marker= (2) No such file or directory

{
"time": "2021-10-10T05:22:11.644198Z",
"tenant": "tx",
"bucket_name": "ms-bkt-v1",
"bucket_id": "5d32949e-6245-422c-b315-9048855d3a9a.28133.1",
"old_num_shards": 11,
"new_num_shards": 43
}2021-10-11T11:10:28.536+0000 7f20f637a340 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000004 marker= (2) No such file or directory
2021-10-11T11:10:28.538+0000 7f20f637a340 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000005 marker= (2) No such file or directory
2021-10-11T11:10:28.538+0000 7f20f637a340 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000006 marker= (2) No such file or directory
2021-10-11T11:10:28.538+0000 7f20f637a340 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000007 marker= (2) No such file or directory
2021-10-11T11:10:28.539+0000 7f20f637a340 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000008 marker= (2) No such file or directory
2021-10-11T11:10:28.539+0000 7f20f637a340 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000009 marker= (2) No such file or directory
2021-10-11T11:10:28.539+0000 7f20f637a340 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000010 marker= (2) No such file or directory
2021-10-11T11:10:28.540+0000 7f20f637a340 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000011 marker= (2) No such file or directory
, {
"time": "2021-10-10T05:22:11.813771Z",
"tenant": "",
"bucket_name": "elizabethd.615-bucky-488-0",
"bucket_id": "5d32949e-6245-422c-b315-9048855d3a9a.28133.3",
"old_num_shards": 11,
"new_num_shards": 227
}, {
"time": "2021-10-10T05:22:11.678128Z",
"tenant": "tx",
"bucket_name": "ms-bkt-v2",
"bucket_id": "5d32949e-6245-422c-b315-9048855d3a9a.28841.1",
"old_num_shards": 11,
"new_num_shards": 163
}, {
"time": "2021-10-10T05:22:11.711405Z",
"tenant": "tx",
"bucket_name": "ss-bkt-v1",
"bucket_id": "5d32949e-6245-422c-b315-9048855d3a9a.26963.1",
"old_num_shards": 11,
"new_num_shards": 59
}, {
"time": "2021-10-10T05:22:11.761627Z",
"tenant": "tx",
"bucket_name": "ss-bkt-v3",
"bucket_id": "5d32949e-6245-422c-b315-9048855d3a9a.26963.3",
"old_num_shards": 11,
"new_num_shards": 53
}2021-10-11T11:10:28.541+0000 7f20f637a340 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000014 marker= (2) No such file or directory

]
2021-10-11T11:10:28.541+0000 7f20f637a340 -1 ERROR: failed to list reshard log entries, oid=reshard.0000000015 marker= (2) No such file or directory
[ceph: root@magna017 /]#

Actions #3

Updated by Vidushi Mishra over 2 years ago

On doing a radosgw-admin reshard process, the buckets that were listed in the reshard list are regarded on the slave site as well.

Logs for reshard process with debug_rgw = 20 and debug_ms = 1

http://magna002.ceph.redhat.com/ceph-qe-logs/vidushi/upstream-testing/52877/11-Oct/reshard_process

Actions #4

Updated by Casey Bodley over 2 years ago

  • Status changed from New to Triaged

Vidushi Mishra wrote:

On doing a radosgw-admin reshard process, the buckets that were listed in the reshard list are regarded on the slave site as well.

Logs for reshard process with debug_rgw = 20 and debug_ms = 1

http://magna002.ceph.redhat.com/ceph-qe-logs/vidushi/upstream-testing/52877/11-Oct/reshard_process

thanks for the logs! i see that all 4 of those buckets resharded successfully, so it seems like the issue is that the secondary zone isn't trying these reshards at all. i'll see what i can find

Actions #5

Updated by Casey Bodley over 2 years ago

oops! looks like this part of rgw_rados.cc is to blame:

  /* only the master zone in the zonegroup reshards buckets */
  run_reshard_thread = run_reshard_thread && (zonegroup.master_zone == zone.id);
  if (run_reshard_thread)  {
    reshard->start_processor();
  }

that check should be based on zonegroup features instead

Actions #6

Updated by Casey Bodley over 2 years ago

  • Status changed from Triaged to Fix Under Review
  • Pull request ID set to 43505
Actions #7

Updated by Casey Bodley over 2 years ago

  • Status changed from Fix Under Review to Resolved

merged into wip-rgw-multisite-reshard and kicked off new builds

Actions

Also available in: Atom PDF