Project

General

Profile

Bug #52896

rgw-multisite: Dynamic resharding take too long to take effect

Added by Vidushi Mishra over 2 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
% Done:

0%

Source:
Tags:
multisite-reshard
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On a multi-site cluster, dynamic resharding takes very long to kick in, even when the rgw_max_objs_per_shard value is reached.

ceph version:
ceph version 17.0.0-8051-g15b54dc9 (15b54dc9eaa835e95809e32e8ddf109d416320c9) quincy (dev)

Steps :
1. Tune rgw_max_objs_per_shard value to 100 and restart the rgws on the master and the slave.
2. create a bucket on the master and upload objects from both sites.
3. Let the number of objects uploaded be 2K in order for the dynamic resharding to take effect. [since rgw_max_objs_per_shard = 100]
4. wait for 2 hours after the objects upload and sync is completed.

Result:

Dynamic resharding has yet not kicked-in on either site.

History

#1 Updated by Casey Bodley over 2 years ago

  • Status changed from New to Triaged

not sure what would cause this delay. the RGWReshard::ReshardWorker thread should be attempting reshards every rgw_reshard_thread_interval (default 5 min), and that runs the same code that `radosgw-admin reshard process` does - and we've seen that command succeed

do the radosgw logs show any error messages starting with 'RGWReshard::process_entry'?

during this delay, can you confirm that 'radosgw-admin reshard list' shows the buckets you expect to be resharded?

#2 Updated by Vidushi Mishra over 2 years ago

I have 2 more observations to add here.

1. when I restarted the gateways, the dynamic resahrding kicked in as expected at least on the primary site. On the secondary, we see the error https://tracker.ceph.com/issues/52877 after restarting gateways.

2. radosgw-admin reshard list does not show any entry.

3. 'RGWReshard::process_entry > I do not see these error messages in the rgw logs. What would be the debugging level for this to get populated in the logs?

#3 Updated by Casey Bodley over 2 years ago

  • Assignee set to Casey Bodley

#4 Updated by Casey Bodley over 2 years ago

Vidushi Mishra wrote:

I have 2 more observations to add here.

1. when I restarted the gateways, the dynamic resahrding kicked in as expected at least on the primary site. On the secondary, we see the error https://tracker.ceph.com/issues/52877 after restarting gateways.

huh, i don't see why restart would be required

those "failed to list reshard log entries" error messages on the secondary are just noise, and wouldn't prevent any pending reshards

2. radosgw-admin reshard list does not show any entry.

can you clarify this one? if 'reshard list' is empty, i assume that would mean everything resharded successfully on that zone

3. 'RGWReshard::process_entry > I do not see these error messages in the rgw logs. What would be the debugging level for this to get populated in the logs?

errors there would be at log level 0, so they'd show up without special configuration

but it's worth running the gateways under --debug-rgw=20 to make sure we see messages like these every ~10 minutes:

rgw reshard worker thread: processing logshard = x
rgw reshard worker thread: finish processing logshard = x

#5 Updated by Casey Bodley over 2 years ago

  • Status changed from Triaged to Closed

Also available in: Atom PDF