Project

General

Profile

Actions

Bug #62710

closed

multisite replication is super slow when some of the rgws configured in zonegroup are down

Added by Jane Zhu 8 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
multisite
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Multisite replication is super slow when some of the rgws configured in zonegroup are down.
This can be reproduces with the main branch (as of Aug. 29th, 2023).

Multisite clusters configuration:
  • 2 clusters
  • each cluster has 3 rgw nodes, and 16 rgw instances per node (with 8 client-facing on the primary site)
  • all rgw instances are added in the zonegroup settings
  • shutdown all the rgw instances on one primary rgw node
Client traffic:
  • cosbench write only, 15 users, 30 workers, 600 seconds
  • generated 1800 buckets, >2 million objects

Replication lag:
The replication still not done 50 mins after the client traffic finished

Actions #1

Updated by Jane Zhu 8 months ago

I set the severity of this issue to Major because the issue results in significant replication lag. But it does have a walkaround, which is to remove the downed rgws from the zonegroup settings. So please free to change it to Minor if you think it's more appropriate.

Actions #2

Updated by Jane Zhu 8 months ago

I accidently put this in a wrong project. Can somebody please move it to "rgw" project? Thanks!

Actions #4

Updated by Neha Ojha 8 months ago

  • Project changed from teuthology to rgw
  • Status changed from New to Fix Under Review
  • Pull request ID set to 53320
Actions #5

Updated by Daniel Gryniewicz 8 months ago

  • Assignee set to Shilpa MJ
Actions #6

Updated by Casey Bodley 3 months ago

  • Status changed from Fix Under Review to Resolved
Actions #7

Updated by Soumya Koduri about 1 month ago

@Jane,

The below changes - https://github.com/ceph/ceph/pull/53320/commits/e200499bb3c5703862b92a4d7fb534d98601f1bf seem to have caused regression in LC/cloud-transition code - https://tracker.ceph.com/issues/65251.

if (diff >= CONN_STATUS_EXPIRE_SECS) {
endpoints_status[endpoint].store(ceph::real_clock::zero());
ldout(cct, 10) << "endpoint " << endpoint << " unconnectable status expired. mark it connectable" << dendl;
break;
}
<<<

Even though there is valid endpoint, since the updated time is < 2sec, it returned null RGWRESTStreamS3PutObj pointer resulting in crash in tier code. The crash can be avoided with an extra check but it would still return error failing the transition request at times.

Could you please explain why the above check is needed and if needs to be modified to handle the LC cloud transition and perhaps cloud-sync too (which uses similar routines). Thanks!

Actions

Also available in: Atom PDF