Bug #62710: multisite replication is super slow when some of the rgws configured in zonegroup are down - rgw - Ceph

Actions

Copy link

Bug #62710

closed

multisite replication is super slow when some of the rgws configured in zonegroup are down

Added by Jane Zhu 8 months ago. Updated about 1 month ago.

Status:

Resolved

Priority:

Normal

Assignee:

Shilpa MJ

Target version:

% Done:

Source:

Tags:

multisite

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

53320

Crash signature (v1):

Crash signature (v2):

Description

Multisite replication is super slow when some of the rgws configured in zonegroup are down.
This can be reproduces with the main branch (as of Aug. 29th, 2023).

Multisite clusters configuration:

2 clusters
each cluster has 3 rgw nodes, and 16 rgw instances per node (with 8 client-facing on the primary site)
all rgw instances are added in the zonegroup settings
shutdown all the rgw instances on one primary rgw node

Client traffic:

cosbench write only, 15 users, 30 workers, 600 seconds
generated 1800 buckets, >2 million objects

Replication lag:
The replication still not done 50 mins after the client traffic finished

Actions

Copy link

Updated by Jane Zhu 8 months ago

I set the severity of this issue to Major because the issue results in significant replication lag. But it does have a walkaround, which is to remove the downed rgws from the zonegroup settings. So please free to change it to Minor if you think it's more appropriate.

Actions

Copy link

Updated by Jane Zhu 8 months ago

I accidently put this in a wrong project. Can somebody please move it to "rgw" project? Thanks!

Actions

Copy link

Updated by Jane Zhu 8 months ago

https://github.com/ceph/ceph/pull/53320

Actions

Copy link

Updated by Neha Ojha 8 months ago

Project changed from teuthology to rgw
Status changed from New to Fix Under Review
Pull request ID set to 53320

Actions

Copy link

Updated by Daniel Gryniewicz 8 months ago

Assignee set to Shilpa MJ

Actions

Copy link

Updated by Casey Bodley 3 months ago

Status changed from Fix Under Review to Resolved

Actions

Copy link

Updated by Soumya Koduri about 1 month ago

@Jane,

The below changes - https://github.com/ceph/ceph/pull/53320/commits/e200499bb3c5703862b92a4d7fb534d98601f1bf seem to have caused regression in LC/cloud-transition code - https://tracker.ceph.com/issues/65251.

if (diff >= CONN_STATUS_EXPIRE_SECS) {
endpoints_status[endpoint].store(ceph::real_clock::zero());
ldout(cct, 10) << "endpoint " << endpoint << " unconnectable status expired. mark it connectable" << dendl;
break;
}
<<<

Even though there is valid endpoint, since the updated time is < 2sec, it returned null RGWRESTStreamS3PutObj pointer resulting in crash in tier code. The crash can be avoided with an extra check but it would still return error failing the transition request at times.

Could you please explain why the above check is needed and if needs to be modified to handle the LC cloud transition and perhaps cloud-sync too (which uses similar routines). Thanks!

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #62710

multisite replication is super slow when some of the rgws configured in zonegroup are down

Updated by Jane Zhu 8 months ago

Updated by Jane Zhu 8 months ago

Updated by Jane Zhu 8 months ago

Updated by Neha Ojha 8 months ago

Updated by Daniel Gryniewicz 8 months ago

Updated by Casey Bodley 3 months ago

Updated by Soumya Koduri about 1 month ago