Bug #53668: Why not add a xxx.retry obJ to metadata synchronization at multisite for exception retries - rgw - Ceph

Actions

Copy link

Bug #53668

open

Why not add a xxx.retry obJ to metadata synchronization at multisite for exception retries

Added by Jinghua Zeng over 2 years ago. Updated about 2 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Target version:

Ceph - v14.2.23

% Done:

Source:

Tags:

multisite

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I see data synchronization being retry，the previously failed instance Error repo will be attempted before synchronization

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Josh Durgin about 2 years ago

Project changed from Ceph to rgw

Actions

Copy link

Updated by Casey Bodley about 2 years ago

Status changed from New to Need More Info
Tags set to multisite

in general, object uploads tend to be way more frequent than metadata changes like bucket/user creation. the datalog sees a LOT more traffic than the mdlog, so is more sensitive to errors/retries

the datalog's error repo allows us to move failing entries out of the datalog, so data sync can continue to advance and process new entries. this adds some extra complexity to data sync, because it has to schedule sync from two different sources

metadata sync could have an error repo, but we rarely see issues with metadata sync catching up from a backlog, so i don't think it's worth the complexity. multisite is already complicated enough that we have trouble maintaining it. and we did have issues with error handling that were resolved as part of https://tracker.ceph.com/issues/39657

i think it would be ideal for rgw to use the same code paths for data- and metadata sync, but we're a long way from being able to do that

@Jinghua Zeng are you interested in working on stuff like this?

Actions

Copy link

Updated by Casey Bodley about 2 years ago

Related to Bug #39657: multisite: metadata sync does not keep retrying failed entries added

Actions

Copy link

Updated by Christian Rohmann about 2 years ago

Casey Bodley wrote:

metadata sync could have an error repo, but we rarely see issues with metadata sync catching up from a backlog, so i don't think it's worth the complexity. multisite is already complicated enough that we have trouble maintaining it. and we did have issues with error handling that were resolved as part of https://tracker.ceph.com/issues/39657

Casey, may I bluntly point you to https://tracker.ceph.com/issues/46563 or the crash I reported at https://tracker.ceph.com/issues/46563#note-9 even. There the metadata simply gets stuck and only a shutdown of the RADOSGW instances and a "metadata sync init" would fix it.

Is that all going to be fixed with https://tracker.ceph.com/issues/51784 then? But likely that is not fixing the crash is it?

Actions

Copy link

Updated by Jinghua Zeng about 2 years ago

Casey Bodley wrote:

in general, object uploads tend to be way more frequent than metadata changes like bucket/user creation. the datalog sees a LOT more traffic than the mdlog, so is more sensitive to errors/retries

the datalog's error repo allows us to move failing entries out of the datalog, so data sync can continue to advance and process new entries. this adds some extra complexity to data sync, because it has to schedule sync from two different sources

metadata sync could have an error repo, but we rarely see issues with metadata sync catching up from a backlog, so i don't think it's worth the complexity. multisite is already complicated enough that we have trouble maintaining it. and we did have issues with error handling that were resolved as part of https://tracker.ceph.com/issues/39657

i think it would be ideal for rgw to use the same code paths for data- and metadata sync, but we're a long way from being able to do that

@Jinghua Zeng are you interested in working on stuff like this?

I'm very interested in that. I think data synchronization depends on metadata. Sometimes, error repo may have many entries to synchronize due to a bucket instance synchronization failure. It also causes the metadata pool to be too large because BILog is not being used

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #53668

Why not add a xxx.retry obJ to metadata synchronization at multisite for exception retries

Updated by Josh Durgin about 2 years ago

Updated by Casey Bodley about 2 years ago

Updated by Casey Bodley about 2 years ago

Updated by Christian Rohmann about 2 years ago

Updated by Jinghua Zeng about 2 years ago