Project

General

Profile

Actions

Bug #53668

open

Why not add a xxx.retry obJ to metadata synchronization at multisite for exception retries

Added by Jinghua Zeng over 2 years ago. Updated about 2 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Tags:
multisite
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I see data synchronization being retry,the previously failed instance Error repo will be attempted before synchronization


Related issues 1 (0 open1 closed)

Related to rgw - Bug #39657: multisite: metadata sync does not keep retrying failed entriesResolvedCasey Bodley

Actions
Actions #1

Updated by Josh Durgin about 2 years ago

  • Project changed from Ceph to rgw
Actions #2

Updated by Casey Bodley about 2 years ago

  • Status changed from New to Need More Info
  • Tags set to multisite

in general, object uploads tend to be way more frequent than metadata changes like bucket/user creation. the datalog sees a LOT more traffic than the mdlog, so is more sensitive to errors/retries

the datalog's error repo allows us to move failing entries out of the datalog, so data sync can continue to advance and process new entries. this adds some extra complexity to data sync, because it has to schedule sync from two different sources

metadata sync could have an error repo, but we rarely see issues with metadata sync catching up from a backlog, so i don't think it's worth the complexity. multisite is already complicated enough that we have trouble maintaining it. and we did have issues with error handling that were resolved as part of https://tracker.ceph.com/issues/39657

i think it would be ideal for rgw to use the same code paths for data- and metadata sync, but we're a long way from being able to do that

@Jinghua Zeng are you interested in working on stuff like this?

Actions #3

Updated by Casey Bodley about 2 years ago

  • Related to Bug #39657: multisite: metadata sync does not keep retrying failed entries added
Actions #4

Updated by Christian Rohmann about 2 years ago

Casey Bodley wrote:

metadata sync could have an error repo, but we rarely see issues with metadata sync catching up from a backlog, so i don't think it's worth the complexity. multisite is already complicated enough that we have trouble maintaining it. and we did have issues with error handling that were resolved as part of https://tracker.ceph.com/issues/39657

Casey, may I bluntly point you to https://tracker.ceph.com/issues/46563 or the crash I reported at https://tracker.ceph.com/issues/46563#note-9 even. There the metadata simply gets stuck and only a shutdown of the RADOSGW instances and a "metadata sync init" would fix it.

Is that all going to be fixed with https://tracker.ceph.com/issues/51784 then? But likely that is not fixing the crash is it?

Actions #5

Updated by Jinghua Zeng about 2 years ago

Casey Bodley wrote:

in general, object uploads tend to be way more frequent than metadata changes like bucket/user creation. the datalog sees a LOT more traffic than the mdlog, so is more sensitive to errors/retries

the datalog's error repo allows us to move failing entries out of the datalog, so data sync can continue to advance and process new entries. this adds some extra complexity to data sync, because it has to schedule sync from two different sources

metadata sync could have an error repo, but we rarely see issues with metadata sync catching up from a backlog, so i don't think it's worth the complexity. multisite is already complicated enough that we have trouble maintaining it. and we did have issues with error handling that were resolved as part of https://tracker.ceph.com/issues/39657

i think it would be ideal for rgw to use the same code paths for data- and metadata sync, but we're a long way from being able to do that

@Jinghua Zeng are you interested in working on stuff like this?

I'm very interested in that. I think data synchronization depends on metadata. Sometimes, error repo may have many entries to synchronize due to a bucket instance synchronization failure. It also causes the metadata pool to be too large because BILog is not being used

Actions

Also available in: Atom PDF