Project

General

Profile

Bug #37792

multisite: overwrites in versioning-suspended buckets fail to sync

Added by Casey Bodley about 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
multisite versioning
Backport:
luminous mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

steps to reproduce in a two-zone multisite configuration:

  1. create a bucket
  2. upload an object "obj"
  3. enable versioning on the bucket
  4. reupload the same object "obj"
  5. suspend versioning on the bucket
  6. reupload the same object "obj"

the third upload will repeatedly fail to sync with errors like "cls_rgw_bucket_link_olh() returned r=-125" in the rgw log, and errors like "NOTICE: op.olh_tag (zxopy27aag3jjr38ddtow7517gdpgz4c) != olh.tag (bne5h7ou7gingobf89ae5crr2p3p284y)" in the osd log. this happens because, in this specific case, fetch_remote_obj() takes the source zone's olh attributes and writes them directly to the head object, instead of first fetching from the current head object in rados


Related issues

Related to rgw - Bug #39118: rgw: remove_olh_pending_entries() does not limit the number of xattrs to remove Resolved 04/04/2019
Duplicated by rgw - Bug #21210: rgw:multisite: put obj in a version-suspended bucket when sync to slave zone, the list_index cannot added corretlly Duplicate
Copied to rgw - Backport #38080: mimic: multisite: overwrites in versioning-suspended buckets fail to sync Resolved
Copied to rgw - Backport #38081: luminous: multisite: overwrites in versioning-suspended buckets fail to sync Resolved

History

#1 Updated by Casey Bodley about 5 years ago

  • Status changed from In Progress to Fix Under Review

please backport both pull requests:
https://github.com/ceph/ceph/pull/25794 (fixes original bug)
https://github.com/ceph/ceph/pull/26157 (repairs damage caused by bug)

#2 Updated by Casey Bodley about 5 years ago

  • Status changed from Fix Under Review to Pending Backport

#3 Updated by Casey Bodley about 5 years ago

  • Copied to Backport #38080: mimic: multisite: overwrites in versioning-suspended buckets fail to sync added

#4 Updated by Casey Bodley about 5 years ago

  • Copied to Backport #38081: luminous: multisite: overwrites in versioning-suspended buckets fail to sync added

#5 Updated by Casey Bodley almost 5 years ago

  • Related to Bug #39118: rgw: remove_olh_pending_entries() does not limit the number of xattrs to remove added

#6 Updated by Nathan Cutler over 4 years ago

  • Pull request ID set to 25974

#7 Updated by duc pham over 4 years ago

I have the same issue. My version cluster is 13.2.6. When I suspend versioning on a site then not reupload the same obj from another site.

#8 Updated by duc pham over 4 years ago

When reenable from a site, which could not reupload obj, I got the error:

 20 RGWWQ: empty
 20 cr:s=0x5629729e0000:op=0x562972a00c00:21RGWRadosSetOmapKeysCR: operate()
 20 cr:s=0x5629729e0000:op=0x562972a00c00:21RGWRadosSetOmapKeysCR: operate()
 20 cr:s=0x5629729e0000:op=0x562972a00c00:21RGWRadosSetOmapKeysCR: operate()
 20 cr:s=0x5629729e0000:op=0x562972a00c00:21RGWRadosSetOmapKeysCR: operate()
 20 cr:s=0x5629729e0000:op=0x562971ffc600:13RGWOmapAppend: operate()
 15 stack 0x5629729e0000 end
 20 run: stack=0x5629729e0000 is done
 20 cr:s=0x5629717c5440:op=0x562971da9600:18RGWDataSyncShardCR: operate()
 20 collect(): s=0x5629717c5440 stack=0x5629729e0a20 is still running
 20 collect(): s=0x5629717c5440 stack=0x5629729e0000 is complete
 20 run: stack=0x5629717c5440 is_blocked_by_stack()=0 is_sleeping=0 waiting_for_child()=1
 20 cr:s=0x5629729e0a20:op=0x562972537200:22RGWSimpleRadosUnlockCR: operate()
 20 cr:s=0x5629729e0a20:op=0x562972537200:22RGWSimpleRadosUnlockCR: operate()
 20 cr:s=0x5629729e0a20:op=0x562972537200:22RGWSimpleRadosUnlockCR: operate()
 20 cr:s=0x5629729e0a20:op=0x562972537200:22RGWSimpleRadosUnlockCR: operate()
 20 cr:s=0x5629729e0a20:op=0x562971f12000:20RGWContinuousLeaseCR: operate()
 15 stack 0x5629729e0a20 end
 20 run: stack=0x5629729e0a20 is done
 20 cr:s=0x5629717c5440:op=0x562971da9600:18RGWDataSyncShardCR: operate()
 20 collect(): s=0x5629717c5440 stack=0x5629729e0a20 is complete
 20 cr:s=0x5629717c5440:op=0x562971da9600:18RGWDataSyncShardCR: operate()
 10 RGW-SYNC:data:sync:shard[80]: incremental sync failed (r=-2)
 20 cr:s=0x5629717c5440:op=0x562971da9600:18RGWDataSyncShardCR: operate() returned r=-2
 20 cr:s=0x5629717c5440:op=0x562971cae000:25RGWDataSyncShardControlCR: operate()
  5 data sync: Sync:11b2b871:data:DataShard:datalog.sync-status.shard.11b2b871-89ec-4d8d-b72f-8057b2dbf1ec.80:finish
  0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
 20 run: stack=0x5629717c5440 is io blocked
 20 cr:s=0x5629727565a0:op=0x56297204fe00:20RGWSimpleRadosLockCR: operate()
 20 cr:s=0x5629727565a0:op=0x56297204fe00:20RGWSimpleRadosLockCR: operate()
 20 cr:s=0x5629727565a0:op=0x56297204fe00:20RGWSimpleRadosLockCR: operate()
 20 cr:s=0x5629727565a0:op=0x56297204fe00:20RGWSimpleRadosLockCR: operate()
 20 cr:s=0x5629727565a0:op=0x5629717bc700:20RGWContinuousLeaseCR: operate()
 20 run: stack=0x5629727565a0 is io blocked
 20 cr:s=0x562971cd6360:op=0x562971e30d00:18RGWDataSyncShardCR: operate()
 10 RGW-SYNC:data:sync:shard[105]: took lease
  5 data sync: Sync:11b2b871:data:DataShard:datalog.sync-status.shard.11b2b871-89ec-4d8d-b72f-8057b2dbf1ec.105:inc sync
 20 cr:s=0x5629729e0a20:op=0x56297204fe00:13RGWOmapAppend: operate()
 20 run: stack=0x5629729e0a20 is_blocked_by_stack()=0 is_sleeping=1 waiting_for_child()=0
 20 cr:s=0x562971cd6360:op=0x562972537200:21RGWRadosGetOmapKeysCR: operate()
 20 cr:s=0x562971cd6360:op=0x562972537200:21RGWRadosGetOmapKeysCR: operate()
 20 run: stack=0x562971cd6360 is io blocked

#9 Updated by Nathan Cutler over 4 years ago

  • Status changed from Pending Backport to Resolved

#10 Updated by J. Eric Ivancich over 4 years ago

  • Pull request ID changed from 25974 to 25794

Updated pr id, which had transposed two digits.

#11 Updated by Casey Bodley over 2 years ago

  • Duplicated by Bug #21210: rgw:multisite: put obj in a version-suspended bucket when sync to slave zone, the list_index cannot added corretlly added

Also available in: Atom PDF