Project

General

Profile

Bug #37792

multisite: overwrites in versioning-suspended buckets fail to sync

Added by Casey Bodley 7 months ago. Updated 5 days ago.

Status:
Pending Backport
Priority:
High
Assignee:
Target version:
-
Start date:
01/04/2019
Due date:
% Done:

0%

Source:
Tags:
multisite versioning
Backport:
luminous mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

steps to reproduce in a two-zone multisite configuration:

  1. create a bucket
  2. upload an object "obj"
  3. enable versioning on the bucket
  4. reupload the same object "obj"
  5. suspend versioning on the bucket
  6. reupload the same object "obj"

the third upload will repeatedly fail to sync with errors like "cls_rgw_bucket_link_olh() returned r=-125" in the rgw log, and errors like "NOTICE: op.olh_tag (zxopy27aag3jjr38ddtow7517gdpgz4c) != olh.tag (bne5h7ou7gingobf89ae5crr2p3p284y)" in the osd log. this happens because, in this specific case, fetch_remote_obj() takes the source zone's olh attributes and writes them directly to the head object, instead of first fetching from the current head object in rados


Related issues

Related to rgw - Bug #39118: rgw: remove_olh_pending_entries() does not limit the number of xattrs to remove Pending Backport 04/04/2019
Copied to rgw - Backport #38080: mimic: multisite: overwrites in versioning-suspended buckets fail to sync In Progress
Copied to rgw - Backport #38081: luminous: multisite: overwrites in versioning-suspended buckets fail to sync Resolved

History

#1 Updated by Casey Bodley 7 months ago

  • Status changed from In Progress to Need Review

please backport both pull requests:
https://github.com/ceph/ceph/pull/25794 (fixes original bug)
https://github.com/ceph/ceph/pull/26157 (repairs damage caused by bug)

#2 Updated by Casey Bodley 6 months ago

  • Status changed from Need Review to Pending Backport

#3 Updated by Casey Bodley 6 months ago

  • Copied to Backport #38080: mimic: multisite: overwrites in versioning-suspended buckets fail to sync added

#4 Updated by Casey Bodley 6 months ago

  • Copied to Backport #38081: luminous: multisite: overwrites in versioning-suspended buckets fail to sync added

#5 Updated by Casey Bodley 4 months ago

  • Related to Bug #39118: rgw: remove_olh_pending_entries() does not limit the number of xattrs to remove added

#6 Updated by Nathan Cutler 9 days ago

  • Pull request ID set to 25974

#7 Updated by duc pham 5 days ago

I have the same issue. My version cluster is 13.2.6. When I suspend versioning on a site then not reupload the same obj from another site.

#8 Updated by duc pham 5 days ago

When reenable from a site, which could not reupload obj, I got the error:

 20 RGWWQ: empty
 20 cr:s=0x5629729e0000:op=0x562972a00c00:21RGWRadosSetOmapKeysCR: operate()
 20 cr:s=0x5629729e0000:op=0x562972a00c00:21RGWRadosSetOmapKeysCR: operate()
 20 cr:s=0x5629729e0000:op=0x562972a00c00:21RGWRadosSetOmapKeysCR: operate()
 20 cr:s=0x5629729e0000:op=0x562972a00c00:21RGWRadosSetOmapKeysCR: operate()
 20 cr:s=0x5629729e0000:op=0x562971ffc600:13RGWOmapAppend: operate()
 15 stack 0x5629729e0000 end
 20 run: stack=0x5629729e0000 is done
 20 cr:s=0x5629717c5440:op=0x562971da9600:18RGWDataSyncShardCR: operate()
 20 collect(): s=0x5629717c5440 stack=0x5629729e0a20 is still running
 20 collect(): s=0x5629717c5440 stack=0x5629729e0000 is complete
 20 run: stack=0x5629717c5440 is_blocked_by_stack()=0 is_sleeping=0 waiting_for_child()=1
 20 cr:s=0x5629729e0a20:op=0x562972537200:22RGWSimpleRadosUnlockCR: operate()
 20 cr:s=0x5629729e0a20:op=0x562972537200:22RGWSimpleRadosUnlockCR: operate()
 20 cr:s=0x5629729e0a20:op=0x562972537200:22RGWSimpleRadosUnlockCR: operate()
 20 cr:s=0x5629729e0a20:op=0x562972537200:22RGWSimpleRadosUnlockCR: operate()
 20 cr:s=0x5629729e0a20:op=0x562971f12000:20RGWContinuousLeaseCR: operate()
 15 stack 0x5629729e0a20 end
 20 run: stack=0x5629729e0a20 is done
 20 cr:s=0x5629717c5440:op=0x562971da9600:18RGWDataSyncShardCR: operate()
 20 collect(): s=0x5629717c5440 stack=0x5629729e0a20 is complete
 20 cr:s=0x5629717c5440:op=0x562971da9600:18RGWDataSyncShardCR: operate()
 10 RGW-SYNC:data:sync:shard[80]: incremental sync failed (r=-2)
 20 cr:s=0x5629717c5440:op=0x562971da9600:18RGWDataSyncShardCR: operate() returned r=-2
 20 cr:s=0x5629717c5440:op=0x562971cae000:25RGWDataSyncShardControlCR: operate()
  5 data sync: Sync:11b2b871:data:DataShard:datalog.sync-status.shard.11b2b871-89ec-4d8d-b72f-8057b2dbf1ec.80:finish
  0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
 20 run: stack=0x5629717c5440 is io blocked
 20 cr:s=0x5629727565a0:op=0x56297204fe00:20RGWSimpleRadosLockCR: operate()
 20 cr:s=0x5629727565a0:op=0x56297204fe00:20RGWSimpleRadosLockCR: operate()
 20 cr:s=0x5629727565a0:op=0x56297204fe00:20RGWSimpleRadosLockCR: operate()
 20 cr:s=0x5629727565a0:op=0x56297204fe00:20RGWSimpleRadosLockCR: operate()
 20 cr:s=0x5629727565a0:op=0x5629717bc700:20RGWContinuousLeaseCR: operate()
 20 run: stack=0x5629727565a0 is io blocked
 20 cr:s=0x562971cd6360:op=0x562971e30d00:18RGWDataSyncShardCR: operate()
 10 RGW-SYNC:data:sync:shard[105]: took lease
  5 data sync: Sync:11b2b871:data:DataShard:datalog.sync-status.shard.11b2b871-89ec-4d8d-b72f-8057b2dbf1ec.105:inc sync
 20 cr:s=0x5629729e0a20:op=0x56297204fe00:13RGWOmapAppend: operate()
 20 run: stack=0x5629729e0a20 is_blocked_by_stack()=0 is_sleeping=1 waiting_for_child()=0
 20 cr:s=0x562971cd6360:op=0x562972537200:21RGWRadosGetOmapKeysCR: operate()
 20 cr:s=0x562971cd6360:op=0x562972537200:21RGWRadosGetOmapKeysCR: operate()
 20 run: stack=0x562971cd6360 is io blocked

Also available in: Atom PDF