Project

General

Profile

Actions

Bug #37792

closed

multisite: overwrites in versioning-suspended buckets fail to sync

Added by Casey Bodley over 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
multisite versioning
Backport:
luminous mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

steps to reproduce in a two-zone multisite configuration:

  1. create a bucket
  2. upload an object "obj"
  3. enable versioning on the bucket
  4. reupload the same object "obj"
  5. suspend versioning on the bucket
  6. reupload the same object "obj"

the third upload will repeatedly fail to sync with errors like "cls_rgw_bucket_link_olh() returned r=-125" in the rgw log, and errors like "NOTICE: op.olh_tag (zxopy27aag3jjr38ddtow7517gdpgz4c) != olh.tag (bne5h7ou7gingobf89ae5crr2p3p284y)" in the osd log. this happens because, in this specific case, fetch_remote_obj() takes the source zone's olh attributes and writes them directly to the head object, instead of first fetching from the current head object in rados


Related issues 4 (0 open4 closed)

Related to rgw - Bug #39118: rgw: remove_olh_pending_entries() does not limit the number of xattrs to removeResolvedCasey Bodley04/04/2019

Actions
Has duplicate rgw - Bug #21210: rgw:multisite: put obj in a version-suspended bucket when sync to slave zone, the list_index cannot added corretllyDuplicateCasey Bodley

Actions
Copied to rgw - Backport #38080: mimic: multisite: overwrites in versioning-suspended buckets fail to syncResolvedNathan CutlerActions
Copied to rgw - Backport #38081: luminous: multisite: overwrites in versioning-suspended buckets fail to syncResolvedCasey BodleyActions
Actions #1

Updated by Casey Bodley over 5 years ago

  • Status changed from In Progress to Fix Under Review

please backport both pull requests:
https://github.com/ceph/ceph/pull/25794 (fixes original bug)
https://github.com/ceph/ceph/pull/26157 (repairs damage caused by bug)

Actions #2

Updated by Casey Bodley about 5 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #3

Updated by Casey Bodley about 5 years ago

  • Copied to Backport #38080: mimic: multisite: overwrites in versioning-suspended buckets fail to sync added
Actions #4

Updated by Casey Bodley about 5 years ago

  • Copied to Backport #38081: luminous: multisite: overwrites in versioning-suspended buckets fail to sync added
Actions #5

Updated by Casey Bodley about 5 years ago

  • Related to Bug #39118: rgw: remove_olh_pending_entries() does not limit the number of xattrs to remove added
Actions #6

Updated by Nathan Cutler almost 5 years ago

  • Pull request ID set to 25974
Actions #7

Updated by duc pham almost 5 years ago

I have the same issue. My version cluster is 13.2.6. When I suspend versioning on a site then not reupload the same obj from another site.

Actions #8

Updated by duc pham almost 5 years ago

When reenable from a site, which could not reupload obj, I got the error:

 20 RGWWQ: empty
 20 cr:s=0x5629729e0000:op=0x562972a00c00:21RGWRadosSetOmapKeysCR: operate()
 20 cr:s=0x5629729e0000:op=0x562972a00c00:21RGWRadosSetOmapKeysCR: operate()
 20 cr:s=0x5629729e0000:op=0x562972a00c00:21RGWRadosSetOmapKeysCR: operate()
 20 cr:s=0x5629729e0000:op=0x562972a00c00:21RGWRadosSetOmapKeysCR: operate()
 20 cr:s=0x5629729e0000:op=0x562971ffc600:13RGWOmapAppend: operate()
 15 stack 0x5629729e0000 end
 20 run: stack=0x5629729e0000 is done
 20 cr:s=0x5629717c5440:op=0x562971da9600:18RGWDataSyncShardCR: operate()
 20 collect(): s=0x5629717c5440 stack=0x5629729e0a20 is still running
 20 collect(): s=0x5629717c5440 stack=0x5629729e0000 is complete
 20 run: stack=0x5629717c5440 is_blocked_by_stack()=0 is_sleeping=0 waiting_for_child()=1
 20 cr:s=0x5629729e0a20:op=0x562972537200:22RGWSimpleRadosUnlockCR: operate()
 20 cr:s=0x5629729e0a20:op=0x562972537200:22RGWSimpleRadosUnlockCR: operate()
 20 cr:s=0x5629729e0a20:op=0x562972537200:22RGWSimpleRadosUnlockCR: operate()
 20 cr:s=0x5629729e0a20:op=0x562972537200:22RGWSimpleRadosUnlockCR: operate()
 20 cr:s=0x5629729e0a20:op=0x562971f12000:20RGWContinuousLeaseCR: operate()
 15 stack 0x5629729e0a20 end
 20 run: stack=0x5629729e0a20 is done
 20 cr:s=0x5629717c5440:op=0x562971da9600:18RGWDataSyncShardCR: operate()
 20 collect(): s=0x5629717c5440 stack=0x5629729e0a20 is complete
 20 cr:s=0x5629717c5440:op=0x562971da9600:18RGWDataSyncShardCR: operate()
 10 RGW-SYNC:data:sync:shard[80]: incremental sync failed (r=-2)
 20 cr:s=0x5629717c5440:op=0x562971da9600:18RGWDataSyncShardCR: operate() returned r=-2
 20 cr:s=0x5629717c5440:op=0x562971cae000:25RGWDataSyncShardControlCR: operate()
  5 data sync: Sync:11b2b871:data:DataShard:datalog.sync-status.shard.11b2b871-89ec-4d8d-b72f-8057b2dbf1ec.80:finish
  0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
 20 run: stack=0x5629717c5440 is io blocked
 20 cr:s=0x5629727565a0:op=0x56297204fe00:20RGWSimpleRadosLockCR: operate()
 20 cr:s=0x5629727565a0:op=0x56297204fe00:20RGWSimpleRadosLockCR: operate()
 20 cr:s=0x5629727565a0:op=0x56297204fe00:20RGWSimpleRadosLockCR: operate()
 20 cr:s=0x5629727565a0:op=0x56297204fe00:20RGWSimpleRadosLockCR: operate()
 20 cr:s=0x5629727565a0:op=0x5629717bc700:20RGWContinuousLeaseCR: operate()
 20 run: stack=0x5629727565a0 is io blocked
 20 cr:s=0x562971cd6360:op=0x562971e30d00:18RGWDataSyncShardCR: operate()
 10 RGW-SYNC:data:sync:shard[105]: took lease
  5 data sync: Sync:11b2b871:data:DataShard:datalog.sync-status.shard.11b2b871-89ec-4d8d-b72f-8057b2dbf1ec.105:inc sync
 20 cr:s=0x5629729e0a20:op=0x56297204fe00:13RGWOmapAppend: operate()
 20 run: stack=0x5629729e0a20 is_blocked_by_stack()=0 is_sleeping=1 waiting_for_child()=0
 20 cr:s=0x562971cd6360:op=0x562972537200:21RGWRadosGetOmapKeysCR: operate()
 20 cr:s=0x562971cd6360:op=0x562972537200:21RGWRadosGetOmapKeysCR: operate()
 20 run: stack=0x562971cd6360 is io blocked
Actions #9

Updated by Nathan Cutler over 4 years ago

  • Status changed from Pending Backport to Resolved
Actions #10

Updated by J. Eric Ivancich over 4 years ago

  • Pull request ID changed from 25974 to 25794

Updated pr id, which had transposed two digits.

Actions #11

Updated by Casey Bodley over 2 years ago

  • Has duplicate Bug #21210: rgw:multisite: put obj in a version-suspended bucket when sync to slave zone, the list_index cannot added corretlly added
Actions

Also available in: Atom PDF