Project

General

Profile

Actions

Bug #56997

closed

bucket lifecycle policy updates breaking metadata sync

Added by Casey Bodley over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2022-08-01T19:11:10.659+0300 7fdfd6ffd640 20 rgw async rados processor: remove lc config for hnjmrt-1
2022-08-01T19:11:10.660+0300 7fdfd6ffd640 10 lifecycle: RGWRados::convert_old_bucket_info(): bucket=:hnjmrt-1[2bd5e553-5d25-4207-8b94-0b96c5d80301.4146.1])
2022-08-01T19:11:10.660+0300 7fdfd6ffd640 10 lifecycle: cache get: name=a2.rgw.meta+root+hnjmrt-1 : miss
2022-08-01T19:11:10.660+0300 7fdfd6ffd640 20 lifecycle: rados->read ofs=0 len=0
2022-08-01T19:11:10.660+0300 7fdf577fe640 20 rgw rados thread: cr:s=0x7fdf4406dc20:op=0x7fdf44173b40:20RGWMetaRemoveEntryCR: operate()
2022-08-01T19:11:10.660+0300 7fdfd6ffd640  1 -- 10.46.10.90:0/2917887852 --> [v2:10.46.10.90:6808/3488456,v1:10.46.10.90:6809/3488456] -- osd_op(unknown.0.0:3109 4.0 4:5ad72b23:root::hnjmrt-1:head [call version.read in=11b,read 0~0,getxattrs] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e16) v8 -- 0x7fdfbc018f60 con 0x55c905bfacc0
2022-08-01T19:11:10.660+0300 7fdf577fe640 20 rgw rados thread: cr:s=0x7fdf4406dc20:op=0x7fdf44173b40:20RGWMetaRemoveEntryCR: operate()
2022-08-01T19:11:10.660+0300 7fdf577fe640 20 rgw rados thread: cr:s=0x7fdf4406dc20:op=0x7fdf44173b40:20RGWMetaRemoveEntryCR: operate()
2022-08-01T19:11:10.660+0300 7fdf577fe640 20 rgw rados thread: cr:s=0x7fdf4406dc20:op=0x7fdf44173b40:20RGWMetaRemoveEntryCR: operate()
2022-08-01T19:11:10.660+0300 7fdf577fe640 20 rgw rados thread: cr:s=0x7fdf4406dc20:op=0x7fdf44147a80:24RGWMetaSyncSingleEntryCR: operate()
2022-08-01T19:11:10.660+0300 7fdf577fe640 20 rgw rados thread: cr:s=0x7fdf4406dc20:op=0x7fdf44147a80:24RGWMetaSyncSingleEntryCR: operate()
2022-08-01T19:11:10.660+0300 7fdf577fe640 10 RGW-SYNC:meta:shard[10]:entry[bucket:hnjmrt-1]: success
2022-08-01T19:11:10.660+0300 7fdf577fe640 15 stack 0x7fdf4406dc20 end
2022-08-01T19:11:10.660+0300 7fdf577fe640 20 run: stack=0x7fdf4406dc20 is done
2022-08-01T19:11:10.660+0300 7fe049ffb640  1 -- 10.46.10.90:0/2917887852 <== osd.0 v2:10.46.10.90:6808/3488456 4310 ==== osd_op_reply(3109 hnjmrt-1 [call,read 0~0,getxattrs] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 236+0+0 (crc 0 0 0) 0x7fe03406b020 con 0x55c905bfacc0
2022-08-01T19:11:10.661+0300 7fdfd6ffd640 20 lifecycle: rados_obj.operate() r=-2 bl.length=0
2022-08-01T19:11:10.661+0300 7fdfd6ffd640 10 lifecycle: cache put: name=a2.rgw.meta+root+hnjmrt-1 info.flags=0x0
2022-08-01T19:11:10.661+0300 7fdfd6ffd640 10 lifecycle: adding a2.rgw.meta+root+hnjmrt-1 to cache LRU end
2022-08-01T19:11:10.661+0300 7fdfd6ffd640  0 lifecycle: ERROR: get_bucket_entrypoint_info() returned -2 bucket=:hnjmrt-1[2bd5e553-5d25-4207-8b94-0b96c5d80301.4146.1])
2022-08-01T19:11:10.661+0300 7fdfd6ffd640  0 lifecycle: ERROR: failed converting old bucket info: -2
2022-08-01T19:11:10.661+0300 7fdfd6ffd640  0 lifecycle: RGWLC::RGWDeleteLC() failed to set attrs on bucket=hnjmrt-1 returned err=-2
2022-08-01T19:11:10.661+0300 7fdfd6ffd640  0 rgw async rados processor: put_post failed to remove lc config for hnjmrt-1
2022-08-01T19:11:10.661+0300 7fdfd6ffd640  0 rgw async rados processor: ERROR: can't store key: bucket.instance:hnjmrt-1:2bd5e553-5d25-4207-8b94-0b96c5d80301.4146.1 ret=-2
2022-08-01T19:11:10.661+0300 7fdf577fe640 20 rgw rados thread: cr:s=0x7fdf44171170:op=0x7fdf440d8fe0:19RGWMetaStoreEntryCR: operate()
2022-08-01T19:11:10.661+0300 7fdf577fe640 20 rgw rados thread: cr:s=0x7fdf44171170:op=0x7fdf440d8fe0:19RGWMetaStoreEntryCR: operate() returned r=-2
2022-08-01T19:11:10.661+0300 7fdf577fe640 20 rgw rados thread: cr:s=0x7fdf44171170:op=0x7fdf44065540:24RGWMetaSyncSingleEntryCR: operate()
2022-08-01T19:11:10.661+0300 7fdf577fe640 20 rgw rados thread: cr:s=0x7fdf44171170:op=0x7fdf44065540:24RGWMetaSyncSingleEntryCR: failed to store metadata entry: bucket.instance:hnjmrt-1:2bd5e553-5d25-4207-8b94-0b96c5d80301.4146.1, got retcode=-2, will retry

this ENOENT error originates from RGWMetadataHandlerPut_BucketInstance::put_post(), where https://github.com/ceph/ceph/pull/46928 recently added new logic to update the lc list

the root cause seems to be this error from get_bucket_entrypoint_info() at the bottom of this call stack:
  • RGWLC::remove_bucket_config()
  • RadosBucket::merge_and_store_attrs()
  • RGWBucketCtl::set_bucket_instance_attrs()
  • RGWBucketCtl::convert_old_bucket_info()

metadata sync can't make any guarantees about the ordering of these sync events. so when it needs to sync a piece of bucket instance metadata, that sync must not depend on the existence of its entrypoint metadata. in this case, metadata sync had just removed this entrypoint metadata because it was deleted on the master zone

ultimately, i'm not sure why convert_old_bucket_info() is being called here. but RGWLC::remove_bucket_config() shouldn't be calling merge_and_store_attrs() to remove RGW_ATTR_LC, because RGWMetadataHandlerPut_BucketInstance::put_post() already saw that the attribute isn't there. this code path should only need to call guard_lc_modify()->sal_lc->rm_entry()


Related issues 1 (0 open1 closed)

Has duplicate rgw - Bug #57129: rgw: multisite tests are failing on "meta checkpoint" checksDuplicate

Actions
Actions #1

Updated by Matt Benjamin over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to Matt Benjamin
Actions #2

Updated by Matt Benjamin over 1 year ago

From what I can make out, the reasons why the metadata sync must not call merge_and_store_attrs(...) are essentially a layering violation--and more importantly, it certainly looks like this call path, if taken, should recover cleanly from the no-entrypoint error.

That said, it's easy to avoid this call path from remove_bucket_config() in more or less the same way set_bucket_config() does, so for now, let's do that.

Matt

Actions #3

Updated by Matt Benjamin over 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 47411
Actions #4

Updated by Casey Bodley over 1 year ago

  • Has duplicate Bug #57129: rgw: multisite tests are failing on "meta checkpoint" checks added
Actions #5

Updated by Casey Bodley over 1 year ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF