Bug #62411
openRGWLC::RGWDeleteLC() failed to set attrs on bucket=test-client.0-jja3ud4g2xqpxdd-130 returned err=-125
0%
Description
2023-08-10T04:43:10.746 INFO:teuthology.orchestra.run.smithi134.stdout:s3tests_boto3/functional/test_s3.py::test_bucket_delete_nonempty PASSED [ 36%] 2023-08-10T04:43:12.033 INFO:tasks.rgw.client.0.smithi134.stdout:2023-08-10T04:43:12.029+0000 4fc27640 -1 req 11380411328366689377 0.097997732s s3:delete_bucket ERROR: could not remove bucket test-client.0-jja3ud4g2xqpxdd-130 2023-08-10T04:43:12.119 INFO:teuthology.orchestra.run.smithi134.stdout:s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl PASSED [ 36%] 2023-08-10T04:43:12.274 INFO:teuthology.orchestra.run.smithi134.stdout:s3tests_boto3/functional/test_s3.py::test_object_write_to_nonexist_bucket PASSED [ 36%] 2023-08-10T04:43:12.275 INFO:teuthology.orchestra.run.smithi134.stdout:s3tests_boto3/functional/test_s3.py::test_object_write_to_nonexist_bucket ERROR [ 36%] 2023-08-10T04:43:12.755 INFO:teuthology.orchestra.run.smithi134.stdout:s3tests_boto3/functional/test_s3.py::test_object_write_with_chunked_transfer_encoding PASSED [ 36%] 2023-08-10T04:43:12.755 INFO:teuthology.orchestra.run.smithi134.stdout:s3tests_boto3/functional/test_s3.py::test_object_write_with_chunked_transfer_encoding ERROR [ 36%] 2023-08-10T04:43:13.191 INFO:teuthology.orchestra.run.smithi134.stdout:s3tests_boto3/functional/test_s3.py::test_bucket_create_delete PASSED [ 36%] 2023-08-10T04:43:13.191 INFO:teuthology.orchestra.run.smithi134.stdout:s3tests_boto3/functional/test_s3.py::test_bucket_create_delete ERROR [ 36%]
after that could not remove bucket test-client.0-jja3ud4g2xqpxdd-130
error, every following test fails to clean up with NoSuchBucket
when calling nuke_bucket(test-client.0-jja3ud4g2xqpxdd-130)
rgw logs adds some context: http://qa-proxy.ceph.com/teuthology/prsrivas-2023-08-10_03:42:21-rgw-wip-rgw-61916-test-failures-distro-default-smithi/7364436/remote/smithi134/log/rgw.ceph.client.0.log.gz
2023-08-10T04:43:11.993+0000 b4d45640 0 lifecycle: RGWLC::RGWDeleteLC() failed to set attrs on bucket=test-client.0-jja3ud4g2xqpxdd-130 returned err=-125 2023-08-10T04:43:12.018+0000 df59a640 10 req 11380411328366689377 0.086997993s s3:delete_bucket cache get: name=default.rgw.meta+root+test-client.0-jja3ud4g2xqpxdd-130 : hit (requested=0x11, cached=0x17) 2023-08-10T04:43:12.018+0000 df59a640 10 req 11380411328366689377 0.086997993s s3:delete_bucket removing default.rgw.meta+root+test-client.0-jja3ud4g2xqpxdd-130 from cache 2023-08-10T04:43:12.018+0000 df59a640 10 req 11380411328366689377 0.086997993s s3:delete_bucket distributing notification oid=default.rgw.control:notify.7 cni=[op: 1, obj: default.rgw.meta:root:test-client.0-jja3ud4g2xqpxdd-130, ofs0, ns] 2023-08-10T04:43:12.020+0000 16281640 10 rgw watcher librados: RGWWatcher::handle_notify() notify_id 1073741824311 cookie 693002112 notifier 4614 bl.length()=177 2023-08-10T04:43:12.024+0000 12862c640 10 req 11380411328366689377 0.092997849s s3:delete_bucket distributing notification oid=default.rgw.control:notify.7 cni=[op: 1, obj: default.rgw.meta:root:.bucket.meta.test-client.0-jja3ud4g2xqpxdd-130:c4205a4e-e47c-412a-a59c-96762e734842.4620.272, ofs0, ns] 2023-08-10T04:43:12.025+0000 16a82640 10 rgw watcher librados: RGWWatcher::handle_notify() notify_id 1073741824312 cookie 693002112 notifier 4614 bl.length()=236 2023-08-10T04:43:12.029+0000 4fc27640 -1 req 11380411328366689377 0.097997732s s3:delete_bucket ERROR: could not remove bucket test-client.0-jja3ud4g2xqpxdd-130
when we hit that RGWLC::RGWDeleteLC()
error, i think we've already deleted the bucket entrypoint, but never remove the bucket from the user's user.buckets list. that causes the s3test cleanup code to see the bucket in ListBuckets
, but be unable to ListObjectVersions
to clean it up
the -125 (ECANCELED) error would result from a race, which seems to be exercised by test_bucket_concurrent_set_canned_acl
. during bucket deletion, we shouldn't need to update the bucket metadata; we should only be removing the bucket's entry from the lc list
Updated by Matt Benjamin 9 months ago
hi casey,
to clarify, is this a bug in the lc code, or the api it's calling?
Matt
Updated by Casey Bodley 9 months ago
i was able to reproduce this locally by a) running radosgw under valgrind to slow things down, and b) using the pytest-repeat
plugin to run test_bucket_concurrent_set_canned_acl
hundreds of times in a row
s3tests diff:
diff --git a/requirements.txt b/requirements.txt
index 7742d8f..39736c2 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -12,4 +12,5 @@ pytz >=2011k
httplib2
lxml
pytest
+pytest-repeat
tox
in this case, the failures started after 335 runs:
$ S3TEST_CONF=s3tests.conf.SAMPLE tox -- -v s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl --count 1000 ... s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[332-1000] PASSED [ 33%] s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[333-1000] PASSED [ 33%] s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[334-1000] PASSED [ 33%] s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[335-1000] PASSED [ 33%] s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[336-1000] PASSED [ 33%] s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[336-1000] ERROR [ 33%] s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[337-1000] PASSED [ 33%] s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[337-1000] ERROR [ 33%] s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[338-1000] PASSED [ 33%] s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[338-1000] ERROR [ 33%] s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[339-1000] PASSED [ 33%] s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[339-1000] ERROR [ 33%]
Updated by Casey Bodley 9 months ago
the teuthology run doesn't enable debug-ms=1, so i'm sharing sections of my logs:
2023-08-11T12:29:15.210-0400 7b4d26c0 20 req 10056301545426509648 0.082002886s s3:delete_bucket abort_multiparts INFO: aborting and cleaning up multipart upload(s); bucket=":yournamehere-vkrdgfri5xv6mcl3-335[fbfb299a-87f1-4a5b-bc21-232931f78614.4156.335])"; uploads.size()=0; is_truncated=0 2023-08-11T12:29:15.211-0400 7b4d26c0 10 lifecycle: RGWRados::convert_old_bucket_info(): bucket=:yournamehere-vkrdgfri5xv6mcl3-335[fbfb299a-87f1-4a5b-bc21-232931f78614.4156.335]) 2023-08-11T12:29:15.211-0400 7b4d26c0 10 lifecycle: cache get: name=default.rgw.meta+root+yournamehere-vkrdgfri5xv6mcl3-335 : hit (requested=0x17, cached=0x17) 2023-08-11T12:29:15.211-0400 7b4d26c0 20 lifecycle: WARNING: blocking librados call 2023-08-11T12:29:15.212-0400 7b4d26c0 1 -- 192.168.245.130:0/706717837 --> [v2:192.168.245.130:6800/3526074991,v1:192.168.245.130:6801/3526074991] -- osd_op(unknown.0.0:70954 4.0 4:eeb87d7b:root::.bucket.meta.yournamehere-vkrdgfri5xv6mcl3-335%3afbfb299a-87f1-4a5b-bc21-232931f78614.4156.335:head [delete,create,call version.check_conds in=122b,call version.set in=58b,writefull 0~387 in=387b,setxattr user.rgw.acl (209) in=221b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e16) v8 -- 0xbd3e230 con 0x15644f30 2023-08-11T12:29:15.214-0400 13b766c0 1 -- 192.168.245.130:0/706717837 <== osd.0 v2:192.168.245.130:6800/3526074991 78664 ==== osd_op_reply(70954 .bucket.meta.yournamehere-vkrdgfri5xv6mcl3-335:fbfb299a-87f1-4a5b-bc21-232931f78614.4156.335 [delete,create,call,call,writefull 0~387,setxattr (209)] v16'67567 uv67560 ondisk = -125 ((125) Operation canceled)) v8 ==== 446+0+0 (crc 0 0 0) 0x3ef3e610 con 0x15644f30 2023-08-11T12:29:15.215-0400 7b4d26c0 10 lifecycle: removing default.rgw.meta+root+.bucket.meta.yournamehere-vkrdgfri5xv6mcl3-335:fbfb299a-87f1-4a5b-bc21-232931f78614.4156.335 from cache 2023-08-11T12:29:15.215-0400 7b4d26c0 0 lifecycle: RGWLC::RGWDeleteLC() failed to set attrs on bucket=yournamehere-vkrdgfri5xv6mcl3-335 returned err=-125
this error is coming from the call to RGWLC::remove_bucket_config()
in RadosBucket::remove_bucket()
: https://github.com/ceph/ceph/blob/875dae76/src/rgw/driver/rados/rgw_sal_rados.cc#L453-L455
because that unnecessarily uses the default argument bool merge_attrs = true
, that overwrites the bucket metadata to remove RGW_ATTR_LC
, which uses cls_version to bump the version number
2023-08-11T12:29:15.276-0400 764c86c0 1 -- 192.168.245.130:0/706717837 --> [v2:192.168.245.130:6800/3526074991,v1:192.168.245.130:6801/3526074991] -- osd_op(unknown.0.0:70992 4.0 4:eeb87d7b:root::.bucket.meta.yournamehere-vkrdgfri5xv6mcl3-335%3afbfb299a-87f1-4a5b-bc21-232931f78614.4156.335:head [call version.check_conds in=122b,call version.set in=58b,delete] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e16) v8 -- 0x57ae1430 con 0x15644f30 2023-08-11T12:29:15.278-0400 3ebcb6c0 1 -- 192.168.245.130:0/2158442906 --> [v2:192.168.245.130:6800/3526074991,v1:192.168.245.130:6801/3526074991] -- osd_op(unknown.0.0:1546 2.0 2:6fe6cc47:notif::queues_list_object:head [omap-get-keys in=12b] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e16) v8 -- 0x57b14000 con 0x308b6740 2023-08-11T12:29:15.279-0400 13b766c0 1 -- 192.168.245.130:0/706717837 <== osd.0 v2:192.168.245.130:6800/3526074991 78706 ==== osd_op_reply(70992 .bucket.meta.yournamehere-vkrdgfri5xv6mcl3-335:fbfb299a-87f1-4a5b-bc21-232931f78614.4156.335 [call,call,delete] v16'67570 uv67560 ondisk = -125 ((125) Operation canceled)) v8 ==== 320+0+0 (crc 0 0 0) 0x3ece7c80 con 0x15644f30 2023-08-11T12:29:15.280-0400 6bcb36c0 -1 req 10056301545426509648 0.152005360s s3:delete_bucket ERROR: could not remove bucket yournamehere-vkrdgfri5xv6mcl3-335
this ECANCELED error is likely due to the write from remove_bucket_config()
. i just don't understand why this doesn't fail on every DeleteBucket request
Updated by Casey Bodley 9 months ago
p.s. the buckets in this test case don't even have a lifecycle policy, so we're running this stuff unnecessarily
Updated by Casey Bodley 9 months ago
Casey Bodley wrote:
this error is coming from the call to
RGWLC::remove_bucket_config()
inRadosBucket::remove_bucket()
: https://github.com/ceph/ceph/blob/875dae76/src/rgw/driver/rados/rgw_sal_rados.cc#L453-L455
because that unnecessarily uses the default argumentbool merge_attrs = true
, that overwrites the bucket metadata to removeRGW_ATTR_LC
, which uses cls_version to bump the version number
this write could cause the later delete to fail with ECANCELED
because the version number changed, but in the log it's this write in remove_bucket_config()
that fails with ECANCELED
- which implies that remove_bucket_config()
is racing with some other update. i can't tell from the log what it is yet
Updated by Casey Bodley 9 months ago
i added some extra logging to the RGWObjVersionTracker
to follow the cls_version updates. i see one of the put_acls requests bumping the version from 6->7, but the later delete_bucket request is checking for version 6. the delete_bucket starts after put_acls completes, so the requests themselves don't race
it seems to be a race in RGWSI_SysObj_Cache itself. while the put_acls requests are racing, i see the cache notifications arriving out of order:
2023-08-11T16:34:50.594-0400 1c7876c0 20 rgw watcher librados: RGWSI_SysObj_Cache received notify for objv _XqpmbIiCICgx1_zAdi60Nd9:7 of default.rgw.meta+root+.bucket.meta.yournamehere-3a276u7iwxaslg1c-242:40bd6ef6-ca4f-4f22-95d9-978bb0945b94.4156.242 ... 2023-08-11T16:34:50.599-0400 1c7876c0 20 rgw watcher librados: RGWSI_SysObj_Cache received notify for objv _XqpmbIiCICgx1_zAdi60Nd9:6 of default.rgw.meta+root+.bucket.meta.yournamehere-3a276u7iwxaslg1c-242:40bd6ef6-ca4f-4f22-95d9-978bb0945b94.4156.242
Updated by Casey Bodley 9 months ago
- Status changed from New to Triaged
- Tags changed from lifecycle to lifecycle cache