Project

General

Profile

Actions

Bug #62411

open

RGWLC::RGWDeleteLC() failed to set attrs on bucket=test-client.0-jja3ud4g2xqpxdd-130 returned err=-125

Added by Casey Bodley 9 months ago. Updated 9 months ago.

Status:
Triaged
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
lifecycle cache
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

from http://qa-proxy.ceph.com/teuthology/prsrivas-2023-08-10_03:42:21-rgw-wip-rgw-61916-test-failures-distro-default-smithi/7364436/teuthology.log:

2023-08-10T04:43:10.746 INFO:teuthology.orchestra.run.smithi134.stdout:s3tests_boto3/functional/test_s3.py::test_bucket_delete_nonempty PASSED  [ 36%]
2023-08-10T04:43:12.033 INFO:tasks.rgw.client.0.smithi134.stdout:2023-08-10T04:43:12.029+0000 4fc27640 -1 req 11380411328366689377 0.097997732s s3:delete_bucket ERROR: could not remove bucket test-client.0-jja3ud4g2xqpxdd-130
2023-08-10T04:43:12.119 INFO:teuthology.orchestra.run.smithi134.stdout:s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl PASSED [ 36%]
2023-08-10T04:43:12.274 INFO:teuthology.orchestra.run.smithi134.stdout:s3tests_boto3/functional/test_s3.py::test_object_write_to_nonexist_bucket PASSED [ 36%]
2023-08-10T04:43:12.275 INFO:teuthology.orchestra.run.smithi134.stdout:s3tests_boto3/functional/test_s3.py::test_object_write_to_nonexist_bucket ERROR [ 36%]
2023-08-10T04:43:12.755 INFO:teuthology.orchestra.run.smithi134.stdout:s3tests_boto3/functional/test_s3.py::test_object_write_with_chunked_transfer_encoding PASSED [ 36%]
2023-08-10T04:43:12.755 INFO:teuthology.orchestra.run.smithi134.stdout:s3tests_boto3/functional/test_s3.py::test_object_write_with_chunked_transfer_encoding ERROR [ 36%]
2023-08-10T04:43:13.191 INFO:teuthology.orchestra.run.smithi134.stdout:s3tests_boto3/functional/test_s3.py::test_bucket_create_delete PASSED    [ 36%]
2023-08-10T04:43:13.191 INFO:teuthology.orchestra.run.smithi134.stdout:s3tests_boto3/functional/test_s3.py::test_bucket_create_delete ERROR     [ 36%]

after that could not remove bucket test-client.0-jja3ud4g2xqpxdd-130 error, every following test fails to clean up with NoSuchBucket when calling nuke_bucket(test-client.0-jja3ud4g2xqpxdd-130)

rgw logs adds some context: http://qa-proxy.ceph.com/teuthology/prsrivas-2023-08-10_03:42:21-rgw-wip-rgw-61916-test-failures-distro-default-smithi/7364436/remote/smithi134/log/rgw.ceph.client.0.log.gz

2023-08-10T04:43:11.993+0000 b4d45640  0 lifecycle: RGWLC::RGWDeleteLC() failed to set attrs on bucket=test-client.0-jja3ud4g2xqpxdd-130 returned err=-125
2023-08-10T04:43:12.018+0000 df59a640 10 req 11380411328366689377 0.086997993s s3:delete_bucket cache get: name=default.rgw.meta+root+test-client.0-jja3ud4g2xqpxdd-130 : hit (requested=0x11, cached=0x17)
2023-08-10T04:43:12.018+0000 df59a640 10 req 11380411328366689377 0.086997993s s3:delete_bucket removing default.rgw.meta+root+test-client.0-jja3ud4g2xqpxdd-130 from cache
2023-08-10T04:43:12.018+0000 df59a640 10 req 11380411328366689377 0.086997993s s3:delete_bucket distributing notification oid=default.rgw.control:notify.7 cni=[op: 1, obj: default.rgw.meta:root:test-client.0-jja3ud4g2xqpxdd-130, ofs0, ns]
2023-08-10T04:43:12.020+0000 16281640 10 rgw watcher librados: RGWWatcher::handle_notify()  notify_id 1073741824311 cookie 693002112 notifier 4614 bl.length()=177
2023-08-10T04:43:12.024+0000 12862c640 10 req 11380411328366689377 0.092997849s s3:delete_bucket distributing notification oid=default.rgw.control:notify.7 cni=[op: 1, obj: default.rgw.meta:root:.bucket.meta.test-client.0-jja3ud4g2xqpxdd-130:c4205a4e-e47c-412a-a59c-96762e734842.4620.272, ofs0, ns]
2023-08-10T04:43:12.025+0000 16a82640 10 rgw watcher librados: RGWWatcher::handle_notify()  notify_id 1073741824312 cookie 693002112 notifier 4614 bl.length()=236
2023-08-10T04:43:12.029+0000 4fc27640 -1 req 11380411328366689377 0.097997732s s3:delete_bucket ERROR: could not remove bucket test-client.0-jja3ud4g2xqpxdd-130

when we hit that RGWLC::RGWDeleteLC() error, i think we've already deleted the bucket entrypoint, but never remove the bucket from the user's user.buckets list. that causes the s3test cleanup code to see the bucket in ListBuckets, but be unable to ListObjectVersions to clean it up

the -125 (ECANCELED) error would result from a race, which seems to be exercised by test_bucket_concurrent_set_canned_acl. during bucket deletion, we shouldn't need to update the bucket metadata; we should only be removing the bucket's entry from the lc list

Actions #1

Updated by Matt Benjamin 9 months ago

hi casey,

to clarify, is this a bug in the lc code, or the api it's calling?

Matt

Actions #2

Updated by Casey Bodley 9 months ago

i was able to reproduce this locally by a) running radosgw under valgrind to slow things down, and b) using the pytest-repeat plugin to run test_bucket_concurrent_set_canned_acl hundreds of times in a row

s3tests diff:

diff --git a/requirements.txt b/requirements.txt
index 7742d8f..39736c2 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -12,4 +12,5 @@ pytz >=2011k
 httplib2
 lxml
 pytest
+pytest-repeat
 tox

in this case, the failures started after 335 runs:

$ S3TEST_CONF=s3tests.conf.SAMPLE tox -- -v s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl --count 1000
...
s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[332-1000] PASSED                                                            [ 33%] 
s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[333-1000] PASSED                                                            [ 33%] 
s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[334-1000] PASSED                                                            [ 33%] 
s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[335-1000] PASSED                                                            [ 33%]
s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[336-1000] PASSED                                                            [ 33%] 
s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[336-1000] ERROR                                                             [ 33%] 
s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[337-1000] PASSED                                                            [ 33%]
s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[337-1000] ERROR                                                             [ 33%]
s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[338-1000] PASSED                                                            [ 33%]
s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[338-1000] ERROR                                                             [ 33%]
s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[339-1000] PASSED                                                            [ 33%]
s3tests_boto3/functional/test_s3.py::test_bucket_concurrent_set_canned_acl[339-1000] ERROR                                                             [ 33%]

Actions #3

Updated by Casey Bodley 9 months ago

the teuthology run doesn't enable debug-ms=1, so i'm sharing sections of my logs:

2023-08-11T12:29:15.210-0400 7b4d26c0 20 req 10056301545426509648 0.082002886s s3:delete_bucket abort_multiparts INFO: aborting and cleaning up multipart upload(s); bucket=":yournamehere-vkrdgfri5xv6mcl3-335[fbfb299a-87f1-4a5b-bc21-232931f78614.4156.335])"; uploads.size()=0; is_truncated=0
2023-08-11T12:29:15.211-0400 7b4d26c0 10 lifecycle: RGWRados::convert_old_bucket_info(): bucket=:yournamehere-vkrdgfri5xv6mcl3-335[fbfb299a-87f1-4a5b-bc21-232931f78614.4156.335])
2023-08-11T12:29:15.211-0400 7b4d26c0 10 lifecycle: cache get: name=default.rgw.meta+root+yournamehere-vkrdgfri5xv6mcl3-335 : hit (requested=0x17, cached=0x17)
2023-08-11T12:29:15.211-0400 7b4d26c0 20 lifecycle: WARNING: blocking librados call
2023-08-11T12:29:15.212-0400 7b4d26c0  1 -- 192.168.245.130:0/706717837 --> [v2:192.168.245.130:6800/3526074991,v1:192.168.245.130:6801/3526074991] -- osd_op(unknown.0.0:70954 4.0 4:eeb87d7b:root::.bucket.meta.yournamehere-vkrdgfri5xv6mcl3-335%3afbfb299a-87f1-4a5b-bc21-232931f78614.4156.335:head [delete,create,call version.check_conds in=122b,call version.set in=58b,writefull 0~387 in=387b,setxattr user.rgw.acl (209) in=221b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e16) v8 -- 0xbd3e230 con 0x15644f30
2023-08-11T12:29:15.214-0400 13b766c0  1 -- 192.168.245.130:0/706717837 <== osd.0 v2:192.168.245.130:6800/3526074991 78664 ==== osd_op_reply(70954 .bucket.meta.yournamehere-vkrdgfri5xv6mcl3-335:fbfb299a-87f1-4a5b-bc21-232931f78614.4156.335 [delete,create,call,call,writefull 0~387,setxattr (209)] v16'67567 uv67560 ondisk = -125 ((125) Operation canceled)) v8 ==== 446+0+0 (crc 0 0 0) 0x3ef3e610 con 0x15644f30
2023-08-11T12:29:15.215-0400 7b4d26c0 10 lifecycle: removing default.rgw.meta+root+.bucket.meta.yournamehere-vkrdgfri5xv6mcl3-335:fbfb299a-87f1-4a5b-bc21-232931f78614.4156.335 from cache
2023-08-11T12:29:15.215-0400 7b4d26c0  0 lifecycle: RGWLC::RGWDeleteLC() failed to set attrs on bucket=yournamehere-vkrdgfri5xv6mcl3-335 returned err=-125

this error is coming from the call to RGWLC::remove_bucket_config() in RadosBucket::remove_bucket(): https://github.com/ceph/ceph/blob/875dae76/src/rgw/driver/rados/rgw_sal_rados.cc#L453-L455
because that unnecessarily uses the default argument bool merge_attrs = true, that overwrites the bucket metadata to remove RGW_ATTR_LC, which uses cls_version to bump the version number

2023-08-11T12:29:15.276-0400 764c86c0  1 -- 192.168.245.130:0/706717837 --> [v2:192.168.245.130:6800/3526074991,v1:192.168.245.130:6801/3526074991] -- osd_op(unknown.0.0:70992 4.0 4:eeb87d7b:root::.bucket.meta.yournamehere-vkrdgfri5xv6mcl3-335%3afbfb299a-87f1-4a5b-bc21-232931f78614.4156.335:head [call version.check_conds in=122b,call version.set in=58b,delete] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e16) v8 -- 0x57ae1430 con 0x15644f30
2023-08-11T12:29:15.278-0400 3ebcb6c0  1 -- 192.168.245.130:0/2158442906 --> [v2:192.168.245.130:6800/3526074991,v1:192.168.245.130:6801/3526074991] -- osd_op(unknown.0.0:1546 2.0 2:6fe6cc47:notif::queues_list_object:head [omap-get-keys in=12b] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e16) v8 -- 0x57b14000 con 0x308b6740
2023-08-11T12:29:15.279-0400 13b766c0  1 -- 192.168.245.130:0/706717837 <== osd.0 v2:192.168.245.130:6800/3526074991 78706 ==== osd_op_reply(70992 .bucket.meta.yournamehere-vkrdgfri5xv6mcl3-335:fbfb299a-87f1-4a5b-bc21-232931f78614.4156.335 [call,call,delete] v16'67570 uv67560 ondisk = -125 ((125) Operation canceled)) v8 ==== 320+0+0 (crc 0 0 0) 0x3ece7c80 con 0x15644f30
2023-08-11T12:29:15.280-0400 6bcb36c0 -1 req 10056301545426509648 0.152005360s s3:delete_bucket ERROR: could not remove bucket yournamehere-vkrdgfri5xv6mcl3-335

this ECANCELED error is likely due to the write from remove_bucket_config(). i just don't understand why this doesn't fail on every DeleteBucket request

Actions #4

Updated by Casey Bodley 9 months ago

p.s. the buckets in this test case don't even have a lifecycle policy, so we're running this stuff unnecessarily

Actions #5

Updated by Casey Bodley 9 months ago

Casey Bodley wrote:

this error is coming from the call to RGWLC::remove_bucket_config() in RadosBucket::remove_bucket(): https://github.com/ceph/ceph/blob/875dae76/src/rgw/driver/rados/rgw_sal_rados.cc#L453-L455
because that unnecessarily uses the default argument bool merge_attrs = true, that overwrites the bucket metadata to remove RGW_ATTR_LC, which uses cls_version to bump the version number

this write could cause the later delete to fail with ECANCELED because the version number changed, but in the log it's this write in remove_bucket_config() that fails with ECANCELED - which implies that remove_bucket_config() is racing with some other update. i can't tell from the log what it is yet

Actions #6

Updated by Casey Bodley 9 months ago

i added some extra logging to the RGWObjVersionTracker to follow the cls_version updates. i see one of the put_acls requests bumping the version from 6->7, but the later delete_bucket request is checking for version 6. the delete_bucket starts after put_acls completes, so the requests themselves don't race

it seems to be a race in RGWSI_SysObj_Cache itself. while the put_acls requests are racing, i see the cache notifications arriving out of order:

2023-08-11T16:34:50.594-0400 1c7876c0 20 rgw watcher librados: RGWSI_SysObj_Cache received notify for objv _XqpmbIiCICgx1_zAdi60Nd9:7 of default.rgw.meta+root+.bucket.meta.yournamehere-3a276u7iwxaslg1c-242:40bd6ef6-ca4f-4f22-95d9-978bb0945b94.4156.242
...
2023-08-11T16:34:50.599-0400 1c7876c0 20 rgw watcher librados: RGWSI_SysObj_Cache received notify for objv _XqpmbIiCICgx1_zAdi60Nd9:6 of default.rgw.meta+root+.bucket.meta.yournamehere-3a276u7iwxaslg1c-242:40bd6ef6-ca4f-4f22-95d9-978bb0945b94.4156.242
Actions #7

Updated by Casey Bodley 9 months ago

  • Status changed from New to Triaged
  • Tags changed from lifecycle to lifecycle cache
Actions

Also available in: Atom PDF