Project

General

Profile

Actions

Bug #63684

open

RGW segmentation fault when reading object permissions via the swift API

Added by Steve Taylor 6 months ago. Updated 2 months ago.

Status:
Pending Backport
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
swift website backport_processed
Backport:
reef squid
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The specifics of the circumstances under which this is reproduced are still not completely understood, but a swift get_obj() call attempts to read object permissions via rados and rgw segfaults. This has been observed in a deployment with three rgw daemons. When the first one hits this condition and segfaults, the request is retried on a second daemon, which causes it to segfault as well, and then the process is repeated on the third daemon and it also segfaults.

One note that may or may not be important is that all of the daemons in the cluster are Reef 18.2.0, but the OpenStack clients are using Quincy 17.2.6 libraries. Everything is running on Ubuntu 20.04.

An rgw log file from a daemon that experienced the segmentation fault with debug_rgw=20/20 is attached.


Files

ceph-rgw.log.gz (607 KB) ceph-rgw.log.gz Debug log showing rgw segmentation fault Steve Taylor, 11/29/2023 09:51 PM
rgw-swift.sh (1.37 KB) rgw-swift.sh Simple reproduction script Steve Taylor, 02/02/2024 03:50 PM

Related issues 2 (1 open1 closed)

Copied to rgw - Backport #64833: reef: RGW segmentation fault when reading object permissions via the swift APIIn ProgressActions
Copied to rgw - Backport #64834: squid: RGW segmentation fault when reading object permissions via the swift APIResolvedCasey BodleyActions
Actions #1

Updated by Casey Bodley 6 months ago

  • Assignee set to J. Eric Ivancich
Actions #2

Updated by J. Eric Ivancich 6 months ago

Adding some log info into the tracker:

7> 2023-11-29T21:03:36.307+0000 7f5aa8ccb700 20 req 5758547824905073806 0.000000000s swift:list_bucket RGWSI_User_RADOS::read_user_info(): anonymous user
-6> 2023-11-29T21:03:36.307+0000 7f5aa8ccb700 2 req 5758547824905073806 0.000000000s swift:list_bucket recalculating target
-5> 2023-11-29T21:03:36.307+0000 7f5aa8ccb700 10 req 5758547824905073806 0.000000000s Starting retarget
-4> 2023-11-29T21:03:36.307+0000 7f5aa8ccb700 20 req 5758547824905073806 0.000000000s get_obj_state: rctx=0x557aaf3bf9d0 obj=aqua-ad2a9d62-TestContainer-86316939:aqua-ad2a9d62-TestObject-342663179 state=0x557ab23ad9e8 s
>prefetch_data=1
3> 2023-11-29T21:03:36.307+0000 7f5ad6d27700 10 req 5758547824905073806 0.000000000s manifest: total_size = 1024
-2> 2023-11-29T21:03:36.307+0000 7f5ad6d27700 20 req 5758547824905073806 0.000000000s get_obj_state: setting s
>obj_tag to 315dc471-3e30-4e74-9343-c214f6e0490a.105679.12184763268865788700
-1> 2023-11-29T21:03:36.307+0000 7f5ad6d27700 2 req 5758547824905073806 0.000000000s swift:get_obj reading permissions
0> 2023-11-29T21:03:36.311+0000 7f5ad6d27700 -1 ** Caught signal (Segmentation fault) *
in thread 7f5ad6d27700 thread_name:radosgw
ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420) [0x7f5b7df93420]
2: (rgw::sal::RadosObject::RadosReadOp::RadosReadOp(rgw::sal::RadosObject*, RGWObjectCtx*)+0x138) [0x557aa506e7d8]
3: (rgw::sal::RadosObject::get_read_op()+0x37) [0x557aa506ec27]
4: /usr/bin/radosgw(+0x90424f) [0x557aa4d9f24f]
5: (rgw_build_object_policies(DoutPrefixProvider const*, rgw::sal::Driver*, req_state*, bool, optional_yield)+0x25c) [0x557aa4da02dc]
6: (RGWHandler::do_read_permissions(RGWOp*, bool, optional_yield)+0x54) [0x557aa4da0374]
7: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, rgw::sal::Driver*, bool)+0x499) [0x557aa4bab4a9]
8: (process_request(RGWProcessEnv const&, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWRestfulIO*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >, int*)+0x25f1) [0x557aa4baeba1]
Actions #3

Updated by Steve Taylor 6 months ago

A colleague was able to find a simple sequence that reproduces this segmentation fault consistently. Perform the following operations via the swift api:
  1. Create a bucket
  2. Set up the bucket as a static web bucket (https://docs.openstack.org/swift/latest/api/static-website.html)
  3. Manually add an index.html file to the bucket
  4. Attempt to download index.html via the bucket url only (no "index.html" specification in the url)
  5. Segmentation fault

If "index.html" is specified in the url, the file is downloaded and the segfault does not occur. If no index.html is uploaded and it is auto-generated, then it can be downloaded with or without specifying "index.html" in the url (no segfault).

This is a pretty different process than the one that originally exhibited this behavior, but it seems to be the same segmentation fault.

Actions #4

Updated by Steve Taylor 6 months ago

Quick update. The segfault only occurs when the url contains a trailing slash and index.html is omitted. If the trailing slash isn't present, the segfault doesn't occur.

Steve Taylor wrote:

A colleague was able to find a simple sequence that reproduces this segmentation fault consistently. Perform the following operations via the swift api:
  1. Create a bucket
  2. Set up the bucket as a static web bucket (https://docs.openstack.org/swift/latest/api/static-website.html)
  3. Manually add an index.html file to the bucket
  4. Attempt to download index.html via the bucket url only (no "index.html" specification in the url)
  5. Segmentation fault

If "index.html" is specified in the url, the file is downloaded and the segfault does not occur. If no index.html is uploaded and it is auto-generated, then it can be downloaded with or without specifying "index.html" in the url (no segfault).

This is a pretty different process than the one that originally exhibited this behavior, but it seems to be the same segmentation fault.

Actions #5

Updated by Casey Bodley 6 months ago

The segfault only occurs when the url contains a trailing slash and index.html is omitted. If the trailing slash isn't present, the segfault doesn't occur.

this reminds me of s3website crashes in the past that involved trailing slashes. i think that was https://tracker.ceph.com/issues/56281 (https://github.com/ceph/ceph/pull/46933)

Actions #6

Updated by Steve Taylor 6 months ago

Casey Bodley wrote:

The segfault only occurs when the url contains a trailing slash and index.html is omitted. If the trailing slash isn't present, the segfault doesn't occur.

this reminds me of s3website crashes in the past that involved trailing slashes. i think that was https://tracker.ceph.com/issues/56281 (https://github.com/ceph/ceph/pull/46933)

For what it's worth, the same test suite that reproduces this behavior for me on 18.2.0 has also been run on 17.2.6 many, many times and never exhibited the segfault that is seen consistently with 18.2.0. We upgraded from 17.2.6 to 18.2.0 and haven't tested 17.2.7.

Actions #7

Updated by J. Eric Ivancich 5 months ago

I've been unable to reproduce this error on 18.2.0. Here are my steps:

Add the following to ceph.conf:

    rgw_enable_apis = swift, swift_auth, s3website

The following commands (note: $auth produces the login credentials on the command-line):

swift upload --object-name index.html bkt1 foo.html
swift upload --object-name 401error.html bkt1 foo.html
swift upload --object-name 404error.html bkt1 foo.html
swift post $auth -r '.r:*,.rlistings' bkt1
swift post $auth -m 'web-index:index.html' bkt1
swift post $auth -m 'web-error:error.html' bkt1

wget http://localhost:8000/bkt1/

And it retrieves index.html.

Can you figure out what changes I'd need to make to reproduce your issue?

Actions #8

Updated by J. Eric Ivancich 5 months ago

  • Status changed from New to Need More Info
Actions #9

Updated by Steve Taylor 5 months ago

In my 18.2.0 environment, your steps without the three error-related commands reproduce the segfault. Just the index.html upload, the listing post, the index post, and then the wget segfaults. I wonder if it is somehow related to user permissions or something.

J. Eric Ivancich wrote:

I've been unable to reproduce this error on 18.2.0. Here are my steps:

Add the following to ceph.conf:
[...]

The following commands (note: $auth produces the login credentials on the command-line):

[...]

And it retrieves index.html.

Can you figure out what changes I'd need to make to reproduce your issue?

Actions #10

Updated by J. Eric Ivancich 5 months ago

I think my config isn't quite right....

I get:
2023-12-21T14:09:09.068-0500 7f0563de26c0 2 req 18245755200562823448 0.003000044s s3:get_obj reading permissions

Where your log contains:
2023-11-29T21:03:36.307+0000 7f5ad6d27700 2 req 5758547824905073806 0.000000000s swift:get_obj reading permissions

I've really never played w/ static websites, so I'm not sure where my config is off.... Any thoughts?
So somehow the s3 code is still processing my request.

Actions #11

Updated by Steve Taylor 5 months ago

I am seeing the following OpenStack Tempest tests fail consistently because of this segfault in my environment:
https://docs.openstack.org/tempest/latest/_modules/object_storage/test_container_services.html#ContainerTest.test_create_container_with_remove_metadata_key
https://docs.openstack.org/tempest/latest/_modules/object_storage/test_container_services.html#ContainerTest.test_create_container_with_remove_metadata_value

Hopefully that helps. For me this fails consistently with Swift/OpenStack, so I would agree with your assessment that your test using S3 is probably the difference.

J. Eric Ivancich wrote:

I think my config isn't quite right....

I get:
2023-12-21T14:09:09.068-0500 7f0563de26c0 2 req 18245755200562823448 0.003000044s s3:get_obj reading permissions

Where your log contains:
2023-11-29T21:03:36.307+0000 7f5ad6d27700 2 req 5758547824905073806 0.000000000s swift:get_obj reading permissions

I've really never played w/ static websites, so I'm not sure where my config is off.... Any thoughts?
So somehow the s3 code is still processing my request.

Actions #12

Updated by J. Eric Ivancich 5 months ago

Steve Taylor wrote:

I am seeing the following OpenStack Tempest tests fail consistently because of this segfault in my environment:
https://docs.openstack.org/tempest/latest/_modules/object_storage/test_container_services.html#ContainerTest.test_create_container_with_remove_metadata_key
https://docs.openstack.org/tempest/latest/_modules/object_storage/test_container_services.html#ContainerTest.test_create_container_with_remove_metadata_value

Hopefully that helps. For me this fails consistently with Swift/OpenStack, so I would agree with your assessment that your test using S3 is probably the difference.

I haven't been able to figure out how to configure my desktop test bench (i.e., with vstart.sh) to provide a swift static website the way yours does. I'll need some bullet-proof instructions there before I can continue.

Actions #13

Updated by Steve Taylor 5 months ago

It looks like the tempest failure report I was looking at was somewhat misleading. The tests I referenced previously failed because RGW had already segfaulted during the static web test:
https://docs.openstack.org/tempest/latest/_modules/object_storage/test_container_staticweb.html#StaticWebTest.test_web_index

The reproduction I have been observing uses an automated test environment, so I will work on reproducing this completely manually so I can provide better instructions.

J. Eric Ivancich wrote:

Steve Taylor wrote:

I am seeing the following OpenStack Tempest tests fail consistently because of this segfault in my environment:
https://docs.openstack.org/tempest/latest/_modules/object_storage/test_container_services.html#ContainerTest.test_create_container_with_remove_metadata_key
https://docs.openstack.org/tempest/latest/_modules/object_storage/test_container_services.html#ContainerTest.test_create_container_with_remove_metadata_value

Hopefully that helps. For me this fails consistently with Swift/OpenStack, so I would agree with your assessment that your test using S3 is probably the difference.

I haven't been able to figure out how to configure my desktop test bench (i.e., with vstart.sh) to provide a swift static website the way yours does. I'll need some bullet-proof instructions there before I can continue.

Actions #14

Updated by Steve Taylor 4 months ago

J. Eric Ivancich wrote:

Steve Taylor wrote:

I am seeing the following OpenStack Tempest tests fail consistently because of this segfault in my environment:
https://docs.openstack.org/tempest/latest/_modules/object_storage/test_container_services.html#ContainerTest.test_create_container_with_remove_metadata_key
https://docs.openstack.org/tempest/latest/_modules/object_storage/test_container_services.html#ContainerTest.test_create_container_with_remove_metadata_value

Hopefully that helps. For me this fails consistently with Swift/OpenStack, so I would agree with your assessment that your test using S3 is probably the difference.

I haven't been able to figure out how to configure my desktop test bench (i.e., with vstart.sh) to provide a swift static website the way yours does. I'll need some bullet-proof instructions there before I can continue.

I feel a little bit silly after all of the time I have spent trying to figure out how to reproduce this. It turns out it is quite simple. The steps you listed above for creating the static website are great. The only change needed is the url. Instead of http://localhost:8000/bkt1/, try http://localhost:8000/swift/v1/bkt1/.

I wrote a simple script that deploys a single-node cluster via cephadm, then configures a static website using the steps you used, and when I include "swift/v1" in the url as above, it causes rgw to segfault.

Actions #15

Updated by Steve Taylor 4 months ago

Now that I have a simple set of steps to reproduce this reliably, I also tested it against 18.2.1 to make sure it hasn't already been fixed. The segfault is still present in 18.2.1.

Actions #16

Updated by Steve Taylor 4 months ago

Attaching a bash script that can deploy a single-node Ceph cluster (currently configured for 18.2.1) and reproduce this segfault. I am running this on a clean Ubuntu 22.04 VM as a user configured with passwordless sudo access.

For me, this script reproduces the segfault every time in just a few minutes. My VM is configured with 3 OSD disk devices that cephadm detects and deploys.

Actions #17

Updated by J. Eric Ivancich 4 months ago

Steve Taylor wrote:

Attaching a bash script that can deploy a single-node Ceph cluster (currently configured for 18.2.1) and reproduce this segfault. I am running this on a clean Ubuntu 22.04 VM as a user configured with passwordless sudo access.

For me, this script reproduces the segfault every time in just a few minutes. My VM is configured with 3 OSD disk devices that cephadm detects and deploys.

Thanks, I'll take a look!

Actions #18

Updated by Steve Taylor 4 months ago

After some debugging, it appears that this bug was introduced here: https://github.com/ceph/ceph/commit/c652f76c97337b39665bcf64b6e655275c3832e7#diff-8d5ab77cf6b24e95c2c7b352c52d1e254781f6c9e7fb3fd148bddc8a3f4321e7R2200

In the case of the swift handler, _source->get_bucket() here returns NULL and _source->get_bucket()->get_info() segfaults. Just a few frames down the call stack, the bucket info is available separately from the RADOS object's bucket pointer.

I'm not sure if it's the object's NULL bucket pointer or the expectation that it isn't NULL that's the problem, but this is the reason for the segfault.

Actions #19

Updated by Steve Taylor 3 months ago

Steve Taylor wrote:

After some debugging, it appears that this bug was introduced here: https://github.com/ceph/ceph/commit/c652f76c97337b39665bcf64b6e655275c3832e7#diff-8d5ab77cf6b24e95c2c7b352c52d1e254781f6c9e7fb3fd148bddc8a3f4321e7R2200

In the case of the swift handler, _source->get_bucket() here returns NULL and _source->get_bucket()->get_info() segfaults. Just a few frames down the call stack, the bucket info is available separately from the RADOS object's bucket pointer.

I'm not sure if it's the object's NULL bucket pointer or the expectation that it isn't NULL that's the problem, but this is the reason for the segfault.

I set a breakpoint at the point where the segfault occurs and queried the bucket via the "standard" URL without "/swift/v1" to examine the state of _source, and it has a valid bucket pointer with valid bucket info. That explains why the segfault doesn't occur for that test.

Given that analysis, it seems that bug is _source's NULL bucket pointer when the swift handler is used.

Actions #20

Updated by Steve Taylor 3 months ago

Steve Taylor wrote:

Steve Taylor wrote:

After some debugging, it appears that this bug was introduced here: https://github.com/ceph/ceph/commit/c652f76c97337b39665bcf64b6e655275c3832e7#diff-8d5ab77cf6b24e95c2c7b352c52d1e254781f6c9e7fb3fd148bddc8a3f4321e7R2200

In the case of the swift handler, _source->get_bucket() here returns NULL and _source->get_bucket()->get_info() segfaults. Just a few frames down the call stack, the bucket info is available separately from the RADOS object's bucket pointer.

I'm not sure if it's the object's NULL bucket pointer or the expectation that it isn't NULL that's the problem, but this is the reason for the segfault.

I set a breakpoint at the point where the segfault occurs and queried the bucket via the "standard" URL without "/swift/v1" to examine the state of _source, and it has a valid bucket pointer with valid bucket info. That explains why the segfault doesn't occur for that test.

Given that analysis, it seems that bug is _source's NULL bucket pointer when the swift handler is used.

Disregard that last statement. I realized I made a mistake in my debugging and the result was inconclusive. The segfault analysis is still valid, but I'm not sure about the S3 path.

Actions #21

Updated by Steve Taylor 3 months ago

After some additional testing, it doesn't look like the RGWBucketInfo whose attempted retrieval is causing the segfault is even being used after being passed to the RGWRados::Object constructor on line 2200 of rgw_sal_rados.cc. If I replace that argument with the following, the segfault goes away and the static website test passes as expected.

_source->get_bucket() ? _source->get_bucket()->get_info() : *(new RGWBucketInfo())

That obviously creates a memory leak and isn't a good fix, but it might be a decent workaround until there is a proper fix. It is also possible that this has other, undesired side effects. It seems to address this segfault, however.

Actions #22

Updated by Steve Taylor 3 months ago

Steve Taylor wrote:

After some additional testing, it doesn't look like the RGWBucketInfo whose attempted retrieval is causing the segfault is even being used after being passed to the RGWRados::Object constructor on line 2200 of rgw_sal_rados.cc. If I replace that argument with the following, the segfault goes away and the static website test passes as expected.

_source->get_bucket() ? _source->get_bucket()->get_info() : *(new RGWBucketInfo())

That obviously creates a memory leak and isn't a good fix, but it might be a decent workaround until there is a proper fix. It is also possible that this has other, undesired side effects. It seems to address this segfault, however.

Another minor update. Not being familiar with thread management and things in RGW, I used the new operator in the workaround above to make sure the RGWBucketInfo didn't go out of scope while someone still had a reference to it. I have done some testing with RGWBucketInfo() in place of *(new RGWBucketInfo()), and it seems to work equally well. Still a workaround and likely not a proper fix, but the memory leak doesn't seem to be required.

Actions #23

Updated by J. Eric Ivancich 3 months ago

Steve Taylor wrote:

After some additional testing, it doesn't look like the RGWBucketInfo whose attempted retrieval is causing the segfault is even being used after being passed to the RGWRados::Object constructor on line 2200 of rgw_sal_rados.cc. If I replace that argument with the following, the segfault goes away and the static website test passes as expected.

_source->get_bucket() ? _source->get_bucket()->get_info() : *(new RGWBucketInfo())

That obviously creates a memory leak and isn't a good fix, but it might be a decent workaround until there is a proper fix. It is also possible that this has other, undesired side effects. It seems to address this segfault, however.

Thanks for this work. This isn't quite a solution, but it points to the issue at hand. I'll have to look into why `get_bucket()` is apparently returning a nullptr.

Actions #24

Updated by J. Eric Ivancich 3 months ago

J. Eric Ivancich wrote:

Steve Taylor wrote:

After some additional testing, it doesn't look like the RGWBucketInfo whose attempted retrieval is causing the segfault is even being used after being passed to the RGWRados::Object constructor on line 2200 of rgw_sal_rados.cc. If I replace that argument with the following, the segfault goes away and the static website test passes as expected.

_source->get_bucket() ? _source->get_bucket()->get_info() : *(new RGWBucketInfo())

That obviously creates a memory leak and isn't a good fix, but it might be a decent workaround until there is a proper fix. It is also possible that this has other, undesired side effects. It seems to address this segfault, however.

Thanks for this work. This isn't quite a solution, but it points to the issue at hand. I'll have to look into why `get_bucket()` is apparently returning a nullptr.

Steve, since you've got this running in a debugger, perhaps you can print out the entire stack trace when you get to that point and get_bucket() is returning a nullptr. I would think _source would be fully initialized at this point such that get_bucket() would not return a nullptr.

Eric

Actions #25

Updated by Steve Taylor 3 months ago

J. Eric Ivancich wrote:

J. Eric Ivancich wrote:

Steve Taylor wrote:

After some additional testing, it doesn't look like the RGWBucketInfo whose attempted retrieval is causing the segfault is even being used after being passed to the RGWRados::Object constructor on line 2200 of rgw_sal_rados.cc. If I replace that argument with the following, the segfault goes away and the static website test passes as expected.

_source->get_bucket() ? _source->get_bucket()->get_info() : *(new RGWBucketInfo())

That obviously creates a memory leak and isn't a good fix, but it might be a decent workaround until there is a proper fix. It is also possible that this has other, undesired side effects. It seems to address this segfault, however.

Thanks for this work. This isn't quite a solution, but it points to the issue at hand. I'll have to look into why `get_bucket()` is apparently returning a nullptr.

Steve, since you've got this running in a debugger, perhaps you can print out the entire stack trace when you get to that point and get_bucket() is returning a nullptr. I would think _source would be fully initialized at this point such that get_bucket() would not return a nullptr.

Eric

#0 0x000055555671c07d in rgw::sal::RadosObject::RadosReadOp::RadosReadOp (this=0x55555f4b4d00,
source=0x55555f681900, _rctx=<optimized out>)
at /home/steve/workspace/gitlab/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:2200
#1 0x000055555671c7d6 in std::make_unique<rgw::sal::RadosObject::RadosReadOp, rgw::sal::RadosObject*, RGWObjectCtx*&>
() at /usr/include/c++/11/bits/unique_ptr.h:962
#2 rgw::sal::RadosObject::get_read_op (this=0x55555f681900)
at /home/steve/workspace/gitlab/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:2193
#3 0x00005555563d40c5 in get_obj_policy_from_attr (dpp=<optimized out>, cct=0x5555581d6000,
driver=driver@entry=0x5555580e5720, bucket_info=..., bucket_attrs=std::map with 2 elements = {...},
policy=policy@entry=0x55555f748c60, storage_class=0x0, obj=0x55555f681900, y=...)
at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_op.cc:265
#4 0x00005555563d4c81 in read_obj_policy (dpp=<optimized out>, dpp@entry=0x55555f429c00,
driver=driver@entry=0x5555580e5720, s=s@entry=0x7ffec6049ba0, bucket_info=...,
bucket_attrs=std::map with 2 elements = {...}, acl=acl@entry=0x55555f748c60, storage_class=0x0, policy=...,
bucket=0x55555f72ad00, object=0x55555f681900, y=..., copy_src=false)
at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_op.cc:405
#5 0x00005555563d588e in rgw_build_object_policies (dpp=0x55555f429c00, driver=0x5555580e5720, s=0x7ffec6049ba0,
prefetch_data=<optimized out>, y=...) at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_op.cc:667
#6 0x00005555563f069b in RGWHandler::do_read_permissions (this=this@entry=0x55555f726000, op=<optimized out>,
op@entry=0x55555f429c00, only_bucket=only_bucket@entry=false, y=...)
at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_op.cc:8139
#7 0x00005555564427b4 in RGWHandler_REST::read_permissions (this=0x55555f726000, op_obj=0x55555f429c00, y=...)
at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_rest.cc:1932
#8 0x00005555561d6b6e in rgw_process_authenticated (handler=handler@entry=0x55555f726000,
op=@0x7ffec6049a28: 0x55555f429c00, req=req@entry=0x7ffec604a730, s=0x7ffec6049ba0, y=..., driver=0x5555580e5720,
skip_retarget=false) at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_process.cc:196
#9 0x00005555561dc4b8 in process_request (penv=..., req=req@entry=0x7ffec604a730, frontend_prefix="",
client_io=client_io@entry=0x7ffec604a7e0, yield=..., scheduler=0x55555d2cfaa8, user=0x7ffec604a940,
latency=0x7ffec604a708, http_ret=0x7ffec604a704) at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_process.cc:392
#10 0x000055555610c223 in (anonymous namespace)::handle_connection<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0> > > (context=..., env=..., stream=...,
timeout=..., header_limit=16384, buffer=..., is_ssl=false, pause_mutex=..., scheduler=0x55555d2cfaa8,
uri_prefix="", ec=..., yield=...) at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_asio_frontend.cc:284
#11 0x000055555610cde2 in operator() (
_closure=__closure@entry=0x55555f6d42b8, yield=...)
at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_asio_frontend.cc:1055
#12 0x000055555610cfab in operator() (_closure=_closure@entry=0x7ffec604bf18, c=...)
at /home/steve/workspace/gitlab/ceph/src/spawn/include/spawn/impl/spawn.hpp:390
#13 0x000055555610d1bc in std::__invoke_impl<boost::context::continuation, spawn::detail::spawn_helper<boost::asio::executor_binder<void ()(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0> > >, (--Type <RET> for more, q to quit, c to continue without paging--c
anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::<lambda(yield_context)>, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::<lambda(boost::context::continuation&&)>&, boost::context::continuation> (_f=...) at /usr/include/c++/11/bits/invoke.h:61
#14 std::
_invoke<spawn::detail::spawn_helper<boost::asio::executor_binder<void (
)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0> > >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::<lambda(yield_context)>, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::<lambda(boost::context::continuation&&)>&, boost::context::continuation> (_fn=...) at /usr/include/c++/11/bits/invoke.h:97
#15 std::invoke<spawn::detail::spawn_helper<boost::asio::executor_binder<void ()(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0> > >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::<lambda(yield_context)>, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::<lambda(boost::context::continuation&&)>&, boost::context::continuation> (
_fn=...) at /usr/include/c++/11/functional:98
#16 boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (
)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0> > >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::<lambda(yield_context)>, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::<lambda(boost::context::continuation&&)> >::run (fctx=<optimized out>, this=0x7ffec604bf00) at /home/steve/workspace/gitlab/ceph/build/boost/include/boost/context/continuation_fcontext.hpp:143
#17 boost::context::detail::context_entry<boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0> > >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::<lambda(yield_context)>, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::<lambda(boost::context::continuation&&)> > >(boost::context::detail::transfer_t) (t=...) at /home/steve/workspace/gitlab/ceph/build/boost/include/boost/context/continuation_fcontext.hpp:80
#18 0x00005555572dd9ff in make_fcontext ()
#19 0x0000000000000000 in ?? ()

Actions #26

Updated by Steve Taylor 3 months ago

Steve Taylor wrote:

J. Eric Ivancich wrote:

J. Eric Ivancich wrote:

Steve Taylor wrote:

After some additional testing, it doesn't look like the RGWBucketInfo whose attempted retrieval is causing the segfault is even being used after being passed to the RGWRados::Object constructor on line 2200 of rgw_sal_rados.cc. If I replace that argument with the following, the segfault goes away and the static website test passes as expected.

_source->get_bucket() ? _source->get_bucket()->get_info() : *(new RGWBucketInfo())

That obviously creates a memory leak and isn't a good fix, but it might be a decent workaround until there is a proper fix. It is also possible that this has other, undesired side effects. It seems to address this segfault, however.

Thanks for this work. This isn't quite a solution, but it points to the issue at hand. I'll have to look into why `get_bucket()` is apparently returning a nullptr.

Steve, since you've got this running in a debugger, perhaps you can print out the entire stack trace when you get to that point and get_bucket() is returning a nullptr. I would think _source would be fully initialized at this point such that get_bucket() would not return a nullptr.

Eric

#0 0x000055555671c07d in rgw::sal::RadosObject::RadosReadOp::RadosReadOp (this=0x55555f4b4d00,
source=0x55555f681900, _rctx=<optimized out>)
at /home/steve/workspace/gitlab/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:2200
#1 0x000055555671c7d6 in std::make_unique<rgw::sal::RadosObject::RadosReadOp, rgw::sal::RadosObject*, RGWObjectCtx*&>
() at /usr/include/c++/11/bits/unique_ptr.h:962
#2 rgw::sal::RadosObject::get_read_op (this=0x55555f681900)
at /home/steve/workspace/gitlab/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:2193
#3 0x00005555563d40c5 in get_obj_policy_from_attr (dpp=<optimized out>, cct=0x5555581d6000,
driver=driver@entry=0x5555580e5720, bucket_info=..., bucket_attrs=std::map with 2 elements = {...},
policy=policy@entry=0x55555f748c60, storage_class=0x0, obj=0x55555f681900, y=...)
at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_op.cc:265
#4 0x00005555563d4c81 in read_obj_policy (dpp=<optimized out>, dpp@entry=0x55555f429c00,
driver=driver@entry=0x5555580e5720, s=s@entry=0x7ffec6049ba0, bucket_info=...,
bucket_attrs=std::map with 2 elements = {...}, acl=acl@entry=0x55555f748c60, storage_class=0x0, policy=...,
bucket=0x55555f72ad00, object=0x55555f681900, y=..., copy_src=false)
at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_op.cc:405
#5 0x00005555563d588e in rgw_build_object_policies (dpp=0x55555f429c00, driver=0x5555580e5720, s=0x7ffec6049ba0,
prefetch_data=<optimized out>, y=...) at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_op.cc:667
#6 0x00005555563f069b in RGWHandler::do_read_permissions (this=this@entry=0x55555f726000, op=<optimized out>,
op@entry=0x55555f429c00, only_bucket=only_bucket@entry=false, y=...)
at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_op.cc:8139
#7 0x00005555564427b4 in RGWHandler_REST::read_permissions (this=0x55555f726000, op_obj=0x55555f429c00, y=...)
at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_rest.cc:1932
#8 0x00005555561d6b6e in rgw_process_authenticated (handler=handler@entry=0x55555f726000,
op=@0x7ffec6049a28: 0x55555f429c00, req=req@entry=0x7ffec604a730, s=0x7ffec6049ba0, y=..., driver=0x5555580e5720,
skip_retarget=false) at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_process.cc:196
#9 0x00005555561dc4b8 in process_request (penv=..., req=req@entry=0x7ffec604a730, frontend_prefix="",
client_io=client_io@entry=0x7ffec604a7e0, yield=..., scheduler=0x55555d2cfaa8, user=0x7ffec604a940,
latency=0x7ffec604a708, http_ret=0x7ffec604a704) at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_process.cc:392
#10 0x000055555610c223 in (anonymous namespace)::handle_connection<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0> > > (context=..., env=..., stream=...,
timeout=..., header_limit=16384, buffer=..., is_ssl=false, pause_mutex=..., scheduler=0x55555d2cfaa8,
uri_prefix="", ec=..., yield=...) at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_asio_frontend.cc:284
#11 0x000055555610cde2 in operator() (
_closure=__closure@entry=0x55555f6d42b8, yield=...)
at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_asio_frontend.cc:1055
#12 0x000055555610cfab in operator() (_closure=_closure@entry=0x7ffec604bf18, c=...)
at /home/steve/workspace/gitlab/ceph/src/spawn/include/spawn/impl/spawn.hpp:390
#13 0x000055555610d1bc in std::__invoke_impl<boost::context::continuation, spawn::detail::spawn_helper<boost::asio::executor_binder<void ()(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0> > >, (--Type <RET> for more, q to quit, c to continue without paging--c
anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::<lambda(yield_context)>, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::<lambda(boost::context::continuation&&)>&, boost::context::continuation> (_f=...) at /usr/include/c++/11/bits/invoke.h:61
#14 std::
_invoke<spawn::detail::spawn_helper<boost::asio::executor_binder<void (
)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0> > >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::<lambda(yield_context)>, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::<lambda(boost::context::continuation&&)>&, boost::context::continuation> (_fn=...) at /usr/include/c++/11/bits/invoke.h:97
#15 std::invoke<spawn::detail::spawn_helper<boost::asio::executor_binder<void ()(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0> > >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::<lambda(yield_context)>, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::<lambda(boost::context::continuation&&)>&, boost::context::continuation> (
_fn=...) at /usr/include/c++/11/functional:98
#16 boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (
)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0> > >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::<lambda(yield_context)>, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::<lambda(boost::context::continuation&&)> >::run (fctx=<optimized out>, this=0x7ffec604bf00) at /home/steve/workspace/gitlab/ceph/build/boost/include/boost/context/continuation_fcontext.hpp:143
#17 boost::context::detail::context_entry<boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0> > >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::<lambda(yield_context)>, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::<lambda(boost::context::continuation&&)> > >(boost::context::detail::transfer_t) (t=...) at /home/steve/workspace/gitlab/ceph/build/boost/include/boost/context/continuation_fcontext.hpp:80
#18 0x00005555572dd9ff in make_fcontext ()
#19 0x0000000000000000 in ?? ()

When I hit this segfault and see this stack trace, evaluating _source->get_bucket() in the debugger gives me 0.

Actions #27

Updated by Steve Taylor 3 months ago

Steve Taylor wrote:

Steve Taylor wrote:

J. Eric Ivancich wrote:

J. Eric Ivancich wrote:

Steve Taylor wrote:

After some additional testing, it doesn't look like the RGWBucketInfo whose attempted retrieval is causing the segfault is even being used after being passed to the RGWRados::Object constructor on line 2200 of rgw_sal_rados.cc. If I replace that argument with the following, the segfault goes away and the static website test passes as expected.

_source->get_bucket() ? _source->get_bucket()->get_info() : *(new RGWBucketInfo())

That obviously creates a memory leak and isn't a good fix, but it might be a decent workaround until there is a proper fix. It is also possible that this has other, undesired side effects. It seems to address this segfault, however.

Thanks for this work. This isn't quite a solution, but it points to the issue at hand. I'll have to look into why `get_bucket()` is apparently returning a nullptr.

Steve, since you've got this running in a debugger, perhaps you can print out the entire stack trace when you get to that point and get_bucket() is returning a nullptr. I would think _source would be fully initialized at this point such that get_bucket() would not return a nullptr.

Eric

#0 0x000055555671c07d in rgw::sal::RadosObject::RadosReadOp::RadosReadOp (this=0x55555f4b4d00,
source=0x55555f681900, _rctx=<optimized out>)
at /home/steve/workspace/gitlab/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:2200
#1 0x000055555671c7d6 in std::make_unique<rgw::sal::RadosObject::RadosReadOp, rgw::sal::RadosObject*, RGWObjectCtx*&>
() at /usr/include/c++/11/bits/unique_ptr.h:962
#2 rgw::sal::RadosObject::get_read_op (this=0x55555f681900)
at /home/steve/workspace/gitlab/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:2193
#3 0x00005555563d40c5 in get_obj_policy_from_attr (dpp=<optimized out>, cct=0x5555581d6000,
driver=driver@entry=0x5555580e5720, bucket_info=..., bucket_attrs=std::map with 2 elements = {...},
policy=policy@entry=0x55555f748c60, storage_class=0x0, obj=0x55555f681900, y=...)
at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_op.cc:265
#4 0x00005555563d4c81 in read_obj_policy (dpp=<optimized out>, dpp@entry=0x55555f429c00,
driver=driver@entry=0x5555580e5720, s=s@entry=0x7ffec6049ba0, bucket_info=...,
bucket_attrs=std::map with 2 elements = {...}, acl=acl@entry=0x55555f748c60, storage_class=0x0, policy=...,
bucket=0x55555f72ad00, object=0x55555f681900, y=..., copy_src=false)
at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_op.cc:405
#5 0x00005555563d588e in rgw_build_object_policies (dpp=0x55555f429c00, driver=0x5555580e5720, s=0x7ffec6049ba0,
prefetch_data=<optimized out>, y=...) at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_op.cc:667
#6 0x00005555563f069b in RGWHandler::do_read_permissions (this=this@entry=0x55555f726000, op=<optimized out>,
op@entry=0x55555f429c00, only_bucket=only_bucket@entry=false, y=...)
at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_op.cc:8139
#7 0x00005555564427b4 in RGWHandler_REST::read_permissions (this=0x55555f726000, op_obj=0x55555f429c00, y=...)
at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_rest.cc:1932
#8 0x00005555561d6b6e in rgw_process_authenticated (handler=handler@entry=0x55555f726000,
op=@0x7ffec6049a28: 0x55555f429c00, req=req@entry=0x7ffec604a730, s=0x7ffec6049ba0, y=..., driver=0x5555580e5720,
skip_retarget=false) at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_process.cc:196
#9 0x00005555561dc4b8 in process_request (penv=..., req=req@entry=0x7ffec604a730, frontend_prefix="",
client_io=client_io@entry=0x7ffec604a7e0, yield=..., scheduler=0x55555d2cfaa8, user=0x7ffec604a940,
latency=0x7ffec604a708, http_ret=0x7ffec604a704) at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_process.cc:392
#10 0x000055555610c223 in (anonymous namespace)::handle_connection<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0> > > (context=..., env=..., stream=...,
timeout=..., header_limit=16384, buffer=..., is_ssl=false, pause_mutex=..., scheduler=0x55555d2cfaa8,
uri_prefix="", ec=..., yield=...) at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_asio_frontend.cc:284
#11 0x000055555610cde2 in operator() (
_closure=__closure@entry=0x55555f6d42b8, yield=...)
at /home/steve/workspace/gitlab/ceph/src/rgw/rgw_asio_frontend.cc:1055
#12 0x000055555610cfab in operator() (_closure=_closure@entry=0x7ffec604bf18, c=...)
at /home/steve/workspace/gitlab/ceph/src/spawn/include/spawn/impl/spawn.hpp:390
#13 0x000055555610d1bc in std::__invoke_impl<boost::context::continuation, spawn::detail::spawn_helper<boost::asio::executor_binder<void ()(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0> > >, (--Type <RET> for more, q to quit, c to continue without paging--c
anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::<lambda(yield_context)>, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::<lambda(boost::context::continuation&&)>&, boost::context::continuation> (_f=...) at /usr/include/c++/11/bits/invoke.h:61
#14 std::
_invoke<spawn::detail::spawn_helper<boost::asio::executor_binder<void (
)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0> > >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::<lambda(yield_context)>, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::<lambda(boost::context::continuation&&)>&, boost::context::continuation> (_fn=...) at /usr/include/c++/11/bits/invoke.h:97
#15 std::invoke<spawn::detail::spawn_helper<boost::asio::executor_binder<void ()(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0> > >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::<lambda(yield_context)>, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::<lambda(boost::context::continuation&&)>&, boost::context::continuation> (
_fn=...) at /usr/include/c++/11/functional:98
#16 boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (
)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0> > >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::<lambda(yield_context)>, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::<lambda(boost::context::continuation&&)> >::run (fctx=<optimized out>, this=0x7ffec604bf00) at /home/steve/workspace/gitlab/ceph/build/boost/include/boost/context/continuation_fcontext.hpp:143
#17 boost::context::detail::context_entry<boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0> > >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::<lambda(yield_context)>, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::<lambda(boost::context::continuation&&)> > >(boost::context::detail::transfer_t) (t=...) at /home/steve/workspace/gitlab/ceph/build/boost/include/boost/context/continuation_fcontext.hpp:80
#18 0x00005555572dd9ff in make_fcontext ()
#19 0x0000000000000000 in ?? ()

When I hit this segfault and see this stack trace, evaluating _source->get_bucket() in the debugger gives me 0.

And just so it's clear, the radosgw binary that I'm running in the debugger is one I built with debug symbols from a downstream Ceph fork with the v18.2.1 tag checked out. The automated test environment that I originally saw this fail in is using 18.2.0 without debug symbols from packages installed via apt from download.ceph.com.

Actions #28

Updated by Peter Razumovsky 3 months ago

We are facing the same issue on ceph v18.2.1 (rook v1.12.10)

debug    -35> 2024-03-06T11:51:19.896+0000 7f3b9974b640  1 ====== starting new request req=0x7f3b91bb46f0 =====                                     [266/1820]
debug    -34> 2024-03-06T11:51:19.896+0000 7f3b9974b640  2 req 3524652451431343238 0.000000000s initializing for trans_id = tx0000030ea14baec466486-0065e858b7
-35d2e-openstack-store                                                                                                                                        
debug    -33> 2024-03-06T11:51:19.896+0000 7f3b9974b640 10 req 3524652451431343238 0.000000000s rgw api priority: s3=8 s3website=7                            
debug    -32> 2024-03-06T11:51:19.896+0000 7f3b9974b640 10 req 3524652451431343238 0.000000000s host=openstack-store.it.just.works                            
debug    -31> 2024-03-06T11:51:19.896+0000 7f3b9974b640 20 req 3524652451431343238 0.000000000s subdomain= domain=openstack-store.it.just.works in_hosted_doma
in=1 in_hosted_domain_s3website=0                                                                                                                             
debug    -30> 2024-03-06T11:51:19.896+0000 7f3b9974b640 20 req 3524652451431343238 0.000000000s final domain/bucket subdomain= domain=openstack-store.it.just.
works in_hosted_domain=1 in_hosted_domain_s3website=0 s->info.domain=openstack-store.it.just.works s->info.request_uri=/swift/v1/AUTH_e72098eae6174bfb9a2a90f9
d0b3b0fa/tempest-TestContainer-617926801/                                                                                                                     
debug    -29> 2024-03-06T11:51:19.896+0000 7f3b9974b640 10 req 3524652451431343238 0.000000000s ver=v1 first=tempest-TestContainer-617926801 req=             
debug    -28> 2024-03-06T11:51:19.896+0000 7f3b9974b640 20 req 3524652451431343238 0.000000000s get_handler handler=28RGWHandler_REST_Bucket_SWIFT            
debug    -27> 2024-03-06T11:51:19.896+0000 7f3b9974b640 10 req 3524652451431343238 0.000000000s handler=28RGWHandler_REST_Bucket_SWIFT                        
debug    -26> 2024-03-06T11:51:19.896+0000 7f3b9974b640  2 req 3524652451431343238 0.000000000s getting op 0                                                  
debug    -25> 2024-03-06T11:51:19.896+0000 7f3b9974b640 10 req 3524652451431343238 0.000000000s cache get: name=openstack-store.rgw.log++script.prerequest. : 
hit (negative entry)                                                                                                                                          
debug    -24> 2024-03-06T11:51:19.896+0000 7f3b9974b640 10 req 3524652451431343238 0.000000000s swift:list_bucket scheduling with throttler client=3 cost=1   
debug    -23> 2024-03-06T11:51:19.896+0000 7f3b9974b640 10 req 3524652451431343238 0.000000000s swift:list_bucket op=28RGWListBucket_ObjStore_SWIFT           
debug    -22> 2024-03-06T11:51:19.896+0000 7f3b9974b640  2 req 3524652451431343238 0.000000000s swift:list_bucket verifying requester                         
debug    -21> 2024-03-06T11:51:19.896+0000 7f3b9974b640 20 req 3524652451431343238 0.000000000s swift:list_bucket rgw::auth::swift::DefaultStrategy: trying rg
w::auth::swift::TempURLEngine                                                                                                                                 
debug    -20> 2024-03-06T11:51:19.896+0000 7f3b9974b640 20 req 3524652451431343238 0.000000000s swift:list_bucket rgw::auth::swift::TempURLEngine denied with 
reason=-13                                                                                                                                                    
debug    -19> 2024-03-06T11:51:19.896+0000 7f3b9974b640 20 req 3524652451431343238 0.000000000s swift:list_bucket rgw::auth::swift::DefaultStrategy: trying rg
w::auth::swift::SignedTokenEngine                                                                                                                             
debug    -18> 2024-03-06T11:51:19.896+0000 7f3b9974b640 20 req 3524652451431343238 0.000000000s swift:list_bucket rgw::auth::swift::SignedTokenEngine denied w
ith reason=-1                                                                                                                                                 
debug    -17> 2024-03-06T11:51:19.896+0000 7f3b9974b640 20 req 3524652451431343238 0.000000000s swift:list_bucket rgw::auth::swift::DefaultStrategy: trying rg
w::auth::keystone::TokenEngine                                                                                                                                
debug    -16> 2024-03-06T11:51:19.896+0000 7f3b9974b640 20 req 3524652451431343238 0.000000000s swift:list_bucket rgw::auth::keystone::TokenEngine denied with
 reason=-13                                                                                                                                                   
debug    -15> 2024-03-06T11:51:19.896+0000 7f3b9974b640 20 req 3524652451431343238 0.000000000s swift:list_bucket rgw::auth::swift::DefaultStrategy: trying rg
w::auth::swift::SwiftAnonymousEngine                                                                                                                          
debug    -14> 2024-03-06T11:51:19.896+0000 7f3b9974b640 20 req 3524652451431343238 0.000000000s swift:list_bucket rgw::auth::swift::SwiftAnonymousEngine grant
ed access                                                                                                                                                     
debug    -13> 2024-03-06T11:51:19.896+0000 7f3b9974b640  2 req 3524652451431343238 0.000000000s swift:list_bucket normalizing buckets and tenants             
debug    -12> 2024-03-06T11:51:19.896+0000 7f3b9974b640 10 req 3524652451431343238 0.000000000s s->object=<NULL> s->bucket=e72098eae6174bfb9a2a90f9d0b3b0fa/te
mpest-TestContainer-617926801                                                                                                                                 
debug    -11> 2024-03-06T11:51:19.896+0000 7f3b9974b640  2 req 3524652451431343238 0.000000000s swift:list_bucket init permissions                            
debug    -10> 2024-03-06T11:51:19.896+0000 7f3b9974b640 10 req 3524652451431343238 0.000000000s swift:list_bucket cache get: name=openstack-store.rgw.meta+roo
t+e72098eae6174bfb9a2a90f9d0b3b0fa/tempest-TestContainer-617926801 : hit (requested=0x11, cached=0x17)      
debug     -9> 2024-03-06T11:51:19.896+0000 7f3b9974b640 15 req 3524652451431343238 0.000000000s swift:list_bucket decode_policy Read AccessControlPolicy<Acces
sControlPolicy xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Owner><ID>e72098eae6174bfb9a2a90f9d0b3b0fa$e72098eae6174bfb9a2a90f9d0b3b0fa</ID><DisplayName>t
empest-StaticWebTest-914150072</DisplayName></Owner><AccessControlList><Grant><Grantee xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="Canonic
alUser"><ID>e72098eae6174bfb9a2a90f9d0b3b0fa$e72098eae6174bfb9a2a90f9d0b3b0fa</ID><DisplayName>tempest-StaticWebTest-914150072</DisplayName></Grantee><Permiss
ion>FULL_CONTROL</Permission></Grant></AccessControlList></AccessControlPolicy>                                                                               
debug     -8> 2024-03-06T11:51:19.896+0000 7f3b9974b640 10 req 3524652451431343238 0.000000000s swift:list_bucket cache get: name=openstack-store.rgw.meta+use
rs.uid+e72098eae6174bfb9a2a90f9d0b3b0fa$e72098eae6174bfb9a2a90f9d0b3b0fa : hit (requested=0x13, cached=0x17)
debug     -7> 2024-03-06T11:51:19.896+0000 7f3b9974b640 20 req 3524652451431343238 0.000000000s swift:list_bucket RGWSI_User_RADOS::read_user_info(): anonymou
s user
debug     -6> 2024-03-06T11:51:19.896+0000 7f3b9974b640  2 req 3524652451431343238 0.000000000s swift:list_bucket recalculating target
debug     -5> 2024-03-06T11:51:19.896+0000 7f3b9974b640 10 req 3524652451431343238 0.000000000s Starting retarget
debug     -4> 2024-03-06T11:51:19.896+0000 7f3b9974b640 20 req 3524652451431343238 0.000000000s get_obj_state: rctx=0x55c6ddaa1960 obj=tempest-TestContainer-6
17926801:tempest-TestObject-915263681 state=0x55c6ddd919e8 s->prefetch_data=1
debug     -3> 2024-03-06T11:51:19.900+0000 7f3c0982b640 10 req 3524652451431343238 0.003999995s manifest: total_size = 1024
debug     -2> 2024-03-06T11:51:19.900+0000 7f3c0982b640 20 req 3524652451431343238 0.003999995s get_obj_state: setting s->obj_tag to 6202cd53-54f0-4b8b-9604-f
a663854712f.220462.6087511773425337581
debug     -1> 2024-03-06T11:51:19.900+0000 7f3c0982b640  2 req 3524652451431343238 0.003999995s swift:get_obj reading permissions
debug      0> 2024-03-06T11:51:19.900+0000 7f3c0982b640 -1 *** Caught signal (Segmentation fault) **
 in thread 7f3c0982b640 thread_name:radosgw

 ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
 1: /lib64/libc.so.6(+0x3e6f0) [0x7f3c47c856f0]
 2: (rgw::sal::RadosObject::RadosReadOp::RadosReadOp(rgw::sal::RadosObject*, RGWObjectCtx*)+0xd4) [0x55c6d7807664]
 3: (rgw::sal::RadosObject::get_read_op()+0x37) [0x55c6d7807af7]
 4: radosgw(+0x5506bd) [0x55c6d75da6bd] 
 5: (rgw_build_object_policies(DoutPrefixProvider const*, rgw::sal::Driver*, req_state*, bool, optional_yield)+0x245) [0x55c6d75db255]
 6: (RGWHandler::do_read_permissions(RGWOp*, bool, optional_yield)+0x54) [0x55c6d76158d4]
 7: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, rgw::sal::Driver*, bool)+0x34d) [0x55c6d749f2bd]
 8: (process_request(RGWProcessEnv const&, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWRestfulIO*,
 optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned lo
ng, std::ratio<1l, 1000000000l> >*, int*)+0x1039) [0x55c6d74a2f39]
 9: radosgw(+0xb4c99f) [0x55c6d7bd699f] 
 10: radosgw(+0x37ad76) [0x55c6d7404d76]
 11: make_fcontext()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Failing tempest test:

{0} tempest.api.object_storage.test_container_staticweb.StaticWebTest.test_web_index [0.492110s] ... FAILED                                                   

Captured traceback:                                                                                                                                           
~~~~~~~~~~~~~~~~~~~                                                                                                                                           
    Traceback (most recent call last):                                                                                                                        

      File "/var/lib/openstack/lib/python3.10/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper                                            
    return func(*func_args, **func_kwargs)                                                                                                                    

      File "/var/lib/openstack/lib/python3.10/site-packages/tempest/api/object_storage/test_container_staticweb.py", line 65, in test_web_index               
    resp, body = self.account_client.request("GET",                                                                                                           

      File "/var/lib/openstack/lib/python3.10/site-packages/tempest/lib/common/rest_client.py", line 750, in request                                          
    self._error_checker(resp, resp_body)                                                                                                                      

      File "/var/lib/openstack/lib/python3.10/site-packages/tempest/lib/common/rest_client.py", line 934, in _error_checker                                   
    raise exceptions.UnexpectedResponseCode(str(resp.status),                                                                                                 

    tempest.lib.exceptions.UnexpectedResponseCode: Unexpected response code received                                                                          
Details: 502       
Actions #29

Updated by Peter Razumovsky 3 months ago

Do we have any WA for this?

Actions #30

Updated by Peter Razumovsky 3 months ago

I see it was already fixed someday https://tracker.ceph.com/issues/56029. Wondering why it happens again, maybe because of re-implementation of this part?

Actions #31

Updated by Peter Razumovsky 3 months ago

It seems https://github.com/ceph/ceph/commit/db8d1d455c7f41b2527fb79ab510f186a7d63109 just lost somehow during Ceph Reef releasing:

19:24:01:~/git/ceph$ git log v17.2.7 | grep db8d1d455c7f41b2527fb79ab510f186a7d63109
commit db8d1d455c7f41b2527fb79ab510f186a7d63109
19:24:06:~/git/ceph$ git log v18.2.1 | grep db8d1d455c7f41b2527fb79ab510f186a7d63109
19:24:13:~/git/ceph$ git log v18.2.2 | grep db8d1d455c7f41b2527fb79ab510f186a7d63109
19:24:18:~/git/ceph$ 
Actions #32

Updated by Casey Bodley 3 months ago

  • Status changed from Need More Info to Fix Under Review
  • Backport set to reef squid
  • Pull request ID set to 56003
Actions #33

Updated by Casey Bodley 3 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Tags set to swift website
Actions #34

Updated by Backport Bot 3 months ago

  • Copied to Backport #64833: reef: RGW segmentation fault when reading object permissions via the swift API added
Actions #35

Updated by Backport Bot 3 months ago

  • Copied to Backport #64834: squid: RGW segmentation fault when reading object permissions via the swift API added
Actions #36

Updated by Backport Bot 3 months ago

  • Tags changed from swift website to swift website backport_processed
Actions

Also available in: Atom PDF