Bug #64571
openlifecycle transition crashes since reloading bucket attrs for notification
0%
Description
regression from https://github.com/ceph/ceph/pull/52114, whose test results included these failures https://pulpito.ceph.com/yuvalif-2024-01-30_08:46:48-rgw-wip-end2end-tracing-distro-default-smithi/
example rgw/lifecycle job from http://qa-proxy.ceph.com/teuthology/yuriw-2024-02-23_19:55:04-rgw-main-distro-default-smithi/7572634/teuthology.log
2024-02-24T03:30:13.667 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_transition_set_invalid_date PASSED [ 76%] 2024-02-24T03:30:23.431 INFO:tasks.rgw.client.0.smithi171.stderr:daemon-helper: command crashed with signal 11 2024-02-24T03:30:28.964 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~0s 2024-02-24T03:30:34.468 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~6s 2024-02-24T03:30:39.972 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~11s 2024-02-24T03:30:45.475 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~17s 2024-02-24T03:30:50.979 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~22s 2024-02-24T03:30:56.483 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~28s 2024-02-24T03:31:01.986 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~33s 2024-02-24T03:31:05.290 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_transition FAILED [ 76%] 2024-02-24T03:31:05.291 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_transition ERROR [ 76%] 2024-02-24T03:31:07.490 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~39s 2024-02-24T03:31:12.994 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~44s 2024-02-24T03:31:18.497 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~50s 2024-02-24T03:31:20.174 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_transition_single_rule_multi_trans FAILED [ 76%] 2024-02-24T03:31:20.174 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_transition_single_rule_multi_trans ERROR [ 76%] 2024-02-24T03:31:24.001 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~55s 2024-02-24T03:31:29.506 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~61s 2024-02-24T03:31:35.009 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~66s 2024-02-24T03:31:36.263 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_set_noncurrent_transition FAILED [ 76%] 2024-02-24T03:31:36.263 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_set_noncurrent_transition ERROR [ 76%] 2024-02-24T03:31:40.513 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~72s 2024-02-24T03:31:46.017 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~77s 2024-02-24T03:31:48.736 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_noncur_transition FAILED [ 77%] 2024-02-24T03:31:48.736 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_noncur_transition ERROR [ 77%]
example rgw/cloud-transition job from http://qa-proxy.ceph.com/teuthology/yuriw-2024-02-23_19:55:04-rgw-main-distro-default-smithi/7572619/teuthology.log
2024-02-24T03:12:02.182 INFO:tasks.rgw.client.0.smithi106.stderr:daemon-helper: command crashed with signal 11 2024-02-24T03:12:05.128 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~0s 2024-02-24T03:12:10.732 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~6s 2024-02-24T03:12:16.337 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~11s 2024-02-24T03:12:21.947 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~17s 2024-02-24T03:12:27.553 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~22s 2024-02-24T03:12:33.157 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~28s 2024-02-24T03:12:38.761 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~34s 2024-02-24T03:12:44.365 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~39s 2024-02-24T03:12:49.968 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~45s 2024-02-24T03:12:55.572 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~50s 2024-02-24T03:13:01.177 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~56s 2024-02-24T03:13:06.781 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~62s 2024-02-24T03:13:12.386 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~67s 2024-02-24T03:13:17.990 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~73s 2024-02-24T03:13:23.594 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~78s 2024-02-24T03:13:29.198 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~84s 2024-02-24T03:13:34.803 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~90s 2024-02-24T03:13:40.407 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~95s 2024-02-24T03:13:42.823 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_transition FAILED [ 25%] 2024-02-24T03:13:42.823 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_transition ERROR [ 25%] 2024-02-24T03:13:46.011 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~101s 2024-02-24T03:13:51.615 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~106s 2024-02-24T03:13:57.219 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~112s 2024-02-24T03:13:59.823 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_multiple_transition FAILED [ 50%] 2024-02-24T03:13:59.824 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_multiple_transition ERROR [ 50%] 2024-02-24T03:14:02.823 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~118s 2024-02-24T03:14:08.427 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~123s 2024-02-24T03:14:14.031 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~129s 2024-02-24T03:14:19.635 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~135s 2024-02-24T03:14:21.200 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_noncur_cloud_transition FAILED [ 75%] 2024-02-24T03:14:21.200 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_noncur_cloud_transition ERROR [ 75%] 2024-02-24T03:14:25.240 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~140s 2024-02-24T03:14:30.844 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~146s 2024-02-24T03:14:35.192 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_transition_large_obj FAILED [100%] 2024-02-24T03:14:35.193 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_transition_large_obj ERROR [100%] 2024-02-24T03:14:35.193 INFO:teuthology.orchestra.run.smithi106.stdout:
Files
Updated by Casey Bodley 2 months ago
wasn't able to reproduce under vstart. tested both with a debug build and a release build
required extra storage-class configuration to run the lifecycle transition tests:
~/ceph/build $ bin/radosgw-admin zonegroup placement add --placement-id default-placement --storage-class FROZEN ~/ceph/build $ bin/radosgw-admin zonegroup placement add --placement-id default-placement --storage-class LUKEWARM ~/ceph/build $ bin/radosgw-admin zone placement add --placement-id default-placement --storage-class FROZEN --data-pool default.rgw.frozen.data ~/ceph/build $ bin/radosgw-admin zone placement add --placement-id default-placement --storage-class LUKEWARM --data-pool default.rgw.lukewarm.data
(restart radosgw)
then uncomment storage_classes
in s3tests.conf.SAMPLE
:
## Storage classes #storage_classes = "LUKEWARM, FROZEN"
~/s3-tests $ S3TEST_CONF=s3tests.conf.SAMPLE tox -- s3tests_boto3/functional/test_s3.py -v -m 'not (fails_on_rgw or sse_s3)' ... s3tests_boto3/functional/test_s3.py::test_lifecycle_multipart_expiration PASSED [ 74%] s3tests_boto3/functional/test_s3.py::test_lifecycle_transition_set_invalid_date PASSED [ 74%] s3tests_boto3/functional/test_s3.py::test_lifecycle_transition PASSED [ 74%] s3tests_boto3/functional/test_s3.py::test_lifecycle_transition_single_rule_multi_trans PASSED [ 74%] s3tests_boto3/functional/test_s3.py::test_lifecycle_set_noncurrent_transition PASSED [ 75%] s3tests_boto3/functional/test_s3.py::test_lifecycle_noncur_transition PASSED [ 75%] s3tests_boto3/functional/test_s3.py::test_lifecycle_plain_null_version_current_transition PASSED [ 75%] s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_transition SKIPPED (no cloud_storage_class configured) [ 75%] s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_multiple_transition SKIPPED ([s3 cloud] section missing cloud_storage_class) [ 75%] s3tests_boto3/functional/test_s3.py::test_lifecycle_noncur_cloud_transition SKIPPED ([s3 cloud] section missing cloud_storage_class) [ 76%] s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_transition_large_obj SKIPPED ([s3 cloud] section missing cloud_storage_class) [ 76%] s3tests_boto3/functional/test_s3.py::test_encrypted_transfer_1b PASSED [ 76%] ...
Updated by Casey Bodley 2 months ago
- Subject changed from lifecycle crashes since merge end-to-end tracing to lifecycle transition crashes since merge end-to-end tracing
Updated by Casey Bodley 2 months ago
when testing the squid backport of end2end tracing https://github.com/ceph/ceph/pull/55625, i didn't see these crashes. https://pulpito.ceph.com/cbodley-2024-02-27_20:28:25-rgw-wip-cbodley-testing-distro-default-smithi/
Updated by Casey Bodley about 2 months ago
saw an expiration crash in https://qa-proxy.ceph.com/teuthology/cbodley-2024-03-13_12:30:10-rgw-wip-cbodley-testing-distro-default-smithi/7596563/teuthology.log that has a stack trace. it might be related
Updated by Casey Bodley about 2 months ago
hey Shreyansh and Soumya, any updates here? this is one urgent for the squid release
Updated by Soumya Koduri about 2 months ago
Casey Bodley wrote:
hey Shreyansh and Soumya, any updates here? this is one urgent for the squid release
Hi Casey..have started looking into it. I couldn't reproduce this issue locally with simple lc/cloud transition tests. Shreyansh mentioned that the crashes were frequently observed when using s3tests. I will run those now and update.
Updated by Soumya Koduri about 2 months ago
Not able to reproduce this issue locally. Pasting the bt of two crashes reported so far -
2024-03-13T13:33:20.678 INFO:tasks.rgw.client.0.smithi106.stdout:*** Caught signal (Segmentation fault)
2024-03-13T13:33:20.679 INFO:tasks.rgw.client.0.smithi106.stdout: in thread 7fb15d0b9640 thread_name:wp_thrd: 1, 1
2024-03-13T13:33:20.690 INFO:tasks.rgw.client.0.smithi106.stdout: ceph version 19.0.0-2011-g06f9ec2c (06f9ec2cb3c57d3b59cb10ddd4c321a82bb853f4) squid (dev)
2024-03-13T13:33:20.690 INFO:tasks.rgw.client.0.smithi106.stdout: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fb285ae5520]
2024-03-13T13:33:20.690 INFO:tasks.rgw.client.0.smithi106.stdout: 2: tc_new()
2024-03-13T13:33:20.690 INFO:tasks.rgw.client.0.smithi106.stdout: 3: radosgw(+0x10d45c3) [0x55806f3b75c3]
2024-03-13T13:33:20.690 INFO:tasks.rgw.client.0.smithi106.stdout: 4: radosgw(+0x4a64d4) [0x55806e7894d4]
2024-03-13T13:33:20.690 INFO:tasks.rgw.client.0.smithi106.stdout: 5: radosgw(+0x6a6213) [0x55806e989213]
2024-03-13T13:33:20.690 INFO:tasks.rgw.client.0.smithi106.stdout: 6: (RGWSI_Bucket_SObj::read_bucket_instance_info(ptr_wrapper<RGWSI_MetaBackend::Context, 4>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWBucketInfo*, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >, optional_yield, DoutPrefixProvider const*, rgw_cache_entry_info*, boost::optional<obj_version>)+0x13e) [0x55806ece616e]
2024-03-13T13:33:20.690 INFO:tasks.rgw.client.0.smithi106.stdout: 7: radosgw(+0xaa5ba9) [0x55806ed88ba9]
2024-03-13T13:33:20.690 INFO:tasks.rgw.client.0.smithi106.stdout: 8: (std::_Function_handler<int (RGWSI_MetaBackend_Handler::Op*), RGWBucketInstanceMetadataHandler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>)::{lambda(RGWSI_MetaBackend_Handler::Op*)#1}>::_M_invoke(std::_Any_data const&, RGWSI_MetaBackend_Handler::Op*&&)+0x34) [0x55806ed84134]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 9: radosgw(+0xa110a0) [0x55806ecf40a0]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 10: (RGWSI_MetaBackend_SObj::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (RGWSI_MetaBackend::Context*)>)+0x5b) [0x55806ecf413b]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 11: (RGWSI_MetaBackend_Handler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (RGWSI_MetaBackend_Handler::Op*)>)+0x77) [0x55806ecf32a7]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 12: (RGWBucketCtl::read_bucket_instance_info(rgw_bucket const&, RGWBucketInfo*, optional_yield, DoutPrefixProvider const*, RGWBucketCtl::BucketInstance::GetParams const&)+0x11b) [0x55806ed86f5b]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 13: (RGWRados::get_bucket_instance_info(rgw_bucket const&, RGWBucketInfo&, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >, optional_yield, DoutPrefixProvider const*)+0x55) [0x55806ebc5915]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 14: (RGWRados::BucketShard::init(rgw_bucket const&, rgw_obj const&, RGWBucketInfo*, DoutPrefixProvider const*, optional_yield)+0x8a) [0x55806eba034a]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 15: (RGWRados::Bucket::UpdateIndex::guard_reshard(DoutPrefixProvider const*, rgw_obj const&, RGWRados::BucketShard, std::function<int (RGWRados::BucketShard*)>, optional_yield)+0x368) [0x55806ebad4e8]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 16: (RGWRados::Bucket::UpdateIndex::prepare(DoutPrefixProvider const*, RGWModifyOp, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, optional_yield, bool)+0x113) [0x55806ebadad3]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 17: (RGWRados::Object::Delete::delete_obj(optional_yield, DoutPrefixProvider const*, bool)+0x6fc) [0x55806ebc000c]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 18: (rgw::sal::RadosObject::RadosDeleteOp::delete_obj(DoutPrefixProvider const*, optional_yield, unsigned int)+0x15a) [0x55806ebfbd9a]
2024-03-13T13:33:20.692 INFO:tasks.rgw.client.0.smithi106.stdout: 19: radosgw(+0xa53e42) [0x55806ed36e42]
2024-03-13T13:33:20.692 INFO:tasks.rgw.client.0.smithi106.stdout: 20: radosgw(+0xa54339) [0x55806ed37339]
2024-03-13T13:33:20.692 INFO:tasks.rgw.client.0.smithi106.stdout: 21: (LCOpRule::process(rgw_bucket_dir_entry&, DoutPrefixProvider const*, WorkQ*)+0x433) [0x55806ed3aef3]
2024-03-13T13:33:20.692 INFO:tasks.rgw.client.0.smithi106.stdout: 22: radosgw(+0xa588a9) [0x55806ed3b8a9]
2024-03-13T13:33:20.692 INFO:tasks.rgw.client.0.smithi106.stdout: 23: radosgw(+0xa56926) [0x55806ed39926]
2024-03-13T13:33:20.692 INFO:tasks.rgw.client.0.smithi106.stdout: 24: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7fb285b37b43]
2024-03-13T13:33:20.692 INFO:tasks.rgw.client.0.smithi106.stdout: 25: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7fb285bc9a00]
Updated by Soumya Koduri about 2 months ago
#0 _pthread_kill_implementation (
threadid=<optimized out>, signo=signo@entry=11,
no_tid=no_tid@entry=0) at pthread_kill.c:44
#1 0x00007ffff58aff33 in __pthread_kill_internal (
signo=11, threadid=<optimized out>)
at pthread_kill.c:78
#2 0x00007ffff585fab6 in __GI_raise (sig=11)
at ../sysdeps/posix/raise.c:26
#3 0x0000555556d7bcc0 in reraise_fatal (
signum=signum@entry=11)
at /root/ceph/src/global/signal_handler.cc:88
#4 0x0000555556d7d351 in handle_oneshot_fatal_signal (
signum=11)
--Type <RET> for more, q to quit, c to continue without paging--c
at /root/ceph/src/global/signal_handler.cc:367
#5 <signal handler called>
#6 RGWHTTPClient::set_send_length (len=4036, this=0x0)
at /root/ceph/src/rgw/rgw_http_client.h:128
#7 RGWLCCloudStreamPut::send_ready (
this=this@entry=0x55555837b260, dpp=0x5555583e0460,
rest_obj=...)
at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:660
#8 0x0000555556befef5 in cloud_tier_transfer_object (
dpp=<optimized out>, readf=0x555558437b80,
writef=0x55555837b260)
at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:728
#9 0x0000555556bf04e7 in cloud_tier_plain_transfer (
tier_ctx=...)
at /usr/include/c++/12/bits/shared_ptr_base.h:1665
#10 0x0000555556bf5a9b in rgw_cloud_tier_transfer_object
(tier_ctx=...,
cloud_targets=std::set with 1 element = {...})
at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:1298
#11 0x00005555569eed13 in rgw::sal::RadosObject::transition_to_cloud (this=0x55555fa84000,
bucket=0x55555d5db100, tier=0x55555d7ca340, o=...,
cloud_targets=std::set with 1 element = {...},
cct=0x5555583d6000, update_object=true,
dpp=<optimized out>, y=...)
at /root/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:1766
#12 0x0000555556b1f61c in LCOpAction_Transition::transition_obj_to_cloud (this=this@entry=0x55555fa0ffc0, oc=...)
at /root/ceph/src/rgw/rgw_lc.cc:1396
#13 0x0000555556b1fed1 in LCOpAction_Transition::process
(this=0x55555fa0ffc0, oc=...)
at /root/ceph/src/rgw/rgw_lc.cc:1453
#14 0x0000555556b2087c in LCOpAction_CurrentTransition::process (this=<optimized out>, oc=...)
at /root/ceph/src/rgw/rgw_lc.cc:1503
#15 0x0000555556b0bf7f in LCOpRule::process (
this=this@entry=0x7ffecf9d26a8, o=...,
dpp=<optimized out>, wq=wq@entry=0x55555839f510)
at /root/ceph/src/rgw/rgw_lc.cc:1618
#16 0x0000555556b0c547 in operator() (
__closure=0x55555839f5f0, wk=0x55555d1792c0,
wq=0x55555839f510, wi=...)
at /root/ceph/src/rgw/rgw_lc.cc:1704
#17 0x0000555556b0c77b in std::_invoke_impl<void, RGWLC::bucket_lc_process(std::string&, LCWorker*, time_t, bool)::<lambda(RGWLC::LCWorker*, WorkQ*, WorkItem&)>&, RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&> (_f=...)
at /usr/include/c++/12/bits/invoke.h:61
#18 std::_invoke_r<void, RGWLC::bucket_lc_process(std::string&, LCWorker*, time_t, bool)::<lambda(RGWLC::LCWorker*, WorkQ*, WorkItem&)>&, RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&> (__fn=...)
at /usr/include/c++/12/bits/invoke.h:111
#19 std::_Function_handler<void(RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&), RGWLC::bucket_lc_process(std::string&, LCWorker*, time_t, bool)::<lambda(RGWLC::LCWorker*, WorkQ*, WorkItem&)> >::_M_invoke(const std::_Any_data &, RGWLC::LCWorker &&, WorkQ *&&, boost::variant<void, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry> &) (
__functor=..., __args#0=<optimized out>,
__args#1=<optimized out>, __args#2=...)
at /usr/include/c++/12/bits/std_function.h:290
#20 0x0000555556b170d9 in std::function<void (RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&)>::operator()(RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&) const (
this=this@entry=0x55555839f5f0,
__args#0=<optimized out>, __args#1=<optimized out>,
__args#1@entry=0x55555839f510, __args#2=...)
at /usr/include/c++/12/bits/std_function.h:591
#21 0x0000555556b1b815 in WorkQ::entry (
this=0x55555839f510)
at /root/ceph/src/rgw/rgw_lc.cc:778
#22 0x00007ffff7033c0f in Thread::entry_wrapper (
this=0x55555839f510)
at /root/ceph/src/common/Thread.cc:87
#23 0x00007ffff7033c27 in Thread::_entry_func (
arg=<optimized out>)
at /root/ceph/src/common/Thread.cc:74
#24 0x00007ffff58ae19d in start_thread (
arg=<optimized out>) at pthread_create.c:442
#25 0x00007ffff592fc60 in clone3 ()
at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Updated by Soumya Koduri about 2 months ago
Two more crashes with similar bt are seen -
http://qa-proxy.ceph.com/teuthology/soumyakoduri-2024-03-19_02:37:28-rgw:cloud-transition-wip-skoduri-lc-distro-default-smithi/7610307/teuthology.log
2024-03-19T03:09:59.640 INFO:tasks.rgw.client.0.smithi148.stdout:*** Caught signal (Segmentation fault)
2024-03-19T03:09:59.640 INFO:tasks.rgw.client.0.smithi148.stdout: in thread 7fc7f2910640 thread_name:wp_thrd: 0, 0
2024-03-19T03:09:59.645 INFO:tasks.rgw.client.0.smithi148.stdout: ceph version 19.0.0-2098-ge0537c1b (e0537c1b936b5049f35101185806ff470c74a520) squid (dev)
2024-03-19T03:09:59.645 INFO:tasks.rgw.client.0.smithi148.stdout: 1: /lib64/libc.so.6(+0x54db0) [0x7fc918654db0]
2024-03-19T03:09:59.645 INFO:tasks.rgw.client.0.smithi148.stdout: 2: (tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned int, int)+0xeb) [0x7fc91b82977b]
2024-03-19T03:09:59.645 INFO:tasks.rgw.client.0.smithi148.stdout: 3: (tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned int)+0x30) [0x7fc91b829b20]
2024-03-19T03:09:59.645 INFO:tasks.rgw.client.0.smithi148.stdout: 4: radosgw(+0x488928) [0x555c4cb78928]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 5: (RGWSI_Bucket_SObj::read_bucket_instance_info(ptr_wrapper<RGWSI_MetaBackend::Context, 4>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWBucketInfo*, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >, optional_yield, DoutPrefixProvider const*, rgw_cache_entry_info*, boost::optional<obj_version>)+0x22a) [0x555c4d0369ca]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 6: radosgw(+0x9f8069) [0x555c4d0e8069]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 7: (std::_Function_handler<int (RGWSI_MetaBackend_Handler::Op*), RGWBucketInstanceMetadataHandler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>)::{lambda(RGWSI_MetaBackend_Handler::Op*)#1}>::_M_invoke(std::_Any_data const&, RGWSI_MetaBackend_Handler::Op*&&)+0x34) [0x555c4d0e5594]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 8: radosgw(+0x953ed0) [0x555c4d043ed0]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 9: (RGWSI_MetaBackend_SObj::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (RGWSI_MetaBackend::Context*)>)+0x5b) [0x555c4d043f6b]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 10: (RGWSI_MetaBackend_Handler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (RGWSI_MetaBackend_Handler::Op*)>)+0x77) [0x555c4d0430f7]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 11: (RGWBucketCtl::read_bucket_instance_info(rgw_bucket const&, RGWBucketInfo*, optional_yield, DoutPrefixProvider const*, RGWBucketCtl::BucketInstance::GetParams const&)+0x11b) [0x555c4d0e89db]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 12: (rgw::sal::RadosBucket::load_bucket(DoutPrefixProvider const*, optional_yield)+0x211) [0x555c4cf3aee1]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 13: (rgw::notify::publish_reserve(DoutPrefixProvider const*, rgw::SiteConfig const&, rgw::notify::EventType, rgw::notify::reservation_t&, RGWObjTags const*)+0xb38) [0x555c4ceb4958]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 14: radosgw(+0x99e9f0) [0x555c4d08e9f0]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 15: radosgw(+0x99f7ad) [0x555c4d08f7ad]
2024-03-19T03:09:59.647 INFO:tasks.rgw.client.0.smithi148.stdout: 16: (LCOpRule::process(rgw_bucket_dir_entry&, DoutPrefixProvider const*, WorkQ*)+0x423) [0x555c4d091183]
2024-03-19T03:09:59.647 INFO:tasks.rgw.client.0.smithi148.stdout: 17: radosgw(+0x9a1b96) [0x555c4d091b96]
2024-03-19T03:09:59.647 INFO:tasks.rgw.client.0.smithi148.stdout: 18: radosgw(+0x99fac1) [0x555c4d08fac1]
2024-03-19T03:09:59.647 INFO:tasks.rgw.client.0.smithi148.stdout: 19: /lib64/libc.so.6(+0x9f802) [0x7fc91869f802]
2024-03-19T03:09:59.647 INFO:tasks.rgw.client.0.smithi148.stdout: 20: /lib64/libc.so.6(+0x3f450) [0x7fc91863f450]
2024-03-19T03:09:59.836 INFO:tasks.rgw.client.0.smithi148.stderr:daemon-helper: command crashed with signal 11
2024-03-19T03:10:04.263 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~0s
2024-03-19T03:10:09.867 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~6s
2024-03-19T03:10:15.472 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~11s
2024-03-19T03:10:21.076 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~17s
2024-03-19T03:10:24.602 INFO:tasks.rgw.client.1.smithi148.stdout:* Caught signal (Segmentation fault)
2024-03-19T03:10:24.602 INFO:tasks.rgw.client.1.smithi148.stdout: in thread 7fc10624a640 thread_name:wp_thrd: 0, 2
2024-03-19T03:10:24.603 INFO:tasks.rgw.client.1.smithi148.stdout: ceph version 19.0.0-2098-ge0537c1b (e0537c1b936b5049f35101185806ff470c74a520) squid (dev)
2024-03-19T03:10:24.603 INFO:tasks.rgw.client.1.smithi148.stdout: 1: /lib64/libc.so.6(+0x54db0) [0x7fc22d054db0]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 2: (tcmalloc::CentralFreeList::FetchFromOneSpans(int, void, void**)+0x3c) [0x7fc2302249bc]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 3: (tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**)+0x20) [0x7fc23022aef0]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 4: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0x80) [0x7fc23022afb0]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 5: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* ()(unsigned long))+0x76) [0x7fc23022b126]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 6: (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xaf) [0x5578dccdbcef]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 7: (rgw_get_rados_ref(DoutPrefixProvider const, librados::v14_2_0::Rados*, rgw_raw_obj, rgw_rados_ref*)+0x3d) [0x5578dd1a43bd]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 8: (RGWRados::get_obj_head_ref(DoutPrefixProvider const*, rgw_placement_rule const&, rgw_obj const&, rgw_rados_ref*)+0x23a) [0x5578dd0f605a]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 9: (RGWRados::Object::Delete::delete_obj(optional_yield, DoutPrefixProvider const*, bool)+0x519) [0x5578dd11aed9]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 10: (rgw::sal::RadosObject::RadosDeleteOp::delete_obj(DoutPrefixProvider const*, optional_yield, unsigned int)+0x15a) [0x5578dd16155a]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 11: radosgw(+0x99cbf2) [0x5578dd29ebf2]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 12: radosgw(+0x99d149) [0x5578dd29f149]
2024-03-19T03:10:24.605 INFO:tasks.rgw.client.1.smithi148.stdout: 13: (LCOpRule::process(rgw_bucket_dir_entry&, DoutPrefixProvider const*, WorkQ*)+0x423) [0x5578dd2a3183]
2024-03-19T03:10:24.605 INFO:tasks.rgw.client.1.smithi148.stdout: 14: radosgw(+0x9a1b96) [0x5578dd2a3b96]
2024-03-19T03:10:24.605 INFO:tasks.rgw.client.1.smithi148.stdout: 15: radosgw(+0x99fac1) [0x5578dd2a1ac1]
2024-03-19T03:10:24.605 INFO:tasks.rgw.client.1.smithi148.stdout: 16: /lib64/libc.so.6(+0x9f802) [0x7fc22d09f802]
2024-03-19T03:10:24.605 INFO:tasks.rgw.client.1.smithi148.stdout: 17: /lib64/libc.so.6(+0x3f450) [0x7fc22d03f450]
Updated by Soumya Koduri about 2 months ago
- File ceph.client.0.log ceph.client.0.log added
With Valgrind enabled & lc_debug_interval set to `1`, the crash is not observed however in one of the runs a valgrind error "Leak_PossiblyLost" is reported possibly during cleanup -
Attaching valgrind log file of that RGW service
Updated by Casey Bodley about 1 month ago
spotted a different lifecycle crash in the rgw/dbstore suite, which is interesting because no expiration/transition is involved:
2024-03-20T23:59:59.506 INFO:teuthology.orchestra.run.smithi032.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_invalid_status PASSED [ 79%] 2024-03-20T23:59:59.746 INFO:teuthology.orchestra.run.smithi032.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_set_date PASSED [ 79%] 2024-03-20T23:59:59.994 INFO:teuthology.orchestra.run.smithi032.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_set_invalid_date PASSED [ 79%] 2024-03-21T00:00:00.415 INFO:teuthology.orchestra.run.smithi032.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_set_noncurrent PASSED [ 79%] 2024-03-21T00:00:00.619 INFO:tasks.rgw.client.0.smithi032.stdout:*** Caught signal (Segmentation fault) ** 2024-03-21T00:00:00.619 INFO:tasks.rgw.client.0.smithi032.stdout: in thread 7f04230c1640 thread_name:lifecycle_thr_0 2024-03-21T00:00:00.620 INFO:tasks.rgw.client.0.smithi032.stdout: ceph version 19.0.0-2246-g7793a10a (7793a10a2a6cd1bd96a28b9ecab476964200946d) squid (dev) 2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f052baad520] 2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 2: radosgw(+0xa6d465) [0x562391280465] 2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 3: (RGWLC::process(int, int, RGWLC::LCWorker*, bool)+0x96d) [0x56239128cf4d] 2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 4: (RGWLC::process(RGWLC::LCWorker*, std::unique_ptr<rgw::sal::Bucket, std::default_delete<rgw::sal::Bucket> > const&, bool)+0x16d) [0x56239128f9fd] 2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 5: (RGWLC::LCWorker::entry()+0x539) [0x56239128ffc9] 2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 6: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f052baffb43] 2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 7: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f052bb91a00] 2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout:2024-03-21T00:00:00.603+0000 7f04230c1640 -1 *** Caught signal (Segmentation fault) ** 2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: in thread 7f04230c1640 thread_name:lifecycle_thr_0 2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: ceph version 19.0.0-2246-g7793a10a (7793a10a2a6cd1bd96a28b9ecab476964200946d) squid (dev) 2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f052baad520] 2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 2: radosgw(+0xa6d465) [0x562391280465] 2024-03-21T00:00:00.622 INFO:tasks.rgw.client.0.smithi032.stdout: 3: (RGWLC::process(int, int, RGWLC::LCWorker*, bool)+0x96d) [0x56239128cf4d] 2024-03-21T00:00:00.622 INFO:tasks.rgw.client.0.smithi032.stdout: 4: (RGWLC::process(RGWLC::LCWorker*, std::unique_ptr<rgw::sal::Bucket, std::default_delete<rgw::sal::Bucket> > const&, bool)+0x16d) [0x56239128f9fd] 2024-03-21T00:00:00.622 INFO:tasks.rgw.client.0.smithi032.stdout: 5: (RGWLC::LCWorker::entry()+0x539) [0x56239128ffc9] 2024-03-21T00:00:00.622 INFO:tasks.rgw.client.0.smithi032.stdout: 6: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f052baffb43] 2024-03-21T00:00:00.622 INFO:tasks.rgw.client.0.smithi032.stdout: 7: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f052bb91a00] 2024-03-21T00:00:00.622 INFO:tasks.rgw.client.0.smithi032.stdout: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2024-03-21T00:00:00.622 INFO:tasks.rgw.client.0.smithi032.stdout: 2024-03-21T00:00:00.635 INFO:tasks.rgw.client.0.smithi032.stdout: 0> 2024-03-21T00:00:00.603+0000 7f04230c1640 -1 *** Caught signal (Segmentation fault) ** 2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: in thread 7f04230c1640 thread_name:lifecycle_thr_0 2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: 2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: ceph version 19.0.0-2246-g7793a10a (7793a10a2a6cd1bd96a28b9ecab476964200946d) squid (dev) 2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f052baad520] 2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: 2: radosgw(+0xa6d465) [0x562391280465] 2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: 3: (RGWLC::process(int, int, RGWLC::LCWorker*, bool)+0x96d) [0x56239128cf4d] 2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: 4: (RGWLC::process(RGWLC::LCWorker*, std::unique_ptr<rgw::sal::Bucket, std::default_delete<rgw::sal::Bucket> > const&, bool)+0x16d) [0x56239128f9fd] 2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: 5: (RGWLC::LCWorker::entry()+0x539) [0x56239128ffc9] 2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: 6: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f052baffb43] 2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: 7: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f052bb91a00] 2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: 2024-03-21T00:00:00.652 INFO:tasks.rgw.client.0.smithi032.stdout: -178> 2024-03-21T00:00:00.603+0000 7f04230c1640 -1 *** Caught signal (Segmentation fault) ** 2024-03-21T00:00:00.652 INFO:tasks.rgw.client.0.smithi032.stdout: in thread 7f04230c1640 thread_name:lifecycle_thr_0 2024-03-21T00:00:00.652 INFO:tasks.rgw.client.0.smithi032.stdout: 2024-03-21T00:00:00.652 INFO:tasks.rgw.client.0.smithi032.stdout: ceph version 19.0.0-2246-g7793a10a (7793a10a2a6cd1bd96a28b9ecab476964200946d) squid (dev) 2024-03-21T00:00:00.652 INFO:tasks.rgw.client.0.smithi032.stdout: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f052baad520] 2024-03-21T00:00:00.653 INFO:tasks.rgw.client.0.smithi032.stdout: 2: radosgw(+0xa6d465) [0x562391280465] 2024-03-21T00:00:00.653 INFO:tasks.rgw.client.0.smithi032.stdout: 3: (RGWLC::process(int, int, RGWLC::LCWorker*, bool)+0x96d) [0x56239128cf4d] 2024-03-21T00:00:00.653 INFO:tasks.rgw.client.0.smithi032.stdout: 4: (RGWLC::process(RGWLC::LCWorker*, std::unique_ptr<rgw::sal::Bucket, std::default_delete<rgw::sal::Bucket> > const&, bool)+0x16d) [0x56239128f9fd] 2024-03-21T00:00:00.653 INFO:tasks.rgw.client.0.smithi032.stdout: 5: (RGWLC::LCWorker::entry()+0x539) [0x56239128ffc9] 2024-03-21T00:00:00.653 INFO:tasks.rgw.client.0.smithi032.stdout: 6: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f052baffb43] 2024-03-21T00:00:00.653 INFO:tasks.rgw.client.0.smithi032.stdout: 7: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f052bb91a00] 2024-03-21T00:00:00.653 INFO:tasks.rgw.client.0.smithi032.stdout: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2024-03-21T00:00:00.653 INFO:tasks.rgw.client.0.smithi032.stdout: 2024-03-21T00:00:00.673 INFO:teuthology.orchestra.run.smithi032.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_set_deletemarker PASSED [ 80%] 2024-03-21T00:00:00.783 INFO:tasks.rgw.client.0.smithi032.stderr:daemon-helper: command crashed with signal 11 2024-03-21T00:00:03.650 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~0s 2024-03-21T00:00:09.154 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~6s 2024-03-21T00:00:14.658 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~11s 2024-03-21T00:00:20.162 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~17s 2024-03-21T00:00:20.251 INFO:teuthology.orchestra.run.smithi032.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_set_filter FAILED [ 80%] 2024-03-21T00:00:20.251 INFO:teuthology.orchestra.run.smithi032.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_set_filter ERROR [ 80%]
from https://qa-proxy.ceph.com/teuthology/cbodley-2024-03-20_23:26:29-rgw-wip-cbodley-testing-distro-default-smithi/7614326/teuthology.log
Updated by Casey Bodley about 1 month ago
- Priority changed from Urgent to Immediate
Updated by Casey Bodley about 1 month ago
i opened https://github.com/ceph/ceph/pull/56368 with a partial revert of the rgw tracing changes from https://github.com/ceph/ceph/pull/52114
i consider that a last resort for squid if we aren't able to find a root cause. in the meantime, i'd like to verify that lifecycle runs are green against that branch. we can potentially reintroduce that commit in pieces as a way to bisect the issue
Updated by Casey Bodley about 1 month ago
even with the rgw tracing parts reverted in https://github.com/ceph/ceph/pull/56368, i still saw a lifecycle crash in https://qa-proxy.ceph.com/teuthology/cbodley-2024-03-21_17:31:42-rgw:lifecycle-wip-60682-distro-default-smithi/7615035/teuthology.log
2024-03-21T17:55:47.365 INFO:teuthology.orchestra.run.smithi144.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_get_no_id PASSED [ 72%] 2024-03-21T17:56:41.499 INFO:tasks.rgw.client.0.smithi144.stdout:*** Caught signal (Segmentation fault) ** 2024-03-21T17:56:41.500 INFO:tasks.rgw.client.0.smithi144.stdout: in thread 7fac95e39640 thread_name:lifecycle_thr_0 2024-03-21T17:56:41.688 INFO:tasks.rgw.client.0.smithi144.stderr:daemon-helper: command crashed with signal 11 2024-03-21T17:56:42.352 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~0s 2024-03-21T17:56:47.855 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~6s 2024-03-21T17:56:53.359 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~11s 2024-03-21T17:56:58.863 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~17s 2024-03-21T17:57:04.366 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~22s 2024-03-21T17:57:09.870 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~28s 2024-03-21T17:57:13.350 INFO:teuthology.orchestra.run.smithi144.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_expiration FAILED [ 72%]
Updated by Soumya Koduri about 1 month ago
Casey Bodley wrote:
spotted a different lifecycle crash in the rgw/dbstore suite, which is interesting because no expiration/transition is involved:
[...]
from https://qa-proxy.ceph.com/teuthology/cbodley-2024-03-20_23:26:29-rgw-wip-cbodley-testing-distro-default-smithi/7614326/teuthology.log
There are some lifecycle tests (which include expiration) not marked as 'fails_on_dbstore'. I will skip those tests too and check if the issue occurs even with just applying lifecycle policy rules.
Updated by Casey Bodley about 1 month ago
Casey Bodley wrote:
even with the rgw tracing parts reverted in https://github.com/ceph/ceph/pull/56368, i still saw a lifecycle crash in https://qa-proxy.ceph.com/teuthology/cbodley-2024-03-21_17:31:42-rgw:lifecycle-wip-60682-distro-default-smithi/7615035/teuthology.log
i pushed another branch that reverts the rest of the tracing changes in librados and messenger, and still hit the crash in http://qa-proxy.ceph.com/teuthology/cbodley-2024-03-22_01:09:02-rgw:lifecycle-wip-60682-all-distro-default-smithi/7615848/teuthology.log
Updated by Casey Bodley about 1 month ago
i managed to catch a crash under gdb with a local vstart cluster. since this seems timing-related and didn't reproduce under valgrind in teuthology, i'm testing a release build (CMAKE_BUILD_TYUPE=RelWithDebInfo) and started vstart.sh without the -d for debug logging
here's the backtrace from gdb:
Thread 111 "lifecycle_thr_1" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7f889d4006c0 (LWP 470554)] ceph::buffer::v15_2_0::ptr::release (this=this@entry=0x55ea38fd7988) at /home/cbodley/ceph/src/common/buffer.cc:423 423 (1 == cached_raw->nref.load(std::memory_order_acquire)); (gdb) bt #0 ceph::buffer::v15_2_0::ptr::release (this=this@entry=0x55ea38fd7988) at /home/cbodley/ceph/src/common/buffer.cc:423 #1 0x000055ea34677d23 in ceph::buffer::v15_2_0::ptr::~ptr (this=0x55ea38fd7988, __in_chrg=<optimized out>) at /home/cbodley/ceph/src/include/buffer.h:247 #2 ceph::buffer::v15_2_0::ptr_node::~ptr_node (this=0x55ea38fd7980, __in_chrg=<optimized out>) at /home/cbodley/ceph/src/include/buffer.h:371 #3 ceph::buffer::v15_2_0::ptr_node::disposer::operator() (this=<optimized out>, delete_this=0x55ea38fd7980) at /home/cbodley/ceph/src/include/buffer.h:366 #4 ceph::buffer::v15_2_0::ptr_node::disposer::operator() (this=<synthetic pointer>, delete_this=0x55ea38fd7980) at /home/cbodley/ceph/src/include/buffer.h:364 #5 ceph::buffer::v15_2_0::list::buffers_t::clear_and_dispose (this=0x55ea3ae178a0) at /home/cbodley/ceph/src/include/buffer.h:600 #6 ceph::buffer::v15_2_0::list::~list (this=0x55ea3ae178a0, __in_chrg=<optimized out>) at /home/cbodley/ceph/src/include/buffer.h:959 #7 std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>::~pair ( this=0x55ea3ae17880, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/stl_pair.h:187 #8 std::destroy_at<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > ( __location=0x55ea3ae17880) at /usr/include/c++/13/bits/stl_construct.h:88 #9 std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > (__p=0x55ea3ae17880, __a=...) at /usr/include/c++/13/bits/alloc_traits.h:560 #10 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::_M_destroy_node ( __p=0x55ea3ae17860, this=<optimized out>) at /usr/include/c++/13/bits/stl_tree.h:625 #11 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::_M_drop_node ( this=<optimized out>, __p=0x55ea3ae17860) at /usr/include/c++/13/bits/stl_tree.h:633 #12 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::_M_erase ( __x=0x55ea3ae17860, this=<optimized out>) at /usr/include/c++/13/bits/stl_tree.h:1938 #13 0x000055ea34677cf4 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::_M_erase (__x=0x55ea3a69fda0, this=0x55ea3aabd2b8) at /usr/include/c++/13/bits/stl_tree.h:1936 #14 0x000055ea346803a1 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char--Type <RET> for more, q to quit, c to continue without paging-- _traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::~_Rb_tree (this=0x55ea3aabd2b8, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/stl_tree.h:736 #15 std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::~map (this=0x55ea3aabd2b8, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/stl_map.h:314 #16 rgw::sal::StoreBucket::~StoreBucket (this=0x55ea3aabce00, __in_chrg=<optimized out>) at /home/cbodley/ceph/src/rgw/rgw_sal_store.h:134 #17 rgw::sal::RadosBucket::~RadosBucket (this=0x55ea3aabce00, __in_chrg=<optimized out>) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:273 #18 0x000055ea346803c9 in rgw::sal::RadosBucket::~RadosBucket (this=0x55ea3aabce00, __in_chrg=<optimized out>) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:273 #19 0x000055ea3485fca3 in std::default_delete<rgw::sal::Bucket>::operator() (__ptr=<optimized out>, this=<optimized out>) at /usr/include/c++/13/bits/unique_ptr.h:93 #20 std::unique_ptr<rgw::sal::Bucket, std::default_delete<rgw::sal::Bucket> >::~unique_ptr (this=0x7f889d3fc278, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/unique_ptr.h:404 #21 RGWLC::bucket_lc_process (this=this@entry=0x55ea37d4cd20, shard_id=..., worker=worker@entry=0x55ea37d3b590, stop_at=stop_at@entry=1711123005, once=once@entry=false) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:1833 #22 0x000055ea34862358 in RGWLC::process (this=this@entry=0x55ea37d4cd20, index=16, max_lock_secs=max_lock_secs@entry=90, worker=worker@entry=0x55ea37d3b590, once=once@entry=false) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:2400 #23 0x000055ea3486c65e in RGWLC::process (this=0x55ea37d4cd20, worker=worker@entry=0x55ea37d3b590, optional_bucket=std::unique_ptr<rgw::sal::Bucket> = {...}, once=once@entry=false) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:2019 #24 0x000055ea3486c8c9 in RGWLC::LCWorker::entry (this=0x55ea37d3b590) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:207 #25 0x00007f88e48ac897 in start_thread (arg=<optimized out>) at pthread_create.c:444 #26 0x00007f88e493380c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
digging into ptr::release()
:
#0 ceph::buffer::v15_2_0::ptr::release (this=this@entry=0x55ea38fd7988) at /home/cbodley/ceph/src/common/buffer.cc:423 423 (1 == cached_raw->nref.load(std::memory_order_acquire)); (gdb) p *this $13 = {_raw = 0x0, _off = 947337538, _len = 1383608423} (gdb) list 418 // only a single reference. Altogether with initializing `nref` of 419 // freshly fabricated one with `1` through the std::atomic's ctor 420 // (which doesn't impose a memory barrier on the strongly-ordered 421 // x86), this allows to avoid all atomical operations in such case. 422 const bool last_one = \ 423 (1 == cached_raw->nref.load(std::memory_order_acquire)); 424 if (likely(last_one) || --cached_raw->nref == 0) { 425 bdout << "deleting raw " << static_cast<void*>(cached_raw) 426 << " len " << cached_raw->get_len() << bendl; 427 ANNOTATE_HAPPENS_AFTER(&cached_raw->nref); (gdb) p cached_raw $14 = (ceph::buffer::v15_2_0::raw * const) 0x3330336d51545f6d (gdb) p *cached_raw Cannot access memory at address 0x3330336d51545f6d
the buffer::ptr
itself holds garbage. cached_raw
here holds the original value of _raw
, whose dereference here leads to the segfault
Updated by Casey Bodley about 1 month ago
caught another crash that looks more like the others we've seen:
Thread 114 "wp_thrd: 2, 2" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7faafd8006c0 (LWP 475870)] 0x0000562cb14e818b in ceph::buffer::v15_2_0::list::buffers_t::buffers_iterator<ceph::buffer::v15_2_0::ptr_node>::operator++ (this=<synthetic pointer>) at /home/cbodley/ceph/src/include/buffer.h:461 461 cur = cur->next; (gdb) bt #0 0x0000562cb14e818b in ceph::buffer::v15_2_0::list::buffers_t::buffers_iterator<ceph::buffer::v15_2_0::ptr_node>::operator++ (this=<synthetic pointer>) at /home/cbodley/ceph/src/include/buffer.h:461 #1 ceph::buffer::v15_2_0::list::buffers_t::buffers_iterator<ceph::buffer::v15_2_0::ptr_node>::operator++ (this=<synthetic pointer>) at /home/cbodley/ceph/src/include/buffer.h:466 #2 ceph::buffer::v15_2_0::list::buffers_t::clear_and_dispose (this=0x562cb6bfbde0) at /home/cbodley/ceph/src/include/buffer.h:599 #3 ceph::buffer::v15_2_0::list::~list (this=0x562cb6bfbde0, __in_chrg=<optimized out>) at /home/cbodley/ceph/src/include/buffer.h:959 #4 std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>::~pair ( this=0x562cb6bfbdc0, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/stl_pair.h:187 #5 std::destroy_at<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > ( __location=0x562cb6bfbdc0) at /usr/include/c++/13/bits/stl_construct.h:88 #6 std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > (__p=0x562cb6bfbdc0, __a=...) at /usr/include/c++/13/bits/alloc_traits.h:560 #7 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::_M_destroy_node ( __p=0x562cb6bfbda0, this=<optimized out>) at /usr/include/c++/13/bits/stl_tree.h:625 #8 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::_M_drop_node ( this=<optimized out>, __p=0x562cb6bfbda0) at /usr/include/c++/13/bits/stl_tree.h:633 #9 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::_M_erase ( __x=0x562cb6bfbda0, this=0x7faafd7faad0) at /usr/include/c++/13/bits/stl_tree.h:1938 #10 0x0000562cb14ebdd9 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::~_Rb_tree (this=0x7faafd7faad0, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/stl_tree.h:736 #11 std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::~map (this=0x7faafd7faad0, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/stl_map.h:314 #12 RGWSI_Bucket_SObj::bucket_info_cache_entry::~bucket_info_cache_entry (this=0x7faafd7fa618, __in_chrg=<optimized out>) at /home/cbodley/ceph/src/rgw/services/svc_bucket_sobj.h:40 #13 boost::optional_detail::optional_base<RGWSI_Bucket_SObj::bucket_info_cache_entry>::destroy_impl (this=0x7faafd7fa610) at /home/cbodley/ceph/build/boost/include/boost/optional/optional.hpp:779 #14 boost::optional_detail::optional_base<RGWSI_Bucket_SObj::bucket_info_cache_entry>::destroy (this=0x7faafd7fa610) at /home/cbodley/ceph/build/boost/include/boost/optional/optional.hpp:765 #15 boost::optional_detail::optional_base<RGWSI_Bucket_SObj::bucket_info_cache_entry>::~optional_base (this=0x7faafd7fa610, __in_chrg=<optimized out>) at /home/cbodley/ceph/build/boost/include/boost/optional/optional.hpp:263 #16 boost::optional<RGWSI_Bucket_SObj::bucket_info_cache_entry>::~optional (this=0x7faafd7fa610, __in_chrg=<optimized out>) at /home/cbodley/ceph/build/boost/include/boost/optional/optional.hpp:909 #17 RGWSI_Bucket_SObj::read_bucket_instance_info (this=0x562cb44c58c0, ctx=..., key="yournamehere-44osugk2129zmf6z-4:200daa33-e350-46bf-8c6f-3476d1d9d0d6.4141.25", info=0x7faafd7fbe28, pmtime=0x0, pattrs=0x0, y=..., dpp=0x562cb4544cb0, cache_info=0x0, refresh_version=...) at /home/cbodley/ceph/src/rgw/services/svc_bucket_sobj.cc:299 #18 0x0000562cb15f0845 in operator() (ctx=..., __closure=<optimized out>) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_bucket.cc:2855 #19 std::__invoke_impl<int, RGWBucketCtl::read_bucket_instance_info(const rgw_bucket&, RGWBucketInfo*, optional_yield, const DoutPrefixProvider*, const BucketInstance::GetParams&)::<lambda(RGWSI_Bucket_BI_Ctx&)>&, ptr_wrapper<RGWSI_MetaBackend::Context, 4>&> (__f=...) at /usr/include/c++/13/bits/invoke.h:61 #20 std::__invoke_r<int, RGWBucketCtl::read_bucket_instance_info(const rgw_bucket&, RGWBucketInfo*, optional_yield, const DoutPrefixProvider*, const BucketInstance::GetParams&)::<lambda(RGWSI_Bucket_BI_Ctx&)>&, ptr_wrapper<RGWSI_MetaBackend::Context, 4>&> (__fn=...) at /usr/include/c++/13/bits/invoke.h:114 #21 std::_Function_handler<int(ptr_wrapper<RGWSI_MetaBackend::Context, 4>&), RGWBucketCtl::read_bucket_instance_info(const rgw_bucket&, RGWBucketInfo*, optional_yield, const DoutPrefixProvider*, const BucketInstance::GetParams&)::<lambda(RGWSI_Bucket_BI_Ctx&)> >::_M_invoke(const std::_Any_data &, ptr_wrapper<RGWSI_MetaBackend::Context, 4> &) (__functor=..., __args#0=...) at /usr/include/c++/13/bits/std_function.h:290 #22 0x0000562cb160e290 in std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>::operator()(ptr_wrapper<RGWSI_MetaBackend::Context, 4>&) const ( __args#0=..., this=<optimized out>) at /usr/include/c++/13/bits/std_function.h:591 #23 RGWBucketInstanceMetadataHandler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>)::{lambda(RGWSI_MetaBackend_Handler::Op*)#1}::operator()(RGWSI_MetaBackend_Handler::Op*) const (op=<optimized out>, __closure=<optimized out>) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_bucket.cc:2528 #24 std::__invoke_impl<int, RGWBucketInstanceMetadataHandler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>)::{lambda(RGWSI_MetaBackend_Handler::Op*)#1}&, RGWSI_MetaBackend_Handler::Op*>(std::__invoke_other, RGWBucketInstanceMetadataHandler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>)::{lambda(RGWSI_MetaBackend_Handler::Op*)#1}&, RGWSI_MetaBackend_Handler::Op*&&) (__f=...) at /usr/include/c++/13/bits/invoke.h:61 #25 std::__invoke_r<int, RGWBucketInstanceMetadataHandler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>)::{lambda(RGWSI_MetaBackend_Handler::Op*)#1}&, RGWSI_MetaBackend_Handler::Op*>(RGWBucketInstanceMetadataHandler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>)::{lambda(RGWSI_MetaBackend_Handler::Op*)#1}&, RGWSI_MetaBackend_Handler::Op*&&) (__fn=...) at /usr/include/c++/13/bits/invoke.h:114 #26 std::_Function_handler<int (RGWSI_MetaBackend_Handler::Op*), RGWBucketInstanceMetadataHandler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>)::{lambda(RGWSI_MetaBackend_Handler::Op*)#1}>::_M_invoke(std::_Any_data const&, RGWSI_MetaBackend_Handler::Op*&&) (__functor=..., __args#0=<optimized out>) at /usr/include/c++/13/bits/std_function.h:290 #27 0x0000562cb151326f in std::function<int (RGWSI_MetaBackend_Handler::Op*)>::operator()(RGWSI_MetaBackend_Handler::Op*) const (__args#0=<optimized out>, this=<optimized out>) at /usr/include/c++/13/bits/std_function.h:591 #28 operator() (ctx=0x7faafd7face0, __closure=0x7faafd7fae00) at /home/cbodley/ceph/src/rgw/services/svc_meta_be.cc:183 #29 std::__invoke_impl<int, RGWSI_MetaBackend_Handler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int(Op*)>)::<lambda(RGWSI_MetaBackend::Context*)>&, RGWSI_MetaBackend::Context*> (__f=...) at /usr/include/c++/13/bits/invoke.h:61 #30 std::__invoke_r<int, RGWSI_MetaBackend_Handler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int(Op*)>)::<lambda(RGWSI_MetaBackend::Context*)>&, RGWSI_MetaBackend::Context*> (__fn=...) at /usr/include/c++/13/bits/invoke.h:114 #31 std::_Function_handler<int(RGWSI_MetaBackend::Context*), RGWSI_MetaBackend_Handler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int(Op*)>)::<lambda(RGWSI_MetaBackend::Context*)> >::_M_invoke(const std::_Any_data &, RGWSI_MetaBackend::Context *&&) (__functor=..., __args#0=<optimized out>) at /usr/include/c++/13/bits/std_function.h:290 #32 0x0000562cb1514b3c in std::function<int (RGWSI_MetaBackend::Context*)>::operator()(RGWSI_MetaBackend::Context*) const (__args#0=<optimized out>, this=<optimized out>) at /usr/include/c++/13/bits/std_function.h:591 #33 RGWSI_MetaBackend_SObj::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (RGWSI_MetaBackend::Context*)>) ( this=<optimized out>, opt=std::optional containing std::variant [no contained value], f=...) at /home/cbodley/ceph/src/rgw/services/svc_meta_be_sobj.cc:106 #34 0x0000562cb151362c in RGWSI_MetaBackend_Handler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (RGWSI_MetaBackend_Handler::Op*)>) (this=<optimized out>, bectx_params=std::optional [no contained value], f=...) at /home/cbodley/ceph/src/rgw/services/svc_meta_be.cc:180 #35 0x0000562cb15efc77 in RGWBucketInstanceMetadataHandler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>) (this=<optimized out>, f=..., bectx_params=std::optional [no contained value]) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_bucket.cc:2526 #36 RGWBucketCtl::read_bucket_instance_info (this=<optimized out>, bucket=..., info=<optimized out>, info@entry=0x7faafd7fbe28, y=..., dpp=<optimized out>, dpp@entry=0x562cb4544cb0, params=...) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_bucket.cc:2854 #37 0x0000562cb12ec311 in RGWRados::get_bucket_instance_info (this=<optimized out>, bucket=..., info=..., pmtime=pmtime@entry=0x0, pattrs=pattrs@entry=0x0, y=..., dpp=dpp@entry=0x562cb4544cb0) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_bucket.h:555 #38 0x0000562cb130602c in RGWRados::BucketShard::init (this=this@entry=0x7faafd7fbc38, _bucket=..., obj=..., bucket_info_out=<optimized out>, dpp=dpp@entry=0x562cb4544cb0, y=...) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.cc:2735 #39 0x0000562cb131d3b8 in RGWRados::Bucket::UpdateIndex::init_bs (y=..., dpp=0x562cb4544cb0, this=0x7faafd7fba40) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.h:907 #40 RGWRados::Bucket::UpdateIndex::get_bucket_shard (y=..., dpp=<optimized out>, pbs=<synthetic pointer>, this=0x7faafd7fba40) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.h:954 #41 RGWRados::Bucket::UpdateIndex::guard_reshard(DoutPrefixProvider const*, rgw_obj const&, RGWRados::BucketShard**, std::function<int (RGWRados::BucketShard*)>, optional_yield) (this=this@entry=0x7faafd7fba40, dpp=0x562cb4544cb0, obj_instance=..., pbs=pbs@entry=0x0, call=..., y=...) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.cc:6852 #42 0x0000562cb131da33 in RGWRados::Bucket::UpdateIndex::prepare (this=this@entry=0x7faafd7fba40, dpp=<optimized out>, dpp@entry=0x562cb4544cb0, op=<optimized out>, op@entry=CLS_RGW_OP_DEL, write_tag=<optimized out>, y=..., log_op=<optimized out>) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.cc:6917 #43 0x0000562cb131ecdd in RGWRados::Object::Delete::delete_obj (this=this@entry=0x562cb6f8da30, y=..., dpp=dpp@entry=0x562cb4544cb0, log_op=log_op@entry=false) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.cc:5806 #44 0x0000562cb1387b86 in rgw::sal::RadosObject::RadosDeleteOp::delete_obj (this=0x562cb6f8d000, dpp=0x562cb4544cb0, y=..., flags=<optimized out>) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:2172 #45 0x0000562cb156a0ba in remove_expired_obj (dpp=0x562cb4544cb0, oc=..., remove_indeed=<optimized out>, event_types=std::vector of length 2, capacity 2 = {...}) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:619 #46 0x0000562cb1586649 in LCOpAction_CurrentExpiration::process (this=<optimized out>, oc=...) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:1192 #47 0x0000562cb156ba8f in LCOpRule::process (this=this@entry=0x7faafd7fcd28, o=..., dpp=0x562cb4544cb0, wq=wq@entry=0x562cb4596910) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:1666 #48 0x0000562cb156c2b3 in operator() (__closure=0x562cb45969f0, wk=0x562cb4529680, wq=0x562cb4596910, wi=...) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:1752 #49 0x0000562cb157e6bf in std::function<void (RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&)>::operator()(RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&) const (__args#2=..., __args#1=0x562cb4596910, __args#0=<optimized out>, this=<optimized out>) at /usr/include/c++/13/bits/std_function.h:591 #50 WorkQ::entry (this=0x562cb4596910) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:804 #51 0x00007fab46aac897 in start_thread (arg=<optimized out>) at pthread_create.c:444 #52 0x00007fab46b3380c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
this time it's destroying bucket_info_cache_entry::attrs
Updated by Soumya Koduri about 1 month ago
Soumya Koduri wrote:
Casey Bodley wrote:
spotted a different lifecycle crash in the rgw/dbstore suite, which is interesting because no expiration/transition is involved:
[...]
from https://qa-proxy.ceph.com/teuthology/cbodley-2024-03-20_23:26:29-rgw-wip-cbodley-testing-distro-default-smithi/7614326/teuthology.logThere are some lifecycle tests (which include expiration) not marked as 'fails_on_dbstore'. I will skip those tests too and check if the issue occurs even with just applying lifecycle policy rules.
lc crashes when running dbstore suite are due to a bug in dbstore (https://github.com/soumyakoduri/ceph/commit/c20c2d2499eeb182ebffb1675950e42719735b5f) and unrelated to the original issue. Will post the fix for review.
Updated by Soumya Koduri about 1 month ago
Casey Bodley wrote:
caught another crash that looks more like the others we've seen:
[...]this time it's destroying
bucket_info_cache_entry::attrs
Most of these crashes seem to be related to bucket obj.
Could you please clarify what will be the bucket_version in the else case below (https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_sal_rados.cc#L546)
int RadosBucket::load_bucket(const DoutPrefixProvider* dpp, optional_yield y)
{
int ret;
int ret;
RGWSI_MetaBackend_CtxParams bectx_params = RGWSI_MetaBackend_CtxParams_SObj();
RGWObjVersionTracker ep_ot;
..
..
} else {
ret = store->ctl()->bucket->read_bucket_instance_info(info.bucket, &info, y, dpp,
RGWBucketCtl::BucketInstance::GetParams()
.set_mtime(&mtime)
.set_attrs(&attrs)
.set_bectx_params(bectx_params));
}
if (ret != 0) {
return ret;
}
bucket_version = ep_ot.read_version; <<<< Here ep_ot.read_version is uninitialized incase of else{}.. right?
}
The new code added in 'lc' publish_reserve()->load_bucket(..) enters the else{} code path as info.bucket.bucket_id is not empty. Could it be that bucket_version here gets updated with incorrect read_version which may result in probable use_after_free of the bucket object in the cache at later point?
Updated by Casey Bodley about 1 month ago
another different tcmalloc segfault, this time just trying to copy a 32-byte std::string:
Thread 110 "wp_thrd: 1, 2" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fb712a006c0 (LWP 491581)] tcmalloc::SLL_TryPop (rv=<synthetic pointer>, list=0x5638bf05c060) at src/linked_list.h:69 Downloading source file /usr/src/debug/gperftools-2.9.1-6.fc39.x86_64/src/linked_list.h 69 void *next = SLL_Next(*list); (gdb) bt #0 tcmalloc::SLL_TryPop (rv=<synthetic pointer>, list=0x5638bf05c060) at src/linked_list.h:69 #1 tcmalloc::ThreadCache::FreeList::TryPop (rv=<synthetic pointer>, this=0x5638bf05c060) at src/thread_cache.h:220 #2 tcmalloc::ThreadCache::Allocate (oom_handler=0x7fb75cc38d40 <tcmalloc::cpp_throw_oom(unsigned long)>, cl=3, size=32, this=0x5638bf05c000) at src/thread_cache.h:379 #3 malloc_fast_path<tcmalloc::cpp_throw_oom> (size=32) at src/tcmalloc.cc:1898 #4 tc_new (size=32) at src/tcmalloc.cc:2019 #5 0x00005638bc2eb4e3 in std::__new_allocator<char>::allocate (this=<optimized out>, __n=<optimized out>) at /usr/include/c++/13/bits/new_allocator.h:151 #6 std::allocator<char>::allocate (__n=<optimized out>, this=<optimized out>) at /usr/include/c++/13/bits/allocator.h:198 #7 std::allocator_traits<std::allocator<char> >::allocate (__n=<optimized out>, __a=...) at /usr/include/c++/13/bits/alloc_traits.h:482 #8 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_allocate (__n=<optimized out>, __a=...) at /usr/include/c++/13/bits/basic_string.h:126 #9 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_create (this=0x7fb7129fbc60, __old_capacity=<optimized out>, __capacity=<synthetic pointer>: <optimized out>) at /usr/include/c++/13/bits/basic_string.tcc:159 #10 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign (this=0x7fb7129fbc60, __str="yournamehere-jepsg3qinlo81byd-1") at /usr/include/c++/13/bits/basic_string.tcc:288 #11 0x00005638bc842f5e in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::assign (__str="yournamehere-jepsg3qinlo81byd-1", this=0x7fb7129fbc60) at /usr/include/c++/13/bits/basic_string.h:1608 #12 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::operator= (__str="yournamehere-jepsg3qinlo81byd-1", this=0x7fb7129fbc60) at /usr/include/c++/13/bits/basic_string.h:814 #13 rgw_bucket::operator= (this=0x7fb7129fbc40) at /home/cbodley/ceph/src/rgw/rgw_bucket_types.h:166 #14 RGWRados::BucketShard::init (this=this@entry=0x7fb7129fbc38, _bucket=..., obj=..., bucket_info_out=0x7fb7129fbe28, dpp=dpp@entry=0x5638c03bccb0, y=...) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.cc:2729 #15 0x00005638bc85a3b8 in RGWRados::Bucket::UpdateIndex::init_bs (y=..., dpp=0x5638c03bccb0, this=0x7fb7129fba40) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.h:907 #16 RGWRados::Bucket::UpdateIndex::get_bucket_shard (y=..., dpp=<optimized out>, pbs=<synthetic pointer>, this=0x7fb7129fba40) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.h:954 #17 RGWRados::Bucket::UpdateIndex::guard_reshard(DoutPrefixProvider const*, rgw_obj const&, RGWRados::BucketShard**, std::function<int (RGWRados::BucketShard*)>, optional_yield) (this=this@entry=0x7fb7129fba40, dpp=0x5638c03bccb0, obj_instance=..., pbs=pbs@entry=0x0, call=..., y=...) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.cc:6852 #18 0x00005638bc85aa33 in RGWRados::Bucket::UpdateIndex::prepare (this=this@entry=0x7fb7129fba40, dpp=<optimized out>, dpp@entry=0x5638c03bccb0, op=<optimized out>, op@entry=CLS_RGW_OP_DEL, write_tag=<optimized out>, y=..., log_op=<optimized out>) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.cc:6917 #19 0x00005638bc85bcdd in RGWRados::Object::Delete::delete_obj (this=this@entry=0x5638c2f0aa30, y=..., dpp=dpp@entry=0x5638c03bccb0, log_op=log_op@entry=false) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.cc:5806 #20 0x00005638bc8c4b86 in rgw::sal::RadosObject::RadosDeleteOp::delete_obj (this=0x5638c2f0a000, dpp=0x5638c03bccb0, y=..., flags=<optimized out>) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:2172 #21 0x00005638bcaa70ba in remove_expired_obj (dpp=0x5638c03bccb0, oc=..., remove_indeed=<optimized out>, event_types=std::vector of length 2, capacity 2 = {...}) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:619 #22 0x00005638bcac3649 in LCOpAction_CurrentExpiration::process (this=<optimized out>, oc=...) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:1192 #23 0x00005638bcaa8a8f in LCOpRule::process (this=this@entry=0x7fb7129fcd28, o=..., dpp=0x5638c03bccb0, wq=wq@entry=0x5638c0410590) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:1666
Updated by Shreyansh Sancheti about 1 month ago
Casey Bodley wrote:
another different tcmalloc segfault, this time just trying to copy a 32-byte std::string:
[...]
I found a different one today while doing the Cloud transition tests with s3tests. I don't think there's any relation to the above ones.
(gdb) bt
#0 __pthread_kill_implementation (
threadid=<optimized out>, signo=signo@entry=6,
no_tid=no_tid@entry=0) at pthread_kill.c:44
#1 0x00007fdfbf8aff33 in __pthread_kill_internal (
signo=6, threadid=<optimized out>)
at pthread_kill.c:78
#2 0x00007fdfbf85fab6 in __GI_raise (sig=sig@entry=6)
at ../sysdeps/posix/raise.c:26
#3 0x00007fdfbf849886 in __GI_abort () at abort.c:100
#4 0x00007fdfbfad7e80 in std::__glibcxx_assert_fail (
file=file@entry=0x55e60f92f800 "/usr/include/c++/12/bits/basic_string.h", line=line@entry=1229,
function=function@entry=0x55e60f92f6e0 "constexpr std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::reference std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::operator[](size_type) [with _CharT = char; _Traits = std::char_traits<char"...,
condition=condition@entry=0x55e60f92f6cb "__pos <= size()")
at ../../../../../libstdc++-v3/src/c++11/debug.cc:60
#5 0x000055e60eac92bc in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::operator[] (this=this@entry=0x7fde9519d0a0,
__pos=<optimized out>)
at /usr/include/c++/12/bits/basic_string.h:1229
#6 0x000055e60ed26511 in RGWRESTStreamS3PutObj::send_init (this=this@entry=0x55e618ddd000, obj=...)
at /usr/include/c++/12/bits/basic_string.h:1071
#7 0x000055e60ed2d464 in RGWRESTConn::put_obj_send_init
--Type <RET> for more, q to quit, c to continue without paging--c
(this=0x7fde9519dbb0, obj=...,
extra_params=extra_params@entry=0x0,
req=req@entry=0x55e618969ad0)
at /root/ceph/src/rgw/rgw_rest_conn.cc:238
#8 0x000055e60f12431f in RGWLCCloudStreamPut::init (
this=this@entry=0x55e618969a40)
at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:498
#9 0x000055e60f127599 in cloud_tier_transfer_object (
dpp=<optimized out>, readf=0x55e618da1080,
writef=0x55e618969a40)
at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:722
#10 0x000055e60f127c77 in cloud_tier_plain_transfer (
tier_ctx=...)
at /usr/include/c++/12/bits/shared_ptr_base.h:1665
#11 0x000055e60f12d22b in rgw_cloud_tier_transfer_object
(tier_ctx=...,
cloud_targets=std::set with 1 element = {...})
at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:1298
#12 0x000055e60ef1c7e1 in rgw::sal::RadosObject::transition_to_cloud (this=0x55e618c7c000,
bucket=0x55e618de8e00, tier=0x55e61667bd40, o=...,
cloud_targets=std::set with 1 element = {...},
cct=0x55e6115e8000, update_object=false,
dpp=<optimized out>, y=...)
at /root/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:1905
#13 0x000055e60f053b2c in LCOpAction_Transition::transition_obj_to_cloud (this=this@entry=0x55e61677d9e0, oc=...)
at /root/ceph/src/rgw/rgw_lc.cc:1397
#14 0x000055e60f0543e1 in LCOpAction_Transition::process
(this=0x55e61677d9e0, oc=...)
at /root/ceph/src/rgw/rgw_lc.cc:1454
#15 0x000055e60f054d8c in LCOpAction_CurrentTransition::process (this=<optimized out>, oc=...)
at /root/ceph/src/rgw/rgw_lc.cc:1504
#16 0x000055e60f03d7b9 in LCOpRule::process (
this=this@entry=0x7fde9519e6a8, o=...,
dpp=<optimized out>, wq=wq@entry=0x55e6115add10)
at /root/ceph/src/rgw/rgw_lc.cc:1619
#17 0x000055e60f03dd81 in operator() (
__closure=0x55e6115addf0, wk=0x55e616387590,
wq=0x55e6115add10, wi=...)
at /root/ceph/src/rgw/rgw_lc.cc:1705
#18 0x000055e60f03dfb5 in std::__invoke_impl<void, RGWLC::bucket_lc_process(std::string&, LCWorker*, time_t, bool)::<lambda(RGWLC::LCWorker*, WorkQ*, WorkItem&)>&, RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&> (__f=...)
at /usr/include/c++/12/bits/invoke.h:61
#19 std::__invoke_r<void, RGWLC::bucket_lc_process(std::string&, LCWorker*, time_t, bool)::<lambda(RGWLC::LCWorker*, WorkQ*, WorkItem&)>&, RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&> (__fn=...)
at /usr/include/c++/12/bits/invoke.h:111
#20 std::_Function_handler<void(RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&), RGWLC::bucket_lc_process(std::string&, LCWorker*, time_t, bool)::<lambda(RGWLC::LCWorker*, WorkQ*, WorkItem&)> >::_M_invoke(const std::_Any_data &, RGWLC::LCWorker *&&, WorkQ *&&, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry> &) (
__functor=..., __args#0=<optimized out>,
__args#1=<optimized out>, __args#2=...)
at /usr/include/c++/12/bits/std_function.h:290
#21 0x000055e60f048f27 in std::function<void (RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&)>::operator()(RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&) const (
this=this@entry=0x55e6115addf0,
__args#0=<optimized out>, __args#1=<optimized out>,
__args#1@entry=0x55e6115add10, __args#2=...)
at /usr/include/c++/12/bits/std_function.h:591
#22 0x000055e60f04d6cf in WorkQ::entry (
this=0x55e6115add10)
at /root/ceph/src/rgw/rgw_lc.cc:779
#23 0x00007fdfc1033c0f in Thread::entry_wrapper (
this=0x55e6115add10)
at /root/ceph/src/common/Thread.cc:87
#24 0x00007fdfc1033c27 in Thread::_entry_func (
arg=<optimized out>)
at /root/ceph/src/common/Thread.cc:74
#25 0x00007fdfbf8ae19d in start_thread (
arg=<optimized out>) at pthread_create.c:442
#26 0x00007fdfbf92fc60 in clone3 ()
at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Updated by Soumya Koduri about 1 month ago
Soumya Koduri wrote:
#0 __pthread_kill_implementation (
threadid=<optimized out>, signo=signo@entry=11,
no_tid=no_tid@entry=0) at pthread_kill.c:44
#1 0x00007ffff58aff33 in __pthread_kill_internal (
signo=11, threadid=<optimized out>)
at pthread_kill.c:78
#2 0x00007ffff585fab6 in __GI_raise (sig=11)
at ../sysdeps/posix/raise.c:26
#3 0x0000555556d7bcc0 in reraise_fatal (
signum=signum@entry=11)
at /root/ceph/src/global/signal_handler.cc:88
#4 0x0000555556d7d351 in handle_oneshot_fatal_signal (
signum=11)
--Type <RET> for more, q to quit, c to continue without paging--c
at /root/ceph/src/global/signal_handler.cc:367
#5 <signal handler called>
#6 RGWHTTPClient::set_send_length (len=4036, this=0x0)
at /root/ceph/src/rgw/rgw_http_client.h:128
#7 RGWLCCloudStreamPut::send_ready (
this=this@entry=0x55555837b260, dpp=0x5555583e0460,
rest_obj=...)
at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:660
#8 0x0000555556befef5 in cloud_tier_transfer_object (
dpp=<optimized out>, readf=0x555558437b80,
writef=0x55555837b260)
at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:728
#9 0x0000555556bf04e7 in cloud_tier_plain_transfer (
tier_ctx=...)
at /usr/include/c++/12/bits/shared_ptr_base.h:1665
#10 0x0000555556bf5a9b in rgw_cloud_tier_transfer_object
(tier_ctx=...,
cloud_targets=std::set with 1 element = {...})
at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:1298
#11 0x00005555569eed13 in rgw::sal::RadosObject::transition_to_cloud (this=0x55555fa84000,
bucket=0x55555d5db100, tier=0x55555d7ca340, o=...,
cloud_targets=std::set with 1 element = {...},
cct=0x5555583d6000, update_object=true,
dpp=<optimized out>, y=...)
at /root/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:1766
#12 0x0000555556b1f61c in LCOpAction_Transition::transition_obj_to_cloud (this=this@entry=0x55555fa0ffc0, oc=...)
at /root/ceph/src/rgw/rgw_lc.cc:1396
#13 0x0000555556b1fed1 in LCOpAction_Transition::process
(this=0x55555fa0ffc0, oc=...)
at /root/ceph/src/rgw/rgw_lc.cc:1453
This crash in cloud-transition code is being tracked as part of https://tracker.ceph.com/issues/65251
Updated by Casey Bodley about 1 month ago
the lifecycle changes in https://github.com/ceph/ceph/pull/55657 seem to be responsible for the crash, but it's not clear exactly how. discussion continues in https://github.com/ceph/ceph/pull/55657#discussion_r1541310624
Updated by Krunal Chheda 30 days ago
- Subject changed from lifecycle transition crashes since merge end-to-end tracing to lifecycle transition crashes since reloading bucket attrs for notification
As discussed https://github.com/ceph/ceph/pull/55657#discussion_r1541310624, removing the logic to reload the bucket inside the publish_reserve and expect the caller to reload the bucket before calling.
Updated by Casey Bodley 30 days ago
- Status changed from New to Fix Under Review
- Assignee deleted (
Shreyansh Sancheti)
Updated by Casey Bodley 26 days ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 26 days ago
- Copied to Backport #65375: squid: lifecycle transition crashes since reloading bucket attrs for notification added
Updated by Backport Bot 26 days ago
- Tags changed from lifecycle to lifecycle backport_processed
Updated by Soumya Koduri 24 days ago
- File valgrind_log_8000.xml valgrind_log_8000.xml added
Updated by Soumya Koduri 19 days ago
The cause for this issue seems to be due to multiple LC worker threads updating the same `bucket` handle, which is not thread-safe. When another `load_bucket()` call was added in `remove_expirted_obj` as part of https://github.com/ceph/ceph/pull/55657, multiple threads could update and overwrite the same ceph::buffer::list stored in `bucket->attrs` resulting in corruption at times.