Project

General

Profile

Actions

Bug #64571

open

lifecycle transition crashes since reloading bucket attrs for notification

Added by Casey Bodley 2 months ago. Updated 13 days ago.

Status:
Pending Backport
Priority:
Immediate
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
lifecycle backport_processed
Backport:
squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

regression from https://github.com/ceph/ceph/pull/52114, whose test results included these failures https://pulpito.ceph.com/yuvalif-2024-01-30_08:46:48-rgw-wip-end2end-tracing-distro-default-smithi/

example rgw/lifecycle job from http://qa-proxy.ceph.com/teuthology/yuriw-2024-02-23_19:55:04-rgw-main-distro-default-smithi/7572634/teuthology.log

2024-02-24T03:30:13.667 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_transition_set_invalid_date PASSED [ 76%]
2024-02-24T03:30:23.431 INFO:tasks.rgw.client.0.smithi171.stderr:daemon-helper: command crashed with signal 11
2024-02-24T03:30:28.964 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~0s
2024-02-24T03:30:34.468 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~6s
2024-02-24T03:30:39.972 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~11s
2024-02-24T03:30:45.475 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~17s
2024-02-24T03:30:50.979 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~22s
2024-02-24T03:30:56.483 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~28s
2024-02-24T03:31:01.986 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~33s
2024-02-24T03:31:05.290 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_transition FAILED    [ 76%]
2024-02-24T03:31:05.291 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_transition ERROR     [ 76%]
2024-02-24T03:31:07.490 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~39s
2024-02-24T03:31:12.994 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~44s
2024-02-24T03:31:18.497 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~50s
2024-02-24T03:31:20.174 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_transition_single_rule_multi_trans FAILED [ 76%]
2024-02-24T03:31:20.174 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_transition_single_rule_multi_trans ERROR [ 76%]
2024-02-24T03:31:24.001 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~55s
2024-02-24T03:31:29.506 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~61s
2024-02-24T03:31:35.009 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~66s
2024-02-24T03:31:36.263 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_set_noncurrent_transition FAILED [ 76%]
2024-02-24T03:31:36.263 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_set_noncurrent_transition ERROR [ 76%]
2024-02-24T03:31:40.513 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~72s
2024-02-24T03:31:46.017 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~77s
2024-02-24T03:31:48.736 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_noncur_transition FAILED [ 77%]
2024-02-24T03:31:48.736 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_noncur_transition ERROR [ 77%]

example rgw/cloud-transition job from http://qa-proxy.ceph.com/teuthology/yuriw-2024-02-23_19:55:04-rgw-main-distro-default-smithi/7572619/teuthology.log

2024-02-24T03:12:02.182 INFO:tasks.rgw.client.0.smithi106.stderr:daemon-helper: command crashed with signal 11
2024-02-24T03:12:05.128 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~0s
2024-02-24T03:12:10.732 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~6s
2024-02-24T03:12:16.337 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~11s
2024-02-24T03:12:21.947 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~17s
2024-02-24T03:12:27.553 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~22s
2024-02-24T03:12:33.157 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~28s
2024-02-24T03:12:38.761 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~34s
2024-02-24T03:12:44.365 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~39s
2024-02-24T03:12:49.968 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~45s
2024-02-24T03:12:55.572 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~50s
2024-02-24T03:13:01.177 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~56s
2024-02-24T03:13:06.781 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~62s
2024-02-24T03:13:12.386 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~67s
2024-02-24T03:13:17.990 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~73s
2024-02-24T03:13:23.594 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~78s
2024-02-24T03:13:29.198 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~84s
2024-02-24T03:13:34.803 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~90s
2024-02-24T03:13:40.407 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~95s
2024-02-24T03:13:42.823 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_transition FAILED [ 25%]
2024-02-24T03:13:42.823 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_transition ERROR [ 25%]
2024-02-24T03:13:46.011 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~101s
2024-02-24T03:13:51.615 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~106s
2024-02-24T03:13:57.219 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~112s
2024-02-24T03:13:59.823 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_multiple_transition FAILED [ 50%]
2024-02-24T03:13:59.824 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_multiple_transition ERROR [ 50%]
2024-02-24T03:14:02.823 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~118s
2024-02-24T03:14:08.427 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~123s
2024-02-24T03:14:14.031 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~129s
2024-02-24T03:14:19.635 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~135s
2024-02-24T03:14:21.200 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_noncur_cloud_transition FAILED [ 75%]
2024-02-24T03:14:21.200 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_noncur_cloud_transition ERROR [ 75%]
2024-02-24T03:14:25.240 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~140s
2024-02-24T03:14:30.844 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~146s
2024-02-24T03:14:35.192 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_transition_large_obj FAILED [100%]
2024-02-24T03:14:35.193 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_transition_large_obj ERROR [100%]
2024-02-24T03:14:35.193 INFO:teuthology.orchestra.run.smithi106.stdout:


Files

ceph.client.0.log (117 KB) ceph.client.0.log Soumya Koduri, 03/19/2024 06:37 PM
valgrind_log_8000.xml (485 KB) valgrind_log_8000.xml New valgrind log Soumya Koduri, 04/10/2024 06:36 PM

Related issues 1 (0 open1 closed)

Copied to rgw - Backport #65375: squid: lifecycle transition crashes since reloading bucket attrs for notificationResolvedActions
Actions #1

Updated by Casey Bodley 2 months ago

wasn't able to reproduce under vstart. tested both with a debug build and a release build

required extra storage-class configuration to run the lifecycle transition tests:

~/ceph/build $ bin/radosgw-admin zonegroup placement add --placement-id default-placement --storage-class FROZEN
~/ceph/build $ bin/radosgw-admin zonegroup placement add --placement-id default-placement --storage-class LUKEWARM
~/ceph/build $ bin/radosgw-admin zone placement add --placement-id default-placement --storage-class FROZEN --data-pool default.rgw.frozen.data
~/ceph/build $ bin/radosgw-admin zone placement add --placement-id default-placement --storage-class LUKEWARM --data-pool default.rgw.lukewarm.data

(restart radosgw)

then uncomment storage_classes in s3tests.conf.SAMPLE:

## Storage classes
#storage_classes = "LUKEWARM, FROZEN" 

~/s3-tests $ S3TEST_CONF=s3tests.conf.SAMPLE tox -- s3tests_boto3/functional/test_s3.py -v -m 'not (fails_on_rgw or sse_s3)'
...
s3tests_boto3/functional/test_s3.py::test_lifecycle_multipart_expiration PASSED                                                                        [ 74%] 
s3tests_boto3/functional/test_s3.py::test_lifecycle_transition_set_invalid_date PASSED                                                                 [ 74%] 
s3tests_boto3/functional/test_s3.py::test_lifecycle_transition PASSED                                                                                  [ 74%]
s3tests_boto3/functional/test_s3.py::test_lifecycle_transition_single_rule_multi_trans PASSED                                                          [ 74%]
s3tests_boto3/functional/test_s3.py::test_lifecycle_set_noncurrent_transition PASSED                                                                   [ 75%] 
s3tests_boto3/functional/test_s3.py::test_lifecycle_noncur_transition PASSED                                                                           [ 75%] 
s3tests_boto3/functional/test_s3.py::test_lifecycle_plain_null_version_current_transition PASSED                                                       [ 75%]
s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_transition SKIPPED (no cloud_storage_class configured)                                       [ 75%]
s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_multiple_transition SKIPPED ([s3 cloud] section missing cloud_storage_class)                 [ 75%]
s3tests_boto3/functional/test_s3.py::test_lifecycle_noncur_cloud_transition SKIPPED ([s3 cloud] section missing cloud_storage_class)                   [ 76%] 
s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_transition_large_obj SKIPPED ([s3 cloud] section missing cloud_storage_class)                [ 76%] 
s3tests_boto3/functional/test_s3.py::test_encrypted_transfer_1b PASSED                                                                                 [ 76%]
...
Actions #2

Updated by Casey Bodley 2 months ago

  • Subject changed from lifecycle crashes since merge end-to-end tracing to lifecycle transition crashes since merge end-to-end tracing
Actions #4

Updated by Soumya Koduri about 2 months ago

  • Assignee set to Shreyansh Sancheti
Actions #6

Updated by Casey Bodley about 1 month ago

hey Shreyansh and Soumya, any updates here? this is one urgent for the squid release

Actions #7

Updated by Soumya Koduri about 1 month ago

Casey Bodley wrote:

hey Shreyansh and Soumya, any updates here? this is one urgent for the squid release

Hi Casey..have started looking into it. I couldn't reproduce this issue locally with simple lc/cloud transition tests. Shreyansh mentioned that the crashes were frequently observed when using s3tests. I will run those now and update.

Actions #8

Updated by Soumya Koduri about 1 month ago

Not able to reproduce this issue locally. Pasting the bt of two crashes reported so far -

2024-03-13T13:33:20.678 INFO:tasks.rgw.client.0.smithi106.stdout:*** Caught signal (Segmentation fault)
2024-03-13T13:33:20.679 INFO:tasks.rgw.client.0.smithi106.stdout: in thread 7fb15d0b9640 thread_name:wp_thrd: 1, 1
2024-03-13T13:33:20.690 INFO:tasks.rgw.client.0.smithi106.stdout: ceph version 19.0.0-2011-g06f9ec2c (06f9ec2cb3c57d3b59cb10ddd4c321a82bb853f4) squid (dev)
2024-03-13T13:33:20.690 INFO:tasks.rgw.client.0.smithi106.stdout: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fb285ae5520]
2024-03-13T13:33:20.690 INFO:tasks.rgw.client.0.smithi106.stdout: 2: tc_new()
2024-03-13T13:33:20.690 INFO:tasks.rgw.client.0.smithi106.stdout: 3: radosgw(+0x10d45c3) [0x55806f3b75c3]
2024-03-13T13:33:20.690 INFO:tasks.rgw.client.0.smithi106.stdout: 4: radosgw(+0x4a64d4) [0x55806e7894d4]
2024-03-13T13:33:20.690 INFO:tasks.rgw.client.0.smithi106.stdout: 5: radosgw(+0x6a6213) [0x55806e989213]
2024-03-13T13:33:20.690 INFO:tasks.rgw.client.0.smithi106.stdout: 6: (RGWSI_Bucket_SObj::read_bucket_instance_info(ptr_wrapper<RGWSI_MetaBackend::Context, 4>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWBucketInfo*, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >, optional_yield, DoutPrefixProvider const*, rgw_cache_entry_info*, boost::optional<obj_version>)+0x13e) [0x55806ece616e]
2024-03-13T13:33:20.690 INFO:tasks.rgw.client.0.smithi106.stdout: 7: radosgw(+0xaa5ba9) [0x55806ed88ba9]
2024-03-13T13:33:20.690 INFO:tasks.rgw.client.0.smithi106.stdout: 8: (std::_Function_handler<int (RGWSI_MetaBackend_Handler::Op*), RGWBucketInstanceMetadataHandler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>)::{lambda(RGWSI_MetaBackend_Handler::Op*)#1}>::_M_invoke(std::_Any_data const&, RGWSI_MetaBackend_Handler::Op*&&)+0x34) [0x55806ed84134]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 9: radosgw(+0xa110a0) [0x55806ecf40a0]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 10: (RGWSI_MetaBackend_SObj::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (RGWSI_MetaBackend::Context*)>)+0x5b) [0x55806ecf413b]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 11: (RGWSI_MetaBackend_Handler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (RGWSI_MetaBackend_Handler::Op*)>)+0x77) [0x55806ecf32a7]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 12: (RGWBucketCtl::read_bucket_instance_info(rgw_bucket const&, RGWBucketInfo*, optional_yield, DoutPrefixProvider const*, RGWBucketCtl::BucketInstance::GetParams const&)+0x11b) [0x55806ed86f5b]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 13: (RGWRados::get_bucket_instance_info(rgw_bucket const&, RGWBucketInfo&, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >, optional_yield, DoutPrefixProvider const*)+0x55) [0x55806ebc5915]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 14: (RGWRados::BucketShard::init(rgw_bucket const&, rgw_obj const&, RGWBucketInfo*, DoutPrefixProvider const*, optional_yield)+0x8a) [0x55806eba034a]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 15: (RGWRados::Bucket::UpdateIndex::guard_reshard(DoutPrefixProvider const*, rgw_obj const&, RGWRados::BucketShard
, std::function<int (RGWRados::BucketShard*)>, optional_yield)+0x368) [0x55806ebad4e8]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 16: (RGWRados::Bucket::UpdateIndex::prepare(DoutPrefixProvider const*, RGWModifyOp, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, optional_yield, bool)+0x113) [0x55806ebadad3]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 17: (RGWRados::Object::Delete::delete_obj(optional_yield, DoutPrefixProvider const*, bool)+0x6fc) [0x55806ebc000c]
2024-03-13T13:33:20.691 INFO:tasks.rgw.client.0.smithi106.stdout: 18: (rgw::sal::RadosObject::RadosDeleteOp::delete_obj(DoutPrefixProvider const*, optional_yield, unsigned int)+0x15a) [0x55806ebfbd9a]
2024-03-13T13:33:20.692 INFO:tasks.rgw.client.0.smithi106.stdout: 19: radosgw(+0xa53e42) [0x55806ed36e42]
2024-03-13T13:33:20.692 INFO:tasks.rgw.client.0.smithi106.stdout: 20: radosgw(+0xa54339) [0x55806ed37339]
2024-03-13T13:33:20.692 INFO:tasks.rgw.client.0.smithi106.stdout: 21: (LCOpRule::process(rgw_bucket_dir_entry&, DoutPrefixProvider const*, WorkQ*)+0x433) [0x55806ed3aef3]
2024-03-13T13:33:20.692 INFO:tasks.rgw.client.0.smithi106.stdout: 22: radosgw(+0xa588a9) [0x55806ed3b8a9]
2024-03-13T13:33:20.692 INFO:tasks.rgw.client.0.smithi106.stdout: 23: radosgw(+0xa56926) [0x55806ed39926]
2024-03-13T13:33:20.692 INFO:tasks.rgw.client.0.smithi106.stdout: 24: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7fb285b37b43]
2024-03-13T13:33:20.692 INFO:tasks.rgw.client.0.smithi106.stdout: 25: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7fb285bc9a00]

Actions #9

Updated by Soumya Koduri about 1 month ago

#0 _pthread_kill_implementation (
threadid=<optimized out>, signo=signo@entry=11,
no_tid=no_tid@entry=0) at pthread_kill.c:44
#1 0x00007ffff58aff33 in __pthread_kill_internal (
signo=11, threadid=<optimized out>)
at pthread_kill.c:78
#2 0x00007ffff585fab6 in __GI_raise (sig=11)
at ../sysdeps/posix/raise.c:26
#3 0x0000555556d7bcc0 in reraise_fatal (
signum=signum@entry=11)
at /root/ceph/src/global/signal_handler.cc:88
#4 0x0000555556d7d351 in handle_oneshot_fatal_signal (
signum=11)
--Type <RET> for more, q to quit, c to continue without paging--c
at /root/ceph/src/global/signal_handler.cc:367
#5 <signal handler called>
#6 RGWHTTPClient::set_send_length (len=4036, this=0x0)
at /root/ceph/src/rgw/rgw_http_client.h:128
#7 RGWLCCloudStreamPut::send_ready (
this=this@entry=0x55555837b260, dpp=0x5555583e0460,
rest_obj=...)
at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:660
#8 0x0000555556befef5 in cloud_tier_transfer_object (
dpp=<optimized out>, readf=0x555558437b80,
writef=0x55555837b260)
at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:728
#9 0x0000555556bf04e7 in cloud_tier_plain_transfer (
tier_ctx=...)
at /usr/include/c++/12/bits/shared_ptr_base.h:1665
#10 0x0000555556bf5a9b in rgw_cloud_tier_transfer_object
(tier_ctx=...,
cloud_targets=std::set with 1 element = {...})
at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:1298
#11 0x00005555569eed13 in rgw::sal::RadosObject::transition_to_cloud (this=0x55555fa84000,
bucket=0x55555d5db100, tier=0x55555d7ca340, o=...,
cloud_targets=std::set with 1 element = {...},
cct=0x5555583d6000, update_object=true,
dpp=<optimized out>, y=...)
at /root/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:1766
#12 0x0000555556b1f61c in LCOpAction_Transition::transition_obj_to_cloud (this=this@entry=0x55555fa0ffc0, oc=...)
at /root/ceph/src/rgw/rgw_lc.cc:1396
#13 0x0000555556b1fed1 in LCOpAction_Transition::process
(this=0x55555fa0ffc0, oc=...)
at /root/ceph/src/rgw/rgw_lc.cc:1453
#14 0x0000555556b2087c in LCOpAction_CurrentTransition::process (this=<optimized out>, oc=...)
at /root/ceph/src/rgw/rgw_lc.cc:1503
#15 0x0000555556b0bf7f in LCOpRule::process (
this=this@entry=0x7ffecf9d26a8, o=...,
dpp=<optimized out>, wq=wq@entry=0x55555839f510)
at /root/ceph/src/rgw/rgw_lc.cc:1618
#16 0x0000555556b0c547 in operator() (
__closure=0x55555839f5f0, wk=0x55555d1792c0,
wq=0x55555839f510, wi=...)
at /root/ceph/src/rgw/rgw_lc.cc:1704
#17 0x0000555556b0c77b in std::
_invoke_impl<void, RGWLC::bucket_lc_process(std::string&, LCWorker*, time_t, bool)::<lambda(RGWLC::LCWorker*, WorkQ*, WorkItem&)>&, RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&> (_f=...)
at /usr/include/c++/12/bits/invoke.h:61
#18 std::
_invoke_r<void, RGWLC::bucket_lc_process(std::string&, LCWorker*, time_t, bool)::<lambda(RGWLC::LCWorker*, WorkQ*, WorkItem&)>&, RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&> (__fn=...)
at /usr/include/c++/12/bits/invoke.h:111
#19 std::_Function_handler<void(RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&), RGWLC::bucket_lc_process(std::string&, LCWorker*, time_t, bool)::<lambda(RGWLC::LCWorker*, WorkQ*, WorkItem&)> >::_M_invoke(const std::_Any_data &, RGWLC::LCWorker &&, WorkQ *&&, boost::variant<void, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry> &) (
__functor=..., __args#0=<optimized out>,
__args#1=<optimized out>, __args#2=...)
at /usr/include/c++/12/bits/std_function.h:290
#20 0x0000555556b170d9 in std::function<void (RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&)>::operator()(RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&) const (
this=this@entry=0x55555839f5f0,
__args#0=<optimized out>, __args#1=<optimized out>,
__args#1@entry=0x55555839f510, __args#2=...)
at /usr/include/c++/12/bits/std_function.h:591
#21 0x0000555556b1b815 in WorkQ::entry (
this=0x55555839f510)
at /root/ceph/src/rgw/rgw_lc.cc:778
#22 0x00007ffff7033c0f in Thread::entry_wrapper (
this=0x55555839f510)
at /root/ceph/src/common/Thread.cc:87
#23 0x00007ffff7033c27 in Thread::_entry_func (
arg=<optimized out>)
at /root/ceph/src/common/Thread.cc:74
#24 0x00007ffff58ae19d in start_thread (
arg=<optimized out>) at pthread_create.c:442
#25 0x00007ffff592fc60 in clone3 ()
at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Actions #10

Updated by Soumya Koduri about 1 month ago

Two more crashes with similar bt are seen -
http://qa-proxy.ceph.com/teuthology/soumyakoduri-2024-03-19_02:37:28-rgw:cloud-transition-wip-skoduri-lc-distro-default-smithi/7610307/teuthology.log

2024-03-19T03:09:59.640 INFO:tasks.rgw.client.0.smithi148.stdout:*** Caught signal (Segmentation fault)
2024-03-19T03:09:59.640 INFO:tasks.rgw.client.0.smithi148.stdout: in thread 7fc7f2910640 thread_name:wp_thrd: 0, 0
2024-03-19T03:09:59.645 INFO:tasks.rgw.client.0.smithi148.stdout: ceph version 19.0.0-2098-ge0537c1b (e0537c1b936b5049f35101185806ff470c74a520) squid (dev)
2024-03-19T03:09:59.645 INFO:tasks.rgw.client.0.smithi148.stdout: 1: /lib64/libc.so.6(+0x54db0) [0x7fc918654db0]
2024-03-19T03:09:59.645 INFO:tasks.rgw.client.0.smithi148.stdout: 2: (tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned int, int)+0xeb) [0x7fc91b82977b]
2024-03-19T03:09:59.645 INFO:tasks.rgw.client.0.smithi148.stdout: 3: (tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned int)+0x30) [0x7fc91b829b20]
2024-03-19T03:09:59.645 INFO:tasks.rgw.client.0.smithi148.stdout: 4: radosgw(+0x488928) [0x555c4cb78928]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 5: (RGWSI_Bucket_SObj::read_bucket_instance_info(ptr_wrapper<RGWSI_MetaBackend::Context, 4>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWBucketInfo*, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >, optional_yield, DoutPrefixProvider const*, rgw_cache_entry_info*, boost::optional<obj_version>)+0x22a) [0x555c4d0369ca]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 6: radosgw(+0x9f8069) [0x555c4d0e8069]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 7: (std::_Function_handler<int (RGWSI_MetaBackend_Handler::Op*), RGWBucketInstanceMetadataHandler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>)::{lambda(RGWSI_MetaBackend_Handler::Op*)#1}>::_M_invoke(std::_Any_data const&, RGWSI_MetaBackend_Handler::Op*&&)+0x34) [0x555c4d0e5594]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 8: radosgw(+0x953ed0) [0x555c4d043ed0]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 9: (RGWSI_MetaBackend_SObj::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (RGWSI_MetaBackend::Context*)>)+0x5b) [0x555c4d043f6b]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 10: (RGWSI_MetaBackend_Handler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (RGWSI_MetaBackend_Handler::Op*)>)+0x77) [0x555c4d0430f7]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 11: (RGWBucketCtl::read_bucket_instance_info(rgw_bucket const&, RGWBucketInfo*, optional_yield, DoutPrefixProvider const*, RGWBucketCtl::BucketInstance::GetParams const&)+0x11b) [0x555c4d0e89db]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 12: (rgw::sal::RadosBucket::load_bucket(DoutPrefixProvider const*, optional_yield)+0x211) [0x555c4cf3aee1]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 13: (rgw::notify::publish_reserve(DoutPrefixProvider const*, rgw::SiteConfig const&, rgw::notify::EventType, rgw::notify::reservation_t&, RGWObjTags const*)+0xb38) [0x555c4ceb4958]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 14: radosgw(+0x99e9f0) [0x555c4d08e9f0]
2024-03-19T03:09:59.646 INFO:tasks.rgw.client.0.smithi148.stdout: 15: radosgw(+0x99f7ad) [0x555c4d08f7ad]
2024-03-19T03:09:59.647 INFO:tasks.rgw.client.0.smithi148.stdout: 16: (LCOpRule::process(rgw_bucket_dir_entry&, DoutPrefixProvider const*, WorkQ*)+0x423) [0x555c4d091183]
2024-03-19T03:09:59.647 INFO:tasks.rgw.client.0.smithi148.stdout: 17: radosgw(+0x9a1b96) [0x555c4d091b96]
2024-03-19T03:09:59.647 INFO:tasks.rgw.client.0.smithi148.stdout: 18: radosgw(+0x99fac1) [0x555c4d08fac1]
2024-03-19T03:09:59.647 INFO:tasks.rgw.client.0.smithi148.stdout: 19: /lib64/libc.so.6(+0x9f802) [0x7fc91869f802]
2024-03-19T03:09:59.647 INFO:tasks.rgw.client.0.smithi148.stdout: 20: /lib64/libc.so.6(+0x3f450) [0x7fc91863f450]
2024-03-19T03:09:59.836 INFO:tasks.rgw.client.0.smithi148.stderr:daemon-helper: command crashed with signal 11
2024-03-19T03:10:04.263 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~0s
2024-03-19T03:10:09.867 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~6s
2024-03-19T03:10:15.472 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~11s
2024-03-19T03:10:21.076 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~17s
2024-03-19T03:10:24.602 INFO:tasks.rgw.client.1.smithi148.stdout:
* Caught signal (Segmentation fault)
2024-03-19T03:10:24.602 INFO:tasks.rgw.client.1.smithi148.stdout: in thread 7fc10624a640 thread_name:wp_thrd: 0, 2
2024-03-19T03:10:24.603 INFO:tasks.rgw.client.1.smithi148.stdout: ceph version 19.0.0-2098-ge0537c1b (e0537c1b936b5049f35101185806ff470c74a520) squid (dev)
2024-03-19T03:10:24.603 INFO:tasks.rgw.client.1.smithi148.stdout: 1: /lib64/libc.so.6(+0x54db0) [0x7fc22d054db0]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 2: (tcmalloc::CentralFreeList::FetchFromOneSpans(int, void
, void**)+0x3c) [0x7fc2302249bc]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 3: (tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**)+0x20) [0x7fc23022aef0]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 4: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0x80) [0x7fc23022afb0]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 5: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* ()(unsigned long))+0x76) [0x7fc23022b126]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 6: (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xaf) [0x5578dccdbcef]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 7: (rgw_get_rados_ref(DoutPrefixProvider const
, librados::v14_2_0::Rados*, rgw_raw_obj, rgw_rados_ref*)+0x3d) [0x5578dd1a43bd]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 8: (RGWRados::get_obj_head_ref(DoutPrefixProvider const*, rgw_placement_rule const&, rgw_obj const&, rgw_rados_ref*)+0x23a) [0x5578dd0f605a]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 9: (RGWRados::Object::Delete::delete_obj(optional_yield, DoutPrefixProvider const*, bool)+0x519) [0x5578dd11aed9]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 10: (rgw::sal::RadosObject::RadosDeleteOp::delete_obj(DoutPrefixProvider const*, optional_yield, unsigned int)+0x15a) [0x5578dd16155a]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 11: radosgw(+0x99cbf2) [0x5578dd29ebf2]
2024-03-19T03:10:24.604 INFO:tasks.rgw.client.1.smithi148.stdout: 12: radosgw(+0x99d149) [0x5578dd29f149]
2024-03-19T03:10:24.605 INFO:tasks.rgw.client.1.smithi148.stdout: 13: (LCOpRule::process(rgw_bucket_dir_entry&, DoutPrefixProvider const*, WorkQ*)+0x423) [0x5578dd2a3183]
2024-03-19T03:10:24.605 INFO:tasks.rgw.client.1.smithi148.stdout: 14: radosgw(+0x9a1b96) [0x5578dd2a3b96]
2024-03-19T03:10:24.605 INFO:tasks.rgw.client.1.smithi148.stdout: 15: radosgw(+0x99fac1) [0x5578dd2a1ac1]
2024-03-19T03:10:24.605 INFO:tasks.rgw.client.1.smithi148.stdout: 16: /lib64/libc.so.6(+0x9f802) [0x7fc22d09f802]
2024-03-19T03:10:24.605 INFO:tasks.rgw.client.1.smithi148.stdout: 17: /lib64/libc.so.6(+0x3f450) [0x7fc22d03f450]

Actions #11

Updated by Soumya Koduri about 1 month ago

With Valgrind enabled & lc_debug_interval set to `1`, the crash is not observed however in one of the runs a valgrind error "Leak_PossiblyLost" is reported possibly during cleanup -

http://qa-proxy.ceph.com/teuthology/soumyakoduri-2024-03-19_16:14:48-rgw:cloud-transition-wip-skoduri-lc-distro-default-smithi/7610946/teuthology.log

Attaching valgrind log file of that RGW service

Actions #12

Updated by Casey Bodley about 1 month ago

spotted a different lifecycle crash in the rgw/dbstore suite, which is interesting because no expiration/transition is involved:

2024-03-20T23:59:59.506 INFO:teuthology.orchestra.run.smithi032.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_invalid_status PASSED [ 79%]
2024-03-20T23:59:59.746 INFO:teuthology.orchestra.run.smithi032.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_set_date PASSED      [ 79%]
2024-03-20T23:59:59.994 INFO:teuthology.orchestra.run.smithi032.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_set_invalid_date PASSED [ 79%]
2024-03-21T00:00:00.415 INFO:teuthology.orchestra.run.smithi032.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_set_noncurrent PASSED [ 79%]
2024-03-21T00:00:00.619 INFO:tasks.rgw.client.0.smithi032.stdout:*** Caught signal (Segmentation fault) **
2024-03-21T00:00:00.619 INFO:tasks.rgw.client.0.smithi032.stdout: in thread 7f04230c1640 thread_name:lifecycle_thr_0
2024-03-21T00:00:00.620 INFO:tasks.rgw.client.0.smithi032.stdout: ceph version 19.0.0-2246-g7793a10a (7793a10a2a6cd1bd96a28b9ecab476964200946d) squid (dev)
2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f052baad520]
2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 2: radosgw(+0xa6d465) [0x562391280465]
2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 3: (RGWLC::process(int, int, RGWLC::LCWorker*, bool)+0x96d) [0x56239128cf4d]
2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 4: (RGWLC::process(RGWLC::LCWorker*, std::unique_ptr<rgw::sal::Bucket, std::default_delete<rgw::sal::Bucket> > const&, bool)+0x16d) [0x56239128f9fd]
2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 5: (RGWLC::LCWorker::entry()+0x539) [0x56239128ffc9]
2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 6: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f052baffb43]
2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 7: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f052bb91a00]
2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout:2024-03-21T00:00:00.603+0000 7f04230c1640 -1 *** Caught signal (Segmentation fault) **
2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: in thread 7f04230c1640 thread_name:lifecycle_thr_0
2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout:
2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: ceph version 19.0.0-2246-g7793a10a (7793a10a2a6cd1bd96a28b9ecab476964200946d) squid (dev)
2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f052baad520]
2024-03-21T00:00:00.621 INFO:tasks.rgw.client.0.smithi032.stdout: 2: radosgw(+0xa6d465) [0x562391280465]
2024-03-21T00:00:00.622 INFO:tasks.rgw.client.0.smithi032.stdout: 3: (RGWLC::process(int, int, RGWLC::LCWorker*, bool)+0x96d) [0x56239128cf4d]
2024-03-21T00:00:00.622 INFO:tasks.rgw.client.0.smithi032.stdout: 4: (RGWLC::process(RGWLC::LCWorker*, std::unique_ptr<rgw::sal::Bucket, std::default_delete<rgw::sal::Bucket> > const&, bool)+0x16d) [0x56239128f9fd]
2024-03-21T00:00:00.622 INFO:tasks.rgw.client.0.smithi032.stdout: 5: (RGWLC::LCWorker::entry()+0x539) [0x56239128ffc9]
2024-03-21T00:00:00.622 INFO:tasks.rgw.client.0.smithi032.stdout: 6: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f052baffb43]
2024-03-21T00:00:00.622 INFO:tasks.rgw.client.0.smithi032.stdout: 7: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f052bb91a00]
2024-03-21T00:00:00.622 INFO:tasks.rgw.client.0.smithi032.stdout: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2024-03-21T00:00:00.622 INFO:tasks.rgw.client.0.smithi032.stdout:
2024-03-21T00:00:00.635 INFO:tasks.rgw.client.0.smithi032.stdout:     0> 2024-03-21T00:00:00.603+0000 7f04230c1640 -1 *** Caught signal (Segmentation fault) **
2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: in thread 7f04230c1640 thread_name:lifecycle_thr_0
2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout:
2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: ceph version 19.0.0-2246-g7793a10a (7793a10a2a6cd1bd96a28b9ecab476964200946d) squid (dev)
2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f052baad520]
2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: 2: radosgw(+0xa6d465) [0x562391280465]
2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: 3: (RGWLC::process(int, int, RGWLC::LCWorker*, bool)+0x96d) [0x56239128cf4d]
2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: 4: (RGWLC::process(RGWLC::LCWorker*, std::unique_ptr<rgw::sal::Bucket, std::default_delete<rgw::sal::Bucket> > const&, bool)+0x16d) [0x56239128f9fd]
2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: 5: (RGWLC::LCWorker::entry()+0x539) [0x56239128ffc9]
2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: 6: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f052baffb43]
2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: 7: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f052bb91a00]
2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2024-03-21T00:00:00.636 INFO:tasks.rgw.client.0.smithi032.stdout:
2024-03-21T00:00:00.652 INFO:tasks.rgw.client.0.smithi032.stdout:  -178> 2024-03-21T00:00:00.603+0000 7f04230c1640 -1 *** Caught signal (Segmentation fault) **
2024-03-21T00:00:00.652 INFO:tasks.rgw.client.0.smithi032.stdout: in thread 7f04230c1640 thread_name:lifecycle_thr_0
2024-03-21T00:00:00.652 INFO:tasks.rgw.client.0.smithi032.stdout:
2024-03-21T00:00:00.652 INFO:tasks.rgw.client.0.smithi032.stdout: ceph version 19.0.0-2246-g7793a10a (7793a10a2a6cd1bd96a28b9ecab476964200946d) squid (dev)
2024-03-21T00:00:00.652 INFO:tasks.rgw.client.0.smithi032.stdout: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f052baad520]
2024-03-21T00:00:00.653 INFO:tasks.rgw.client.0.smithi032.stdout: 2: radosgw(+0xa6d465) [0x562391280465]
2024-03-21T00:00:00.653 INFO:tasks.rgw.client.0.smithi032.stdout: 3: (RGWLC::process(int, int, RGWLC::LCWorker*, bool)+0x96d) [0x56239128cf4d]
2024-03-21T00:00:00.653 INFO:tasks.rgw.client.0.smithi032.stdout: 4: (RGWLC::process(RGWLC::LCWorker*, std::unique_ptr<rgw::sal::Bucket, std::default_delete<rgw::sal::Bucket> > const&, bool)+0x16d) [0x56239128f9fd]
2024-03-21T00:00:00.653 INFO:tasks.rgw.client.0.smithi032.stdout: 5: (RGWLC::LCWorker::entry()+0x539) [0x56239128ffc9]
2024-03-21T00:00:00.653 INFO:tasks.rgw.client.0.smithi032.stdout: 6: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f052baffb43]
2024-03-21T00:00:00.653 INFO:tasks.rgw.client.0.smithi032.stdout: 7: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f052bb91a00]
2024-03-21T00:00:00.653 INFO:tasks.rgw.client.0.smithi032.stdout: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2024-03-21T00:00:00.653 INFO:tasks.rgw.client.0.smithi032.stdout:
2024-03-21T00:00:00.673 INFO:teuthology.orchestra.run.smithi032.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_set_deletemarker PASSED [ 80%]
2024-03-21T00:00:00.783 INFO:tasks.rgw.client.0.smithi032.stderr:daemon-helper: command crashed with signal 11
2024-03-21T00:00:03.650 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~0s
2024-03-21T00:00:09.154 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~6s
2024-03-21T00:00:14.658 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~11s
2024-03-21T00:00:20.162 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~17s
2024-03-21T00:00:20.251 INFO:teuthology.orchestra.run.smithi032.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_set_filter FAILED    [ 80%]
2024-03-21T00:00:20.251 INFO:teuthology.orchestra.run.smithi032.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_set_filter ERROR     [ 80%]

from https://qa-proxy.ceph.com/teuthology/cbodley-2024-03-20_23:26:29-rgw-wip-cbodley-testing-distro-default-smithi/7614326/teuthology.log

Actions #13

Updated by Casey Bodley about 1 month ago

  • Priority changed from Urgent to Immediate
Actions #14

Updated by Casey Bodley about 1 month ago

i opened https://github.com/ceph/ceph/pull/56368 with a partial revert of the rgw tracing changes from https://github.com/ceph/ceph/pull/52114

i consider that a last resort for squid if we aren't able to find a root cause. in the meantime, i'd like to verify that lifecycle runs are green against that branch. we can potentially reintroduce that commit in pieces as a way to bisect the issue

Actions #15

Updated by Casey Bodley about 1 month ago

even with the rgw tracing parts reverted in https://github.com/ceph/ceph/pull/56368, i still saw a lifecycle crash in https://qa-proxy.ceph.com/teuthology/cbodley-2024-03-21_17:31:42-rgw:lifecycle-wip-60682-distro-default-smithi/7615035/teuthology.log

2024-03-21T17:55:47.365 INFO:teuthology.orchestra.run.smithi144.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_get_no_id PASSED     [ 72%]
2024-03-21T17:56:41.499 INFO:tasks.rgw.client.0.smithi144.stdout:*** Caught signal (Segmentation fault) **
2024-03-21T17:56:41.500 INFO:tasks.rgw.client.0.smithi144.stdout: in thread 7fac95e39640 thread_name:lifecycle_thr_0
2024-03-21T17:56:41.688 INFO:tasks.rgw.client.0.smithi144.stderr:daemon-helper: command crashed with signal 11
2024-03-21T17:56:42.352 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~0s
2024-03-21T17:56:47.855 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~6s
2024-03-21T17:56:53.359 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~11s
2024-03-21T17:56:58.863 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~17s
2024-03-21T17:57:04.366 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~22s
2024-03-21T17:57:09.870 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~28s
2024-03-21T17:57:13.350 INFO:teuthology.orchestra.run.smithi144.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_expiration FAILED    [ 72%]

Actions #16

Updated by Soumya Koduri about 1 month ago

Casey Bodley wrote:

spotted a different lifecycle crash in the rgw/dbstore suite, which is interesting because no expiration/transition is involved:
[...]
from https://qa-proxy.ceph.com/teuthology/cbodley-2024-03-20_23:26:29-rgw-wip-cbodley-testing-distro-default-smithi/7614326/teuthology.log

There are some lifecycle tests (which include expiration) not marked as 'fails_on_dbstore'. I will skip those tests too and check if the issue occurs even with just applying lifecycle policy rules.

Actions #17

Updated by Casey Bodley about 1 month ago

Casey Bodley wrote:

even with the rgw tracing parts reverted in https://github.com/ceph/ceph/pull/56368, i still saw a lifecycle crash in https://qa-proxy.ceph.com/teuthology/cbodley-2024-03-21_17:31:42-rgw:lifecycle-wip-60682-distro-default-smithi/7615035/teuthology.log

i pushed another branch that reverts the rest of the tracing changes in librados and messenger, and still hit the crash in http://qa-proxy.ceph.com/teuthology/cbodley-2024-03-22_01:09:02-rgw:lifecycle-wip-60682-all-distro-default-smithi/7615848/teuthology.log

Actions #18

Updated by Casey Bodley about 1 month ago

i managed to catch a crash under gdb with a local vstart cluster. since this seems timing-related and didn't reproduce under valgrind in teuthology, i'm testing a release build (CMAKE_BUILD_TYUPE=RelWithDebInfo) and started vstart.sh without the -d for debug logging

here's the backtrace from gdb:

Thread 111 "lifecycle_thr_1" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f889d4006c0 (LWP 470554)]
ceph::buffer::v15_2_0::ptr::release (this=this@entry=0x55ea38fd7988) at /home/cbodley/ceph/src/common/buffer.cc:423
423             (1 == cached_raw->nref.load(std::memory_order_acquire));
(gdb) bt
#0  ceph::buffer::v15_2_0::ptr::release (this=this@entry=0x55ea38fd7988) at /home/cbodley/ceph/src/common/buffer.cc:423
#1  0x000055ea34677d23 in ceph::buffer::v15_2_0::ptr::~ptr (this=0x55ea38fd7988, __in_chrg=<optimized out>) at /home/cbodley/ceph/src/include/buffer.h:247
#2  ceph::buffer::v15_2_0::ptr_node::~ptr_node (this=0x55ea38fd7980, __in_chrg=<optimized out>) at /home/cbodley/ceph/src/include/buffer.h:371
#3  ceph::buffer::v15_2_0::ptr_node::disposer::operator() (this=<optimized out>, delete_this=0x55ea38fd7980) at /home/cbodley/ceph/src/include/buffer.h:366
#4  ceph::buffer::v15_2_0::ptr_node::disposer::operator() (this=<synthetic pointer>, delete_this=0x55ea38fd7980)
    at /home/cbodley/ceph/src/include/buffer.h:364
#5  ceph::buffer::v15_2_0::list::buffers_t::clear_and_dispose (this=0x55ea3ae178a0) at /home/cbodley/ceph/src/include/buffer.h:600
#6  ceph::buffer::v15_2_0::list::~list (this=0x55ea3ae178a0, __in_chrg=<optimized out>) at /home/cbodley/ceph/src/include/buffer.h:959
#7  std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>::~pair (
    this=0x55ea3ae17880, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/stl_pair.h:187
#8  std::destroy_at<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > (
    __location=0x55ea3ae17880) at /usr/include/c++/13/bits/stl_construct.h:88
#9  std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > (__p=0x55ea3ae17880, __a=...) at /usr/include/c++/13/bits/alloc_traits.h:560
#10 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::_M_destroy_node (
    __p=0x55ea3ae17860, this=<optimized out>) at /usr/include/c++/13/bits/stl_tree.h:625
#11 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::_M_drop_node (
    this=<optimized out>, __p=0x55ea3ae17860) at /usr/include/c++/13/bits/stl_tree.h:633
#12 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::_M_erase (
    __x=0x55ea3ae17860, this=<optimized out>) at /usr/include/c++/13/bits/stl_tree.h:1938
#13 0x000055ea34677cf4 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::_M_erase (__x=0x55ea3a69fda0, this=0x55ea3aabd2b8) at /usr/include/c++/13/bits/stl_tree.h:1936
#14 0x000055ea346803a1 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char--Type <RET> for more, q to quit, c to continue without paging--
_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::~_Rb_tree (this=0x55ea3aabd2b8, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/stl_tree.h:736
#15 std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::~map (this=0x55ea3aabd2b8, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/stl_map.h:314
#16 rgw::sal::StoreBucket::~StoreBucket (this=0x55ea3aabce00, __in_chrg=<optimized out>) at /home/cbodley/ceph/src/rgw/rgw_sal_store.h:134
#17 rgw::sal::RadosBucket::~RadosBucket (this=0x55ea3aabce00, __in_chrg=<optimized out>) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:273
#18 0x000055ea346803c9 in rgw::sal::RadosBucket::~RadosBucket (this=0x55ea3aabce00, __in_chrg=<optimized out>)
    at /home/cbodley/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:273
#19 0x000055ea3485fca3 in std::default_delete<rgw::sal::Bucket>::operator() (__ptr=<optimized out>, this=<optimized out>)
    at /usr/include/c++/13/bits/unique_ptr.h:93
#20 std::unique_ptr<rgw::sal::Bucket, std::default_delete<rgw::sal::Bucket> >::~unique_ptr (this=0x7f889d3fc278, __in_chrg=<optimized out>)
    at /usr/include/c++/13/bits/unique_ptr.h:404
#21 RGWLC::bucket_lc_process (this=this@entry=0x55ea37d4cd20, shard_id=..., worker=worker@entry=0x55ea37d3b590, stop_at=stop_at@entry=1711123005,
    once=once@entry=false) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:1833
#22 0x000055ea34862358 in RGWLC::process (this=this@entry=0x55ea37d4cd20, index=16, max_lock_secs=max_lock_secs@entry=90,
    worker=worker@entry=0x55ea37d3b590, once=once@entry=false) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:2400
#23 0x000055ea3486c65e in RGWLC::process (this=0x55ea37d4cd20, worker=worker@entry=0x55ea37d3b590,
    optional_bucket=std::unique_ptr<rgw::sal::Bucket> = {...}, once=once@entry=false) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:2019
#24 0x000055ea3486c8c9 in RGWLC::LCWorker::entry (this=0x55ea37d3b590) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:207
#25 0x00007f88e48ac897 in start_thread (arg=<optimized out>) at pthread_create.c:444
#26 0x00007f88e493380c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

digging into ptr::release():

#0  ceph::buffer::v15_2_0::ptr::release (this=this@entry=0x55ea38fd7988) at /home/cbodley/ceph/src/common/buffer.cc:423
423             (1 == cached_raw->nref.load(std::memory_order_acquire));
(gdb) p *this
$13 = {_raw = 0x0, _off = 947337538, _len = 1383608423}
(gdb) list
418           // only a single reference. Altogether with initializing `nref` of
419           // freshly fabricated one with `1` through the std::atomic's ctor
420           // (which doesn't impose a memory barrier on the strongly-ordered
421           // x86), this allows to avoid all atomical operations in such case.
422           const bool last_one = \
423             (1 == cached_raw->nref.load(std::memory_order_acquire));
424           if (likely(last_one) || --cached_raw->nref == 0) {
425             bdout << "deleting raw " << static_cast<void*>(cached_raw)
426                   << " len " << cached_raw->get_len() << bendl;
427             ANNOTATE_HAPPENS_AFTER(&cached_raw->nref);
(gdb) p cached_raw
$14 = (ceph::buffer::v15_2_0::raw * const) 0x3330336d51545f6d
(gdb) p *cached_raw
Cannot access memory at address 0x3330336d51545f6d

the buffer::ptr itself holds garbage. cached_raw here holds the original value of _raw, whose dereference here leads to the segfault

Actions #19

Updated by Casey Bodley about 1 month ago

caught another crash that looks more like the others we've seen:

Thread 114 "wp_thrd: 2, 2" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7faafd8006c0 (LWP 475870)]
0x0000562cb14e818b in ceph::buffer::v15_2_0::list::buffers_t::buffers_iterator<ceph::buffer::v15_2_0::ptr_node>::operator++ (this=<synthetic pointer>)
    at /home/cbodley/ceph/src/include/buffer.h:461
461               cur = cur->next;
(gdb) bt
#0  0x0000562cb14e818b in ceph::buffer::v15_2_0::list::buffers_t::buffers_iterator<ceph::buffer::v15_2_0::ptr_node>::operator++ (this=<synthetic pointer>)
    at /home/cbodley/ceph/src/include/buffer.h:461
#1  ceph::buffer::v15_2_0::list::buffers_t::buffers_iterator<ceph::buffer::v15_2_0::ptr_node>::operator++ (this=<synthetic pointer>)
    at /home/cbodley/ceph/src/include/buffer.h:466
#2  ceph::buffer::v15_2_0::list::buffers_t::clear_and_dispose (this=0x562cb6bfbde0) at /home/cbodley/ceph/src/include/buffer.h:599
#3  ceph::buffer::v15_2_0::list::~list (this=0x562cb6bfbde0, __in_chrg=<optimized out>) at /home/cbodley/ceph/src/include/buffer.h:959
#4  std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>::~pair (
    this=0x562cb6bfbdc0, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/stl_pair.h:187
#5  std::destroy_at<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > (
    __location=0x562cb6bfbdc0) at /usr/include/c++/13/bits/stl_construct.h:88
#6  std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > (__p=0x562cb6bfbdc0, __a=...) at /usr/include/c++/13/bits/alloc_traits.h:560
#7  std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::_M_destroy_node (
    __p=0x562cb6bfbda0, this=<optimized out>) at /usr/include/c++/13/bits/stl_tree.h:625
#8  std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::_M_drop_node (
    this=<optimized out>, __p=0x562cb6bfbda0) at /usr/include/c++/13/bits/stl_tree.h:633
#9  std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::_M_erase (
    __x=0x562cb6bfbda0, this=0x7faafd7faad0) at /usr/include/c++/13/bits/stl_tree.h:1938
#10 0x0000562cb14ebdd9 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::~_Rb_tree (this=0x7faafd7faad0, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/stl_tree.h:736
#11 std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::~map (this=0x7faafd7faad0, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/stl_map.h:314
#12 RGWSI_Bucket_SObj::bucket_info_cache_entry::~bucket_info_cache_entry (this=0x7faafd7fa618, __in_chrg=<optimized out>)
    at /home/cbodley/ceph/src/rgw/services/svc_bucket_sobj.h:40
#13 boost::optional_detail::optional_base<RGWSI_Bucket_SObj::bucket_info_cache_entry>::destroy_impl (this=0x7faafd7fa610)
    at /home/cbodley/ceph/build/boost/include/boost/optional/optional.hpp:779
#14 boost::optional_detail::optional_base<RGWSI_Bucket_SObj::bucket_info_cache_entry>::destroy (this=0x7faafd7fa610)
    at /home/cbodley/ceph/build/boost/include/boost/optional/optional.hpp:765
#15 boost::optional_detail::optional_base<RGWSI_Bucket_SObj::bucket_info_cache_entry>::~optional_base (this=0x7faafd7fa610, __in_chrg=<optimized out>)
    at /home/cbodley/ceph/build/boost/include/boost/optional/optional.hpp:263
#16 boost::optional<RGWSI_Bucket_SObj::bucket_info_cache_entry>::~optional (this=0x7faafd7fa610, __in_chrg=<optimized out>)
    at /home/cbodley/ceph/build/boost/include/boost/optional/optional.hpp:909
#17 RGWSI_Bucket_SObj::read_bucket_instance_info (this=0x562cb44c58c0, ctx=...,
    key="yournamehere-44osugk2129zmf6z-4:200daa33-e350-46bf-8c6f-3476d1d9d0d6.4141.25", info=0x7faafd7fbe28, pmtime=0x0, pattrs=0x0, y=...,
    dpp=0x562cb4544cb0, cache_info=0x0, refresh_version=...) at /home/cbodley/ceph/src/rgw/services/svc_bucket_sobj.cc:299
#18 0x0000562cb15f0845 in operator() (ctx=..., __closure=<optimized out>) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_bucket.cc:2855
#19 std::__invoke_impl<int, RGWBucketCtl::read_bucket_instance_info(const rgw_bucket&, RGWBucketInfo*, optional_yield, const DoutPrefixProvider*, const BucketInstance::GetParams&)::<lambda(RGWSI_Bucket_BI_Ctx&)>&, ptr_wrapper<RGWSI_MetaBackend::Context, 4>&> (__f=...) at /usr/include/c++/13/bits/invoke.h:61
#20 std::__invoke_r<int, RGWBucketCtl::read_bucket_instance_info(const rgw_bucket&, RGWBucketInfo*, optional_yield, const DoutPrefixProvider*, const BucketInstance::GetParams&)::<lambda(RGWSI_Bucket_BI_Ctx&)>&, ptr_wrapper<RGWSI_MetaBackend::Context, 4>&> (__fn=...) at /usr/include/c++/13/bits/invoke.h:114
#21 std::_Function_handler<int(ptr_wrapper<RGWSI_MetaBackend::Context, 4>&), RGWBucketCtl::read_bucket_instance_info(const rgw_bucket&, RGWBucketInfo*, optional_yield, const DoutPrefixProvider*, const BucketInstance::GetParams&)::<lambda(RGWSI_Bucket_BI_Ctx&)> >::_M_invoke(const std::_Any_data &, ptr_wrapper<RGWSI_MetaBackend::Context, 4> &) (__functor=..., __args#0=...) at /usr/include/c++/13/bits/std_function.h:290
#22 0x0000562cb160e290 in std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>::operator()(ptr_wrapper<RGWSI_MetaBackend::Context, 4>&) const (
    __args#0=..., this=<optimized out>) at /usr/include/c++/13/bits/std_function.h:591
#23 RGWBucketInstanceMetadataHandler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>)::{lambda(RGWSI_MetaBackend_Handler::Op*)#1}::operator()(RGWSI_MetaBackend_Handler::Op*) const (op=<optimized out>, __closure=<optimized out>)
    at /home/cbodley/ceph/src/rgw/driver/rados/rgw_bucket.cc:2528
#24 std::__invoke_impl<int, RGWBucketInstanceMetadataHandler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>)::{lambda(RGWSI_MetaBackend_Handler::Op*)#1}&, RGWSI_MetaBackend_Handler::Op*>(std::__invoke_other, RGWBucketInstanceMetadataHandler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>)::{lambda(RGWSI_MetaBackend_Handler::Op*)#1}&, RGWSI_MetaBackend_Handler::Op*&&) (__f=...) at /usr/include/c++/13/bits/invoke.h:61
#25 std::__invoke_r<int, RGWBucketInstanceMetadataHandler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>)::{lambda(RGWSI_MetaBackend_Handler::Op*)#1}&, RGWSI_MetaBackend_Handler::Op*>(RGWBucketInstanceMetadataHandler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>)::{lambda(RGWSI_MetaBackend_Handler::Op*)#1}&, RGWSI_MetaBackend_Handler::Op*&&) (__fn=...) at /usr/include/c++/13/bits/invoke.h:114
#26 std::_Function_handler<int (RGWSI_MetaBackend_Handler::Op*), RGWBucketInstanceMetadataHandler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>)::{lambda(RGWSI_MetaBackend_Handler::Op*)#1}>::_M_invoke(std::_Any_data const&, RGWSI_MetaBackend_Handler::Op*&&) (__functor=..., __args#0=<optimized out>) at /usr/include/c++/13/bits/std_function.h:290
#27 0x0000562cb151326f in std::function<int (RGWSI_MetaBackend_Handler::Op*)>::operator()(RGWSI_MetaBackend_Handler::Op*) const (__args#0=<optimized out>,
    this=<optimized out>) at /usr/include/c++/13/bits/std_function.h:591
#28 operator() (ctx=0x7faafd7face0, __closure=0x7faafd7fae00) at /home/cbodley/ceph/src/rgw/services/svc_meta_be.cc:183
#29 std::__invoke_impl<int, RGWSI_MetaBackend_Handler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int(Op*)>)::<lambda(RGWSI_MetaBackend::Context*)>&, RGWSI_MetaBackend::Context*> (__f=...) at /usr/include/c++/13/bits/invoke.h:61
#30 std::__invoke_r<int, RGWSI_MetaBackend_Handler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int(Op*)>)::<lambda(RGWSI_MetaBackend::Context*)>&, RGWSI_MetaBackend::Context*> (__fn=...) at /usr/include/c++/13/bits/invoke.h:114
#31 std::_Function_handler<int(RGWSI_MetaBackend::Context*), RGWSI_MetaBackend_Handler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int(Op*)>)::<lambda(RGWSI_MetaBackend::Context*)> >::_M_invoke(const std::_Any_data &, RGWSI_MetaBackend::Context *&&) (__functor=...,
    __args#0=<optimized out>) at /usr/include/c++/13/bits/std_function.h:290
#32 0x0000562cb1514b3c in std::function<int (RGWSI_MetaBackend::Context*)>::operator()(RGWSI_MetaBackend::Context*) const (__args#0=<optimized out>,
    this=<optimized out>) at /usr/include/c++/13/bits/std_function.h:591
#33 RGWSI_MetaBackend_SObj::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (RGWSI_MetaBackend::Context*)>) (
    this=<optimized out>, opt=std::optional containing std::variant [no contained value], f=...)
    at /home/cbodley/ceph/src/rgw/services/svc_meta_be_sobj.cc:106
#34 0x0000562cb151362c in RGWSI_MetaBackend_Handler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (RGWSI_MetaBackend_Handler::Op*)>) (this=<optimized out>, bectx_params=std::optional [no contained value], f=...) at /home/cbodley/ceph/src/rgw/services/svc_meta_be.cc:180
#35 0x0000562cb15efc77 in RGWBucketInstanceMetadataHandler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (ptr_wrapper<RGWSI_MetaBackend::Context, 4>&)>) (this=<optimized out>, f=..., bectx_params=std::optional [no contained value])
    at /home/cbodley/ceph/src/rgw/driver/rados/rgw_bucket.cc:2526
#36 RGWBucketCtl::read_bucket_instance_info (this=<optimized out>, bucket=..., info=<optimized out>, info@entry=0x7faafd7fbe28, y=..., dpp=<optimized out>,
    dpp@entry=0x562cb4544cb0, params=...) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_bucket.cc:2854
#37 0x0000562cb12ec311 in RGWRados::get_bucket_instance_info (this=<optimized out>, bucket=..., info=..., pmtime=pmtime@entry=0x0, pattrs=pattrs@entry=0x0,
    y=..., dpp=dpp@entry=0x562cb4544cb0) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_bucket.h:555
#38 0x0000562cb130602c in RGWRados::BucketShard::init (this=this@entry=0x7faafd7fbc38, _bucket=..., obj=..., bucket_info_out=<optimized out>,
    dpp=dpp@entry=0x562cb4544cb0, y=...) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.cc:2735
#39 0x0000562cb131d3b8 in RGWRados::Bucket::UpdateIndex::init_bs (y=..., dpp=0x562cb4544cb0, this=0x7faafd7fba40)
    at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.h:907
#40 RGWRados::Bucket::UpdateIndex::get_bucket_shard (y=..., dpp=<optimized out>, pbs=<synthetic pointer>, this=0x7faafd7fba40)
    at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.h:954
#41 RGWRados::Bucket::UpdateIndex::guard_reshard(DoutPrefixProvider const*, rgw_obj const&, RGWRados::BucketShard**, std::function<int (RGWRados::BucketShard*)>, optional_yield) (this=this@entry=0x7faafd7fba40, dpp=0x562cb4544cb0, obj_instance=..., pbs=pbs@entry=0x0, call=..., y=...)
    at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.cc:6852
#42 0x0000562cb131da33 in RGWRados::Bucket::UpdateIndex::prepare (this=this@entry=0x7faafd7fba40, dpp=<optimized out>, dpp@entry=0x562cb4544cb0,
    op=<optimized out>, op@entry=CLS_RGW_OP_DEL, write_tag=<optimized out>, y=..., log_op=<optimized out>)
    at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.cc:6917
#43 0x0000562cb131ecdd in RGWRados::Object::Delete::delete_obj (this=this@entry=0x562cb6f8da30, y=..., dpp=dpp@entry=0x562cb4544cb0,
    log_op=log_op@entry=false) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.cc:5806
#44 0x0000562cb1387b86 in rgw::sal::RadosObject::RadosDeleteOp::delete_obj (this=0x562cb6f8d000, dpp=0x562cb4544cb0, y=..., flags=<optimized out>)
    at /home/cbodley/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:2172
#45 0x0000562cb156a0ba in remove_expired_obj (dpp=0x562cb4544cb0, oc=..., remove_indeed=<optimized out>,
    event_types=std::vector of length 2, capacity 2 = {...}) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:619
#46 0x0000562cb1586649 in LCOpAction_CurrentExpiration::process (this=<optimized out>, oc=...) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:1192
#47 0x0000562cb156ba8f in LCOpRule::process (this=this@entry=0x7faafd7fcd28, o=..., dpp=0x562cb4544cb0, wq=wq@entry=0x562cb4596910)
    at /home/cbodley/ceph/src/rgw/rgw_lc.cc:1666
#48 0x0000562cb156c2b3 in operator() (__closure=0x562cb45969f0, wk=0x562cb4529680, wq=0x562cb4596910, wi=...) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:1752
#49 0x0000562cb157e6bf in std::function<void (RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&)>::operator()(RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&) const (__args#2=..., __args#1=0x562cb4596910, __args#0=<optimized out>, this=<optimized out>)
    at /usr/include/c++/13/bits/std_function.h:591
#50 WorkQ::entry (this=0x562cb4596910) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:804
#51 0x00007fab46aac897 in start_thread (arg=<optimized out>) at pthread_create.c:444
#52 0x00007fab46b3380c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

this time it's destroying bucket_info_cache_entry::attrs

Actions #20

Updated by Soumya Koduri about 1 month ago

Soumya Koduri wrote:

Casey Bodley wrote:

spotted a different lifecycle crash in the rgw/dbstore suite, which is interesting because no expiration/transition is involved:
[...]
from https://qa-proxy.ceph.com/teuthology/cbodley-2024-03-20_23:26:29-rgw-wip-cbodley-testing-distro-default-smithi/7614326/teuthology.log

There are some lifecycle tests (which include expiration) not marked as 'fails_on_dbstore'. I will skip those tests too and check if the issue occurs even with just applying lifecycle policy rules.

lc crashes when running dbstore suite are due to a bug in dbstore (https://github.com/soumyakoduri/ceph/commit/c20c2d2499eeb182ebffb1675950e42719735b5f) and unrelated to the original issue. Will post the fix for review.

Actions #21

Updated by Soumya Koduri about 1 month ago

Casey Bodley wrote:

caught another crash that looks more like the others we've seen:
[...]

this time it's destroying bucket_info_cache_entry::attrs

Most of these crashes seem to be related to bucket obj.

@Casey Bodley,

Could you please clarify what will be the bucket_version in the else case below (https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_sal_rados.cc#L546)

int RadosBucket::load_bucket(const DoutPrefixProvider* dpp, optional_yield y) {
int ret;
int ret;

RGWSI_MetaBackend_CtxParams bectx_params = RGWSI_MetaBackend_CtxParams_SObj();
RGWObjVersionTracker ep_ot;
..
..
} else {
ret = store->ctl()->bucket->read_bucket_instance_info(info.bucket, &info, y, dpp,
RGWBucketCtl::BucketInstance::GetParams()
.set_mtime(&mtime)
.set_attrs(&attrs)
.set_bectx_params(bectx_params));
}
if (ret != 0) {
return ret;
}
bucket_version = ep_ot.read_version;   <<<< Here ep_ot.read_version is uninitialized incase of else{}.. right?

}

The new code added in 'lc' publish_reserve()->load_bucket(..) enters the else{} code path as info.bucket.bucket_id is not empty. Could it be that bucket_version here gets updated with incorrect read_version which may result in probable use_after_free of the bucket object in the cache at later point?

Actions #22

Updated by Casey Bodley about 1 month ago

another different tcmalloc segfault, this time just trying to copy a 32-byte std::string:

Thread 110 "wp_thrd: 1, 2" received signal SIGSEGV, Segmentation fault.                                                                                       
[Switching to Thread 0x7fb712a006c0 (LWP 491581)]                                                                                                             
tcmalloc::SLL_TryPop (rv=<synthetic pointer>, list=0x5638bf05c060) at src/linked_list.h:69                                           
Downloading source file /usr/src/debug/gperftools-2.9.1-6.fc39.x86_64/src/linked_list.h    
69        void *next = SLL_Next(*list);                                                                                                                       
(gdb) bt                                                                                                                                                      
#0  tcmalloc::SLL_TryPop (rv=<synthetic pointer>, list=0x5638bf05c060) at src/linked_list.h:69           
#1  tcmalloc::ThreadCache::FreeList::TryPop (rv=<synthetic pointer>, this=0x5638bf05c060) at src/thread_cache.h:220
#2  tcmalloc::ThreadCache::Allocate (oom_handler=0x7fb75cc38d40 <tcmalloc::cpp_throw_oom(unsigned long)>, cl=3, size=32, this=0x5638bf05c000)
    at src/thread_cache.h:379
#3  malloc_fast_path<tcmalloc::cpp_throw_oom> (size=32) at src/tcmalloc.cc:1898
#4  tc_new (size=32) at src/tcmalloc.cc:2019
#5  0x00005638bc2eb4e3 in std::__new_allocator<char>::allocate (this=<optimized out>, __n=<optimized out>) at /usr/include/c++/13/bits/new_allocator.h:151
#6  std::allocator<char>::allocate (__n=<optimized out>, this=<optimized out>) at /usr/include/c++/13/bits/allocator.h:198
#7  std::allocator_traits<std::allocator<char> >::allocate (__n=<optimized out>, __a=...) at /usr/include/c++/13/bits/alloc_traits.h:482
#8  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_allocate (__n=<optimized out>, __a=...)
    at /usr/include/c++/13/bits/basic_string.h:126
#9  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_create (this=0x7fb7129fbc60, __old_capacity=<optimized out>, 
    __capacity=<synthetic pointer>: <optimized out>) at /usr/include/c++/13/bits/basic_string.tcc:159
#10 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign (this=0x7fb7129fbc60, __str="yournamehere-jepsg3qinlo81byd-1")
    at /usr/include/c++/13/bits/basic_string.tcc:288
#11 0x00005638bc842f5e in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::assign (__str="yournamehere-jepsg3qinlo81byd-1", 
    this=0x7fb7129fbc60) at /usr/include/c++/13/bits/basic_string.h:1608
#12 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::operator= (__str="yournamehere-jepsg3qinlo81byd-1", this=0x7fb7129fbc60)
    at /usr/include/c++/13/bits/basic_string.h:814
#13 rgw_bucket::operator= (this=0x7fb7129fbc40) at /home/cbodley/ceph/src/rgw/rgw_bucket_types.h:166
#14 RGWRados::BucketShard::init (this=this@entry=0x7fb7129fbc38, _bucket=..., obj=..., bucket_info_out=0x7fb7129fbe28, dpp=dpp@entry=0x5638c03bccb0, y=...)
    at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.cc:2729
#15 0x00005638bc85a3b8 in RGWRados::Bucket::UpdateIndex::init_bs (y=..., dpp=0x5638c03bccb0, this=0x7fb7129fba40)
    at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.h:907
#16 RGWRados::Bucket::UpdateIndex::get_bucket_shard (y=..., dpp=<optimized out>, pbs=<synthetic pointer>, this=0x7fb7129fba40)
    at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.h:954
#17 RGWRados::Bucket::UpdateIndex::guard_reshard(DoutPrefixProvider const*, rgw_obj const&, RGWRados::BucketShard**, std::function<int (RGWRados::BucketShard*)>, optional_yield) (this=this@entry=0x7fb7129fba40, dpp=0x5638c03bccb0, obj_instance=..., pbs=pbs@entry=0x0, call=..., y=...)
    at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.cc:6852
#18 0x00005638bc85aa33 in RGWRados::Bucket::UpdateIndex::prepare (this=this@entry=0x7fb7129fba40, dpp=<optimized out>, dpp@entry=0x5638c03bccb0, 
    op=<optimized out>, op@entry=CLS_RGW_OP_DEL, write_tag=<optimized out>, y=..., log_op=<optimized out>)
    at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.cc:6917
#19 0x00005638bc85bcdd in RGWRados::Object::Delete::delete_obj (this=this@entry=0x5638c2f0aa30, y=..., dpp=dpp@entry=0x5638c03bccb0, 
    log_op=log_op@entry=false) at /home/cbodley/ceph/src/rgw/driver/rados/rgw_rados.cc:5806
#20 0x00005638bc8c4b86 in rgw::sal::RadosObject::RadosDeleteOp::delete_obj (this=0x5638c2f0a000, dpp=0x5638c03bccb0, y=..., flags=<optimized out>)
    at /home/cbodley/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:2172
#21 0x00005638bcaa70ba in remove_expired_obj (dpp=0x5638c03bccb0, oc=..., remove_indeed=<optimized out>, 
    event_types=std::vector of length 2, capacity 2 = {...}) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:619
#22 0x00005638bcac3649 in LCOpAction_CurrentExpiration::process (this=<optimized out>, oc=...) at /home/cbodley/ceph/src/rgw/rgw_lc.cc:1192
#23 0x00005638bcaa8a8f in LCOpRule::process (this=this@entry=0x7fb7129fcd28, o=..., dpp=0x5638c03bccb0, wq=wq@entry=0x5638c0410590)
    at /home/cbodley/ceph/src/rgw/rgw_lc.cc:1666

Actions #23

Updated by Shreyansh Sancheti about 1 month ago

Casey Bodley wrote:

another different tcmalloc segfault, this time just trying to copy a 32-byte std::string:
[...]

I found a different one today while doing the Cloud transition tests with s3tests. I don't think there's any relation to the above ones.

(gdb) bt
#0  __pthread_kill_implementation (
    threadid=<optimized out>, signo=signo@entry=6,
    no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007fdfbf8aff33 in __pthread_kill_internal (
    signo=6, threadid=<optimized out>)
    at pthread_kill.c:78
#2  0x00007fdfbf85fab6 in __GI_raise (sig=sig@entry=6)
    at ../sysdeps/posix/raise.c:26
#3  0x00007fdfbf849886 in __GI_abort () at abort.c:100
#4  0x00007fdfbfad7e80 in std::__glibcxx_assert_fail (
    file=file@entry=0x55e60f92f800 "/usr/include/c++/12/bits/basic_string.h", line=line@entry=1229,
    function=function@entry=0x55e60f92f6e0 "constexpr std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::reference std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::operator[](size_type) [with _CharT = char; _Traits = std::char_traits<char"...,
    condition=condition@entry=0x55e60f92f6cb "__pos <= size()")
    at ../../../../../libstdc++-v3/src/c++11/debug.cc:60
#5  0x000055e60eac92bc in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::operator[] (this=this@entry=0x7fde9519d0a0,
    __pos=<optimized out>)
    at /usr/include/c++/12/bits/basic_string.h:1229
#6  0x000055e60ed26511 in RGWRESTStreamS3PutObj::send_init (this=this@entry=0x55e618ddd000, obj=...)
    at /usr/include/c++/12/bits/basic_string.h:1071
#7  0x000055e60ed2d464 in RGWRESTConn::put_obj_send_init
--Type <RET> for more, q to quit, c to continue without paging--c
    (this=0x7fde9519dbb0, obj=...,
    extra_params=extra_params@entry=0x0,
    req=req@entry=0x55e618969ad0)
    at /root/ceph/src/rgw/rgw_rest_conn.cc:238
#8  0x000055e60f12431f in RGWLCCloudStreamPut::init (
    this=this@entry=0x55e618969a40)
    at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:498
#9  0x000055e60f127599 in cloud_tier_transfer_object (
    dpp=<optimized out>, readf=0x55e618da1080,
    writef=0x55e618969a40)
    at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:722
#10 0x000055e60f127c77 in cloud_tier_plain_transfer (
    tier_ctx=...)
    at /usr/include/c++/12/bits/shared_ptr_base.h:1665
#11 0x000055e60f12d22b in rgw_cloud_tier_transfer_object
    (tier_ctx=...,
    cloud_targets=std::set with 1 element = {...})
    at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:1298
#12 0x000055e60ef1c7e1 in rgw::sal::RadosObject::transition_to_cloud (this=0x55e618c7c000,
    bucket=0x55e618de8e00, tier=0x55e61667bd40, o=...,
    cloud_targets=std::set with 1 element = {...},
    cct=0x55e6115e8000, update_object=false,
    dpp=<optimized out>, y=...)
    at /root/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:1905
#13 0x000055e60f053b2c in LCOpAction_Transition::transition_obj_to_cloud (this=this@entry=0x55e61677d9e0, oc=...)
    at /root/ceph/src/rgw/rgw_lc.cc:1397
#14 0x000055e60f0543e1 in LCOpAction_Transition::process
    (this=0x55e61677d9e0, oc=...)
    at /root/ceph/src/rgw/rgw_lc.cc:1454
#15 0x000055e60f054d8c in LCOpAction_CurrentTransition::process (this=<optimized out>, oc=...)
    at /root/ceph/src/rgw/rgw_lc.cc:1504
#16 0x000055e60f03d7b9 in LCOpRule::process (
    this=this@entry=0x7fde9519e6a8, o=...,
    dpp=<optimized out>, wq=wq@entry=0x55e6115add10)
    at /root/ceph/src/rgw/rgw_lc.cc:1619
#17 0x000055e60f03dd81 in operator() (
    __closure=0x55e6115addf0, wk=0x55e616387590,
    wq=0x55e6115add10, wi=...)
    at /root/ceph/src/rgw/rgw_lc.cc:1705
#18 0x000055e60f03dfb5 in std::__invoke_impl<void, RGWLC::bucket_lc_process(std::string&, LCWorker*, time_t, bool)::<lambda(RGWLC::LCWorker*, WorkQ*, WorkItem&)>&, RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&> (__f=...)
    at /usr/include/c++/12/bits/invoke.h:61
#19 std::__invoke_r<void, RGWLC::bucket_lc_process(std::string&, LCWorker*, time_t, bool)::<lambda(RGWLC::LCWorker*, WorkQ*, WorkItem&)>&, RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&> (__fn=...)
    at /usr/include/c++/12/bits/invoke.h:111
#20 std::_Function_handler<void(RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&), RGWLC::bucket_lc_process(std::string&, LCWorker*, time_t, bool)::<lambda(RGWLC::LCWorker*, WorkQ*, WorkItem&)> >::_M_invoke(const std::_Any_data &, RGWLC::LCWorker *&&, WorkQ *&&, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry> &) (
    __functor=..., __args#0=<optimized out>,
    __args#1=<optimized out>, __args#2=...)
    at /usr/include/c++/12/bits/std_function.h:290
#21 0x000055e60f048f27 in std::function<void (RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&)>::operator()(RGWLC::LCWorker*, WorkQ*, boost::variant<void*, std::tuple<LCOpRule, rgw_bucket_dir_entry>, std::tuple<lc_op, rgw_bucket_dir_entry>, rgw_bucket_dir_entry>&) const (
    this=this@entry=0x55e6115addf0,
    __args#0=<optimized out>, __args#1=<optimized out>,
    __args#1@entry=0x55e6115add10, __args#2=...)
    at /usr/include/c++/12/bits/std_function.h:591
#22 0x000055e60f04d6cf in WorkQ::entry (
    this=0x55e6115add10)
    at /root/ceph/src/rgw/rgw_lc.cc:779
#23 0x00007fdfc1033c0f in Thread::entry_wrapper (
    this=0x55e6115add10)
    at /root/ceph/src/common/Thread.cc:87
#24 0x00007fdfc1033c27 in Thread::_entry_func (
    arg=<optimized out>)
    at /root/ceph/src/common/Thread.cc:74
#25 0x00007fdfbf8ae19d in start_thread (
    arg=<optimized out>) at pthread_create.c:442
#26 0x00007fdfbf92fc60 in clone3 ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Actions #24

Updated by Soumya Koduri 27 days ago

Soumya Koduri wrote:

#0 __pthread_kill_implementation (
threadid=<optimized out>, signo=signo@entry=11,
no_tid=no_tid@entry=0) at pthread_kill.c:44
#1 0x00007ffff58aff33 in __pthread_kill_internal (
signo=11, threadid=<optimized out>)
at pthread_kill.c:78
#2 0x00007ffff585fab6 in __GI_raise (sig=11)
at ../sysdeps/posix/raise.c:26
#3 0x0000555556d7bcc0 in reraise_fatal (
signum=signum@entry=11)
at /root/ceph/src/global/signal_handler.cc:88
#4 0x0000555556d7d351 in handle_oneshot_fatal_signal (
signum=11)
--Type <RET> for more, q to quit, c to continue without paging--c
at /root/ceph/src/global/signal_handler.cc:367
#5 <signal handler called>
#6 RGWHTTPClient::set_send_length (len=4036, this=0x0)
at /root/ceph/src/rgw/rgw_http_client.h:128
#7 RGWLCCloudStreamPut::send_ready (
this=this@entry=0x55555837b260, dpp=0x5555583e0460,
rest_obj=...)
at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:660
#8 0x0000555556befef5 in cloud_tier_transfer_object (
dpp=<optimized out>, readf=0x555558437b80,
writef=0x55555837b260)
at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:728
#9 0x0000555556bf04e7 in cloud_tier_plain_transfer (
tier_ctx=...)
at /usr/include/c++/12/bits/shared_ptr_base.h:1665
#10 0x0000555556bf5a9b in rgw_cloud_tier_transfer_object
(tier_ctx=...,
cloud_targets=std::set with 1 element = {...})
at /root/ceph/src/rgw/driver/rados/rgw_lc_tier.cc:1298
#11 0x00005555569eed13 in rgw::sal::RadosObject::transition_to_cloud (this=0x55555fa84000,
bucket=0x55555d5db100, tier=0x55555d7ca340, o=...,
cloud_targets=std::set with 1 element = {...},
cct=0x5555583d6000, update_object=true,
dpp=<optimized out>, y=...)
at /root/ceph/src/rgw/driver/rados/rgw_sal_rados.cc:1766
#12 0x0000555556b1f61c in LCOpAction_Transition::transition_obj_to_cloud (this=this@entry=0x55555fa0ffc0, oc=...)
at /root/ceph/src/rgw/rgw_lc.cc:1396
#13 0x0000555556b1fed1 in LCOpAction_Transition::process
(this=0x55555fa0ffc0, oc=...)
at /root/ceph/src/rgw/rgw_lc.cc:1453

This crash in cloud-transition code is being tracked as part of https://tracker.ceph.com/issues/65251

Actions #25

Updated by Casey Bodley 25 days ago

the lifecycle changes in https://github.com/ceph/ceph/pull/55657 seem to be responsible for the crash, but it's not clear exactly how. discussion continues in https://github.com/ceph/ceph/pull/55657#discussion_r1541310624

Actions #26

Updated by Krunal Chheda 24 days ago

  • Pull request ID set to 56712
Actions #27

Updated by Krunal Chheda 24 days ago

  • Subject changed from lifecycle transition crashes since merge end-to-end tracing to lifecycle transition crashes since reloading bucket attrs for notification

As discussed https://github.com/ceph/ceph/pull/55657#discussion_r1541310624, removing the logic to reload the bucket inside the publish_reserve and expect the caller to reload the bucket before calling.

Actions #28

Updated by Casey Bodley 24 days ago

  • Status changed from New to Fix Under Review
  • Assignee deleted (Shreyansh Sancheti)
Actions #29

Updated by Casey Bodley 20 days ago

  • Status changed from Fix Under Review to Pending Backport
Actions #30

Updated by Backport Bot 20 days ago

  • Copied to Backport #65375: squid: lifecycle transition crashes since reloading bucket attrs for notification added
Actions #31

Updated by Backport Bot 20 days ago

  • Tags changed from lifecycle to lifecycle backport_processed
Actions #33

Updated by Soumya Koduri 13 days ago

The cause for this issue seems to be due to multiple LC worker threads updating the same `bucket` handle, which is not thread-safe. When another `load_bucket()` call was added in `remove_expirted_obj` as part of https://github.com/ceph/ceph/pull/55657, multiple threads could update and overwrite the same ceph::buffer::list stored in `bucket->attrs` resulting in corruption at times.

Actions

Also available in: Atom PDF