Bug #57195
closedterminate called after throwing an instance of 'std::bad_variant_access'
0%
Description
rgw crashes on startup in a lot of centos8 jobs: http://qa-proxy.ceph.com/teuthology/cbodley-2022-08-18_23:33:15-rgw-wip-cbodley-testing-distro-default-smithi/6979615/teuthology.log
terminate called after throwing an instance of 'std::bad_variant_access' what(): std::get: wrong index for variant *** Caught signal (Aborted) ** in thread ee24540 thread_name:memcheck-amd64- ceph version 17.0.0-14379-ga8b84acb (a8b84acb87be6574934d5b5cc860020487d73e7a) quincy (dev) 1: /lib64/libpthread.so.0(+0x12ce0) [0x8de2ce0] 2: gsignal() 3: abort() 4: /lib64/libstdc++.so.6(+0x9009b) [0x962209b] 5: /lib64/libstdc++.so.6(+0x9653c) [0x962853c] 6: /lib64/libstdc++.so.6(+0x96597) [0x9628597] 7: /lib64/libstdc++.so.6(+0x967f8) [0x96287f8] 8: (std::__throw_bad_variant_access(bool)+0) [0x610506] 9: (void boost::throw_exception<boost::bad_function_call>(boost::bad_function_call const&)+0) [0x61052a] 10: radosgw(+0x55bd19) [0x663d19] 11: main() 12: __libc_start_main() 13: _start()
first saw on august 5th in https://tracker.ceph.com/issues/57050#note-2
frames 8 and 9 show two different exceptions being thrown. in that other tracker issue, the exceptions were:
8: (std::__throw_bad_variant_access(bool)+0) [0x7f4c203a6020] 9: (void boost::throw_exception<boost::gregorian::bad_day_of_month>(boost::gregorian::bad_day_of_month const&)+0) [0x7f4c203a6044
Updated by Casey Bodley over 1 year ago
tried but was unable to reproduce in a centos stream 8 vm with the following cmake config:
cmake -GNinja -DCMAKE_BUILD_TYPE=Debug -DWITH_MGR=OFF -DWITH_CEPHFS=OFF -DWITH_KRBD=OFF -DWITH_RBD=OFF -DWITH_MGR_DASHBOARD_FRONTEND=OFF -DWITH_RDMA=OFF -DWITH_FUSE=OFF ..
Updated by Casey Bodley over 1 year ago
this seems to only crash in our valgrind jobs, ex https://pulpito.ceph.com/amaredia-2022-08-30_18:13:58-rgw:verify-main-distro-default-smithi/
i'll try to reproduce manually under valgrind
Updated by Casey Bodley over 1 year ago
Casey Bodley wrote:
this seems to only crash in our valgrind jobs, ex https://pulpito.ceph.com/amaredia-2022-08-30_18:13:58-rgw:verify-main-distro-default-smithi/
i'll try to reproduce manually under valgrind
scratch that, it's the rgw-datacache jobs that fail consistently, and the no-datacache ones that pass
Updated by Casey Bodley over 1 year ago
- Status changed from New to Fix Under Review
- Pull request ID set to 47907
reproduced after configuring rgw d3n l1 local datacache enabled = true
:
8: (std::__throw_bad_variant_access(bool)+0) [0x55ccf24b07f2] 9: (ceph::version_1_0::spin_lock(std::atomic_flag&)+0) [0x55ccf24b0813] 10: (unsigned long const md_config_t::get_val<unsigned long>(ConfigValues const&, std::basic_string_view<char, std::char_traits<char> >) const+0x97) [0x55ccf2604a17] 11: (StoreManager::get_config(bool, ceph::common::CephContext*)+0x291) [0x55ccf2bd1cbf]
this is a regression from https://github.com/ceph/ceph/pull/47362, which switched from using these legacy config variables:
bool rgw_d3n_datacache_enabled = cct->_conf->rgw_d3n_l1_local_datacache_enabled; if (rgw_d3n_datacache_enabled && (cct->_conf->rgw_max_chunk_size != cct->_conf->rgw_obj_stripe_size)) {
to lookups with
get_val<T>()
:const auto& d3n = g_conf().get_val<bool>("rgw_d3n_l1_local_datacache_enabled"); if (!admin && d3n) { if (g_conf().get_val<size_t>("rgw_max_chunk_size") != g_conf().get_val<size_t>("rgw_obj_stripe_size")) {
Updated by Casey Bodley over 1 year ago
- Status changed from Fix Under Review to Resolved