Bug #63867
openSegfault in CommonSafeTimer::cancel_all_events due to uninitialized data
0%
Description
Reported downstream here: https://bugzilla.redhat.com/show_bug.cgi?id=2241339
`ceph status` crashes with a SIGSEGV on Fedora 39 packages, both x86_64 and aarch64. The backtrace suggests a junk pointer:
std::_Rb_tree_rebalance_for_erase (__z=0xe0073e70, __header=...) at ../../../../../libstdc++-v3/src/c++98/tree.cc:296 296 if (__y->_M_left == 0) // __z has at most one non-null child. y == z. (gdb) bt #0 std::_Rb_tree_rebalance_for_erase (__z=0xe0073e70, __header=...) at ../../../../../libstdc++-v3/src/c++98/tree.cc:296 #1 0x0000ffffe8f17a94 in std::_Rb_tree<unsigned long, std::pair<unsigned long const, unsigned int>, std::_Select1st<std::pair<unsigned long const, unsigned int> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, unsigned int> > >::_M_erase_aux (__position=..., this=0xffffe6a6e068) at /usr/include/c++/13/bits/stl_tree.h:2489 #2 std::_Rb_tree<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*>, std::_Select1st<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> >, std::less<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > >, std::allocator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> > >::erase[abi:cxx11](std::_Rb_tree_iterator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> >) ( __position=..., this=0xffffe6a6e068) at /usr/include/c++/13/bits/stl_tree.h:1210 #3 std::multimap<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, Context*, std::less<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > >, std::allocator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> > >::erase[abi:cxx11](std::_Rb_tree_iterator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> >) (__position=..., this=0xffffe6a6e068) at /usr/include/c++/13/bits/stl_multimap.h:715 #4 CommonSafeTimer<std::mutex>::cancel_all_events (this=this@entry=0xffffe6a6e000) at /usr/src/debug/ceph-18.2.1-1.fc39.aarch64/src/common/Timer.cc:206 #5 0x0000ffffe8f1412c [PAC] in CommonSafeTimer<std::mutex>::shutdown (this=0xffffe6a6e000) at /usr/src/debug/ceph-18.2.1-1.fc39.aarch64/src/common/Timer.cc:66 #6 0x0000ffffe9156678 [PAC] in MonClient::shutdown (this=0xffffe6a6dc00) at /usr/src/debug/ceph-18.2.1-1.fc39.aarch64/src/mon/MonClient.cc:562 #7 0x0000ffffe91575cc [PAC] in MonClient::get_monmap_and_config (this=this@entry=0xffffe6a6dc00) at /usr/src/debug/ceph-18.2.1-1.fc39.aarch64/src/mon/MonClient.cc:199 #8 0x0000ffffe96d0a48 [PAC] in librados::v14_2_0::RadosClient::connect (this=0xffffe0063bb0) at /usr/src/debug/ceph-18.2.1-1.fc39.aarch64/src/librados/RadosClient.cc:232 #9 0x0000ffffe9667464 [PAC] in _rados_connect (cluster=0xffffe0063bb0) at /usr/src/debug/ceph-18.2.1-1.fc39.aarch64/src/librados/librados_c.cc:221 #10 0x0000ffffe97e912c [PAC] in __pyx_pf_5rados_5Rados_28connect () from /usr/lib64/python3.12/site-packages/rados.cpython-312-aarch64-linux-gnu.so #11 0x0000ffffe97e8d9c in __pyx_pw_5rados_5Rados_29connect () from /usr/lib64/python3.12/site-packages/rados.cpython-312-aarch64-linux-gnu.so #12 0x0000ffffe9915e0c in __Pyx_CyFunction_Vectorcall_FASTCALL_KEYWORDS () from /usr/lib64/python3.12/site-packages/rados.cpython-312-aarch64-linux-gnu.so #13 0x0000fffff7b18e40 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=1, args=0xffffe6a6e5f8, callable=0xffffe9973c60, tstate=0xaaaaaaed6f70) at /usr/src/debug/python3.12-3.12.1-1.fc39.aarch64/Include/internal/pycore_call.h:92 #14 method_vectorcall (method=<optimized out>, args=0xfffff7f3e298 <_PyRuntime+76288>, nargsf=<optimized out>, kwnames=0x0) at /usr/src/debug/python3.12-3.12.1-1.fc39.aarch64/Objects/classobject.c:69 #15 0x0000fffff7ac9e90 [PAC] in PyCFunction_Call (kwargs=0xffffe6b165c0, args=0xfffff7f3e280 <_PyRuntime+76264>, callable=0xffffe6edac80) at /usr/src/debug/python3.12-3.12.1-1.fc39.aarch64/Objects/call.c:387 #16 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0xfffff7fb0110, throwflag=<optimized out>) at Python/bytecodes.c:3254 #17 0x0000fffff7b18d58 [PAC] in _PyFunction_Vectorcall (kwnames=0x0, nargsf=1, stack=0xffffe6a6e858, func=0xffffe99c0fe0) at /usr/src/debug/python3.12-3.12.1-1.fc39.aarch64/Objects/call.c:419 #18 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=1, args=0xffffe6a6e858, callable=0xffffe99c0fe0, tstate=0xaaaaaaed6f70) at /usr/src/debug/python3.12-3.12.1-1.fc39.aarch64/Include/internal/pycore_call.h:92 #19 method_vectorcall (method=<optimized out>, args=0xfffff7f3e298 <_PyRuntime+76288>, nargsf=<optimized out>, kwnames=0x0) at /usr/src/debug/python3.12-3.12.1-1.fc39.aarch64/Objects/classobject.c:69 --Type <RET> for more, q to quit, c to continue without paging-- #20 0x0000fffff7c20f24 [PAC] in thread_run (boot_raw=0xffffe0066050) at /usr/src/debug/python3.12-3.12.1-1.fc39.aarch64/Modules/_threadmodule.c:1114 #21 0x0000fffff7bd1ef0 [PAC] in pythread_wrapper (arg=<optimized out>) at /usr/src/debug/python3.12-3.12.1-1.fc39.aarch64/Python/thread_pthread.h:233 #22 0x0000fffff7800584 [PAC] in start_thread (arg=0xfffff7fa4760) at pthread_create.c:444 #23 0x0000fffff786fc4c [PAC] in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone3.S:76
Poking around stepping with gdb, it seems the Ceph code passes in a bad pointer to `schedule.erase`:
#1 CommonSafeTimer<std::mutex>::cancel_all_events (this=this@entry=0xffffe6a6e000) at /usr/src/debug/ceph-18.2.1-1.fc39.aarch64/src/common/Timer.cc:206 206 schedule.erase(p->second); (gdb) print p $3 = {first = 0xffffe01100f0, second = <error reading variable: Cannot access memory at address 0xe0073e70>}
`--debug-timer=20/20` actually makes it crash earlier, trying to print the debug log for that value. Valgrind immediately reports uninitialized data:
$ valgrind --track-origins=yes ceph --debug-timer=20/20 status ==19564== Memcheck, a memory error detector ==19564== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==19564== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info ==19564== Command: /usr/bin/ceph --debug-timer=20/20 status ==19564== 2023-12-20T18:33:04.106+0900 19033180 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory 2023-12-20T18:33:04.109+0900 19033180 -1 AuthRegistry(0x1852c758) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx 2023-12-20T18:33:04.334+0900 19033180 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory 2023-12-20T18:33:04.335+0900 19033180 -1 AuthRegistry(0x19031c20) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx 2023-12-20T18:33:04.336+0900 19033180 10 timer(0x19032000).init 2023-12-20T18:33:04.381+0900 19033180 10 timer(0x19032000).add_event_at 5213.494829s -> 0x13d4ff80 2023-12-20T18:33:04.482+0900 1b2c7180 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] 2023-12-20T18:33:04.504+0900 1d307180 10 timer(0x19032000).timer_thread starting 2023-12-20T18:33:04.506+0900 1d307180 20 timer(0x19032000).timer_thread going to sleep ==19564== Thread 2: ==19564== Use of uninitialised value of size 8 ==19564== at 0x14C0AB54: UnknownInlinedFun (chrono.h:212) ==19564== by 0x14C0AB54: UnknownInlinedFun (chrono.h:287) ==19564== by 0x14C0AB54: UnknownInlinedFun (chrono.h:582) ==19564== by 0x14C0AB54: std::ostream& ceph::operator<< <ceph::mono_clock, (void*)0>(std::ostream&, std::chrono::time_point<ceph::mono_clock, ceph::mono_clock::duration> const&) (ceph_time.cc:103) ==19564== by 0x14BD7BCB: CommonSafeTimer<std::mutex>::cancel_all_events() (Timer.cc:204) ==19564== by 0x14BD412B: CommonSafeTimer<std::mutex>::shutdown() (Timer.cc:67) ==19564== by 0x14E16677: MonClient::shutdown() (MonClient.cc:562) ==19564== by 0x14E17893: MonClient::get_monmap_and_config() (MonClient.cc:199) ==19564== by 0x14910A47: librados::v14_2_0::RadosClient::connect() (RadosClient.cc:232) ==19564== by 0x148A7463: rados_connect@@ (librados_c.cc:221) ==19564== by 0x146C912B: __pyx_pf_5rados_5Rados_28connect (in /usr/lib64/python3.12/site-packages/rados.cpython-312-aarch64-linux-gnu.so) ==19564== by 0x146C8D9B: __pyx_pw_5rados_5Rados_29connect (in /usr/lib64/python3.12/site-packages/rados.cpython-312-aarch64-linux-gnu.so) ==19564== by 0x147F5E0B: __Pyx_CyFunction_Vectorcall_FASTCALL_KEYWORDS (in /usr/lib64/python3.12/site-packages/rados.cpython-312-aarch64-linux-gnu.so) ==19564== by 0x4AF8E3F: UnknownInlinedFun (pycore_call.h:92) ==19564== by 0x4AF8E3F: method_vectorcall (classobject.c:69) ==19564== by 0x4AA9E8F: UnknownInlinedFun (call.c:387) ==19564== by 0x4AA9E8F: _PyEval_EvalFrameDefault (bytecodes.c:3254) ==19564== Uninitialised value was created by a heap allocation ==19564== at 0x48C1040: operator new(unsigned long) (vg_replace_malloc.c:483) ==19564== by 0x14BD4F3B: UnknownInlinedFun (new_allocator.h:147) ==19564== by 0x14BD4F3B: UnknownInlinedFun (allocator.h:198) ==19564== by 0x14BD4F3B: UnknownInlinedFun (alloc_traits.h:482) ==19564== by 0x14BD4F3B: UnknownInlinedFun (stl_tree.h:563) ==19564== by 0x14BD4F3B: UnknownInlinedFun (stl_tree.h:613) ==19564== by 0x14BD4F3B: UnknownInlinedFun (stl_tree.h:1637) ==19564== by 0x14BD4F3B: std::_Rb_tree_iterator<std::pair<Context* const, std::_Rb_tree_iterator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> > > > std::_Rb_tree<Context*, std::pair<Context* const, std::_Rb_tree_iterator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> > >, std::_Select1st<std::pair<Context* const, std::_Rb_tree_iterator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> > > >, std::less<Context*>, std::allocator<std::pair<Context* const, std::_Rb_tree_iterator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> > > > >::_M_emplace_hint_unique<std::pair<Context* const, std::_Rb_tree_iterator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> > >&>(std::_Rb_tree_const_iterator<std::pair<Context* const, std::_Rb_tree_iterator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> > > >, std::pair<Context* const, std::_Rb_tree_iterator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> > >&) (stl_tree.h:2462) ==19564== by 0x14BD34F3: UnknownInlinedFun (stl_map.h:640) ==19564== by 0x14BD34F3: UnknownInlinedFun (stl_map.h:860) ==19564== by 0x14BD34F3: CommonSafeTimer<std::mutex>::add_event_at(std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, Context*) (Timer.cc:152) ==19564== by 0x14E15FBB: MonClient::init() (MonClient.cc:515) ==19564== by 0x14E172BF: MonClient::get_monmap_and_config() (MonClient.cc:153) ==19564== by 0x14910A47: librados::v14_2_0::RadosClient::connect() (RadosClient.cc:232) ==19564== by 0x148A7463: rados_connect@@ (librados_c.cc:221) ==19564== by 0x146C912B: __pyx_pf_5rados_5Rados_28connect (in /usr/lib64/python3.12/site-packages/rados.cpython-312-aarch64-linux-gnu.so) ==19564== by 0x146C8D9B: __pyx_pw_5rados_5Rados_29connect (in /usr/lib64/python3.12/site-packages/rados.cpython-312-aarch64-linux-gnu.so) ==19564== by 0x147F5E0B: __Pyx_CyFunction_Vectorcall_FASTCALL_KEYWORDS (in /usr/lib64/python3.12/site-packages/rados.cpython-312-aarch64-linux-gnu.so) ==19564== by 0x4AF8E3F: UnknownInlinedFun (pycore_call.h:92) ==19564== by 0x4AF8E3F: method_vectorcall (classobject.c:69) ==19564== by 0x4AA9E8F: UnknownInlinedFun (call.c:387) ==19564== by 0x4AA9E8F: _PyEval_EvalFrameDefault (bytecodes.c:3254)
Running under valgrind actually doesn't crash, probably because the uninitialized data is all NULL so it avoids being dereferenced due to NULL checks. This is also likely why presumably this bug doesn't happen on all builds.
The bug doesn't happen on F38/17.2.6, nor does valgrind complain on that version (well, it does about Python stuff, but I suspect those are unrelated false positives), so this seems to be a regression.
Updated by Hector Martin 4 months ago
Confirmed compiler bug, please close this. I hate computers... :(
Updated by Nitzan Mordechai 4 months ago
- Has duplicate Bug #61968: rados::connect() gets segement fault added
Updated by Samuel Just about 2 months ago
- Has duplicate Bug #64804: gcc-13 apparently breaks SafeTimer added