Project

General

Profile

Actions

Bug #63867

open

Segfault in CommonSafeTimer::cancel_all_events due to uninitialized data

Added by Hector Martin 5 months ago. Updated 4 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
common
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Reported downstream here: https://bugzilla.redhat.com/show_bug.cgi?id=2241339

`ceph status` crashes with a SIGSEGV on Fedora 39 packages, both x86_64 and aarch64. The backtrace suggests a junk pointer:

std::_Rb_tree_rebalance_for_erase (__z=0xe0073e70, __header=...) at ../../../../../libstdc++-v3/src/c++98/tree.cc:296
296         if (__y->_M_left == 0)     // __z has at most one non-null child. y == z.                                                                                               
(gdb) bt
#0  std::_Rb_tree_rebalance_for_erase (__z=0xe0073e70, __header=...) at ../../../../../libstdc++-v3/src/c++98/tree.cc:296
#1  0x0000ffffe8f17a94 in std::_Rb_tree<unsigned long, std::pair<unsigned long const, unsigned int>, std::_Select1st<std::pair<unsigned long const, unsigned int> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, unsigned int> > >::_M_erase_aux (__position=..., this=0xffffe6a6e068) at /usr/include/c++/13/bits/stl_tree.h:2489
#2  std::_Rb_tree<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*>, std::_Select1st<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> >, std::less<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > >, std::allocator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> > >::erase[abi:cxx11](std::_Rb_tree_iterator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> >) (
    __position=..., this=0xffffe6a6e068) at /usr/include/c++/13/bits/stl_tree.h:1210
#3  std::multimap<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, Context*, std::less<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > >, std::allocator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> > >::erase[abi:cxx11](std::_Rb_tree_iterator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> >) (__position=..., this=0xffffe6a6e068) at /usr/include/c++/13/bits/stl_multimap.h:715
#4  CommonSafeTimer<std::mutex>::cancel_all_events (this=this@entry=0xffffe6a6e000) at /usr/src/debug/ceph-18.2.1-1.fc39.aarch64/src/common/Timer.cc:206
#5  0x0000ffffe8f1412c [PAC] in CommonSafeTimer<std::mutex>::shutdown (this=0xffffe6a6e000) at /usr/src/debug/ceph-18.2.1-1.fc39.aarch64/src/common/Timer.cc:66
#6  0x0000ffffe9156678 [PAC] in MonClient::shutdown (this=0xffffe6a6dc00) at /usr/src/debug/ceph-18.2.1-1.fc39.aarch64/src/mon/MonClient.cc:562
#7  0x0000ffffe91575cc [PAC] in MonClient::get_monmap_and_config (this=this@entry=0xffffe6a6dc00) at /usr/src/debug/ceph-18.2.1-1.fc39.aarch64/src/mon/MonClient.cc:199
#8  0x0000ffffe96d0a48 [PAC] in librados::v14_2_0::RadosClient::connect (this=0xffffe0063bb0) at /usr/src/debug/ceph-18.2.1-1.fc39.aarch64/src/librados/RadosClient.cc:232
#9  0x0000ffffe9667464 [PAC] in _rados_connect (cluster=0xffffe0063bb0) at /usr/src/debug/ceph-18.2.1-1.fc39.aarch64/src/librados/librados_c.cc:221
#10 0x0000ffffe97e912c [PAC] in __pyx_pf_5rados_5Rados_28connect () from /usr/lib64/python3.12/site-packages/rados.cpython-312-aarch64-linux-gnu.so
#11 0x0000ffffe97e8d9c in __pyx_pw_5rados_5Rados_29connect () from /usr/lib64/python3.12/site-packages/rados.cpython-312-aarch64-linux-gnu.so
#12 0x0000ffffe9915e0c in __Pyx_CyFunction_Vectorcall_FASTCALL_KEYWORDS () from /usr/lib64/python3.12/site-packages/rados.cpython-312-aarch64-linux-gnu.so
#13 0x0000fffff7b18e40 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=1, args=0xffffe6a6e5f8, callable=0xffffe9973c60, tstate=0xaaaaaaed6f70)
    at /usr/src/debug/python3.12-3.12.1-1.fc39.aarch64/Include/internal/pycore_call.h:92
#14 method_vectorcall (method=<optimized out>, args=0xfffff7f3e298 <_PyRuntime+76288>, nargsf=<optimized out>, kwnames=0x0)
    at /usr/src/debug/python3.12-3.12.1-1.fc39.aarch64/Objects/classobject.c:69
#15 0x0000fffff7ac9e90 [PAC] in PyCFunction_Call (kwargs=0xffffe6b165c0, args=0xfffff7f3e280 <_PyRuntime+76264>, callable=0xffffe6edac80)
    at /usr/src/debug/python3.12-3.12.1-1.fc39.aarch64/Objects/call.c:387
#16 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0xfffff7fb0110, throwflag=<optimized out>) at Python/bytecodes.c:3254
#17 0x0000fffff7b18d58 [PAC] in _PyFunction_Vectorcall (kwnames=0x0, nargsf=1, stack=0xffffe6a6e858, func=0xffffe99c0fe0)
    at /usr/src/debug/python3.12-3.12.1-1.fc39.aarch64/Objects/call.c:419
#18 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=1, args=0xffffe6a6e858, callable=0xffffe99c0fe0, tstate=0xaaaaaaed6f70)
    at /usr/src/debug/python3.12-3.12.1-1.fc39.aarch64/Include/internal/pycore_call.h:92
#19 method_vectorcall (method=<optimized out>, args=0xfffff7f3e298 <_PyRuntime+76288>, nargsf=<optimized out>, kwnames=0x0)
    at /usr/src/debug/python3.12-3.12.1-1.fc39.aarch64/Objects/classobject.c:69
--Type <RET> for more, q to quit, c to continue without paging--
#20 0x0000fffff7c20f24 [PAC] in thread_run (boot_raw=0xffffe0066050) at /usr/src/debug/python3.12-3.12.1-1.fc39.aarch64/Modules/_threadmodule.c:1114
#21 0x0000fffff7bd1ef0 [PAC] in pythread_wrapper (arg=<optimized out>) at /usr/src/debug/python3.12-3.12.1-1.fc39.aarch64/Python/thread_pthread.h:233
#22 0x0000fffff7800584 [PAC] in start_thread (arg=0xfffff7fa4760) at pthread_create.c:444
#23 0x0000fffff786fc4c [PAC] in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone3.S:76

Poking around stepping with gdb, it seems the Ceph code passes in a bad pointer to `schedule.erase`:

#1  CommonSafeTimer<std::mutex>::cancel_all_events (this=this@entry=0xffffe6a6e000) at /usr/src/debug/ceph-18.2.1-1.fc39.aarch64/src/common/Timer.cc:206
206         schedule.erase(p->second);
(gdb) print p
$3 = {first = 0xffffe01100f0, second = <error reading variable: Cannot access memory at address 0xe0073e70>}

`--debug-timer=20/20` actually makes it crash earlier, trying to print the debug log for that value. Valgrind immediately reports uninitialized data:

$ valgrind --track-origins=yes ceph --debug-timer=20/20 status
==19564== Memcheck, a memory error detector
==19564== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==19564== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==19564== Command: /usr/bin/ceph --debug-timer=20/20 status
==19564== 
2023-12-20T18:33:04.106+0900 19033180 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
2023-12-20T18:33:04.109+0900 19033180 -1 AuthRegistry(0x1852c758) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx
2023-12-20T18:33:04.334+0900 19033180 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
2023-12-20T18:33:04.335+0900 19033180 -1 AuthRegistry(0x19031c20) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx
2023-12-20T18:33:04.336+0900 19033180 10 timer(0x19032000).init
2023-12-20T18:33:04.381+0900 19033180 10 timer(0x19032000).add_event_at 5213.494829s -> 0x13d4ff80
2023-12-20T18:33:04.482+0900 1b2c7180 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1]
2023-12-20T18:33:04.504+0900 1d307180 10 timer(0x19032000).timer_thread starting
2023-12-20T18:33:04.506+0900 1d307180 20 timer(0x19032000).timer_thread going to sleep
==19564== Thread 2:
==19564== Use of uninitialised value of size 8
==19564==    at 0x14C0AB54: UnknownInlinedFun (chrono.h:212)
==19564==    by 0x14C0AB54: UnknownInlinedFun (chrono.h:287)
==19564==    by 0x14C0AB54: UnknownInlinedFun (chrono.h:582)
==19564==    by 0x14C0AB54: std::ostream& ceph::operator<< <ceph::mono_clock, (void*)0>(std::ostream&, std::chrono::time_point<ceph::mono_clock, ceph::mono_clock::duration> const&) (ceph_time.cc:103)
==19564==    by 0x14BD7BCB: CommonSafeTimer<std::mutex>::cancel_all_events() (Timer.cc:204)
==19564==    by 0x14BD412B: CommonSafeTimer<std::mutex>::shutdown() (Timer.cc:67)
==19564==    by 0x14E16677: MonClient::shutdown() (MonClient.cc:562)
==19564==    by 0x14E17893: MonClient::get_monmap_and_config() (MonClient.cc:199)
==19564==    by 0x14910A47: librados::v14_2_0::RadosClient::connect() (RadosClient.cc:232)
==19564==    by 0x148A7463: rados_connect@@ (librados_c.cc:221)
==19564==    by 0x146C912B: __pyx_pf_5rados_5Rados_28connect (in /usr/lib64/python3.12/site-packages/rados.cpython-312-aarch64-linux-gnu.so)
==19564==    by 0x146C8D9B: __pyx_pw_5rados_5Rados_29connect (in /usr/lib64/python3.12/site-packages/rados.cpython-312-aarch64-linux-gnu.so)
==19564==    by 0x147F5E0B: __Pyx_CyFunction_Vectorcall_FASTCALL_KEYWORDS (in /usr/lib64/python3.12/site-packages/rados.cpython-312-aarch64-linux-gnu.so)
==19564==    by 0x4AF8E3F: UnknownInlinedFun (pycore_call.h:92)
==19564==    by 0x4AF8E3F: method_vectorcall (classobject.c:69)
==19564==    by 0x4AA9E8F: UnknownInlinedFun (call.c:387)
==19564==    by 0x4AA9E8F: _PyEval_EvalFrameDefault (bytecodes.c:3254)
==19564==  Uninitialised value was created by a heap allocation
==19564==    at 0x48C1040: operator new(unsigned long) (vg_replace_malloc.c:483)
==19564==    by 0x14BD4F3B: UnknownInlinedFun (new_allocator.h:147)
==19564==    by 0x14BD4F3B: UnknownInlinedFun (allocator.h:198)
==19564==    by 0x14BD4F3B: UnknownInlinedFun (alloc_traits.h:482)
==19564==    by 0x14BD4F3B: UnknownInlinedFun (stl_tree.h:563)
==19564==    by 0x14BD4F3B: UnknownInlinedFun (stl_tree.h:613)
==19564==    by 0x14BD4F3B: UnknownInlinedFun (stl_tree.h:1637)
==19564==    by 0x14BD4F3B: std::_Rb_tree_iterator<std::pair<Context* const, std::_Rb_tree_iterator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> > > > std::_Rb_tree<Context*, std::pair<Context* const, std::_Rb_tree_iterator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> > >, std::_Select1st<std::pair<Context* const, std::_Rb_tree_iterator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> > > >, std::less<Context*>, std::allocator<std::pair<Context* const, std::_Rb_tree_iterator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> > > > >::_M_emplace_hint_unique<std::pair<Context* const, std::_Rb_tree_iterator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> > >&>(std::_Rb_tree_const_iterator<std::pair<Context* const, std::_Rb_tree_iterator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> > > >, std::pair<Context* const, std::_Rb_tree_iterator<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> > >&) (stl_tree.h:2462)
==19564==    by 0x14BD34F3: UnknownInlinedFun (stl_map.h:640)
==19564==    by 0x14BD34F3: UnknownInlinedFun (stl_map.h:860)
==19564==    by 0x14BD34F3: CommonSafeTimer<std::mutex>::add_event_at(std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, Context*) (Timer.cc:152)
==19564==    by 0x14E15FBB: MonClient::init() (MonClient.cc:515)
==19564==    by 0x14E172BF: MonClient::get_monmap_and_config() (MonClient.cc:153)
==19564==    by 0x14910A47: librados::v14_2_0::RadosClient::connect() (RadosClient.cc:232)
==19564==    by 0x148A7463: rados_connect@@ (librados_c.cc:221)
==19564==    by 0x146C912B: __pyx_pf_5rados_5Rados_28connect (in /usr/lib64/python3.12/site-packages/rados.cpython-312-aarch64-linux-gnu.so)
==19564==    by 0x146C8D9B: __pyx_pw_5rados_5Rados_29connect (in /usr/lib64/python3.12/site-packages/rados.cpython-312-aarch64-linux-gnu.so)
==19564==    by 0x147F5E0B: __Pyx_CyFunction_Vectorcall_FASTCALL_KEYWORDS (in /usr/lib64/python3.12/site-packages/rados.cpython-312-aarch64-linux-gnu.so)
==19564==    by 0x4AF8E3F: UnknownInlinedFun (pycore_call.h:92)
==19564==    by 0x4AF8E3F: method_vectorcall (classobject.c:69)
==19564==    by 0x4AA9E8F: UnknownInlinedFun (call.c:387)
==19564==    by 0x4AA9E8F: _PyEval_EvalFrameDefault (bytecodes.c:3254)

Running under valgrind actually doesn't crash, probably because the uninitialized data is all NULL so it avoids being dereferenced due to NULL checks. This is also likely why presumably this bug doesn't happen on all builds.

The bug doesn't happen on F38/17.2.6, nor does valgrind complain on that version (well, it does about Python stuff, but I suspect those are unrelated false positives), so this seems to be a regression.


Related issues 2 (0 open2 closed)

Has duplicate RADOS - Bug #61968: rados::connect() gets segement faultClosedNitzan Mordechai

Actions
Has duplicate RADOS - Bug #64804: gcc-13 apparently breaks SafeTimerDuplicateSamuel Just

Actions
Actions #1

Updated by Hector Martin 4 months ago

This is looking like a GCC bug now :(

Actions #2

Updated by Hector Martin 4 months ago

Confirmed compiler bug, please close this. I hate computers... :(

Actions #3

Updated by Nitzan Mordechai 4 months ago

  • Has duplicate Bug #61968: rados::connect() gets segement fault added
Actions #4

Updated by Samuel Just about 2 months ago

  • Has duplicate Bug #64804: gcc-13 apparently breaks SafeTimer added
Actions

Also available in: Atom PDF