Bug #50659
closedSegmentation fault under Pacific 16.2.1 when using a custom crush location hook
0%
Description
I feel like if this wasn't somehow just my problem, there'd be an issue open on it already, but I'm not seeing one, and I feel like I've dug about as deep as I can without checking in with you all.
I was testing an upgrade (via ceph orch with cephadm) from 15.2.9 to 16.2.1 and found that my OSDs were crashing with a segmentation fault on start up under 16.2.1. A relevant snippet of the output in the logs is:
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]: debug 0> 2021-05-04T15:40:07.914+0000 7f4e61e54080 -1 *** Caught signal (Segmentation fault) ** May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]: in thread 7f4e61e54080 thread_name:ceph-osd May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]: ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10) pacific (stable) May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]: 1: /lib64/libpthread.so.0(+0x12b20) [0x7f4e5fbbbb20] May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]: 2: /lib64/libc.so.6(+0x9a3da) [0x7f4e5e8863da] May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]: 3: (SubProcess::add_cmd_arg(char const*)+0x4c) [0x56504e693b2c] May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]: 4: (SubProcess::add_cmd_args(char const*, ...)+0x75) [0x56504e693cc5] May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]: 5: (ceph::crush::CrushLocation::update_from_hook()+0x2d4) [0x56504e883304] May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]: 6: (ceph::crush::CrushLocation::init_on_startup()+0x3f5) [0x56504e884455] May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]: 7: (global_init(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::al locator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const*, std::vector<char const*, std::allocator<char const*> >&, unsigned int, code_environment_t, int, bool)+0xcd 1) [0x56504e5305b1] May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]: 8: main() May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]: 9: __libc_start_main() May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]: 10: _start() May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
If I remove my custom crush location hook configuration (i.e. do not specify one), the OSD can start successfully, as the following logic gets triggered which shortcuts whatever is blowing up:
int CrushLocation::update_from_hook() { if (cct->_conf->crush_location_hook.length() == 0) return 0;
Best I can tell, this segfault is happening before my script (simple bash, which I can run inside the container manually just fine) is ever executed. If I change it to something that I know should execute fine, like /bin/ls (even though this won't create reasonable output for the location), I get the same segfault in seemingly the same place, and best I can tell, the alternate executable I'm testing with (in this case /bin/ls) is never run, same as when my script is specified.
I am not a c++ developer but based on my understanding of what I think is relevant code, and web searches, I think the segfault might be coming from a push_back happening on the cmd_args vector in add_cmd_arg. I could be totally wrong about that, but that's where I'm at. strace indicated the SIGSEGV was of type SEGV_MAPERR and I believe the address in question was 0x3 (I no longer have this output handy, however).
I am running all ceph daemons in containers as pulled from docker hub. They are running under docker on Ubuntu 20.04 systems. I have tried docker 19.03.8-0ubuntu1.20.04 and 19.03.8-0ubuntu1.20.04.2, and kernels 5.4.0-42-generic, 5.4.0-71-generic, and HWE 5.8.0-49-generic. The dev cluster I was testing the upgrade in is built from KVM instances, but I was able to reproduce this with a baremetal as well.
I am attaching the full logs of such a failed start.
Please let me know what else I can provide to help here. Thanks.
Files