Project

General

Profile

Actions

Bug #50659

closed

Segmentation fault under Pacific 16.2.1 when using a custom crush location hook

Added by Andrew Davidoff about 3 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I feel like if this wasn't somehow just my problem, there'd be an issue open on it already, but I'm not seeing one, and I feel like I've dug about as deep as I can without checking in with you all.

I was testing an upgrade (via ceph orch with cephadm) from 15.2.9 to 16.2.1 and found that my OSDs were crashing with a segmentation fault on start up under 16.2.1. A relevant snippet of the output in the logs is:

May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]: debug      0> 2021-05-04T15:40:07.914+0000 7f4e61e54080 -1 *** Caught signal (Segmentation fault) **
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  in thread 7f4e61e54080 thread_name:ceph-osd
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10) pacific (stable)
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  1: /lib64/libpthread.so.0(+0x12b20) [0x7f4e5fbbbb20]
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  2: /lib64/libc.so.6(+0x9a3da) [0x7f4e5e8863da]
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  3: (SubProcess::add_cmd_arg(char const*)+0x4c) [0x56504e693b2c]
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  4: (SubProcess::add_cmd_args(char const*, ...)+0x75) [0x56504e693cc5]
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  5: (ceph::crush::CrushLocation::update_from_hook()+0x2d4) [0x56504e883304]
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  6: (ceph::crush::CrushLocation::init_on_startup()+0x3f5) [0x56504e884455]
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  7: (global_init(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::al
locator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const*, std::vector<char const*, std::allocator<char const*> >&, unsigned int, code_environment_t, int, bool)+0xcd
1) [0x56504e5305b1]                                                              
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  8: main() 
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  9: __libc_start_main()
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  10: _start() 
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

If I remove my custom crush location hook configuration (i.e. do not specify one), the OSD can start successfully, as the following logic gets triggered which shortcuts whatever is blowing up:

int CrushLocation::update_from_hook()
{
  if (cct->_conf->crush_location_hook.length() == 0)
    return 0;

Best I can tell, this segfault is happening before my script (simple bash, which I can run inside the container manually just fine) is ever executed. If I change it to something that I know should execute fine, like /bin/ls (even though this won't create reasonable output for the location), I get the same segfault in seemingly the same place, and best I can tell, the alternate executable I'm testing with (in this case /bin/ls) is never run, same as when my script is specified.

I am not a c++ developer but based on my understanding of what I think is relevant code, and web searches, I think the segfault might be coming from a push_back happening on the cmd_args vector in add_cmd_arg. I could be totally wrong about that, but that's where I'm at. strace indicated the SIGSEGV was of type SEGV_MAPERR and I believe the address in question was 0x3 (I no longer have this output handy, however).

I am running all ceph daemons in containers as pulled from docker hub. They are running under docker on Ubuntu 20.04 systems. I have tried docker 19.03.8-0ubuntu1.20.04 and 19.03.8-0ubuntu1.20.04.2, and kernels 5.4.0-42-generic, 5.4.0-71-generic, and HWE 5.8.0-49-generic. The dev cluster I was testing the upgrade in is built from KVM instances, but I was able to reproduce this with a baremetal as well.

I am attaching the full logs of such a failed start.

Please let me know what else I can provide to help here. Thanks.


Files

osd-segfault-when-crush-location-hook-configured.log (66.6 KB) osd-segfault-when-crush-location-hook-configured.log systemd OSD logs of segfault Andrew Davidoff, 05/05/2021 02:20 PM
core.ceph-osd.1620430233.gz (936 KB) core.ceph-osd.1620430233.gz Andrew Davidoff, 05/07/2021 11:39 PM

Related issues 1 (0 open1 closed)

Copied to RADOS - Backport #53480: pacific: Segmentation fault under Pacific 16.2.1 when using a custom crush location hook ResolvedAdam KupczykActions
Actions

Also available in: Atom PDF