Project

General

Profile

Bug #50659

Segmentation fault under Pacific 16.2.1 when using a custom crush location hook

Added by Andrew Davidoff 6 months ago. Updated about 2 months ago.

Status:
New
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I feel like if this wasn't somehow just my problem, there'd be an issue open on it already, but I'm not seeing one, and I feel like I've dug about as deep as I can without checking in with you all.

I was testing an upgrade (via ceph orch with cephadm) from 15.2.9 to 16.2.1 and found that my OSDs were crashing with a segmentation fault on start up under 16.2.1. A relevant snippet of the output in the logs is:

May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]: debug      0> 2021-05-04T15:40:07.914+0000 7f4e61e54080 -1 *** Caught signal (Segmentation fault) **
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  in thread 7f4e61e54080 thread_name:ceph-osd
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10) pacific (stable)
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  1: /lib64/libpthread.so.0(+0x12b20) [0x7f4e5fbbbb20]
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  2: /lib64/libc.so.6(+0x9a3da) [0x7f4e5e8863da]
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  3: (SubProcess::add_cmd_arg(char const*)+0x4c) [0x56504e693b2c]
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  4: (SubProcess::add_cmd_args(char const*, ...)+0x75) [0x56504e693cc5]
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  5: (ceph::crush::CrushLocation::update_from_hook()+0x2d4) [0x56504e883304]
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  6: (ceph::crush::CrushLocation::init_on_startup()+0x3f5) [0x56504e884455]
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  7: (global_init(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::al
locator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const*, std::vector<char const*, std::allocator<char const*> >&, unsigned int, code_environment_t, int, bool)+0xcd
1) [0x56504e5305b1]                                                              
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  8: main() 
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  9: __libc_start_main()
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  10: _start() 
May 04 15:40:07 02.ceph-kubernetes.dev.lax1.REDACTED.net bash[32744]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

If I remove my custom crush location hook configuration (i.e. do not specify one), the OSD can start successfully, as the following logic gets triggered which shortcuts whatever is blowing up:

int CrushLocation::update_from_hook()
{
  if (cct->_conf->crush_location_hook.length() == 0)
    return 0;

Best I can tell, this segfault is happening before my script (simple bash, which I can run inside the container manually just fine) is ever executed. If I change it to something that I know should execute fine, like /bin/ls (even though this won't create reasonable output for the location), I get the same segfault in seemingly the same place, and best I can tell, the alternate executable I'm testing with (in this case /bin/ls) is never run, same as when my script is specified.

I am not a c++ developer but based on my understanding of what I think is relevant code, and web searches, I think the segfault might be coming from a push_back happening on the cmd_args vector in add_cmd_arg. I could be totally wrong about that, but that's where I'm at. strace indicated the SIGSEGV was of type SEGV_MAPERR and I believe the address in question was 0x3 (I no longer have this output handy, however).

I am running all ceph daemons in containers as pulled from docker hub. They are running under docker on Ubuntu 20.04 systems. I have tried docker 19.03.8-0ubuntu1.20.04 and 19.03.8-0ubuntu1.20.04.2, and kernels 5.4.0-42-generic, 5.4.0-71-generic, and HWE 5.8.0-49-generic. The dev cluster I was testing the upgrade in is built from KVM instances, but I was able to reproduce this with a baremetal as well.

I am attaching the full logs of such a failed start.

Please let me know what else I can provide to help here. Thanks.

osd-segfault-when-crush-location-hook-configured.log View - systemd OSD logs of segfault (66.6 KB) Andrew Davidoff, 05/05/2021 02:20 PM

core.ceph-osd.1620430233.gz (936 KB) Andrew Davidoff, 05/07/2021 11:39 PM

History

#1 Updated by Andrew Davidoff 6 months ago

I forgot to add that I tried to diff code I thought was relevant between tags v15.2.9 and v16.2.1 and thought I saw some win32 related changes that looked "close" to the potentially problematic code, I don't think I saw anything that stood out as code changes that would have broken this, which makes me wonder if it was a compiler issue - which I only suggest because I did find bug reports for segfaults on push_back that seemed to be caused by some buggy compilers, but I know that may be a long shot. I don't normally suggest it's the compiler's fault.

#2 Updated by Neha Ojha 6 months ago

  • Status changed from New to Need More Info

Is it possible for you to capture a coredump? Did the same crush_location_hook work fine on your 15.2.9 cluster?

#3 Updated by Andrew Davidoff 6 months ago

I have attached a coredump. This hook works fine in 15.2.9. I can also run it fine manually from inside a launched OSD container under 16.2.1. I don't think the OSD is actually getting to the point of execing the location hook. Please let me know if I can provide anything else.

#4 Updated by Andrew Davidoff 5 months ago

Here's a bit more info that may be useful. Only because it's a volume already exported to the container out of the box, the crush location hook I am using lives under what the container sees as /var/log/ceph (on the host it's /var/log/ceph/$FSID). Maybe something about that location is problematic? Though as I noted earlier, trying something under /bin, which is part of the container, produced the same results.

#5 Updated by Andrew Davidoff 4 months ago

FYI I tried with ceph/daemon-base:master-24e1f91-pacific-centos-8-x86_64 (the latest non-devel build at this time) just to see if somehow something was different there, since that build was newer (even though it should be and is still 16.2.4), and the problem persists there too.

#6 Updated by Neha Ojha 4 months ago

  • Priority changed from Normal to Urgent

#7 Updated by Andrew Davidoff 4 months ago

I just wanted to note that I see the status is listed as "Need More Info", but I think I have provided everything I have been asked for, and anything I can think of additionally. This is not me being a nag, just wanted to be clear about my perspective on the status of this ticket as it pertains to my input.

#8 Updated by Neha Ojha 4 months ago

  • Status changed from Need More Info to New

#9 Updated by Andrew Davidoff 3 months ago

I saw that 16.2.5 was released. Though I didn't expect it to address this issue, I tested with it anyway just to be sure. The issue persists with 16.2.5.

#10 Updated by Andrew Davidoff 3 months ago

Based on the progress here it seems like I'm probably the only person to have reported this. I still can't figure out why that'd be. I wonder if you have had a chance to look at the core dump and/or reproduce this and if you have an idea of what's going on here? It may help me mitigate on my end if nothing else. Thanks.

#11 Updated by Neha Ojha 2 months ago

  • Assignee set to Adam Kupczyk

Adam, can you start talking a look at this?

#12 Updated by Andrew Davidoff about 2 months ago

I dug into this more today and I am wondering if it has something to do with `_conf->cluster` not being set right (to the default of `ceph`). Unfortunately editing the OSD's `unit.run` to include `--cluster ceph` in the arg list didn't change the behavior, so no additional clue provided there.

Also available in: Atom PDF