Project

General

Profile

Bug #23431

OSD Segmentation fault in thread_name:safe_timer

Added by Dietmar Rieder over 1 year ago. Updated about 1 year ago.

Status:
Duplicate
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
03/21/2018
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:

Description

I noticed an OSD segmentation fault in one of our OSDs logs.
See the attached log entries. There is no core file that I could provide.

Best
Dietmar

    -2> 2018-03-21 05:09:31.310950 7fc69ba1c700  1 -- 10.0.2.10:6828/4210 <== osd.230 10.0.2.10:0/4168 14138 ==== osd_ping(ping e3328 stamp 2018-03-21 05:09:31.307794) v4 ==== 2004+0+0 (603415365 0 0) 0x5589fe22cc00 con 0x558a107ec000
    -1> 2018-03-21 05:09:31.310960 7fc69ba1c700  1 -- 10.0.2.10:6828/4210 --> 10.0.2.10:0/4168 -- osd_ping(ping_reply e3328 stamp 2018-03-21 05:09:31.307794) v4 -- 0x558a0645c600 con 0
     0> 2018-03-21 05:09:31.330524 7fc698258700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fc698258700 thread_name:safe_timer
 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (()+0xa3c611) [0x5589e73f7611]
 2: (()+0xf5e0) [0x7fc69f5185e0]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ceph-osd.239_segfault.log.gz (242 KB) Dietmar Rieder, 03/21/2018 09:09 AM


Related issues

Related to RADOS - Bug #24023: Segfault on OSD in 12.2.5 Duplicate 05/05/2018
Related to RADOS - Bug #23585: osd: safe_timer segfault Duplicate 04/08/2018
Related to RADOS - Bug #23564: OSD Segfaults Duplicate 04/05/2018
Duplicates RADOS - Bug #23352: osd: segfaults under normal operation Resolved 03/14/2018

History

#1 Updated by Brad Hubbard over 1 year ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)
  • Status changed from New to Need More Info
  • Assignee set to Brad Hubbard
  • Component(RADOS) OSD added

What's the exact version of the ceph-osd you are using (exact package URL if possible please).

You could try 'objdump -rdS /path/to/ceph-osd' but you may need the relevant debuginfo packages installed.

If you can capture a coredump and sosreport please upload them using ceph-post-file and let us know the UID here.

#2 Updated by Dietmar Rieder over 1 year ago

The ceph-osd comes from https://download.ceph.com/rpm-luminous/el7/x86_64/
I verified via md5sum if the the local copy is the same as the one on download.ceph.com:

  1. md5sum ceph-osd_local ceph-osd_ceph.com
    5ec58a32c9ac909fe7b094e1df39c3c0 ceph-osd_local
    5ec58a32c9ac909fe7b094e1df39c3c0 ceph-osd_ceph.com

ceph was installed via ceph-deploy using the yum repo with baseurl=http://download.ceph.com/rpm-luminous/el7/$basearch

  1. ceph -v
    ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)

I produced an objdump (see UID 6609a6e3-c22b-4a5d-8bea-5b40b24e9e73), however I'm not familiar with that, so I'm not sure what the "relevant debuginfo packages" are.

I have still no core file.

#3 Updated by Marcin Gibula over 1 year ago

I have seen this as well, on our cluster. We're using bluestore, ubuntu 16, latest luminous.
The crashes were totally random, happened with no load and on empty osds from both replicated and ec pools.

Also no core dump here and no backtrace in logs - so I guess stack is smashed.
I wonder if it could be related to http://tracker.ceph.com/issues/21259 maybe?

#4 Updated by Kjetil Joergensen over 1 year ago

There's a coredump-in-apport on google drive in http://tracker.ceph.com/issues/23352 - it looks at the face of it similar at least.

#5 Updated by Brad Hubbard over 1 year ago

  • Description updated (diff)
  • Status changed from Need More Info to Verified

I agree these are similar and the cause may indeed be the same however there are only two stack frames in this instance and they both appear to be in a library rather than ceph (probably libgcc/glibc spawning the new thread). This is reinforced by the following log output posted in #23352.

[dmesg]
[35103471.167728] safe_timer[491476]: segfault at 21300080000 ip 00007f9f3f7bfccb sp 00007f9f31c7df70 error 4 in libgcc_s.so.1[7f9f3f7b1000+16000]

This shows the crash being in libgcc and, rather than suspecting a bug in libgcc, this is most likely due to some significant memory corruption which is being hit when the new thread is being created, at least that's a theory with no evidence to back it at this stage.

That memory address, '21300080000' looks bogus as well.

#6 Updated by Brad Hubbard over 1 year ago

  • Description updated (diff)

#7 Updated by Brad Hubbard over 1 year ago

  • Related to Bug #23352: osd: segfaults under normal operation added

#8 Updated by Aleksei Zakharov over 1 year ago

Hi.
We have the same issue:

[dmesg]
[1408519.211602] safe_timer[4265]: segfault at 1000000c9 ip 00007f2453fb8ccb sp 00007f244d105830 error 4 in libgcc_s.so.1[7f2453faa000+16000]

[ceph -v]
ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)

This suddenly happens with random osd's and we don't see dependence on any activity or load. It looks like a random bug. There're no errors in the ceph-osd log.

Ubuntu 16.04, bluestore osd's, kernels 4.4 and 4.13.

#9 Updated by Josh Durgin over 1 year ago

  • Related to Bug #24023: Segfault on OSD in 12.2.5 added

#10 Updated by Josh Durgin over 1 year ago

  • Related to Bug #23585: osd: safe_timer segfault added

#11 Updated by Josh Durgin over 1 year ago

#12 Updated by Brad Hubbard over 1 year ago

  • Status changed from Verified to Duplicate

Closing as a duplicate of #23352 where we are focussing.

#13 Updated by Nathan Cutler over 1 year ago

  • Related to deleted (Bug #23352: osd: segfaults under normal operation)

#14 Updated by Nathan Cutler over 1 year ago

  • Duplicates Bug #23352: osd: segfaults under normal operation added

#15 Updated by Kevin Tibi about 1 year ago

Hi,

Same issue with ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)

My OSDs are in docker so the containers fail for a memory issue :

Sep 14 10:01:03 ceph02 kernel: safe_timer54506: segfault at 2000 ip 00007efc1c7ee0b8 sp 00007efc1657c870 error 4 in libgcc_s-4.8.5-20150702.so.1[7efc1c7df000+15000]
Sep 14 10:01:03 ceph02 dockerd: time="2018-09-14T10:01:03+02:00" level=info msg="shim reaped" id=60b872b1e9034d8166f20e08678f0da4793f6409a06b6f00b0a43ae9df5deae4 module="containerd/tasks"

This is very random without activity.

#16 Updated by Brad Hubbard about 1 year ago

See #23352

The fix is in 12.2.8

Also available in: Atom PDF