Project

General

Profile

Actions

Bug #23431

closed

OSD Segmentation fault in thread_name:safe_timer

Added by Dietmar Rieder about 6 years ago. Updated over 5 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I noticed an OSD segmentation fault in one of our OSDs logs.
See the attached log entries. There is no core file that I could provide.

Best
Dietmar

    -2> 2018-03-21 05:09:31.310950 7fc69ba1c700  1 -- 10.0.2.10:6828/4210 <== osd.230 10.0.2.10:0/4168 14138 ==== osd_ping(ping e3328 stamp 2018-03-21 05:09:31.307794) v4 ==== 2004+0+0 (603415365 0 0) 0x5589fe22cc00 con 0x558a107ec000
    -1> 2018-03-21 05:09:31.310960 7fc69ba1c700  1 -- 10.0.2.10:6828/4210 --> 10.0.2.10:0/4168 -- osd_ping(ping_reply e3328 stamp 2018-03-21 05:09:31.307794) v4 -- 0x558a0645c600 con 0
     0> 2018-03-21 05:09:31.330524 7fc698258700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fc698258700 thread_name:safe_timer
 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (()+0xa3c611) [0x5589e73f7611]
 2: (()+0xf5e0) [0x7fc69f5185e0]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Files

ceph-osd.239_segfault.log.gz (242 KB) ceph-osd.239_segfault.log.gz Dietmar Rieder, 03/21/2018 09:09 AM

Related issues 4 (0 open4 closed)

Related to RADOS - Bug #24023: Segfault on OSD in 12.2.5Duplicate05/05/2018

Actions
Related to RADOS - Bug #23585: osd: safe_timer segfaultDuplicate04/08/2018

Actions
Related to RADOS - Bug #23564: OSD SegfaultsDuplicate04/05/2018

Actions
Is duplicate of RADOS - Bug #23352: osd: segfaults under normal operationResolvedBrad Hubbard03/14/2018

Actions
Actions #1

Updated by Brad Hubbard about 6 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)
  • Status changed from New to Need More Info
  • Assignee set to Brad Hubbard
  • Component(RADOS) OSD added

What's the exact version of the ceph-osd you are using (exact package URL if possible please).

You could try 'objdump -rdS /path/to/ceph-osd' but you may need the relevant debuginfo packages installed.

If you can capture a coredump and sosreport please upload them using ceph-post-file and let us know the UID here.

Actions #2

Updated by Dietmar Rieder about 6 years ago

The ceph-osd comes from https://download.ceph.com/rpm-luminous/el7/x86_64/
I verified via md5sum if the the local copy is the same as the one on download.ceph.com:

  1. md5sum ceph-osd_local ceph-osd_ceph.com
    5ec58a32c9ac909fe7b094e1df39c3c0 ceph-osd_local
    5ec58a32c9ac909fe7b094e1df39c3c0 ceph-osd_ceph.com

ceph was installed via ceph-deploy using the yum repo with baseurl=http://download.ceph.com/rpm-luminous/el7/$basearch

  1. ceph -v
    ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)

I produced an objdump (see UID 6609a6e3-c22b-4a5d-8bea-5b40b24e9e73), however I'm not familiar with that, so I'm not sure what the "relevant debuginfo packages" are.

I have still no core file.

Actions #3

Updated by Marcin Gibula about 6 years ago

I have seen this as well, on our cluster. We're using bluestore, ubuntu 16, latest luminous.
The crashes were totally random, happened with no load and on empty osds from both replicated and ec pools.

Also no core dump here and no backtrace in logs - so I guess stack is smashed.
I wonder if it could be related to http://tracker.ceph.com/issues/21259 maybe?

Actions #4

Updated by Kjetil Joergensen about 6 years ago

There's a coredump-in-apport on google drive in http://tracker.ceph.com/issues/23352 - it looks at the face of it similar at least.

Actions #5

Updated by Brad Hubbard about 6 years ago

  • Description updated (diff)
  • Status changed from Need More Info to 12

I agree these are similar and the cause may indeed be the same however there are only two stack frames in this instance and they both appear to be in a library rather than ceph (probably libgcc/glibc spawning the new thread). This is reinforced by the following log output posted in #23352.

[dmesg]
[35103471.167728] safe_timer[491476]: segfault at 21300080000 ip 00007f9f3f7bfccb sp 00007f9f31c7df70 error 4 in libgcc_s.so.1[7f9f3f7b1000+16000]

This shows the crash being in libgcc and, rather than suspecting a bug in libgcc, this is most likely due to some significant memory corruption which is being hit when the new thread is being created, at least that's a theory with no evidence to back it at this stage.

That memory address, '21300080000' looks bogus as well.

Actions #6

Updated by Brad Hubbard about 6 years ago

  • Description updated (diff)
Actions #7

Updated by Brad Hubbard about 6 years ago

  • Related to Bug #23352: osd: segfaults under normal operation added
Actions #8

Updated by Aleksei Zakharov almost 6 years ago

Hi.
We have the same issue:

[dmesg]
[1408519.211602] safe_timer[4265]: segfault at 1000000c9 ip 00007f2453fb8ccb sp 00007f244d105830 error 4 in libgcc_s.so.1[7f2453faa000+16000]

[ceph -v]
ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)

This suddenly happens with random osd's and we don't see dependence on any activity or load. It looks like a random bug. There're no errors in the ceph-osd log.

Ubuntu 16.04, bluestore osd's, kernels 4.4 and 4.13.

Actions #9

Updated by Josh Durgin almost 6 years ago

  • Related to Bug #24023: Segfault on OSD in 12.2.5 added
Actions #10

Updated by Josh Durgin almost 6 years ago

  • Related to Bug #23585: osd: safe_timer segfault added
Actions #11

Updated by Josh Durgin almost 6 years ago

Actions #12

Updated by Brad Hubbard almost 6 years ago

  • Status changed from 12 to Duplicate

Closing as a duplicate of #23352 where we are focussing.

Actions #13

Updated by Nathan Cutler almost 6 years ago

  • Related to deleted (Bug #23352: osd: segfaults under normal operation)
Actions #14

Updated by Nathan Cutler almost 6 years ago

  • Is duplicate of Bug #23352: osd: segfaults under normal operation added
Actions #15

Updated by Kevin Tibi over 5 years ago

Hi,

Same issue with ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)

My OSDs are in docker so the containers fail for a memory issue :

Sep 14 10:01:03 ceph02 kernel: safe_timer54506: segfault at 2000 ip 00007efc1c7ee0b8 sp 00007efc1657c870 error 4 in libgcc_s-4.8.5-20150702.so.1[7efc1c7df000+15000]
Sep 14 10:01:03 ceph02 dockerd: time="2018-09-14T10:01:03+02:00" level=info msg="shim reaped" id=60b872b1e9034d8166f20e08678f0da4793f6409a06b6f00b0a43ae9df5deae4 module="containerd/tasks"

This is very random without activity.

Actions #16

Updated by Brad Hubbard over 5 years ago

See #23352

The fix is in 12.2.8

Actions

Also available in: Atom PDF