Bug #23431: OSD Segmentation fault in thread_name:safe_timer - RADOS - Ceph

Actions

Copy link

Bug #23431

closed

OSD Segmentation fault in thread_name:safe_timer

Added by Dietmar Rieder about 6 years ago. Updated over 5 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Brad Hubbard

Category:

Target version:

Ceph - v12.2.4

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.4

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I noticed an OSD segmentation fault in one of our OSDs logs.
See the attached log entries. There is no core file that I could provide.

Best
Dietmar

    -2> 2018-03-21 05:09:31.310950 7fc69ba1c700  1 -- 10.0.2.10:6828/4210 <== osd.230 10.0.2.10:0/4168 14138 ==== osd_ping(ping e3328 stamp 2018-03-21 05:09:31.307794) v4 ==== 2004+0+0 (603415365 0 0) 0x5589fe22cc00 con 0x558a107ec000
    -1> 2018-03-21 05:09:31.310960 7fc69ba1c700  1 -- 10.0.2.10:6828/4210 --> 10.0.2.10:0/4168 -- osd_ping(ping_reply e3328 stamp 2018-03-21 05:09:31.307794) v4 -- 0x558a0645c600 con 0
     0> 2018-03-21 05:09:31.330524 7fc698258700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fc698258700 thread_name:safe_timer
 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (()+0xa3c611) [0x5589e73f7611]
 2: (()+0xf5e0) [0x7fc69f5185e0]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Files

ceph-osd.239_segfault.log.gz (242 KB) ceph-osd.239_segfault.log.gz

Dietmar Rieder, 03/21/2018 09:09 AM

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by Brad Hubbard about 6 years ago

Project changed from Ceph to RADOS
Category deleted (~~OSD~~)
Status changed from New to Need More Info
Assignee set to Brad Hubbard
Component(RADOS) OSD added

What's the exact version of the ceph-osd you are using (exact package URL if possible please).

You could try 'objdump -rdS /path/to/ceph-osd' but you may need the relevant debuginfo packages installed.

If you can capture a coredump and sosreport please upload them using ceph-post-file and let us know the UID here.

Actions

Copy link

Updated by Dietmar Rieder about 6 years ago

The ceph-osd comes from https://download.ceph.com/rpm-luminous/el7/x86_64/
I verified via md5sum if the the local copy is the same as the one on download.ceph.com:

md5sum ceph-osd_local ceph-osd_ceph.com
5ec58a32c9ac909fe7b094e1df39c3c0 ceph-osd_local
5ec58a32c9ac909fe7b094e1df39c3c0 ceph-osd_ceph.com

ceph was installed via ceph-deploy using the yum repo with baseurl=http://download.ceph.com/rpm-luminous/el7/$basearch

ceph -v
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)

I produced an objdump (see UID 6609a6e3-c22b-4a5d-8bea-5b40b24e9e73), however I'm not familiar with that, so I'm not sure what the "relevant debuginfo packages" are.

I have still no core file.

Actions

Copy link

Updated by Marcin Gibula about 6 years ago

I have seen this as well, on our cluster. We're using bluestore, ubuntu 16, latest luminous.
The crashes were totally random, happened with no load and on empty osds from both replicated and ec pools.

Also no core dump here and no backtrace in logs - so I guess stack is smashed.
I wonder if it could be related to http://tracker.ceph.com/issues/21259 maybe?

Actions

Copy link

Updated by Kjetil Joergensen about 6 years ago

There's a coredump-in-apport on google drive in http://tracker.ceph.com/issues/23352 - it looks at the face of it similar at least.

Actions

Copy link

Updated by Brad Hubbard about 6 years ago

Description updated (diff)
Status changed from Need More Info to 12

I agree these are similar and the cause may indeed be the same however there are only two stack frames in this instance and they both appear to be in a library rather than ceph (probably libgcc/glibc spawning the new thread). This is reinforced by the following log output posted in #23352.

[dmesg]
[35103471.167728] safe_timer[491476]: segfault at 21300080000 ip 00007f9f3f7bfccb sp 00007f9f31c7df70 error 4 in libgcc_s.so.1[7f9f3f7b1000+16000]

This shows the crash being in libgcc and, rather than suspecting a bug in libgcc, this is most likely due to some significant memory corruption which is being hit when the new thread is being created, at least that's a theory with no evidence to back it at this stage.

That memory address, '21300080000' looks bogus as well.

Actions

Copy link

Updated by Brad Hubbard about 6 years ago

Description updated (diff)

Actions

Copy link

Updated by Brad Hubbard about 6 years ago

Related to Bug #23352: osd: segfaults under normal operation added

Actions

Copy link

Updated by Aleksei Zakharov almost 6 years ago

Hi.
We have the same issue:

[dmesg]
[1408519.211602] safe_timer[4265]: segfault at 1000000c9 ip 00007f2453fb8ccb sp 00007f244d105830 error 4 in libgcc_s.so.1[7f2453faa000+16000]

[ceph -v]
ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)

This suddenly happens with random osd's and we don't see dependence on any activity or load. It looks like a random bug. There're no errors in the ceph-osd log.

Ubuntu 16.04, bluestore osd's, kernels 4.4 and 4.13.

Actions

Copy link

Updated by Josh Durgin almost 6 years ago

Related to Bug #24023: Segfault on OSD in 12.2.5 added

Actions

Copy link

#10

Updated by Josh Durgin almost 6 years ago

Related to Bug #23585: osd: safe_timer segfault added

Actions

Copy link

#11

Updated by Josh Durgin almost 6 years ago

Related to Bug #23564: OSD Segfaults added

Actions

Copy link

#12

Updated by Brad Hubbard almost 6 years ago

Status changed from 12 to Duplicate

Closing as a duplicate of #23352 where we are focussing.

Actions

Copy link

#13

Updated by Nathan Cutler almost 6 years ago

Related to deleted (Bug #23352: osd: segfaults under normal operation)

Actions

Copy link

#14

Updated by Nathan Cutler almost 6 years ago

Is duplicate of Bug #23352: osd: segfaults under normal operation added

Actions

Copy link

#15

Updated by Kevin Tibi over 5 years ago

Hi,

Same issue with ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)

My OSDs are in docker so the containers fail for a memory issue :

Sep 14 10:01:03 ceph02 kernel: safe_timer⁵⁴⁵⁰⁶: segfault at 2000 ip 00007efc1c7ee0b8 sp 00007efc1657c870 error 4 in libgcc_s-4.8.5-20150702.so.1[7efc1c7df000+15000]
Sep 14 10:01:03 ceph02 dockerd: time="2018-09-14T10:01:03+02:00" level=info msg="shim reaped" id=60b872b1e9034d8166f20e08678f0da4793f6409a06b6f00b0a43ae9df5deae4 module="containerd/tasks"

This is very random without activity.

Actions

Copy link

#16

Updated by Brad Hubbard over 5 years ago

See #23352

The fix is in 12.2.8

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #23431

OSD Segmentation fault in thread_name:safe_timer

Updated by Brad Hubbard about 6 years ago

Updated by Dietmar Rieder about 6 years ago

Updated by Marcin Gibula about 6 years ago

Updated by Kjetil Joergensen about 6 years ago

Updated by Brad Hubbard about 6 years ago

Updated by Brad Hubbard about 6 years ago

Updated by Brad Hubbard about 6 years ago

Updated by Aleksei Zakharov almost 6 years ago

Updated by Josh Durgin almost 6 years ago

Updated by Josh Durgin almost 6 years ago

Updated by Josh Durgin almost 6 years ago

Updated by Brad Hubbard almost 6 years ago

Updated by Nathan Cutler almost 6 years ago

Updated by Nathan Cutler almost 6 years ago

Updated by Kevin Tibi over 5 years ago

Updated by Brad Hubbard over 5 years ago