OSD Segmentation fault in thread_name:safe_timer
I noticed an OSD segmentation fault in one of our OSDs logs.
See the attached log entries. There is no core file that I could provide.
-2> 2018-03-21 05:09:31.310950 7fc69ba1c700 1 -- 10.0.2.10:6828/4210 <== osd.230 10.0.2.10:0/4168 14138 ==== osd_ping(ping e3328 stamp 2018-03-21 05:09:31.307794) v4 ==== 2004+0+0 (603415365 0 0) 0x5589fe22cc00 con 0x558a107ec000 -1> 2018-03-21 05:09:31.310960 7fc69ba1c700 1 -- 10.0.2.10:6828/4210 --> 10.0.2.10:0/4168 -- osd_ping(ping_reply e3328 stamp 2018-03-21 05:09:31.307794) v4 -- 0x558a0645c600 con 0 0> 2018-03-21 05:09:31.330524 7fc698258700 -1 *** Caught signal (Segmentation fault) ** in thread 7fc698258700 thread_name:safe_timer ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable) 1: (()+0xa3c611) [0x5589e73f7611] 2: (()+0xf5e0) [0x7fc69f5185e0] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
#1 Updated by Brad Hubbard over 1 year ago
- Project changed from Ceph to RADOS
- Category deleted (
- Status changed from New to Need More Info
- Assignee set to Brad Hubbard
- Component(RADOS) OSD added
What's the exact version of the ceph-osd you are using (exact package URL if possible please).
You could try 'objdump -rdS /path/to/ceph-osd' but you may need the relevant debuginfo packages installed.
If you can capture a coredump and sosreport please upload them using ceph-post-file and let us know the UID here.
#2 Updated by Dietmar Rieder over 1 year ago
The ceph-osd comes from https://download.ceph.com/rpm-luminous/el7/x86_64/
I verified via md5sum if the the local copy is the same as the one on download.ceph.com:
- md5sum ceph-osd_local ceph-osd_ceph.com
ceph was installed via ceph-deploy using the yum repo with baseurl=http://download.ceph.com/rpm-luminous/el7/$basearch
- ceph -v
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
I produced an objdump (see UID 6609a6e3-c22b-4a5d-8bea-5b40b24e9e73), however I'm not familiar with that, so I'm not sure what the "relevant debuginfo packages" are.
I have still no core file.
#3 Updated by Marcin Gibula over 1 year ago
I have seen this as well, on our cluster. We're using bluestore, ubuntu 16, latest luminous.
The crashes were totally random, happened with no load and on empty osds from both replicated and ec pools.
Also no core dump here and no backtrace in logs - so I guess stack is smashed.
I wonder if it could be related to http://tracker.ceph.com/issues/21259 maybe?
#5 Updated by Brad Hubbard over 1 year ago
- Description updated (diff)
- Status changed from Need More Info to Verified
I agree these are similar and the cause may indeed be the same however there are only two stack frames in this instance and they both appear to be in a library rather than ceph (probably libgcc/glibc spawning the new thread). This is reinforced by the following log output posted in #23352.
[dmesg] [35103471.167728] safe_timer: segfault at 21300080000 ip 00007f9f3f7bfccb sp 00007f9f31c7df70 error 4 in libgcc_s.so.1[7f9f3f7b1000+16000]
This shows the crash being in libgcc and, rather than suspecting a bug in libgcc, this is most likely due to some significant memory corruption which is being hit when the new thread is being created, at least that's a theory with no evidence to back it at this stage.
That memory address, '21300080000' looks bogus as well.
#8 Updated by Aleksei Zakharov over 1 year ago
We have the same issue:
[dmesg] [1408519.211602] safe_timer: segfault at 1000000c9 ip 00007f2453fb8ccb sp 00007f244d105830 error 4 in libgcc_s.so.1[7f2453faa000+16000] [ceph -v] ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)
This suddenly happens with random osd's and we don't see dependence on any activity or load. It looks like a random bug. There're no errors in the ceph-osd log.
Ubuntu 16.04, bluestore osd's, kernels 4.4 and 4.13.
#15 Updated by Kevin Tibi about 1 year ago
Same issue with ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)
My OSDs are in docker so the containers fail for a memory issue :
Sep 14 10:01:03 ceph02 kernel: safe_timer54506: segfault at 2000 ip 00007efc1c7ee0b8 sp 00007efc1657c870 error 4 in libgcc_s-4.8.5-20150702.so.1[7efc1c7df000+15000]
Sep 14 10:01:03 ceph02 dockerd: time="2018-09-14T10:01:03+02:00" level=info msg="shim reaped" id=60b872b1e9034d8166f20e08678f0da4793f6409a06b6f00b0a43ae9df5deae4 module="containerd/tasks"
This is very random without activity.