Bug #20628
closedceph-osd deadlock in ?simple messenger?
0%
Description
Hi,
We have a jewel 10.2.8 osd that just deadlocked. The osd was marked failed due to no PG stats after 60s:
2017-07-14 12:27:24.869733 mon.0 128.142.35.220:6789/0 161437 : cluster [INF] osd.331 marked down after no pg stats for 61.085540seconds
(Note that we use mon osd report timeout = 60 because we've seen this deadlock before and the deadlocked osd's peers do not mark him as failed in this scenario. IOW, osd's deadlocking in this way generate slow requests until the pg stats time out.)
The OSD and cluster logs are attached and I've ceph-post-file'd the coredump with tag 57a63b32-b3c8-4c40-a2f2-7f205ff475ad.
This is 10.2.8 on centos 7, installed from downloads.ceph.com.
# rpm -q ceph-osd ceph-osd-10.2.8-0.el7.x86_64 # ceph --version ceph version 10.2.8 (f5b1f1fd7c0be0506ba73502a675de9d048b744e)
Cheers, Dan
Files
Updated by Dan van der Ster almost 7 years ago
Forgot to mention: there are no IO errors on this drive. And the daemon does not respond to socket commands: 'ceph daemon osd.331 help' hangs.
Updated by Greg Farnum almost 7 years ago
Okay, so to read that core file we'll need to know your distro, please? :)
I did extract it though and am a bit confused about the time stamps I'm seeing. That says it was generated at 12:08, but the "user" file ceph-post-file sets up was created at 12:05. The mon log snippet you put above marked the osd down at 12:27:24, and I can't find any evidence of a crash in the OSD log, although it does terminate at circa 13:23 after showing some timed out osd_op_tp messages — and a bunch of Pipe reconnects and faults at 12:26:25 and 12:27:25, respectively. Can you walk me through the timeline a bit?