Project

General

Profile

Actions

Bug #20628

closed

ceph-osd deadlock in ?simple messenger?

Added by Dan van der Ster almost 7 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,
We have a jewel 10.2.8 osd that just deadlocked. The osd was marked failed due to no PG stats after 60s:

2017-07-14 12:27:24.869733 mon.0 128.142.35.220:6789/0 161437 : cluster [INF] osd.331 marked down after no pg stats for 61.085540seconds

(Note that we use mon osd report timeout = 60 because we've seen this deadlock before and the deadlocked osd's peers do not mark him as failed in this scenario. IOW, osd's deadlocking in this way generate slow requests until the pg stats time out.)

The OSD and cluster logs are attached and I've ceph-post-file'd the coredump with tag 57a63b32-b3c8-4c40-a2f2-7f205ff475ad.

This is 10.2.8 on centos 7, installed from downloads.ceph.com.

# rpm -q ceph-osd
ceph-osd-10.2.8-0.el7.x86_64
# ceph --version
ceph version 10.2.8 (f5b1f1fd7c0be0506ba73502a675de9d048b744e)

Cheers, Dan


Files

ceph-osd.331.log.gz (118 KB) ceph-osd.331.log.gz Dan van der Ster, 07/14/2017 12:00 PM
ceph.log.gz (307 KB) ceph.log.gz Dan van der Ster, 07/14/2017 12:02 PM
Actions #1

Updated by Dan van der Ster almost 7 years ago

Forgot to mention: there are no IO errors on this drive. And the daemon does not respond to socket commands: 'ceph daemon osd.331 help' hangs.

Actions #2

Updated by Greg Farnum almost 7 years ago

Okay, so to read that core file we'll need to know your distro, please? :)

I did extract it though and am a bit confused about the time stamps I'm seeing. That says it was generated at 12:08, but the "user" file ceph-post-file sets up was created at 12:05. The mon log snippet you put above marked the osd down at 12:27:24, and I can't find any evidence of a crash in the OSD log, although it does terminate at circa 13:23 after showing some timed out osd_op_tp messages — and a bunch of Pipe reconnects and faults at 12:26:25 and 12:27:25, respectively. Can you walk me through the timeline a bit?

Actions #3

Updated by Sage Weil almost 3 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF