Project

General

Profile

Actions

Bug #45796

open

Ceph mon's sporadically report slow ops

Added by David Hows almost 4 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have recently upgraded our cluster to 14.2.9 from 10.2.6 and are in the process of a rolling rebuild of many of the OSDs.

We have started seeing that our system will go into HEALTH_WARN sporadically due to slow ops on the mon's.

Looking into the log, the operations are always osd_pgtemp ops. Following along further, I can see the op in question coming from OSD; there seems to be a race as to which of several OSD's will take the PG. The OSD in question then loses this race and marks itself as "Stray".

My belief is that the Op on the Mon is no longer needed or valid as the epoch as moved forward and thus the slow op on the monitor should be discarded or something.

I have attached ceph-versions for our cluster, the dump of the monitors inflight ops and a snippet from the osd log showing the request to the mon. If i can gather any more diagnostic details please let me know.


Files

ceph-osd.43.log (6.13 KB) ceph-osd.43.log David Hows, 06/01/2020 12:11 AM
ceph-versions.txt (598 Bytes) ceph-versions.txt David Hows, 06/01/2020 12:11 AM
mon-ops.txt (2.33 KB) mon-ops.txt David Hows, 06/01/2020 12:11 AM
Actions #1

Updated by Greg Farnum almost 3 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (Monitor)
  • Component(RADOS) Monitor added
Actions

Also available in: Atom PDF