Project

General

Profile

Backport #40993

mimic: Ceph status in some cases does not report slow ops

Added by Theofilos Mouratidis 3 months ago. Updated about 1 month ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
Release:
mimic
Crash signature:

Description

We had 2 instances when running 13.2.6 they didn't report the slow ops of failing disks.
This is from 1 cluster:

2019-07-25 09:16:45.118 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 738 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:46.077 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 739 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:47.117 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 739 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:48.082 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 739 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:49.045 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 741 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:50.007 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 741 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:51.001 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 743 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:52.001 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:53.022 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:54.051 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:55.005 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 746 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:56.003 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:56.998 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 749 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:57.976 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 440 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:58.980 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 442 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)

This was going for hours and we were getting HEALTH_OK

[10:15][root@p05972678u11018 (production:ceph/erin/osd*28) ~]# zgrep 'slow ops' /var/log/ceph/ceph-osd.16.log-20190726.gz | cut -c 1-13 | sort -u
2019-07-25 00
2019-07-25 01
2019-07-25 02
2019-07-25 03
2019-07-25 04
2019-07-25 05
2019-07-25 06
2019-07-25 07
2019-07-25 08
2019-07-25 09

In the other instance, we had an SSD (rocksdb) failing, and we were getting slow ops for the OSDs that used that SSD. But the ceph status was reporting HEALTH_OK.

Logs sent to:

ceph-post-file: 161fda4e-8339-4217-a10c-77ff79043d7d


Related issues

Copied from RADOS - Bug #41758: Ceph status in some cases does not report slow ops Duplicate 09/11/2019

History

#1 Updated by Neha Ojha 3 months ago

  • Priority changed from Normal to High

#2 Updated by Josh Durgin 2 months ago

  • Assignee set to Sridhar Seshasayee

#3 Updated by Sridhar Seshasayee about 2 months ago

  • Status changed from New to In Progress

The logs relating to this tracker didn't indicate anything obvious upon analysis. The issue was reproduced locally on mimic code base and the following is the analysis so far on why the slow ops didn't get reported.

The DaemonServer as part of DaemonServer::send_report() iterates across all the available mons and osd daemons using the DaemonHealthMetricCollector object interfaces to collect health information. As part of this, DaemonHealthMetricCollector::update() indicates whether the daemon in question has any issue by adding it to a vector of daemons. As a part of this operation, a flag called "reported" is set within the DaemonHealthMetricCollector object to indicate that the daemon is reporting a health issue.

At a later point after all the daemons are done with the update step, send_report() calls summarize() on all the accumulated DaemonHealthMetricCollector objects that have the "reported" flag set which results in the summary string to be passed to the monitor.

However, in this case the "reported" flag for object is reset somehow. As a result, the summary string is not passed in the message to the ceph monitor. Additional analysis is in progress to find out how the "reported" flag is getting reset.

In this particular reproduction, logs indicate that the osd daemon that reported slow ops was added to the list of daemons. But after the update, the summarize step did not add the summary string to the message since the "reported" flag got reset.

Here's the code in question in DaemonHealthMetricCollector.h that updates and creates the summary,

https://github.com/ceph/ceph/blob/mimic/src/mgr/DaemonHealthMetricCollector.h#L13
https://github.com/ceph/ceph/blob/mimic/src/mgr/DaemonHealthMetricCollector.h#L18

I will update further on my findings.

#4 Updated by Sridhar Seshasayee about 1 month ago

Further to my findings earlier, I confirmed that the "reported" flag is being reset in case ONLY an osd daemon reports slow ops. The 'accumulated' map is updated with this collector. As other daemons are processed, in this case the first 'mon' daemon for e.g. resets this flag as part of the 'slow ops' metric type check. This reset happens since a 'slow ops' metric type is already in the accumulated map, and an update on the existing osd daemon collector with new 'mon' specific metric is called. Since there are no slow ops in the 'mon' metric, the "reported" flag gets reset. Due to this when "summarize" is later invoked and since the "reported" flag is reset, the summary string doesn't make it to the monitor.

This happens within the loop in function send_report() within DaemonServer.cc as it iterates through all the daemons shown below,
https://github.com/ceph/ceph/blob/mimic/src/mgr/DaemonServer.cc#L1835

Upon further investigation, this bug is present in the master as well. Therefore, the fix will first be made on the master and then back-ported to mimic.

#5 Updated by Sridhar Seshasayee about 1 month ago

  • Related to Bug #41758: Ceph status in some cases does not report slow ops added

#6 Updated by Sridhar Seshasayee about 1 month ago

  • Tracker changed from Bug to Backport

Converting this to track backport from master where the fix is under review.

#7 Updated by Nathan Cutler about 1 month ago

  • Subject changed from Ceph Mimic status does not report slow ops to mimic: Ceph status in some cases does not report slow ops
  • Status changed from In Progress to New
  • Assignee deleted (Sridhar Seshasayee)
  • Priority changed from High to Normal
  • Target version deleted (v13.2.6)
  • Release set to mimic

#8 Updated by Nathan Cutler about 1 month ago

  • Status changed from New to Need More Info

backport ticket opened prematurely - setting "Need More Info" pending:

1. opening of PR fixing the issue in master
2. merge of said master PR
3. setting of the master tracker issue status to "Pending Backport"

#9 Updated by Nathan Cutler about 1 month ago

  • Related to deleted (Bug #41758: Ceph status in some cases does not report slow ops)

#10 Updated by Nathan Cutler about 1 month ago

  • Copied from Bug #41758: Ceph status in some cases does not report slow ops added

#11 Updated by Neha Ojha about 1 month ago

Nathan Cutler wrote:

backport ticket opened prematurely - setting "Need More Info" pending:

1. opening of PR fixing the issue in master
2. merge of said master PR
3. setting of the master tracker issue status to "Pending Backport"

To clarify things, the bug was reported in mimic using this tracker issue. Later it was found that the same bug existed in master, for which https://tracker.ceph.com/issues/41758 was created. It is fine to mark one of these dup of the other and use one tracker issue for backports.

#12 Updated by Nathan Cutler about 1 month ago

  • Status changed from Need More Info to Rejected

backports will be pursued in https://tracker.ceph.com/issues/41741

#13 Updated by Nathan Cutler about 1 month ago

just for completeness - the mimic fix is (I think): https://github.com/ceph/ceph/pull/30391

Also available in: Atom PDF