Backport #40993: mimic: Ceph status in some cases does not report slow ops - RADOS - Ceph

Actions

Copy link

Backport #40993

closed

mimic: Ceph status in some cases does not report slow ops

Added by Theofilos Mouratidis over 4 years ago. Updated over 4 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

Target version:

Release:

mimic

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We had 2 instances when running 13.2.6 they didn't report the slow ops of failing disks.
This is from 1 cluster:

2019-07-25 09:16:45.118 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 738 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:46.077 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 739 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:47.117 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 739 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:48.082 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 739 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:49.045 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 741 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:50.007 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 741 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:51.001 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 743 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:52.001 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:53.022 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:54.051 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:55.005 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 746 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:56.003 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:56.998 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 749 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:57.976 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 440 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:58.980 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 442 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)

This was going for hours and we were getting HEALTH_OK

[10:15][root@p05972678u11018 (production:ceph/erin/osd*28) ~]# zgrep 'slow ops' /var/log/ceph/ceph-osd.16.log-20190726.gz | cut -c 1-13 | sort -u
2019-07-25 00
2019-07-25 01
2019-07-25 02
2019-07-25 03
2019-07-25 04
2019-07-25 05
2019-07-25 06
2019-07-25 07
2019-07-25 08
2019-07-25 09

In the other instance, we had an SSD (rocksdb) failing, and we were getting slow ops for the OSDs that used that SSD. But the ceph status was reporting HEALTH_OK.

Logs sent to:

ceph-post-file: 161fda4e-8339-4217-a10c-77ff79043d7d

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Neha Ojha over 4 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Josh Durgin over 4 years ago

Assignee set to Sridhar Seshasayee

Actions

Copy link

Updated by Sridhar Seshasayee over 4 years ago

Status changed from New to In Progress

The logs relating to this tracker didn't indicate anything obvious upon analysis. The issue was reproduced locally on mimic code base and the following is the analysis so far on why the slow ops didn't get reported.

The DaemonServer as part of DaemonServer::send_report() iterates across all the available mons and osd daemons using the DaemonHealthMetricCollector object interfaces to collect health information. As part of this, DaemonHealthMetricCollector::update() indicates whether the daemon in question has any issue by adding it to a vector of daemons. As a part of this operation, a flag called "reported" is set within the DaemonHealthMetricCollector object to indicate that the daemon is reporting a health issue.

At a later point after all the daemons are done with the update step, send_report() calls summarize() on all the accumulated DaemonHealthMetricCollector objects that have the "reported" flag set which results in the summary string to be passed to the monitor.

However, in this case the "reported" flag for object is reset somehow. As a result, the summary string is not passed in the message to the ceph monitor. Additional analysis is in progress to find out how the "reported" flag is getting reset.

In this particular reproduction, logs indicate that the osd daemon that reported slow ops was added to the list of daemons. But after the update, the summarize step did not add the summary string to the message since the "reported" flag got reset.

Here's the code in question in DaemonHealthMetricCollector.h that updates and creates the summary,

https://github.com/ceph/ceph/blob/mimic/src/mgr/DaemonHealthMetricCollector.h#L13
https://github.com/ceph/ceph/blob/mimic/src/mgr/DaemonHealthMetricCollector.h#L18

I will update further on my findings.

Actions

Copy link

Updated by Sridhar Seshasayee over 4 years ago

Further to my findings earlier, I confirmed that the "reported" flag is being reset in case ONLY an osd daemon reports slow ops. The 'accumulated' map is updated with this collector. As other daemons are processed, in this case the first 'mon' daemon for e.g. resets this flag as part of the 'slow ops' metric type check. This reset happens since a 'slow ops' metric type is already in the accumulated map, and an update on the existing osd daemon collector with new 'mon' specific metric is called. Since there are no slow ops in the 'mon' metric, the "reported" flag gets reset. Due to this when "summarize" is later invoked and since the "reported" flag is reset, the summary string doesn't make it to the monitor.

This happens within the loop in function send_report() within DaemonServer.cc as it iterates through all the daemons shown below,
https://github.com/ceph/ceph/blob/mimic/src/mgr/DaemonServer.cc#L1835

Upon further investigation, this bug is present in the master as well. Therefore, the fix will first be made on the master and then back-ported to mimic.

Actions

Copy link

Updated by Sridhar Seshasayee over 4 years ago

Related to Bug #41758: Ceph status in some cases does not report slow ops added

Actions

Copy link

Updated by Sridhar Seshasayee over 4 years ago

Tracker changed from Bug to Backport

Converting this to track backport from master where the fix is under review.

Actions

Copy link

Updated by Nathan Cutler over 4 years ago

Subject changed from Ceph Mimic status does not report slow ops to mimic: Ceph status in some cases does not report slow ops
Status changed from In Progress to New
Assignee deleted (~~Sridhar Seshasayee~~)
Priority changed from High to Normal
Target version deleted (~~v13.2.6~~)
Release set to mimic

Actions

Copy link

Updated by Nathan Cutler over 4 years ago

Status changed from New to Need More Info

backport ticket opened prematurely - setting "Need More Info" pending:

1. opening of PR fixing the issue in master
2. merge of said master PR
3. setting of the master tracker issue status to "Pending Backport"

Actions

Copy link

Updated by Nathan Cutler over 4 years ago

Related to deleted (Bug #41758: Ceph status in some cases does not report slow ops)

Actions

Copy link

#10

Updated by Nathan Cutler over 4 years ago

Copied from Bug #41758: Ceph status in some cases does not report slow ops added

Actions

Copy link

#11

Updated by Neha Ojha over 4 years ago

Nathan Cutler wrote:

backport ticket opened prematurely - setting "Need More Info" pending:

1. opening of PR fixing the issue in master
2. merge of said master PR
3. setting of the master tracker issue status to "Pending Backport"

To clarify things, the bug was reported in mimic using this tracker issue. Later it was found that the same bug existed in master, for which https://tracker.ceph.com/issues/41758 was created. It is fine to mark one of these dup of the other and use one tracker issue for backports.

Actions

Copy link

#12