Project

General

Profile

Bug #40993

Ceph Mimic status does not report slow ops

Added by Theofilos Mouratidis 23 days ago. Updated 9 days ago.

Status:
New
Priority:
High
Category:
Administration/Usability
Target version:
Start date:
07/29/2019
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor, OSD
Pull request ID:

Description

We had 2 instances when running 13.2.6 they didn't report the slow ops of failing disks.
This is from 1 cluster:

2019-07-25 09:16:45.118 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 738 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:46.077 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 739 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:47.117 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 739 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:48.082 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 739 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:49.045 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 741 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:50.007 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 741 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:51.001 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 743 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:52.001 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:53.022 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:54.051 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:55.005 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 746 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:56.003 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:56.998 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 749 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:57.976 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 440 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:58.980 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 442 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)

This was going for hours and we were getting HEALTH_OK

[10:15][root@p05972678u11018 (production:ceph/erin/osd*28) ~]# zgrep 'slow ops' /var/log/ceph/ceph-osd.16.log-20190726.gz | cut -c 1-13 | sort -u
2019-07-25 00
2019-07-25 01
2019-07-25 02
2019-07-25 03
2019-07-25 04
2019-07-25 05
2019-07-25 06
2019-07-25 07
2019-07-25 08
2019-07-25 09

In the other instance, we had an SSD (rocksdb) failing, and we were getting slow ops for the OSDs that used that SSD. But the ceph status was reporting HEALTH_OK.

Logs sent to:

ceph-post-file: 161fda4e-8339-4217-a10c-77ff79043d7d

History

#1 Updated by Neha Ojha 21 days ago

  • Priority changed from Normal to High

#2 Updated by Josh Durgin 9 days ago

  • Assignee set to Sridhar Seshasayee

Also available in: Atom PDF