Project

General

Profile

Actions

Backport #40993

closed

mimic: Ceph status in some cases does not report slow ops

Added by Theofilos Mouratidis almost 5 years ago. Updated over 4 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
Release:
mimic
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We had 2 instances when running 13.2.6 they didn't report the slow ops of failing disks.
This is from 1 cluster:

2019-07-25 09:16:45.118 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 738 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:46.077 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 739 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:47.117 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 739 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:48.082 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 739 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:49.045 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 741 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:50.007 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 741 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:51.001 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 743 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:52.001 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:53.022 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:54.051 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:55.005 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 746 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:56.003 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:56.998 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 749 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:57.976 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 440 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
2019-07-25 09:16:58.980 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 442 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)

This was going for hours and we were getting HEALTH_OK

[10:15][root@p05972678u11018 (production:ceph/erin/osd*28) ~]# zgrep 'slow ops' /var/log/ceph/ceph-osd.16.log-20190726.gz | cut -c 1-13 | sort -u
2019-07-25 00
2019-07-25 01
2019-07-25 02
2019-07-25 03
2019-07-25 04
2019-07-25 05
2019-07-25 06
2019-07-25 07
2019-07-25 08
2019-07-25 09

In the other instance, we had an SSD (rocksdb) failing, and we were getting slow ops for the OSDs that used that SSD. But the ceph status was reporting HEALTH_OK.

Logs sent to:

ceph-post-file: 161fda4e-8339-4217-a10c-77ff79043d7d


Related issues 1 (0 open1 closed)

Copied from RADOS - Bug #41758: Ceph status in some cases does not report slow opsDuplicateSridhar Seshasayee09/11/2019

Actions
Actions

Also available in: Atom PDF