Actions
Backport #40993
closedmimic: Ceph status in some cases does not report slow ops
Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
Release:
mimic
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
We had 2 instances when running 13.2.6 they didn't report the slow ops of failing disks.
This is from 1 cluster:
2019-07-25 09:16:45.118 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 738 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859) 2019-07-25 09:16:46.077 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 739 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859) 2019-07-25 09:16:47.117 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 739 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859) 2019-07-25 09:16:48.082 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 739 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859) 2019-07-25 09:16:49.045 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 741 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859) 2019-07-25 09:16:50.007 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 741 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859) 2019-07-25 09:16:51.001 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 743 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859) 2019-07-25 09:16:52.001 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859) 2019-07-25 09:16:53.022 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859) 2019-07-25 09:16:54.051 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859) 2019-07-25 09:16:55.005 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 746 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859) 2019-07-25 09:16:56.003 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 745 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859) 2019-07-25 09:16:56.998 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 749 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859) 2019-07-25 09:16:57.976 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 440 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859) 2019-07-25 09:16:58.980 7f99f787d700 -1 osd.16 324862 get_health_metrics reporting 442 slow ops, oldest is osd_op(client.3968958731.0:24500158 68.3232s0 68.79723232 (undecoded) ondisk+retry+read+known_if_redirected e324859)
This was going for hours and we were getting HEALTH_OK
[10:15][root@p05972678u11018 (production:ceph/erin/osd*28) ~]# zgrep 'slow ops' /var/log/ceph/ceph-osd.16.log-20190726.gz | cut -c 1-13 | sort -u 2019-07-25 00 2019-07-25 01 2019-07-25 02 2019-07-25 03 2019-07-25 04 2019-07-25 05 2019-07-25 06 2019-07-25 07 2019-07-25 08 2019-07-25 09
In the other instance, we had an SSD (rocksdb) failing, and we were getting slow ops for the OSDs that used that SSD. But the ceph status was reporting HEALTH_OK.
Logs sent to:
ceph-post-file: 161fda4e-8339-4217-a10c-77ff79043d7d
Actions