Project

General

Profile

Actions

Bug #45871

open

Incorrect (0) number of slow requests in health check

Added by Eugen Block almost 4 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph version 14.2.9-899-gc02349c600 (c02349c60052aaa6c7bd0c2270c7f7be16fab632) nautilus (stable)

Our cluster shows slow requests during deep-scrubs, that's a known issue and not relevant here. But since our upgrade to 14.2.9 the reports in ceph.log (and therefore in the dashboard) show wrong numbers (0 slow ops):

ceph02:~ # zgrep "slow ops" /var/log/ceph/ceph.log-20200604.xz 
2020-06-03 21:38:54.707315 mon.ceph01 (mon.0) 380808 : cluster [WRN] Health check failed: 0 slow ops, oldest one blocked for 30 sec, osd.15 has slow ops (SLOW_OPS)
2020-06-03 21:38:58.651470 mon.ceph01 (mon.0) 380817 : cluster [INF] Health check cleared: SLOW_OPS (was: 0 slow ops, oldest one blocked for 30 sec, osd.15 has slow ops)
2020-06-03 21:50:07.029901 mon.ceph01 (mon.0) 381676 : cluster [WRN] Health check failed: 2 slow ops, oldest one blocked for 30 sec, osd.25 has slow ops (SLOW_OPS)
2020-06-03 21:50:13.081132 mon.ceph01 (mon.0) 381685 : cluster [INF] Health check cleared: SLOW_OPS (was: 2 slow ops, oldest one blocked for 30 sec, osd.25 has slow ops)
2020-06-03 21:55:49.691150 mon.ceph01 (mon.0) 382183 : cluster [WRN] Health check failed: 0 slow ops, oldest one blocked for 30 sec, osd.15 has slow ops (SLOW_OPS)
2020-06-03 21:55:54.931092 mon.ceph01 (mon.0) 382191 : cluster [WRN] Health check update: 8 slow ops, oldest one blocked for 35 sec, daemons [osd.15,osd.28] have slow ops. (SLOW_OPS)
2020-06-03 21:56:03.868408 mon.ceph01 (mon.0) 382196 : cluster [WRN] Health check update: 2 slow ops, oldest one blocked for 45 sec, daemons [osd.15,osd.28] have slow ops. (SLOW_OPS)
2020-06-03 21:56:09.008182 mon.ceph01 (mon.0) 382201 : cluster [WRN] Health check update: 2 slow ops, oldest one blocked for 47 sec, daemons [osd.15,osd.28] have slow ops. (SLOW_OPS)
2020-06-03 21:56:14.008945 mon.ceph01 (mon.0) 382202 : cluster [WRN] Health check update: 2 slow ops, oldest one blocked for 52 sec, daemons [osd.15,osd.28] have slow ops. (SLOW_OPS)
2020-06-03 21:56:19.009696 mon.ceph01 (mon.0) 382204 : cluster [WRN] Health check update: 0 slow ops, oldest one blocked for 57 sec, daemons [osd.15,osd.28] have slow ops. (SLOW_OPS)
2020-06-03 21:56:24.012114 mon.ceph01 (mon.0) 382207 : cluster [WRN] Health check update: 1 slow ops, oldest one blocked for 60 sec, daemons [osd.15,osd.28] have slow ops. (SLOW_OPS)
2020-06-03 21:56:29.012734 mon.ceph01 (mon.0) 382208 : cluster [WRN] Health check update: 0 slow ops, oldest one blocked for 65 sec, daemons [osd.15,osd.28] have slow ops. (SLOW_OPS)
2020-06-03 21:56:34.013428 mon.ceph01 (mon.0) 382211 : cluster [WRN] Health check update: 0 slow ops, oldest one blocked for 75 sec, daemons [osd.15,osd.28] have slow ops. (SLOW_OPS)
2020-06-03 21:56:39.078084 mon.ceph01 (mon.0) 382213 : cluster [WRN] Health check update: 0 slow ops, oldest one blocked for 75 sec, osd.28 has slow ops (SLOW_OPS)
2020-06-03 21:56:41.444775 mon.ceph01 (mon.0) 382215 : cluster [INF] Health check cleared: SLOW_OPS (was: 0 slow ops, oldest one blocked for 75 sec, osd.28 has slow ops)

Looking into the logs of osd.15 I see no reports of "0 slow ops" but always with a value > 0:

ceph01:~ # zgrep "slow ops" /var/log/ceph/ceph-osd.15.log-20200604.xz | cut -d "," -f 1
2020-06-03 21:55:48.680 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 3 slow ops
2020-06-03 21:55:49.680 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:55:50.660 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:55:51.668 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 8 slow ops
2020-06-03 21:55:52.660 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 8 slow ops
2020-06-03 21:55:53.664 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 2 slow ops
2020-06-03 21:55:54.692 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 2 slow ops
2020-06-03 21:55:55.736 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:55:56.688 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:56:00.696 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 2 slow ops
2020-06-03 21:56:01.664 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 2 slow ops
2020-06-03 21:56:02.632 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 2 slow ops
2020-06-03 21:56:03.660 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 2 slow ops
2020-06-03 21:56:04.692 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 2 slow ops
2020-06-03 21:56:06.700 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 2 slow ops
2020-06-03 21:56:07.687 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 2 slow ops
2020-06-03 21:56:08.659 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:56:09.631 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:56:11.627 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:56:14.607 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:56:16.575 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:56:17.591 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:56:18.567 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops

And for osd.28 I don't see any slow ops in the osd.log at all:

ceph02:~ # zgrep "slow ops" /var/log/ceph/ceph-osd.28.log-20200604.xz
ceph02:~ #
Actions #1

Updated by Greg Farnum almost 3 years ago

  • Project changed from Ceph to RADOS
  • Subject changed from Incorrect number of slow requests in health check to Incorrect (0) number of slow requests in health check
Actions #2

Updated by Nico Schottelius over 2 years ago

On a

ceph version 14.2.21 (5ef401921d7a88aea18ec7558f7f9374ebd8f5a6) nautilus (stable)

cluster we see

  cluster:
    id:     ...
    health: HEALTH_WARN
            0 slow ops, oldest one blocked for 33 sec, osd.117 has slow ops

where osd.117 had been down for some hours. The time always stays the same.

Actions

Also available in: Atom PDF