Bug #45871: Incorrect (0) number of slow requests in health check - RADOS - Ceph

Actions

Copy link

Bug #45871

open

Incorrect (0) number of slow requests in health check

Added by Eugen Block almost 4 years ago. Updated over 2 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.9

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ceph version 14.2.9-899-gc02349c600 (c02349c60052aaa6c7bd0c2270c7f7be16fab632) nautilus (stable)

Our cluster shows slow requests during deep-scrubs, that's a known issue and not relevant here. But since our upgrade to 14.2.9 the reports in ceph.log (and therefore in the dashboard) show wrong numbers (0 slow ops):

ceph02:~ # zgrep "slow ops" /var/log/ceph/ceph.log-20200604.xz 
2020-06-03 21:38:54.707315 mon.ceph01 (mon.0) 380808 : cluster [WRN] Health check failed: 0 slow ops, oldest one blocked for 30 sec, osd.15 has slow ops (SLOW_OPS)
2020-06-03 21:38:58.651470 mon.ceph01 (mon.0) 380817 : cluster [INF] Health check cleared: SLOW_OPS (was: 0 slow ops, oldest one blocked for 30 sec, osd.15 has slow ops)
2020-06-03 21:50:07.029901 mon.ceph01 (mon.0) 381676 : cluster [WRN] Health check failed: 2 slow ops, oldest one blocked for 30 sec, osd.25 has slow ops (SLOW_OPS)
2020-06-03 21:50:13.081132 mon.ceph01 (mon.0) 381685 : cluster [INF] Health check cleared: SLOW_OPS (was: 2 slow ops, oldest one blocked for 30 sec, osd.25 has slow ops)
2020-06-03 21:55:49.691150 mon.ceph01 (mon.0) 382183 : cluster [WRN] Health check failed: 0 slow ops, oldest one blocked for 30 sec, osd.15 has slow ops (SLOW_OPS)
2020-06-03 21:55:54.931092 mon.ceph01 (mon.0) 382191 : cluster [WRN] Health check update: 8 slow ops, oldest one blocked for 35 sec, daemons [osd.15,osd.28] have slow ops. (SLOW_OPS)
2020-06-03 21:56:03.868408 mon.ceph01 (mon.0) 382196 : cluster [WRN] Health check update: 2 slow ops, oldest one blocked for 45 sec, daemons [osd.15,osd.28] have slow ops. (SLOW_OPS)
2020-06-03 21:56:09.008182 mon.ceph01 (mon.0) 382201 : cluster [WRN] Health check update: 2 slow ops, oldest one blocked for 47 sec, daemons [osd.15,osd.28] have slow ops. (SLOW_OPS)
2020-06-03 21:56:14.008945 mon.ceph01 (mon.0) 382202 : cluster [WRN] Health check update: 2 slow ops, oldest one blocked for 52 sec, daemons [osd.15,osd.28] have slow ops. (SLOW_OPS)
2020-06-03 21:56:19.009696 mon.ceph01 (mon.0) 382204 : cluster [WRN] Health check update: 0 slow ops, oldest one blocked for 57 sec, daemons [osd.15,osd.28] have slow ops. (SLOW_OPS)
2020-06-03 21:56:24.012114 mon.ceph01 (mon.0) 382207 : cluster [WRN] Health check update: 1 slow ops, oldest one blocked for 60 sec, daemons [osd.15,osd.28] have slow ops. (SLOW_OPS)
2020-06-03 21:56:29.012734 mon.ceph01 (mon.0) 382208 : cluster [WRN] Health check update: 0 slow ops, oldest one blocked for 65 sec, daemons [osd.15,osd.28] have slow ops. (SLOW_OPS)
2020-06-03 21:56:34.013428 mon.ceph01 (mon.0) 382211 : cluster [WRN] Health check update: 0 slow ops, oldest one blocked for 75 sec, daemons [osd.15,osd.28] have slow ops. (SLOW_OPS)
2020-06-03 21:56:39.078084 mon.ceph01 (mon.0) 382213 : cluster [WRN] Health check update: 0 slow ops, oldest one blocked for 75 sec, osd.28 has slow ops (SLOW_OPS)
2020-06-03 21:56:41.444775 mon.ceph01 (mon.0) 382215 : cluster [INF] Health check cleared: SLOW_OPS (was: 0 slow ops, oldest one blocked for 75 sec, osd.28 has slow ops)

Looking into the logs of osd.15 I see no reports of "0 slow ops" but always with a value > 0:

ceph01:~ # zgrep "slow ops" /var/log/ceph/ceph-osd.15.log-20200604.xz | cut -d "," -f 1
2020-06-03 21:55:48.680 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 3 slow ops
2020-06-03 21:55:49.680 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:55:50.660 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:55:51.668 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 8 slow ops
2020-06-03 21:55:52.660 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 8 slow ops
2020-06-03 21:55:53.664 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 2 slow ops
2020-06-03 21:55:54.692 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 2 slow ops
2020-06-03 21:55:55.736 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:55:56.688 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:56:00.696 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 2 slow ops
2020-06-03 21:56:01.664 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 2 slow ops
2020-06-03 21:56:02.632 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 2 slow ops
2020-06-03 21:56:03.660 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 2 slow ops
2020-06-03 21:56:04.692 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 2 slow ops
2020-06-03 21:56:06.700 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 2 slow ops
2020-06-03 21:56:07.687 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 2 slow ops
2020-06-03 21:56:08.659 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:56:09.631 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:56:11.627 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:56:14.607 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:56:16.575 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:56:17.591 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops
2020-06-03 21:56:18.567 7fcdcae79700 -1 osd.15 289241 get_health_metrics reporting 1 slow ops

And for osd.28 I don't see any slow ops in the osd.log at all:

ceph02:~ # zgrep "slow ops" /var/log/ceph/ceph-osd.28.log-20200604.xz
ceph02:~ #

Actions

Copy link

Updated by Greg Farnum almost 3 years ago

Project changed from Ceph to RADOS
Subject changed from Incorrect number of slow requests in health check to Incorrect (0) number of slow requests in health check

Actions

Copy link

Updated by Nico Schottelius over 2 years ago

On a

ceph version 14.2.21 (5ef401921d7a88aea18ec7558f7f9374ebd8f5a6) nautilus (stable)

cluster we see

  cluster:
    id:     ...
    health: HEALTH_WARN
            0 slow ops, oldest one blocked for 33 sec, osd.117 has slow ops

where osd.117 had been down for some hours. The time always stays the same.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #45871

Incorrect (0) number of slow requests in health check

Updated by Greg Farnum almost 3 years ago

Updated by Nico Schottelius over 2 years ago