Project

General

Profile

Actions

Bug #50637

closed

OSD slow ops warning stuck after OSD fail

Added by Frank Schilder almost 3 years ago. Updated 6 months ago.

Status:
Duplicate
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
medium-hanging-fruit
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We had a disk fail with 2 OSDs deployed on it, ids=580, 581. Since then, the health warning 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops is not cleared despite the OSD being down+out. I include the relevant portions of the ceph log directly below. A similar problem for MON slow ops has been observed in #47380 . Restarting the MONs did not help.

2021-05-03 00:00:00.000111 mon.ceph-01 mon.0 192.168.32.65:6789/0 139302 : cluster [INF] overall HEALTH_OK
2021-05-03 01:00:00.000095 mon.ceph-01 mon.0 192.168.32.65:6789/0 139349 : cluster [INF] overall HEALTH_OK
2021-05-03 01:27:56.169402 mon.ceph-01 mon.0 192.168.32.65:6789/0 139365 : cluster [WRN] Health check failed: 10 slow ops, oldest one blocked for 31 sec, daemons [osd.246,osd.248,osd.286,osd.561] have slow ops. (SLOW_OPS)
2021-05-03 01:28:03.197307 mon.ceph-01 mon.0 192.168.32.65:6789/0 139366 : cluster [WRN] Health check update: 709 slow ops, oldest one blocked for 36 sec, daemons [osd.242,osd.243,osd.244,osd.246,osd.247,osd.248,osd.252,osd.253,osd.254,osd.255]... have slow ops. (SLOW_OPS)
2021-05-03 01:28:03.157603 osd.254 osd.254 192.168.32.68:6826/15398 735 : cluster [WRN] Error(s) ignored for 2:d28d1b1b:::rbd_data.1.ac384d6b8b4567.00000000000022c4:head enough copies available

[zap]

2021-05-03 01:28:03.156392 osd.251 osd.251 192.168.32.69:6814/15081 509 : cluster [WRN] Error(s) ignored for 2:e05cfc48:::rbd_data.1.3ca7946b8b4567.000000000000035c:head enough copies available
2021-05-03 01:28:08.323933 mon.ceph-01 mon.0 192.168.32.65:6789/0 139368 : cluster [WRN] Health check update: 883 slow ops, oldest one blocked for 43 sec, daemons [osd.242,osd.243,osd.244,osd.245,osd.246,osd.247,osd.248,osd.251,osd.252,osd.253]... have slow ops. (SLOW_OPS)
2021-05-03 01:28:03.157479 osd.243 osd.243 192.168.32.68:6800/16984 503 : cluster [WRN] Error(s) ignored for 2:c55cfe83:::rbd_data.1.a9b9a76b8b4567.00000000000001c6:head enough copies available
2021-05-03 01:28:03.158878 osd.243 osd.243 192.168.32.68:6800/16984 504 : cluster [WRN] Error(s) ignored for 2:c56cdd5f:::rbd_data.1.d1fdaf6b8b4567.0000000000000f88:head enough copies available
2021-05-03 01:28:03.124495 osd.577 osd.577 192.168.32.85:6828/11420 387 : cluster [WRN] Error(s) ignored for 2:a4027c44:::rbd_data.1.a5083f6b8b4567.0000000000001936:head enough copies available
2021-05-03 01:28:14.197488 mon.ceph-01 mon.0 192.168.32.65:6789/0 139370 : cluster [WRN] Health check update: 975 slow ops, oldest one blocked for 47 sec, daemons [osd.242,osd.243,osd.244,osd.245,osd.246,osd.247,osd.248,osd.251,osd.252,osd.253]... have slow ops. (SLOW_OPS)
2021-05-03 01:28:17.418279 mon.ceph-01 mon.0 192.168.32.65:6789/0 139372 : cluster [INF] osd.581 failed (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-22) (connection refused reported by osd.561)
2021-05-03 01:28:17.470635 mon.ceph-01 mon.0 192.168.32.65:6789/0 139469 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2021-05-03 01:28:18.503047 mon.ceph-01 mon.0 192.168.32.65:6789/0 139471 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
2021-05-03 01:28:19.198113 mon.ceph-01 mon.0 192.168.32.65:6789/0 139473 : cluster [WRN] Health check update: 1099 slow ops, oldest one blocked for 53 sec, daemons [osd.242,osd.243,osd.244,osd.245,osd.246,osd.247,osd.248,osd.251,osd.252,osd.253]... have slow ops. (SLOW_OPS)
2021-05-03 01:28:20.504696 mon.ceph-01 mon.0 192.168.32.65:6789/0 139474 : cluster [WRN] Health check failed: Degraded data redundancy: 1370/4907354275 objects degraded (0.000%), 3 pgs degraded (PG_DEGRADED)
2021-05-03 01:28:21.308849 mon.ceph-01 mon.0 192.168.32.65:6789/0 139476 : cluster [INF] osd.580 failed (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-22) (connection refused reported by osd.280)
2021-05-03 01:28:24.198568 mon.ceph-01 mon.0 192.168.32.65:6789/0 139605 : cluster [WRN] Health check update: 2 osds down (OSD_DOWN)
2021-05-03 01:28:24.198628 mon.ceph-01 mon.0 192.168.32.65:6789/0 139606 : cluster [WRN] Health check update: 1167 slow ops, oldest one blocked for 57 sec, daemons [osd.242,osd.243,osd.244,osd.245,osd.246,osd.247,osd.248,osd.251,osd.252,osd.253]... have slow ops. (SLOW_OPS)
2021-05-03 01:28:26.791790 mon.ceph-01 mon.0 192.168.32.65:6789/0 139608 : cluster [WRN] Health check update: Reduced data availability: 4 pgs peering (PG_AVAILABILITY)
2021-05-03 01:28:26.791821 mon.ceph-01 mon.0 192.168.32.65:6789/0 139609 : cluster [WRN] Health check update: Degraded data redundancy: 132987/4907352928 objects degraded (0.003%), 24 pgs degraded (PG_DEGRADED)
2021-05-03 01:28:29.199179 mon.ceph-01 mon.0 192.168.32.65:6789/0 139613 : cluster [WRN] Health check update: 1241 slow ops, oldest one blocked for 63 sec, daemons [osd.242,osd.243,osd.244,osd.245,osd.246,osd.247,osd.248,osd.251,osd.252,osd.253]... have slow ops. (SLOW_OPS)
2021-05-03 01:28:33.013884 mon.ceph-01 mon.0 192.168.32.65:6789/0 139616 : cluster [WRN] Health check update: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
2021-05-03 01:28:33.013960 mon.ceph-01 mon.0 192.168.32.65:6789/0 139617 : cluster [WRN] Health check update: Degraded data redundancy: 340284/4907351717 objects degraded (0.007%), 36 pgs degraded, 1 pg undersized (PG_DEGRADED)
2021-05-03 01:28:34.199870 mon.ceph-01 mon.0 192.168.32.65:6789/0 139619 : cluster [WRN] Health check update: 1249 slow ops, oldest one blocked for 67 sec, daemons [osd.243,osd.244,osd.245,osd.246,osd.247,osd.248,osd.251,osd.252,osd.253,osd.254]... have slow ops. (SLOW_OPS)
2021-05-03 01:28:39.200558 mon.ceph-01 mon.0 192.168.32.65:6789/0 139624 : cluster [WRN] Health check update: Degraded data redundancy: 464680/4907351219 objects degraded (0.009%), 43 pgs degraded, 1 pg undersized (PG_DEGRADED)
2021-05-03 01:28:39.200619 mon.ceph-01 mon.0 192.168.32.65:6789/0 139625 : cluster [WRN] Health check update: 1299 slow ops, oldest one blocked for 72 sec, daemons [osd.243,osd.244,osd.245,osd.246,osd.247,osd.248,osd.251,osd.252,osd.253,osd.254]... have slow ops. (SLOW_OPS)
2021-05-03 01:28:44.201112 mon.ceph-01 mon.0 192.168.32.65:6789/0 139630 : cluster [WRN] Health check update: Degraded data redundancy: 882272/4907350011 objects degraded (0.018%), 68 pgs degraded, 1 pg undersized (PG_DEGRADED)
2021-05-03 01:28:44.201145 mon.ceph-01 mon.0 192.168.32.65:6789/0 139631 : cluster [WRN] Health check update: 1258 slow ops, oldest one blocked for 77 sec, daemons [osd.242,osd.243,osd.247,osd.251,osd.252,osd.253,osd.254,osd.255,osd.256,osd.257]... have slow ops. (SLOW_OPS)
2021-05-03 01:28:49.201579 mon.ceph-01 mon.0 192.168.32.65:6789/0 139635 : cluster [WRN] Health check update: Degraded data redundancy: 1484634/4907349129 objects degraded (0.030%), 98 pgs degraded, 1 pg undersized (PG_DEGRADED)
2021-05-03 01:28:49.201641 mon.ceph-01 mon.0 192.168.32.65:6789/0 139636 : cluster [WRN] Health check update: 978 slow ops, oldest one blocked for 83 sec, daemons [osd.247,osd.251,osd.252,osd.253,osd.255,osd.256,osd.264,osd.267,osd.268,osd.270]... have slow ops. (SLOW_OPS)
2021-05-03 01:28:54.202313 mon.ceph-01 mon.0 192.168.32.65:6789/0 139642 : cluster [WRN] Health check update: Degraded data redundancy: 1662096/4907348015 objects degraded (0.034%), 108 pgs degraded, 1 pg undersized (PG_DEGRADED)
2021-05-03 01:28:54.202368 mon.ceph-01 mon.0 192.168.32.65:6789/0 139643 : cluster [WRN] Health check update: 788 slow ops, oldest one blocked for 87 sec, daemons [osd.247,osd.251,osd.252,osd.256,osd.264,osd.271,osd.275,osd.278,osd.279,osd.283]... have slow ops. (SLOW_OPS)
2021-05-03 01:28:59.203162 mon.ceph-01 mon.0 192.168.32.65:6789/0 139701 : cluster [WRN] Health check update: Degraded data redundancy: 1745285/4907347295 objects degraded (0.036%), 113 pgs degraded, 1 pg undersized (PG_DEGRADED)
2021-05-03 01:28:59.203236 mon.ceph-01 mon.0 192.168.32.65:6789/0 139702 : cluster [WRN] Health check update: 744 slow ops, oldest one blocked for 91 sec, daemons [osd.247,osd.251,osd.252,osd.256,osd.264,osd.265,osd.270,osd.271,osd.275,osd.280]... have slow ops. (SLOW_OPS)
2021-05-03 01:28:59.891340 mon.ceph-01 mon.0 192.168.32.65:6789/0 139708 : cluster [WRN] Health check update: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2021-05-03 01:29:00.992129 mon.ceph-01 mon.0 192.168.32.65:6789/0 139726 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2021-05-03 01:29:04.205507 mon.ceph-01 mon.0 192.168.32.65:6789/0 139750 : cluster [WRN] Health check update: Degraded data redundancy: 1975171/4907346245 objects degraded (0.040%), 127 pgs degraded, 1 pg undersized (PG_DEGRADED)
2021-05-03 01:29:04.205561 mon.ceph-01 mon.0 192.168.32.65:6789/0 139751 : cluster [WRN] Health check update: 562 slow ops, oldest one blocked for 96 sec, daemons [osd.247,osd.251,osd.252,osd.264,osd.265,osd.270,osd.271,osd.272,osd.273,osd.283]... have slow ops. (SLOW_OPS)
2021-05-03 01:29:09.206145 mon.ceph-01 mon.0 192.168.32.65:6789/0 139752 : cluster [WRN] Health check update: Degraded data redundancy: 2007214/4907345166 objects degraded (0.041%), 129 pgs degraded, 1 pg undersized (PG_DEGRADED)
2021-05-03 01:29:09.206197 mon.ceph-01 mon.0 192.168.32.65:6789/0 139753 : cluster [WRN] Health check update: 551 slow ops, oldest one blocked for 101 sec, daemons [osd.252,osd.264,osd.270,osd.271,osd.273,osd.278,osd.552,osd.553,osd.559,osd.566]... have slow ops. (SLOW_OPS)
2021-05-03 01:29:14.206806 mon.ceph-01 mon.0 192.168.32.65:6789/0 139755 : cluster [WRN] Health check update: Degraded data redundancy: 2028021/4907344620 objects degraded (0.041%), 130 pgs degraded, 1 pg undersized (PG_DEGRADED)
2021-05-03 01:29:14.206857 mon.ceph-01 mon.0 192.168.32.65:6789/0 139756 : cluster [WRN] Health check update: 558 slow ops, oldest one blocked for 106 sec, daemons [osd.252,osd.264,osd.271,osd.278,osd.282,osd.552,osd.559,osd.563,osd.566,osd.568]... have slow ops. (SLOW_OPS)
2021-05-03 01:29:19.207409 mon.ceph-01 mon.0 192.168.32.65:6789/0 139758 : cluster [WRN] Health check update: Degraded data redundancy: 2070420/4907343920 objects degraded (0.042%), 132 pgs degraded, 1 pg undersized (PG_DEGRADED)
2021-05-03 01:29:19.207467 mon.ceph-01 mon.0 192.168.32.65:6789/0 139759 : cluster [WRN] Health check update: 469 slow ops, oldest one blocked for 109 sec, daemons [osd.252,osd.278,osd.282,osd.559,osd.566,osd.568,osd.580] have slow ops. (SLOW_OPS)
2021-05-03 01:29:19.254238 mon.ceph-01 mon.0 192.168.32.65:6789/0 139760 : cluster [WRN] Health check failed: Reduced data availability: 1 pg inactive (PG_AVAILABILITY)
2021-05-03 01:29:21.283434 mon.ceph-01 mon.0 192.168.32.65:6789/0 139761 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive)
2021-05-03 01:29:24.208116 mon.ceph-01 mon.0 192.168.32.65:6789/0 139762 : cluster [WRN] Health check update: Degraded data redundancy: 2070424/4907343125 objects degraded (0.042%), 132 pgs degraded, 26 pgs undersized (PG_DEGRADED)
2021-05-03 01:29:24.208172 mon.ceph-01 mon.0 192.168.32.65:6789/0 139763 : cluster [WRN] Health check update: 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops (SLOW_OPS)
2021-05-03 01:29:29.208719 mon.ceph-01 mon.0 192.168.32.65:6789/0 139764 : cluster [WRN] Health check update: Degraded data redundancy: 2070425/4907342447 objects degraded (0.042%), 132 pgs degraded, 32 pgs undersized (PG_DEGRADED)

[zap]

2021-05-03 01:38:19.278118 mon.ceph-01 mon.0 192.168.32.65:6789/0 139881 : cluster [WRN] Health check update: Degraded data redundancy: 2070848/4907267025 objects degraded (0.042%), 132 pgs degraded, 132 pgs undersized (PG_DEGRADED)
2021-05-03 01:38:19.278286 mon.ceph-01 mon.0 192.168.32.65:6789/0 139882 : cluster [INF] Marking osd.581 out (has been down for 601 seconds)
2021-05-03 01:38:19.281669 mon.ceph-01 mon.0 192.168.32.65:6789/0 139883 : cluster [WRN] Health check update: 1 osds down (OSD_DOWN)
2021-05-03 01:38:21.314150 mon.ceph-01 mon.0 192.168.32.65:6789/0 139886 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2021-05-03 01:38:24.282902 mon.ceph-01 mon.0 192.168.32.65:6789/0 139888 : cluster [WRN] Health check update: Degraded data redundancy: 1705383/4907266464 objects degraded (0.035%), 108 pgs degraded, 92 pgs undersized (PG_DEGRADED)
2021-05-03 01:38:24.283069 mon.ceph-01 mon.0 192.168.32.65:6789/0 139889 : cluster [INF] Marking osd.580 out (has been down for 602 seconds)
2021-05-03 01:38:24.285727 mon.ceph-01 mon.0 192.168.32.65:6789/0 139890 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2021-05-03 01:38:27.431693 mon.ceph-01 mon.0 192.168.32.65:6789/0 139894 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2021-05-03 01:38:29.286939 mon.ceph-01 mon.0 192.168.32.65:6789/0 139895 : cluster [WRN] Health check update: Degraded data redundancy: 1627190/4907265732 objects degraded (0.033%), 104 pgs degraded, 35 pgs undersized (PG_DEGRADED)
2021-05-03 01:38:34.287435 mon.ceph-01 mon.0 192.168.32.65:6789/0 139896 : cluster [WRN] Health check update: Degraded data redundancy: 2065248/4907264472 objects degraded (0.042%), 132 pgs degraded (PG_DEGRADED)
2021-05-03 01:38:39.287894 mon.ceph-01 mon.0 192.168.32.65:6789/0 139901 : cluster [WRN] Health check update: Degraded data redundancy: 2062695/4907263870 objects degraded (0.042%), 131 pgs degraded (PG_DEGRADED)
2021-05-03 01:38:44.299924 mon.ceph-01 mon.0 192.168.32.65:6789/0 139906 : cluster [WRN] Health check update: Degraded data redundancy: 2058678/4907262836 objects degraded (0.042%), 128 pgs degraded (PG_DEGRADED)
2021-05-03 01:38:49.300652 mon.ceph-01 mon.0 192.168.32.65:6789/0 139910 : cluster [WRN] Health check update: Degraded data redundancy: 2055963/4907262128 objects degraded (0.042%), 126 pgs degraded (PG_DEGRADED)
2021-05-03 01:38:54.301313 mon.ceph-01 mon.0 192.168.32.65:6789/0 139913 : cluster [WRN] Health check update: Degraded data redundancy: 2051400/4907261226 objects degraded (0.042%), 124 pgs degraded (PG_DEGRADED)
2021-05-03 01:38:59.306048 mon.ceph-01 mon.0 192.168.32.65:6789/0 139922 : cluster [WRN] Health check update: Degraded data redundancy: 2048351/4907260506 objects degraded (0.042%), 122 pgs degraded (PG_DEGRADED)
2021-05-03 01:39:04.306542 mon.ceph-01 mon.0 192.168.32.65:6789/0 139928 : cluster [WRN] Health check update: Degraded data redundancy: 2045397/4907259809 objects degraded (0.042%), 116 pgs degraded (PG_DEGRADED)
2021-05-03 01:39:09.307168 mon.ceph-01 mon.0 192.168.32.65:6789/0 139931 : cluster [WRN] Health check update: Degraded data redundancy: 2041840/4907263228 objects degraded (0.042%), 112 pgs degraded (PG_DEGRADED)
2021-05-03 01:39:14.307638 mon.ceph-01 mon.0 192.168.32.65:6789/0 139934 : cluster [WRN] Health check update: Degraded data redundancy: 2039421/4907266565 objects degraded (0.042%), 112 pgs degraded (PG_DEGRADED)
2021-05-03 01:39:19.308110 mon.ceph-01 mon.0 192.168.32.65:6789/0 139936 : cluster [WRN] Health check update: Degraded data redundancy: 2035934/4907272731 objects degraded (0.041%), 111 pgs degraded (PG_DEGRADED)
2021-05-03 01:39:24.308669 mon.ceph-01 mon.0 192.168.32.65:6789/0 139938 : cluster [WRN] Health check update: Degraded data redundancy: 2032539/4907277811 objects degraded (0.041%), 110 pgs degraded, 39 pgs undersized (PG_DEGRADED)
2021-05-03 01:39:29.309138 mon.ceph-01 mon.0 192.168.32.65:6789/0 139939 : cluster [WRN] Health check update: Degraded data redundancy: 2029968/4907278512 objects degraded (0.041%), 110 pgs degraded, 110 pgs undersized (PG_DEGRADED)
2021-05-03 01:39:34.309637 mon.ceph-01 mon.0 192.168.32.65:6789/0 139940 : cluster [WRN] Health check update: Degraded data redundancy: 2026498/4907277594 objects degraded (0.041%), 110 pgs degraded, 110 pgs undersized (PG_DEGRADED)
2021-05-03 01:39:39.310047 mon.ceph-01 mon.0 192.168.32.65:6789/0 139943 : cluster [WRN] Health check update: Degraded data redundancy: 2024069/4907276954 objects degraded (0.041%), 110 pgs degraded, 110 pgs undersized (PG_DEGRADED)
2021-05-03 01:39:44.731147 mon.ceph-01 mon.0 192.168.32.65:6789/0 139944 : cluster [WRN] Health check update: Degraded data redundancy: 2021629/4907276492 objects degraded (0.041%), 109 pgs degraded, 109 pgs undersized (PG_DEGRADED)
2021-05-03 01:39:49.731564 mon.ceph-01 mon.0 192.168.32.65:6789/0 139945 : cluster [WRN] Health check update: Degraded data redundancy: 2018099/4907275534 objects degraded (0.041%), 109 pgs degraded, 109 pgs undersized (PG_DEGRADED)
2021-05-03 01:39:54.732116 mon.ceph-01 mon.0 192.168.32.65:6789/0 139946 : cluster [WRN] Health check update: Degraded data redundancy: 2014558/4907274743 objects degraded (0.041%), 109 pgs degraded, 109 pgs undersized (PG_DEGRADED)

[zap]

2021-05-03 01:59:55.290489 mon.ceph-01 mon.0 192.168.32.65:6789/0 140241 : cluster [WRN] Health check update: Degraded data redundancy: 1314892/4907114507 objects degraded (0.027%), 83 pgs degraded, 83 pgs undersized (PG_DEGRADED)
2021-05-03 02:00:00.000099 mon.ceph-01 mon.0 192.168.32.65:6789/0 140242 : cluster [WRN] overall HEALTH_WARN Degraded data redundancy: 1311896/4907113779 objects degraded (0.027%), 82 pgs degraded, 82 pgs undersized; 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
2021-05-03 02:00:00.290987 mon.ceph-01 mon.0 192.168.32.65:6789/0 140243 : cluster [WRN] Health check update: Degraded data redundancy: 1311896/4907113779 objects degraded (0.027%), 82 pgs degraded, 82 pgs undersized (PG_DEGRADED)

[zap]

2021-05-03 02:55:27.933256 mon.ceph-01 mon.0 192.168.32.65:6789/0 141211 : cluster [WRN] Health check update: Degraded data redundancy: 199/4906695595 objects degraded (0.000%), 1 pg degraded, 1 pg undersized (PG_DEGRADED)
2021-05-03 02:55:32.933679 mon.ceph-01 mon.0 192.168.32.65:6789/0 141214 : cluster [WRN] Health check update: Degraded data redundancy: 75/4906699543 objects degraded (0.000%), 1 pg degraded, 1 pg undersized (PG_DEGRADED)
2021-05-03 02:55:34.421192 mon.ceph-01 mon.0 192.168.32.65:6789/0 141215 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 75/4906699543 objects degraded (0.000%), 1 pg degraded, 1 pg undersized)
2021-05-03 03:00:00.000103 mon.ceph-01 mon.0 192.168.32.65:6789/0 141240 : cluster [WRN] overall HEALTH_WARN 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
2021-05-03 04:00:00.000106 mon.ceph-01 mon.0 192.168.32.65:6789/0 141537 : cluster [WRN] overall HEALTH_WARN 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
2021-05-03 05:00:00.000140 mon.ceph-01 mon.0 192.168.32.65:6789/0 141788 : cluster [WRN] overall HEALTH_WARN 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
Actions #1

Updated by Greg Farnum almost 3 years ago

  • Project changed from Ceph to RADOS
Actions #2

Updated by Frank Schilder almost 3 years ago

I now zapped and re-created the OSDs on this disk. As expected, purging OSD 580 from the cluster cleared the health warning.

Actions #3

Updated by Neha Ojha almost 3 years ago

  • Priority changed from Normal to High

This sounds like a bug, we shouldn't be accounting for down+out osds when counting slow ops.

Actions #4

Updated by Neha Ojha about 2 years ago

  • Priority changed from High to Normal
Actions #5

Updated by Christian Rohmann about 1 year ago

We just observed this exact behavior with a dying server and its OSDs down:

ceph -s
  cluster:
    id:     
    health: HEALTH_WARN
            49 slow ops, oldest one blocked for 525 sec, daemons [osd.37,osd.43] have slow ops.
[...]
-17          12.22543      host crashedmachine                            
 37    ssd    1.74649          osd.37              down         0  1.00000
 41    ssd    1.74649          osd.41              down         0  1.00000
 43    ssd    1.74649          osd.43              down         0  1.00000
 45    ssd    1.74649          osd.45              down         0  1.00000
 47    ssd    1.74649          osd.47              down         0  1.00000
 60    ssd    1.74649          osd.60              down         0  1.00000
 67    ssd    1.74649          osd.67              down         0  1.00000
[...]

But on Ceph Pacific:

ceph version 16.2.10

Actions #6

Updated by Radoslaw Zarzynski about 1 year ago

  • Tags set to medium-hanging-fruit
  • Backport set to pacific,quincy

I think the problem is that we lack a machinery for cleaning the slow-ops status when a monitor marks on OSD down.

Actions #7

Updated by Radoslaw Zarzynski about 1 year ago

  • Assignee set to Prashant D
Actions #8

Updated by Radoslaw Zarzynski about 1 year ago

Prashant, would you mind taking a look at time?

Sure Radoslaw. I will have a look at this.

Actions #9

Updated by Prashant D about 1 year ago

I tried to reproduce this issue on (latest main) vstart cluster by setting osd_op_complaint_time to 1 second and running rados bench runs. Even though there were many slow ops on OSDs but stopping/killing or making osd down/out does not get slow ops stuck on the OSD. All slow ops were getting cleared as soon as osd is in down+out state.

I will try to reproduce this issue on 16.2.10 version.

Actions #10

Updated by Radoslaw Zarzynski about 1 year ago

Bug scrub comment: bump up.

Actions #11

Updated by Radoslaw Zarzynski about 1 year ago

Bump up + ping.

Actions #12

Updated by Prashant D about 1 year ago

I am able to reproduce this issue on latest main branch. Kindly allow me sometime to investigate it further and fix this issue.

Actions #13

Updated by Christian Rohmann about 1 year ago

Prashant D wrote:

I am able to reproduce this issue on latest main branch. Kindly allow me sometime to investigate it further and fix this issue.

That be awesome - Thanks for looking into this!

This likely is a small thing to fix, but very annoying if you run into it.

Actions #14

Updated by Prashant D about 1 year ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 50543
Actions #15

Updated by Prashant D 6 months ago

  • Status changed from Fix Under Review to Duplicate
Actions

Also available in: Atom PDF