Project

General

Profile

Actions

Bug #56479

open

Cannot automatically recover from slow ops warning on ceph-mon

Added by Yao Ning almost 2 years ago. Updated almost 2 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

cluster:
id: b8e99618-7d99-4b50-b4cc-b633292c0fe3
health: HEALTH_WARN
9 slow ops, oldest one blocked for 196 sec, mon.ceph-test-2 has slow ops

mon.ceph-test-2 always show 9 slow ops until reported failure osd restarted, showed in ceph-mon log

2022-07-06 18:09:32.068 7f9ce2212700 -1 mon.ceph-test-2@0(leader) e1 get_health_metrics reporting 9 slow ops, oldest is osd_failure(failed timeout osd.1 [v2:192.168.0.104:6800/815752,v1:192.168.0.104:6801/815752] for 15sec e1863 v1863)
2022-07-06 18:09:37.069 7f9ce2212700 -1 mon.ceph-test-2@0(leader) e1 get_health_metrics reporting 9 slow ops, oldest is osd_failure(failed timeout osd.1 [v2:192.168.0.104:6800/815752,v1:192.168.0.104:6801/815752] for 15sec e1863 v1863)
2022-07-06 18:09:42.235 7f9ce2212700 -1 mon.ceph-test-2@0(leader) e1 get_health_metrics reporting 9 slow ops, oldest is osd_failure(failed timeout osd.1 [v2:192.168.0.104:6800/815752,v1:192.168.0.104:6801/815752] for 15sec e1863 v1863)
2022-07-06 18:09:47.235 7f9ce2212700 -1 mon.ceph-test-2@0(leader) e1 get_health_metrics reporting 9 slow ops, oldest is osd_failure(failed timeout osd.1 [v2:192.168.0.104:6800/815752,v1:192.168.0.104:6801/815752] for 15sec e1863 v1863)
2022-07-06 18:09:52.236 7f9ce2212700 -1 mon.ceph-test-2@0(leader) e1 get_health_metrics reporting 9 slow ops, oldest is osd_failure(failed timeout osd.1 [v2:192.168.0.104:6800/815752,v1:192.168.0.104:6801/815752] for 15sec e1863 v1863)
2022-07-06 18:09:57.237 7f9ce2212700 -1 mon.ceph-test-2@0(leader) e1 get_health_metrics reporting 9 slow ops, oldest is osd_failure(failed timeout osd.1 [v2:192.168.0.104:6800/815752,v1:192.168.0.104:6801/815752] for 15sec e1863 v1863)
2022-07-06 18:10:02.237 7f9ce2212700 -1 mon.ceph-test-2@0(leader) e1 get_health_metrics reporting 9 slow ops, oldest is osd_failure(failed timeout osd.1 [v2:192.168.0.104:6800/815752,v1:192.168.0.104:6801/815752] for 15sec e1863 v1863)

How to reproduce?

1) 3 host, 3 osd on each host;
hostA: 192.168.0.104
hostB: 192.168.0.22
hostC: 192.168.0.41

2) ceph.conf settings to prevent osd being marked down
[global]
mon_osd_reporter_subtree_level = host
mon_osd_min_down_reporters = 3

3) On hostA, use iptables to drop network packets from hostC
iptables -I INPUT -s 192.168.0.41/32 -j DROP

4) wait enough times until slowops appeared in ceph -s

5) stop all osds on hostC

6) remove the iptables rules on hostA
iptables -D INPUT -s 192.168.0.41/32 -j DROP

7) start all osds on hostC

8)and then you can find slowops warn always appeared on ceph -s

I think the main reason causes this problem is, in OSDMonitor.cc, failure_info logged when some osds report others' failure, but failure osds are not marked down immediatelly because of not enough osd report it down. Finally, the reporter(osds) may restart or reboot so that they lost the failure's peer osds and when they are started, the heartbeat with failure osds become normal. So no more event will be sent to ceph mon to tell this, and failure_info always on ceph monitor.

Actions #1

Updated by Yao Ning almost 2 years ago

  • Subject changed from Cannot recover from blocked ops on ceph-mon to Cannot automatically recover from slow ops warning on ceph-mon
Actions #2

Updated by Yao Ning almost 2 years ago

  • Affected Versions v16.2.9 added
  • Affected Versions deleted (v14.2.20)
Actions

Also available in: Atom PDF