Bug #56479: Cannot automatically recover from slow ops warning on ceph-mon - Ceph - Ceph

Actions

Copy link

Bug #56479

open

Cannot automatically recover from slow ops warning on ceph-mon

Added by Yao Ning almost 2 years ago. Updated almost 2 years ago.

Status:

New

Priority:

High

Assignee:

Category:

Monitor

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v16.2.9

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

cluster:
id: b8e99618-7d99-4b50-b4cc-b633292c0fe3
health: HEALTH_WARN
9 slow ops, oldest one blocked for 196 sec, mon.ceph-test-2 has slow ops

mon.ceph-test-2 always show 9 slow ops until reported failure osd restarted, showed in ceph-mon log

2022-07-06 18:09:32.068 7f9ce2212700 -1 mon.ceph-test-2@0(leader) e1 get_health_metrics reporting 9 slow ops, oldest is osd_failure(failed timeout osd.1 [v2:192.168.0.104:6800/815752,v1:192.168.0.104:6801/815752] for 15sec e1863 v1863)
2022-07-06 18:09:37.069 7f9ce2212700 -1 mon.ceph-test-2@0(leader) e1 get_health_metrics reporting 9 slow ops, oldest is osd_failure(failed timeout osd.1 [v2:192.168.0.104:6800/815752,v1:192.168.0.104:6801/815752] for 15sec e1863 v1863)
2022-07-06 18:09:42.235 7f9ce2212700 -1 mon.ceph-test-2@0(leader) e1 get_health_metrics reporting 9 slow ops, oldest is osd_failure(failed timeout osd.1 [v2:192.168.0.104:6800/815752,v1:192.168.0.104:6801/815752] for 15sec e1863 v1863)
2022-07-06 18:09:47.235 7f9ce2212700 -1 mon.ceph-test-2@0(leader) e1 get_health_metrics reporting 9 slow ops, oldest is osd_failure(failed timeout osd.1 [v2:192.168.0.104:6800/815752,v1:192.168.0.104:6801/815752] for 15sec e1863 v1863)
2022-07-06 18:09:52.236 7f9ce2212700 -1 mon.ceph-test-2@0(leader) e1 get_health_metrics reporting 9 slow ops, oldest is osd_failure(failed timeout osd.1 [v2:192.168.0.104:6800/815752,v1:192.168.0.104:6801/815752] for 15sec e1863 v1863)
2022-07-06 18:09:57.237 7f9ce2212700 -1 mon.ceph-test-2@0(leader) e1 get_health_metrics reporting 9 slow ops, oldest is osd_failure(failed timeout osd.1 [v2:192.168.0.104:6800/815752,v1:192.168.0.104:6801/815752] for 15sec e1863 v1863)
2022-07-06 18:10:02.237 7f9ce2212700 -1 mon.ceph-test-2@0(leader) e1 get_health_metrics reporting 9 slow ops, oldest is osd_failure(failed timeout osd.1 [v2:192.168.0.104:6800/815752,v1:192.168.0.104:6801/815752] for 15sec e1863 v1863)

How to reproduce?

1) 3 host, 3 osd on each host;
hostA: 192.168.0.104
hostB: 192.168.0.22
hostC: 192.168.0.41

2) ceph.conf settings to prevent osd being marked down
[global]
mon_osd_reporter_subtree_level = host
mon_osd_min_down_reporters = 3

3) On hostA, use iptables to drop network packets from hostC
iptables -I INPUT -s 192.168.0.41/32 -j DROP

4) wait enough times until slowops appeared in ceph -s

5) stop all osds on hostC

6) remove the iptables rules on hostA
iptables -D INPUT -s 192.168.0.41/32 -j DROP

7) start all osds on hostC

8）and then you can find slowops warn always appeared on ceph -s

I think the main reason causes this problem is, in OSDMonitor.cc, failure_info logged when some osds report others' failure, but failure osds are not marked down immediatelly because of not enough osd report it down. Finally, the reporter(osds) may restart or reboot so that they lost the failure's peer osds and when they are started, the heartbeat with failure osds become normal. So no more event will be sent to ceph mon to tell this, and failure_info always on ceph monitor.

Actions

Copy link

Updated by Yao Ning almost 2 years ago

Subject changed from Cannot recover from blocked ops on ceph-mon to Cannot automatically recover from slow ops warning on ceph-mon

Actions

Copy link

Updated by Yao Ning almost 2 years ago

Affected Versions v16.2.9 added
Affected Versions deleted (~~v14.2.20~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #56479

Cannot automatically recover from slow ops warning on ceph-mon

Updated by Yao Ning almost 2 years ago

Updated by Yao Ning almost 2 years ago