Actions
Bug #43893
closedlingering osd_failure ops (due to failure_info holding references?)
Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Community (dev)
Tags:
Backport:
nautilus,octopus
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
On Nautilus v14.2.6 we see osd_failure ops which linger:
# ceph --cluster=erin health detail HEALTH_WARN 1 slow ops, oldest one blocked for 25202 sec, mon.cepherin-mon-7cb9b591e1 has slow ops SLOW_OPS 1 slow ops, oldest one blocked for 25202 sec, mon.cepherin-mon-7cb9b591e1 has slow ops
The lingering op is:
{ "description": "osd_failure(failed timeout osd.170 [v2:128.142.162.136:6973/1335903,v1:128.142.162.136:6976/1335903] for 23sec e728185 v728185)", "initiated_at": "2020-01-30 01:15:34.939494", "age": 25454.252783782002, "duration": 25454.252815929001, "type_data": { "events": [ { "time": "2020-01-30 01:15:34.939494", "event": "initiated" }, { "time": "2020-01-30 01:15:34.939494", "event": "header_read" }, { "time": "0.000000", "event": "throttled" }, { "time": "0.000000", "event": "all_read" }, { "time": "0.000000", "event": "dispatched" }, { "time": "2020-01-30 01:15:34.939641", "event": "mon:_ms_dispatch" }, { "time": "2020-01-30 01:15:34.939641", "event": "mon:dispatch_op" }, { "time": "2020-01-30 01:15:34.939642", "event": "psvc:dispatch" }, { "time": "2020-01-30 01:15:34.939676", "event": "osdmap:preprocess_query" }, { "time": "2020-01-30 01:15:34.939678", "event": "osdmap:preprocess_failure" }, { "time": "2020-01-30 01:15:34.939701", "event": "osdmap:prepare_update" }, { "time": "2020-01-30 01:15:34.939702", "event": "osdmap:prepare_failure" }, { "time": "2020-01-30 01:15:34.939739", "event": "no_reply: send routed request" } ], "info": { "seq": 10919358, "src_is_mon": false, "source": "osd.297 v2:128.142.25.100:6898/3120619", "forwarded_to_leader": false } } },
We can clear that slow op either by restarting mon.cepherin-mon-7cb9b591e1 or with `ceph osd fail osd.170`.
This issue is identical to https://tracker.ceph.com/issues/24531, so presumably that fix was incomplete.
Actions