Project

General

Profile

Actions

Bug #43893

closed

lingering osd_failure ops (due to failure_info holding references?)

Added by Dan van der Ster over 4 years ago. Updated over 3 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
nautilus,octopus
Regression:
Yes
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On Nautilus v14.2.6 we see osd_failure ops which linger:

# ceph --cluster=erin health detail
HEALTH_WARN 1 slow ops, oldest one blocked for 25202 sec, mon.cepherin-mon-7cb9b591e1 has slow ops
SLOW_OPS 1 slow ops, oldest one blocked for 25202 sec, mon.cepherin-mon-7cb9b591e1 has slow ops

The lingering op is:

        {
            "description": "osd_failure(failed timeout osd.170 [v2:128.142.162.136:6973/1335903,v1:128.142.162.136:6976/1335903] for 23sec e728185 v728185)",
            "initiated_at": "2020-01-30 01:15:34.939494",
            "age": 25454.252783782002,
            "duration": 25454.252815929001,
            "type_data": {
                "events": [
                    {
                        "time": "2020-01-30 01:15:34.939494",
                        "event": "initiated" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939494",
                        "event": "header_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "throttled" 
                    },
                    {
                        "time": "0.000000",
                        "event": "all_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939641",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939641",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939642",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939676",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939678",
                        "event": "osdmap:preprocess_failure" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939701",
                        "event": "osdmap:prepare_update" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939702",
                        "event": "osdmap:prepare_failure" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939739",
                        "event": "no_reply: send routed request" 
                    }
                ],
                "info": {
                    "seq": 10919358,
                    "src_is_mon": false,
                    "source": "osd.297 v2:128.142.25.100:6898/3120619",
                    "forwarded_to_leader": false
                }
            }
        },

We can clear that slow op either by restarting mon.cepherin-mon-7cb9b591e1 or with `ceph osd fail osd.170`.

This issue is identical to https://tracker.ceph.com/issues/24531, so presumably that fix was incomplete.


Related issues 2 (0 open2 closed)

Related to RADOS - Bug #24531: Mimic MONs have slow/long running opsResolvedJoao Eduardo Luis

Actions
Is duplicate of RADOS - Bug #47380: mon: slow ops due to osd_failureResolvedKefu Chai

Actions
Actions

Also available in: Atom PDF