Bug #43893: lingering osd_failure ops (due to failure_info holding references?) - RADOS - Ceph

Actions

Copy link

Bug #43893

closed

lingering osd_failure ops (due to failure_info holding references?)

Added by Dan van der Ster about 4 years ago. Updated over 3 years ago.

Status:

Duplicate

Priority:

Urgent

Assignee:

Category:

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

nautilus,octopus

Regression:

Yes

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.3, Ceph - v14.2.4, Ceph - v14.2.5, Ceph - v14.2.6, Ceph - v14.2.7

ceph-qa-suite:

Component(RADOS):

Monitor

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

On Nautilus v14.2.6 we see osd_failure ops which linger:

# ceph --cluster=erin health detail
HEALTH_WARN 1 slow ops, oldest one blocked for 25202 sec, mon.cepherin-mon-7cb9b591e1 has slow ops
SLOW_OPS 1 slow ops, oldest one blocked for 25202 sec, mon.cepherin-mon-7cb9b591e1 has slow ops

The lingering op is:

        {
            "description": "osd_failure(failed timeout osd.170 [v2:128.142.162.136:6973/1335903,v1:128.142.162.136:6976/1335903] for 23sec e728185 v728185)",
            "initiated_at": "2020-01-30 01:15:34.939494",
            "age": 25454.252783782002,
            "duration": 25454.252815929001,
            "type_data": {
                "events": [
                    {
                        "time": "2020-01-30 01:15:34.939494",
                        "event": "initiated" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939494",
                        "event": "header_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "throttled" 
                    },
                    {
                        "time": "0.000000",
                        "event": "all_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939641",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939641",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939642",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939676",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939678",
                        "event": "osdmap:preprocess_failure" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939701",
                        "event": "osdmap:prepare_update" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939702",
                        "event": "osdmap:prepare_failure" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939739",
                        "event": "no_reply: send routed request" 
                    }
                ],
                "info": {
                    "seq": 10919358,
                    "src_is_mon": false,
                    "source": "osd.297 v2:128.142.25.100:6898/3120619",
                    "forwarded_to_leader": false
                }
            }
        },

We can clear that slow op either by restarting mon.cepherin-mon-7cb9b591e1 or with `ceph osd fail osd.170`.

This issue is identical to https://tracker.ceph.com/issues/24531, so presumably that fix was incomplete.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Dan van der Ster about 4 years ago

We can clear that slow op either by restarting mon.cepherin-mon-7cb9b591e1 or with `ceph osd fail osd.170`.

too early in the morning.. it's `ceph osd down osd.170`.

Actions

Copy link

Updated by Neha Ojha about 4 years ago

Related to Bug #24531: Mimic MONs have slow/long running ops added

Actions

Copy link

Updated by Greg Farnum about 4 years ago

Hmm that prepare_failure() does look like it's behaving a little differently than some of the regular op flow; we must be losing track of it in there somewhere.

Actions

Copy link

Updated by Greg Farnum over 3 years ago

Subject changed from lingering osd_failure ops to lingering osd_failure ops (due to failure_info holding references?)

Looking at this ticket again it's not a no_reply issue at all -- the last event is it getting marked as no_reply, but somehow it's still sticking around. That means there's some dangling reference keeping the op alive, and...

Ah-ha! We keep around the failure-reporting ops in the failure_info structure, which I guess keeps them alive in the optracker. Now, normally you'd expect them to not be long-lived, but I'm not certain anything guarantees that and I'm not seeing an actual timeout.

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Related to Bug #47380: mon: slow ops due to osd_failure added

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Rafal Wadolowski over 3 years ago

Still exist in 14.2.11. When you have some issues with network, then after all you're ending with SLOW_OPS with osd_failure.
Restarting following osds or mon is a good workaround :)

Actions

Copy link