Project

General

Profile

Bug #43893

lingering osd_failure ops (due to failure_info holding references?)

Added by Dan van der Ster over 1 year ago. Updated 9 months ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
nautilus,octopus
Regression:
Yes
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On Nautilus v14.2.6 we see osd_failure ops which linger:

# ceph --cluster=erin health detail
HEALTH_WARN 1 slow ops, oldest one blocked for 25202 sec, mon.cepherin-mon-7cb9b591e1 has slow ops
SLOW_OPS 1 slow ops, oldest one blocked for 25202 sec, mon.cepherin-mon-7cb9b591e1 has slow ops

The lingering op is:

        {
            "description": "osd_failure(failed timeout osd.170 [v2:128.142.162.136:6973/1335903,v1:128.142.162.136:6976/1335903] for 23sec e728185 v728185)",
            "initiated_at": "2020-01-30 01:15:34.939494",
            "age": 25454.252783782002,
            "duration": 25454.252815929001,
            "type_data": {
                "events": [
                    {
                        "time": "2020-01-30 01:15:34.939494",
                        "event": "initiated" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939494",
                        "event": "header_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "throttled" 
                    },
                    {
                        "time": "0.000000",
                        "event": "all_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939641",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939641",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939642",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939676",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939678",
                        "event": "osdmap:preprocess_failure" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939701",
                        "event": "osdmap:prepare_update" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939702",
                        "event": "osdmap:prepare_failure" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939739",
                        "event": "no_reply: send routed request" 
                    }
                ],
                "info": {
                    "seq": 10919358,
                    "src_is_mon": false,
                    "source": "osd.297 v2:128.142.25.100:6898/3120619",
                    "forwarded_to_leader": false
                }
            }
        },

We can clear that slow op either by restarting mon.cepherin-mon-7cb9b591e1 or with `ceph osd fail osd.170`.

This issue is identical to https://tracker.ceph.com/issues/24531, so presumably that fix was incomplete.


Related issues

Related to RADOS - Bug #24531: Mimic MONs have slow/long running ops Resolved
Duplicates RADOS - Bug #47380: mon: slow ops due to osd_failure Resolved

History

#1 Updated by Dan van der Ster over 1 year ago

We can clear that slow op either by restarting mon.cepherin-mon-7cb9b591e1 or with `ceph osd fail osd.170`.

too early in the morning.. it's `ceph osd down osd.170`.

#2 Updated by Neha Ojha over 1 year ago

  • Related to Bug #24531: Mimic MONs have slow/long running ops added

#3 Updated by Greg Farnum over 1 year ago

Hmm that prepare_failure() does look like it's behaving a little differently than some of the regular op flow; we must be losing track of it in there somewhere.

#4 Updated by Greg Farnum about 1 year ago

  • Subject changed from lingering osd_failure ops to lingering osd_failure ops (due to failure_info holding references?)

Looking at this ticket again it's not a no_reply issue at all -- the last event is it getting marked as no_reply, but somehow it's still sticking around. That means there's some dangling reference keeping the op alive, and...

Ah-ha! We keep around the failure-reporting ops in the failure_info structure, which I guess keeps them alive in the optracker. Now, normally you'd expect them to not be long-lived, but I'm not certain anything guarantees that and I'm not seeing an actual timeout.

#5 Updated by Neha Ojha about 1 year ago

  • Related to Bug #47380: mon: slow ops due to osd_failure added

#6 Updated by Neha Ojha about 1 year ago

  • Priority changed from Normal to High

#7 Updated by Rafal Wadolowski 12 months ago

Still exist in 14.2.11. When you have some issues with network, then after all you're ending with SLOW_OPS with osd_failure.
Restarting following osds or mon is a good workaround :)

#8 Updated by Neha Ojha 12 months ago

  • Priority changed from High to Urgent

#9 Updated by Neha Ojha 12 months ago

  • Backport set to nautilus,octopus

#10 Updated by Neha Ojha 9 months ago

  • Status changed from New to Duplicate

Let's use https://tracker.ceph.com/issues/47380 to track this.

#11 Updated by Neha Ojha 9 months ago

  • Related to deleted (Bug #47380: mon: slow ops due to osd_failure)

#12 Updated by Neha Ojha 9 months ago

  • Duplicates Bug #47380: mon: slow ops due to osd_failure added

Also available in: Atom PDF