Project

General

Profile

Actions

Bug #43893

closed

lingering osd_failure ops (due to failure_info holding references?)

Added by Dan van der Ster about 4 years ago. Updated over 3 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
nautilus,octopus
Regression:
Yes
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On Nautilus v14.2.6 we see osd_failure ops which linger:

# ceph --cluster=erin health detail
HEALTH_WARN 1 slow ops, oldest one blocked for 25202 sec, mon.cepherin-mon-7cb9b591e1 has slow ops
SLOW_OPS 1 slow ops, oldest one blocked for 25202 sec, mon.cepherin-mon-7cb9b591e1 has slow ops

The lingering op is:

        {
            "description": "osd_failure(failed timeout osd.170 [v2:128.142.162.136:6973/1335903,v1:128.142.162.136:6976/1335903] for 23sec e728185 v728185)",
            "initiated_at": "2020-01-30 01:15:34.939494",
            "age": 25454.252783782002,
            "duration": 25454.252815929001,
            "type_data": {
                "events": [
                    {
                        "time": "2020-01-30 01:15:34.939494",
                        "event": "initiated" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939494",
                        "event": "header_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "throttled" 
                    },
                    {
                        "time": "0.000000",
                        "event": "all_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939641",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939641",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939642",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939676",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939678",
                        "event": "osdmap:preprocess_failure" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939701",
                        "event": "osdmap:prepare_update" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939702",
                        "event": "osdmap:prepare_failure" 
                    },
                    {
                        "time": "2020-01-30 01:15:34.939739",
                        "event": "no_reply: send routed request" 
                    }
                ],
                "info": {
                    "seq": 10919358,
                    "src_is_mon": false,
                    "source": "osd.297 v2:128.142.25.100:6898/3120619",
                    "forwarded_to_leader": false
                }
            }
        },

We can clear that slow op either by restarting mon.cepherin-mon-7cb9b591e1 or with `ceph osd fail osd.170`.

This issue is identical to https://tracker.ceph.com/issues/24531, so presumably that fix was incomplete.


Related issues 2 (0 open2 closed)

Related to RADOS - Bug #24531: Mimic MONs have slow/long running opsResolvedJoao Eduardo Luis

Actions
Is duplicate of RADOS - Bug #47380: mon: slow ops due to osd_failureResolvedKefu Chai

Actions
Actions #1

Updated by Dan van der Ster about 4 years ago

We can clear that slow op either by restarting mon.cepherin-mon-7cb9b591e1 or with `ceph osd fail osd.170`.

too early in the morning.. it's `ceph osd down osd.170`.

Actions #2

Updated by Neha Ojha about 4 years ago

  • Related to Bug #24531: Mimic MONs have slow/long running ops added
Actions #3

Updated by Greg Farnum about 4 years ago

Hmm that prepare_failure() does look like it's behaving a little differently than some of the regular op flow; we must be losing track of it in there somewhere.

Actions #4

Updated by Greg Farnum over 3 years ago

  • Subject changed from lingering osd_failure ops to lingering osd_failure ops (due to failure_info holding references?)

Looking at this ticket again it's not a no_reply issue at all -- the last event is it getting marked as no_reply, but somehow it's still sticking around. That means there's some dangling reference keeping the op alive, and...

Ah-ha! We keep around the failure-reporting ops in the failure_info structure, which I guess keeps them alive in the optracker. Now, normally you'd expect them to not be long-lived, but I'm not certain anything guarantees that and I'm not seeing an actual timeout.

Actions #5

Updated by Neha Ojha over 3 years ago

  • Related to Bug #47380: mon: slow ops due to osd_failure added
Actions #6

Updated by Neha Ojha over 3 years ago

  • Priority changed from Normal to High
Actions #7

Updated by Rafal Wadolowski over 3 years ago

Still exist in 14.2.11. When you have some issues with network, then after all you're ending with SLOW_OPS with osd_failure.
Restarting following osds or mon is a good workaround :)

Actions #8

Updated by Neha Ojha over 3 years ago

  • Priority changed from High to Urgent
Actions #9

Updated by Neha Ojha over 3 years ago

  • Backport set to nautilus,octopus
Actions #10

Updated by Neha Ojha over 3 years ago

  • Status changed from New to Duplicate

Let's use https://tracker.ceph.com/issues/47380 to track this.

Actions #11

Updated by Neha Ojha over 3 years ago

  • Related to deleted (Bug #47380: mon: slow ops due to osd_failure)
Actions #12

Updated by Neha Ojha over 3 years ago

  • Is duplicate of Bug #47380: mon: slow ops due to osd_failure added
Actions

Also available in: Atom PDF