Project

General

Profile

Actions

Bug #47380

closed

mon: slow ops due to osd_failure

Added by Honggang Yang over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
pacific, octopus, nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

  cluster:
    id:     fe99c552-5730-44ce-b0e8-09fac4e8b2f5
    health: HEALTH_WARN
            noscrub,nodeep-scrub flag(s) set 
            4096 pgs not deep-scrubbed in time
            4096 pgs not scrubbed in time
            2 slow ops, oldest one blocked for 1008320 sec, mon.bjxx-h225 has slow ops 

  services:
    mon: 3 daemons, quorum bjxx-h225,bjpg-h226,bjxx-h227 (age 12d)
    mgr: bjxx-h225(active, since 3w), standbys: bjxx-h226, bjxx-h227
    osd: 48 osds: 48 up (since 12d), 48 in (since 3w) 
         flags noscrub,nodeep-scrub

  data:
    pools:   1 pools, 4096 pgs 
    objects: 22.97M objects, 1.4 TiB 
    usage:   4.3 TiB used, 344 TiB / 348 TiB avail
    pgs:     4096 active+clean

  io: 
    client:   8.5 MiB/s rd, 18 MiB/s wr, 136 op/s rd, 288 op/s wr

mon ops dump:

{
    "ops": [
        {
            "description": "osd_failure(failed timeout osd.15 [v2:10.58.217.44:6859/385022,v1:10.58.217.44:6861/385022] for 79459sec e2386 v2386)",
            "initiated_at": "2020-08-28 22:15:54.270015",
            "age": 1012508.736519877,
            "duration": 1012508.73654485,
            "type_data": {
                "events": [
                    {
                        "time": "2020-08-28 22:15:54.270015",
                        "event": "initiated" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270015",
                        "event": "header_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "throttled" 
                    },
                    {
                        "time": "0.000000",
                        "event": "all_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270074",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270075",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270075",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270076",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270077",
                        "event": "osdmap:preprocess_failure" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270089",
                        "event": "osdmap:prepare_update" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270090",
                        "event": "osdmap:prepare_failure" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270102",
                        "event": "no_reply: send routed request" 
                    }
                ],
                "info": {
                    "seq": 1380158,
                    "src_is_mon": false,
                    "source": "osd.12 v2:10.58.217.44:6808/385031",
                    "forwarded_to_leader": false
                }
            }
        },
        {
            "description": "osd_failure(failed timeout osd.15 [v2:10.58.217.44:6859/385022,v1:10.58.217.44:6861/385022] for 722192sec e2386 v2386)",
            "initiated_at": "2020-09-05 08:48:08.826789",
            "age": 369774.17974637699,
            "duration": 369774.17983697599,
            "type_data": {
                "events": [
                    {
                        "time": "2020-09-05 08:48:08.826789",
                        "event": "initiated" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826789",
                        "event": "header_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "throttled" 
                    },
                    {
                        "time": "0.000000",
                        "event": "all_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826861",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826861",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826862",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826864",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826866",
                        "event": "osdmap:preprocess_failure" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826886",
                        "event": "osdmap:prepare_update" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826888",
                        "event": "osdmap:prepare_failure" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826895",
                        "event": "no_reply: send routed request" 
                    }
                ],
                "info": {
                    "seq": 9388335,
                    "src_is_mon": false,
                    "source": "osd.21 v2:10.58.217.44:6833/385005",
                    "forwarded_to_leader": false
                }
            }
        }
    ],
    "num_ops": 2
}

Related issues 6 (1 open5 closed)

Related to RADOS - Bug #55750: mon: slow request of very long timeNeed More Info

Actions
Has duplicate RADOS - Bug #43893: lingering osd_failure ops (due to failure_info holding references?)Duplicate

Actions
Copied to RADOS - Backport #49917: octopus: mon: slow ops due to osd_failureResolvedsinguliere _Actions
Copied to RADOS - Backport #49918: pacific: mon: slow ops due to osd_failureResolvedNeha OjhaActions
Copied to RADOS - Backport #49919: nautilus: mon: slow ops due to osd_failureResolvedKefu ChaiActions
Copied to RADOS - Bug #50964: mon: slow ops due to osd_failureResolvedKefu Chai

Actions
Actions #1

Updated by Neha Ojha over 3 years ago

  • Related to Bug #43893: lingering osd_failure ops (due to failure_info holding references?) added
Actions #2

Updated by Neha Ojha over 3 years ago

  • Subject changed from mon.bjxx-h225 has slow ops to mon: slow ops due to osd_failure
Actions #3

Updated by Neha Ojha over 3 years ago

  • Status changed from New to Fix Under Review
  • Backport set to nautilus,octopus
  • Pull request ID set to 37068
Actions #4

Updated by Neha Ojha over 3 years ago

  • Related to deleted (Bug #43893: lingering osd_failure ops (due to failure_info holding references?))
Actions #5

Updated by Neha Ojha over 3 years ago

  • Has duplicate Bug #43893: lingering osd_failure ops (due to failure_info holding references?) added
Actions #6

Updated by Neha Ojha over 3 years ago

  • Priority changed from Normal to Urgent
  • Backport changed from nautilus,octopus to pacific, octopus, nautilus
Actions #7

Updated by Greg Farnum about 3 years ago

Copying my note from https://tracker.ceph.com/issues/43893#note-4

Looking at this ticket again it's not a no_reply issue at all -- the last event is it getting marked as no_reply, but somehow it's still sticking around. That means there's some dangling reference keeping the op alive, and...

Ah-ha! We keep around the failure-reporting ops in the failure_info structure, which I guess keeps them alive in the optracker. Now, normally you'd expect them to not be long-lived, but I'm not certain anything guarantees that and I'm not seeing an actual timeout.

So I think we need to break the reference from OSDMonitor::failure_reporter_t to a MonOpRequestRef. When I last looked, that wasn't trivial for some reason or other, but I'm sure it's achievable.

Actions #8

Updated by Sage Weil about 3 years ago

  • Assignee set to Kefu Chai
Actions #9

Updated by Kefu Chai about 3 years ago

Actions #10

Updated by Kefu Chai about 3 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Pull request ID changed from 37068 to 40033
Actions #11

Updated by Backport Bot about 3 years ago

  • Copied to Backport #49917: octopus: mon: slow ops due to osd_failure added
Actions #12

Updated by Backport Bot about 3 years ago

  • Copied to Backport #49918: pacific: mon: slow ops due to osd_failure added
Actions #13

Updated by Backport Bot about 3 years ago

  • Copied to Backport #49919: nautilus: mon: slow ops due to osd_failure added
Actions #14

Updated by Loïc Dachary almost 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions #15

Updated by Kefu Chai almost 3 years ago

  • Copied to Bug #50964: mon: slow ops due to osd_failure added
Actions #16

Updated by Kefu Chai almost 3 years ago

https://github.com/ceph/ceph/pull/40033 failed to address this issue, i am creating another issue #50964 to track this problem

Actions #17

Updated by Radoslaw Zarzynski almost 2 years ago

  • Related to Bug #55750: mon: slow request of very long time added
Actions

Also available in: Atom PDF