Project

General

Profile

Bug #47380

mon: slow ops due to osd_failure

Added by Honggang Yang 7 months ago. Updated 29 days ago.

Status:
Pending Backport
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
pacific, octopus, nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

  cluster:
    id:     fe99c552-5730-44ce-b0e8-09fac4e8b2f5
    health: HEALTH_WARN
            noscrub,nodeep-scrub flag(s) set 
            4096 pgs not deep-scrubbed in time
            4096 pgs not scrubbed in time
            2 slow ops, oldest one blocked for 1008320 sec, mon.bjxx-h225 has slow ops 

  services:
    mon: 3 daemons, quorum bjxx-h225,bjpg-h226,bjxx-h227 (age 12d)
    mgr: bjxx-h225(active, since 3w), standbys: bjxx-h226, bjxx-h227
    osd: 48 osds: 48 up (since 12d), 48 in (since 3w) 
         flags noscrub,nodeep-scrub

  data:
    pools:   1 pools, 4096 pgs 
    objects: 22.97M objects, 1.4 TiB 
    usage:   4.3 TiB used, 344 TiB / 348 TiB avail
    pgs:     4096 active+clean

  io: 
    client:   8.5 MiB/s rd, 18 MiB/s wr, 136 op/s rd, 288 op/s wr

mon ops dump:

{
    "ops": [
        {
            "description": "osd_failure(failed timeout osd.15 [v2:10.58.217.44:6859/385022,v1:10.58.217.44:6861/385022] for 79459sec e2386 v2386)",
            "initiated_at": "2020-08-28 22:15:54.270015",
            "age": 1012508.736519877,
            "duration": 1012508.73654485,
            "type_data": {
                "events": [
                    {
                        "time": "2020-08-28 22:15:54.270015",
                        "event": "initiated" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270015",
                        "event": "header_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "throttled" 
                    },
                    {
                        "time": "0.000000",
                        "event": "all_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270074",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270075",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270075",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270076",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270077",
                        "event": "osdmap:preprocess_failure" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270089",
                        "event": "osdmap:prepare_update" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270090",
                        "event": "osdmap:prepare_failure" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270102",
                        "event": "no_reply: send routed request" 
                    }
                ],
                "info": {
                    "seq": 1380158,
                    "src_is_mon": false,
                    "source": "osd.12 v2:10.58.217.44:6808/385031",
                    "forwarded_to_leader": false
                }
            }
        },
        {
            "description": "osd_failure(failed timeout osd.15 [v2:10.58.217.44:6859/385022,v1:10.58.217.44:6861/385022] for 722192sec e2386 v2386)",
            "initiated_at": "2020-09-05 08:48:08.826789",
            "age": 369774.17974637699,
            "duration": 369774.17983697599,
            "type_data": {
                "events": [
                    {
                        "time": "2020-09-05 08:48:08.826789",
                        "event": "initiated" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826789",
                        "event": "header_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "throttled" 
                    },
                    {
                        "time": "0.000000",
                        "event": "all_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826861",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826861",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826862",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826864",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826866",
                        "event": "osdmap:preprocess_failure" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826886",
                        "event": "osdmap:prepare_update" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826888",
                        "event": "osdmap:prepare_failure" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826895",
                        "event": "no_reply: send routed request" 
                    }
                ],
                "info": {
                    "seq": 9388335,
                    "src_is_mon": false,
                    "source": "osd.21 v2:10.58.217.44:6833/385005",
                    "forwarded_to_leader": false
                }
            }
        }
    ],
    "num_ops": 2
}

Related issues

Duplicated by RADOS - Bug #43893: lingering osd_failure ops (due to failure_info holding references?) Duplicate
Copied to RADOS - Backport #49917: octopus: mon: slow ops due to osd_failure In Progress
Copied to RADOS - Backport #49918: pacific: mon: slow ops due to osd_failure New
Copied to RADOS - Backport #49919: nautilus: mon: slow ops due to osd_failure Need More Info

History

#1 Updated by Neha Ojha 7 months ago

  • Related to Bug #43893: lingering osd_failure ops (due to failure_info holding references?) added

#2 Updated by Neha Ojha 7 months ago

  • Subject changed from mon.bjxx-h225 has slow ops to mon: slow ops due to osd_failure

#3 Updated by Neha Ojha 5 months ago

  • Status changed from New to Fix Under Review
  • Backport set to nautilus,octopus
  • Pull request ID set to 37068

#4 Updated by Neha Ojha 3 months ago

  • Related to deleted (Bug #43893: lingering osd_failure ops (due to failure_info holding references?))

#5 Updated by Neha Ojha 3 months ago

  • Duplicated by Bug #43893: lingering osd_failure ops (due to failure_info holding references?) added

#6 Updated by Neha Ojha 3 months ago

  • Priority changed from Normal to Urgent
  • Backport changed from nautilus,octopus to pacific, octopus, nautilus

#7 Updated by Greg Farnum 2 months ago

Copying my note from https://tracker.ceph.com/issues/43893#note-4

Looking at this ticket again it's not a no_reply issue at all -- the last event is it getting marked as no_reply, but somehow it's still sticking around. That means there's some dangling reference keeping the op alive, and...

Ah-ha! We keep around the failure-reporting ops in the failure_info structure, which I guess keeps them alive in the optracker. Now, normally you'd expect them to not be long-lived, but I'm not certain anything guarantees that and I'm not seeing an actual timeout.

So I think we need to break the reference from OSDMonitor::failure_reporter_t to a MonOpRequestRef. When I last looked, that wasn't trivial for some reason or other, but I'm sure it's achievable.

#8 Updated by Sage Weil about 2 months ago

  • Assignee set to Kefu Chai

#10 Updated by Kefu Chai 29 days ago

  • Status changed from Fix Under Review to Pending Backport
  • Pull request ID changed from 37068 to 40033

#11 Updated by Backport Bot 29 days ago

  • Copied to Backport #49917: octopus: mon: slow ops due to osd_failure added

#12 Updated by Backport Bot 29 days ago

  • Copied to Backport #49918: pacific: mon: slow ops due to osd_failure added

#13 Updated by Backport Bot 29 days ago

  • Copied to Backport #49919: nautilus: mon: slow ops due to osd_failure added

Also available in: Atom PDF