Project

General

Profile

Bug #50964

mon: slow ops due to osd_failure

Added by Kefu Chai 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
pacific, octopus, nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

  cluster:
    id:     fe99c552-5730-44ce-b0e8-09fac4e8b2f5
    health: HEALTH_WARN
            noscrub,nodeep-scrub flag(s) set 
            4096 pgs not deep-scrubbed in time
            4096 pgs not scrubbed in time
            2 slow ops, oldest one blocked for 1008320 sec, mon.bjxx-h225 has slow ops 

  services:
    mon: 3 daemons, quorum bjxx-h225,bjpg-h226,bjxx-h227 (age 12d)
    mgr: bjxx-h225(active, since 3w), standbys: bjxx-h226, bjxx-h227
    osd: 48 osds: 48 up (since 12d), 48 in (since 3w) 
         flags noscrub,nodeep-scrub

  data:
    pools:   1 pools, 4096 pgs 
    objects: 22.97M objects, 1.4 TiB 
    usage:   4.3 TiB used, 344 TiB / 348 TiB avail
    pgs:     4096 active+clean

  io: 
    client:   8.5 MiB/s rd, 18 MiB/s wr, 136 op/s rd, 288 op/s wr

mon ops dump:

{
    "ops": [
        {
            "description": "osd_failure(failed timeout osd.15 [v2:10.58.217.44:6859/385022,v1:10.58.217.44:6861/385022] for 79459sec e2386 v2386)",
            "initiated_at": "2020-08-28 22:15:54.270015",
            "age": 1012508.736519877,
            "duration": 1012508.73654485,
            "type_data": {
                "events": [
                    {
                        "time": "2020-08-28 22:15:54.270015",
                        "event": "initiated" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270015",
                        "event": "header_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "throttled" 
                    },
                    {
                        "time": "0.000000",
                        "event": "all_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270074",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270075",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270075",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270076",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270077",
                        "event": "osdmap:preprocess_failure" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270089",
                        "event": "osdmap:prepare_update" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270090",
                        "event": "osdmap:prepare_failure" 
                    },
                    {
                        "time": "2020-08-28 22:15:54.270102",
                        "event": "no_reply: send routed request" 
                    }
                ],
                "info": {
                    "seq": 1380158,
                    "src_is_mon": false,
                    "source": "osd.12 v2:10.58.217.44:6808/385031",
                    "forwarded_to_leader": false
                }
            }
        },
        {
            "description": "osd_failure(failed timeout osd.15 [v2:10.58.217.44:6859/385022,v1:10.58.217.44:6861/385022] for 722192sec e2386 v2386)",
            "initiated_at": "2020-09-05 08:48:08.826789",
            "age": 369774.17974637699,
            "duration": 369774.17983697599,
            "type_data": {
                "events": [
                    {
                        "time": "2020-09-05 08:48:08.826789",
                        "event": "initiated" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826789",
                        "event": "header_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "throttled" 
                    },
                    {
                        "time": "0.000000",
                        "event": "all_read" 
                    },
                    {
                        "time": "0.000000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826861",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826861",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826862",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826864",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826866",
                        "event": "osdmap:preprocess_failure" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826886",
                        "event": "osdmap:prepare_update" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826888",
                        "event": "osdmap:prepare_failure" 
                    },
                    {
                        "time": "2020-09-05 08:48:08.826895",
                        "event": "no_reply: send routed request" 
                    }
                ],
                "info": {
                    "seq": 9388335,
                    "src_is_mon": false,
                    "source": "osd.21 v2:10.58.217.44:6833/385005",
                    "forwarded_to_leader": false
                }
            }
        }
    ],
    "num_ops": 2
}

Related issues

Copied from RADOS - Bug #47380: mon: slow ops due to osd_failure Resolved
Copied to RADOS - Backport #50988: nautilus: mon: slow ops due to osd_failure Resolved
Copied to RADOS - Backport #50989: pacific: mon: slow ops due to osd_failure Resolved
Copied to RADOS - Backport #50990: octopus: mon: slow ops due to osd_failure Resolved

History

#1 Updated by Kefu Chai 6 months ago

  • Copied from Bug #47380: mon: slow ops due to osd_failure added

#2 Updated by Kefu Chai 6 months ago

  • Status changed from New to Pending Backport

#3 Updated by Backport Bot 6 months ago

  • Copied to Backport #50988: nautilus: mon: slow ops due to osd_failure added

#4 Updated by Backport Bot 6 months ago

  • Copied to Backport #50989: pacific: mon: slow ops due to osd_failure added

#5 Updated by Backport Bot 6 months ago

  • Copied to Backport #50990: octopus: mon: slow ops due to osd_failure added

#6 Updated by Loïc Dachary 5 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF