Project

General

Profile

Actions

Bug #24531

closed

Mimic MONs have slow/long running ops

Added by Wido den Hollander almost 6 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus, mimic, luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When setting up a Mimic 13.2.0 cluster I saw a message like this:

2018-06-15 00:23:55.639369 mon.dc3-f3-cl1-mon1 mon.0 10.16.5.141:6789/0 27880 : cluster [WRN] Health check update: 438 slow ops, oldest one blocked for 32476 sec, daemons [mon.dc3-f3-cl1-mon1,mon.dc3-f4-cl1-mon1,mon.dc3-f5-cl1-mon1] have slow ops. (SLOW_OPS)

Asking this mon for his Ops I see ops which have been running for a very long time:

ceph daemon mon.dc3-f3-cl1-mon1 ops
        {
            "description": "osd_failure(failed timeout osd.266 10.16.5.171:6888/30602 for 1384sec e1380 v1380)",
            "initiated_at": "2018-06-15 13:36:46.593698",
            "age": 9702.180330,
            "duration": 9702.180347,
            "type_data": {
                "events": [
                    {
                        "time": "2018-06-15 13:36:46.593698",
                        "event": "initiated" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.593698",
                        "event": "header_read" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.593699",
                        "event": "throttled" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.593703",
                        "event": "all_read" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599249",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599250",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599250",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599250",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599255",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599255",
                        "event": "osdmap:preprocess_failure" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599257",
                        "event": "osdmap:prepare_update" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599257",
                        "event": "osdmap:prepare_failure" 
                    }
                ],
                "info": {
                    "seq": 972,
                    "src_is_mon": false,
                    "source": "osd.247 10.16.5.172:6868/29968",
                    "forwarded_to_leader": false
                }
            }
        },

The current status of this cluster:

  cluster:
    id:     f8348af9-0d79-4ffd-a2ef-07ca83659935
    health: HEALTH_WARN
            No Zabbix server configured

  services:
    mon: 3 daemons, quorum dc3-f3-cl1-mon1,dc3-f4-cl1-mon1,dc3-f5-cl1-mon1
    mgr: dc3-f3-cl1-mon1(active), standbys: dc3-f5-cl1-mon1, dc3-f4-cl1-mon1
    mds: cephfs-2/2/2 up  {0=dc3-f3-cl1-mds1=up:active,1=dc3-f4-cl1-mds1=up:active}, 1 up:standby
    osd: 288 osds: 288 up, 288 in

  data:
    pools:   3 pools, 8256 pgs
    objects: 499.2 k objects, 1.9 TiB
    usage:   14 TiB used, 1.0 PiB / 1.0 PiB avail
    pgs:     8256 active+clean

  io:
    client:   1.1 GiB/s wr, 0 op/s rd, 547 op/s wr
[root@dc3-f3-cl1-mon1 ceph]# ceph osd dump|head -n 10
epoch 1463
fsid f8348af9-0d79-4ffd-a2ef-07ca83659935
created 2018-06-15 12:23:37.382975
modified 2018-06-15 15:50:11.132981
flags sortbitwise,recovery_deletes,purged_snapdirs
crush_version 510
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
[root@dc3-f3-cl1-mon1 ceph]#

Files

ops.json (15.3 KB) ops.json Wido den Hollander, 06/15/2018 02:18 PM
mon.b.slowops (20.9 KB) mon.b.slowops jun gong, 05/22/2019 06:26 AM
mon.c.slowops (13.1 KB) mon.c.slowops jun gong, 05/22/2019 06:26 AM
mon.a.slowops (6.53 KB) mon.a.slowops jun gong, 05/22/2019 06:26 AM

Related issues 5 (0 open5 closed)

Related to RADOS - Bug #43893: lingering osd_failure ops (due to failure_info holding references?)Duplicate

Actions
Has duplicate RADOS - Bug #37768: mon gets stuck op for failing OSDsDuplicate12/27/2018

Actions
Copied to RADOS - Backport #41862: nautilus: Mimic MONs have slow/long running opsResolvedPrashant DActions
Copied to RADOS - Backport #41863: mimic: Mimic MONs have slow/long running opsResolvedPrashant DActions
Copied to RADOS - Backport #41864: luminous: Mimic MONs have slow/long running opsResolvedNathan CutlerActions
Actions

Also available in: Atom PDF