Project

General

Profile

Actions

Bug #24531

closed

Mimic MONs have slow/long running ops

Added by Wido den Hollander almost 6 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Joao Eduardo Luis
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus, mimic, luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When setting up a Mimic 13.2.0 cluster I saw a message like this:

2018-06-15 00:23:55.639369 mon.dc3-f3-cl1-mon1 mon.0 10.16.5.141:6789/0 27880 : cluster [WRN] Health check update: 438 slow ops, oldest one blocked for 32476 sec, daemons [mon.dc3-f3-cl1-mon1,mon.dc3-f4-cl1-mon1,mon.dc3-f5-cl1-mon1] have slow ops. (SLOW_OPS)

Asking this mon for his Ops I see ops which have been running for a very long time:

ceph daemon mon.dc3-f3-cl1-mon1 ops
        {
            "description": "osd_failure(failed timeout osd.266 10.16.5.171:6888/30602 for 1384sec e1380 v1380)",
            "initiated_at": "2018-06-15 13:36:46.593698",
            "age": 9702.180330,
            "duration": 9702.180347,
            "type_data": {
                "events": [
                    {
                        "time": "2018-06-15 13:36:46.593698",
                        "event": "initiated" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.593698",
                        "event": "header_read" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.593699",
                        "event": "throttled" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.593703",
                        "event": "all_read" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599249",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599250",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599250",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599250",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599255",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599255",
                        "event": "osdmap:preprocess_failure" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599257",
                        "event": "osdmap:prepare_update" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599257",
                        "event": "osdmap:prepare_failure" 
                    }
                ],
                "info": {
                    "seq": 972,
                    "src_is_mon": false,
                    "source": "osd.247 10.16.5.172:6868/29968",
                    "forwarded_to_leader": false
                }
            }
        },

The current status of this cluster:

  cluster:
    id:     f8348af9-0d79-4ffd-a2ef-07ca83659935
    health: HEALTH_WARN
            No Zabbix server configured

  services:
    mon: 3 daemons, quorum dc3-f3-cl1-mon1,dc3-f4-cl1-mon1,dc3-f5-cl1-mon1
    mgr: dc3-f3-cl1-mon1(active), standbys: dc3-f5-cl1-mon1, dc3-f4-cl1-mon1
    mds: cephfs-2/2/2 up  {0=dc3-f3-cl1-mds1=up:active,1=dc3-f4-cl1-mds1=up:active}, 1 up:standby
    osd: 288 osds: 288 up, 288 in

  data:
    pools:   3 pools, 8256 pgs
    objects: 499.2 k objects, 1.9 TiB
    usage:   14 TiB used, 1.0 PiB / 1.0 PiB avail
    pgs:     8256 active+clean

  io:
    client:   1.1 GiB/s wr, 0 op/s rd, 547 op/s wr
[root@dc3-f3-cl1-mon1 ceph]# ceph osd dump|head -n 10
epoch 1463
fsid f8348af9-0d79-4ffd-a2ef-07ca83659935
created 2018-06-15 12:23:37.382975
modified 2018-06-15 15:50:11.132981
flags sortbitwise,recovery_deletes,purged_snapdirs
crush_version 510
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
[root@dc3-f3-cl1-mon1 ceph]#

Files

ops.json (15.3 KB) ops.json Wido den Hollander, 06/15/2018 02:18 PM
mon.b.slowops (20.9 KB) mon.b.slowops jun gong, 05/22/2019 06:26 AM
mon.c.slowops (13.1 KB) mon.c.slowops jun gong, 05/22/2019 06:26 AM
mon.a.slowops (6.53 KB) mon.a.slowops jun gong, 05/22/2019 06:26 AM

Related issues 5 (0 open5 closed)

Related to RADOS - Bug #43893: lingering osd_failure ops (due to failure_info holding references?)Duplicate

Actions
Has duplicate RADOS - Bug #37768: mon gets stuck op for failing OSDsDuplicate12/27/2018

Actions
Copied to RADOS - Backport #41862: nautilus: Mimic MONs have slow/long running opsResolvedPrashant DActions
Copied to RADOS - Backport #41863: mimic: Mimic MONs have slow/long running opsResolvedPrashant DActions
Copied to RADOS - Backport #41864: luminous: Mimic MONs have slow/long running opsResolvedNathan CutlerActions
Actions #1

Updated by Greg Farnum almost 6 years ago

What's the output of "ceph versions" on this cluster?

We had issues in the lab with OSD failure reports not getting cleaned up properly from that op tracker, but I don't think that particular log has turned up and it's a bit confusing how that could have happened.

Actions #2

Updated by Wido den Hollander almost 6 years ago

{
    "mon": {
        "ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)": 3
    },
    "mgr": {
        "ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)": 3
    },
    "osd": {
        "ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)": 288
    },
    "mds": {
        "ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)": 3
    },
    "overall": {
        "ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)": 297
    }
}

This cluster was installed today and we started to do some physical tests by pulling disks, pulling power cords, etc, etc.

Everything recovered just fine, but I saw these messages pop up in the logs and also in 'ceph health'.

Although the message is gone in the status the ops are still there.

Actions #3

Updated by Josh Durgin almost 6 years ago

  • Assignee set to Joao Eduardo Luis

Joao, could you take a look at this?

Actions #4

Updated by Hector Martin over 5 years ago

I just hit this on a 13.2.1 single-host cluster with 1 mon and 8 OSDs. The log is basically identical to the one Wido reported. It seems osd.3 flapped and left behind two stuck ops with identical events.

Actions #5

Updated by Josh Durgin over 5 years ago

  • Has duplicate Bug #37768: mon gets stuck op for failing OSDs added
Actions #6

Updated by Josh Durgin over 5 years ago

  • Priority changed from Normal to High
Actions #7

Updated by Paul Emmerich over 5 years ago

I've seen this on a 13.2.2 cluster after restarting OSDs

Actions #8

Updated by Tobias Rehn about 5 years ago

I am also seeing this on latest mimic (13.2.4). So far it seems like its cosmetic and has no impact.

[root@mon01 ~]# ceph -s
  cluster:
    id:     7b62928c-3bb4-4196-a7c7-5da4fad63c53
    health: HEALTH_WARN
            4 slow ops, oldest one blocked for 103474 sec, daemons [mon.mon01,mon.mon02,mon.mon03] have slow ops.
Actions #9

Updated by Stig Telfer about 5 years ago

I see the same symptoms on a system running 13.2.2 - each monitor has a small number of slow ops, all initiated within a couple of minutes of one another about six weeks ago.

Actions #10

Updated by Paul Emmerich about 5 years ago

I've now encountered this on a total of 3 different clusters with 13.2.2 and 13.2.4

Actions #11

Updated by Марк Коренберг about 5 years ago

Seems, the same:

mmarkk@mmwork:/mnt/cephfs/tg/.snap$ ceph -s
  cluster:
    id:     56ed206b-67cf-42a6-be65-9baf32334fc9
    health: HEALTH_WARN
            3 slow ops, oldest one blocked for 150263 sec, mon.node3 has slow ops
{
    "ops": [
        {
            "description": "remove_snaps({52=[2,3]} v0)",
            "initiated_at": "2019-01-29 21:48:01.206747",
            "age": 150508.902028,
            "duration": 150508.902042,
            "type_data": {
                "events": [
                    {
                        "time": "2019-01-29 21:48:01.206747",
                        "event": "initiated" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206747",
                        "event": "header_read" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206749",
                        "event": "throttled" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206754",
                        "event": "all_read" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206785",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206789",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206790",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206791",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206802",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206804",
                        "event": "osdmap:preprocess_remove_snaps" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206810",
                        "event": "forward_request_leader" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206839",
                        "event": "forwarded" 
                    }
                ],
                "info": {
                    "seq": 1364847,
                    "src_is_mon": false,
                    "source": "mds.0 10.80.20.103:6808/1421988905",
                    "forwarded_to_leader": true
                }
            }
        },
        {
            "description": "remove_snaps({52=[4,5]} v0)",
            "initiated_at": "2019-01-29 21:48:51.202066",
            "age": 150458.906709,
            "duration": 150458.908302,
            "type_data": {
                "events": [
                    {
                        "time": "2019-01-29 21:48:51.202066",
                        "event": "initiated" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202066",
                        "event": "header_read" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202068",
                        "event": "throttled" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202072",
                        "event": "all_read" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202102",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202106",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202107",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202107",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202117",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202118",
                        "event": "osdmap:preprocess_remove_snaps" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202123",
                        "event": "forward_request_leader" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202151",
                        "event": "forwarded" 
                    }
                ],
                "info": {
                    "seq": 1365062,
                    "src_is_mon": false,
                    "source": "mds.0 10.80.20.103:6808/1421988905",
                    "forwarded_to_leader": true
                }
            }
        },
        {
            "description": "remove_snaps({52=[6,7]} v0)",
            "initiated_at": "2019-01-29 21:58:01.208627",
            "age": 149908.900147,
            "duration": 149908.901793,
            "type_data": {
                "events": [
                    {
                        "time": "2019-01-29 21:58:01.208627",
                        "event": "initiated" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208627",
                        "event": "header_read" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208629",
                        "event": "throttled" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208635",
                        "event": "all_read" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208667",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208671",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208672",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208672",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208682",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208683",
                        "event": "osdmap:preprocess_remove_snaps" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208688",
                        "event": "forward_request_leader" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208736",
                        "event": "forwarded" 
                    }
                ],
                "info": {
                    "seq": 1367397,
                    "src_is_mon": false,
                    "source": "mds.0 10.80.20.103:6808/1421988905",
                    "forwarded_to_leader": true
                }
            }
        }
    ],
    "num_ops": 3
}
Actions #12

Updated by Марк Коренберг about 5 years ago

I have restarted mon.node3 and now everything is OK.

Actions #13

Updated by Dan van der Ster about 5 years ago

We had this happen twice this week on a v13.2.5 cluster. (The cluster was recently upgraded from v12.2.11, where this never happened before.)

On our cluster it happens like this:

  • some osd flapping glitch (many osds all go down and up for a few minutes)
  • osds resolve themselves, all coming back up, and PGs getting active correctly
  • a few thousand leftover slow requests on the mons that never resolve:
         health: HEALTH_WARN
                5782 slow ops, oldest one blocked for 24106 sec, daemons [mon.cepherin-mon-084bea7b06,mon.cepherin-mon-7cb9b591e1,mon.cepherin1] have slow ops.
    
  • I've dumped the ops on the peon and leader mons.
    • Peon:
                  "description": "osd_alive(want up_thru 313212 have 313212)",
                    ... a few thousand times
                  "description": "osd_alive(want up_thru 313213 have 313213)",
                  "description": "osd_alive(want up_thru 313218 have 313218)",
                  "description": "osd_failure(failed timeout osd.1283 128.142.25.117:6908/2980276 for 10sec e313238 v313238)",
      
    • Leader:
                  "description": "osd_failure(failed timeout osd.483 128.142.212.174:6878/3315556 for 21sec e313146 v313146)",
                  "description": "osd_failure(failed timeout osd.485 128.142.212.174:6825/3315534 for 21sec e313146 v313146)",
                  "description": "osd_failure(failed timeout osd.56 128.142.168.12:6876/337817 for 20sec e313159 v313159)",
                  "description": "osd_failure(failed timeout osd.527 128.142.210.141:6865/3134286 for 23sec e313176 v313176)",
                  "description": "osd_failure(failed timeout osd.264 128.142.162.136:6832/1715260 for 21sec e313159 v313159)",
                  "description": "osd_failure(failed timeout osd.5 128.142.168.12:6849/337797 for 22sec e313154 v313154)",
                  "description": "osd_failure(failed timeout osd.126 128.142.162.75:6885/4002283 for 24sec e313154 v313154)",
                  "description": "osd_failure(failed timeout osd.143 128.142.162.75:6887/4002279 for 50sec e313200 v313200)",
                  "description": "osd_failure(failed timeout osd.236 128.142.162.78:6812/2713226 for 29sec e313157 v313157)",
                  "description": "osd_failure(failed timeout osd.238 128.142.162.78:6943/2713280 for 29sec e313157 v313157)",
                  "description": "osd_failure(failed timeout osd.1248 128.142.25.108:6832/2865592 for 20sec e313140 v313140)",
                  "description": "osd_failure(failed timeout osd.1283 128.142.25.117:6908/2980276 for 10sec e313238 v313238)",
      
    • The leader ops are all stuck in:
                          {
                              "time": "2019-04-16 23:49:42.506169",
                              "event": "osdmap:prepare_failure" 
                          }
      
  • workaround to remove these mon slow reqs is to restart the mons (in any order).

I've posted all logs here: ceph-post-file: c2082cef-f867-4084-abb8-510e865a6c6e
Leader ops are here: ceph-post-file: e48ff1e2-0256-432b-894a-d84aa7230bf6
Peon ops are here: ceph-post-file: 28e55100-1c1b-4d37-b280-5651d91f28c8

Actions #14

Updated by jun gong almost 5 years ago

same problem with Dan van der Ster,on a v13.2.5 cluster five hours ago.
I restart osd.0 when monitor logs show oldest slow ops(osd failure osd.0 ......),then monitor logs show oldest slow ops on osd.1
finally,i restart all 3 osds,then this problem resolved.

Updated by jun gong almost 5 years ago

The attached file is three mon's dump_historic_slow_ops file.

I deploy v13.2.5 ceph by rook in kunnertes cluster,I restart all three osd one by one resolved this problem,so I can't found osd's slow ops.

The problem happend at 2019-05-21 09:18,and i restart three osd ordered between 10:00 to 10:30.

Actions #16

Updated by Dan van der Ster almost 5 years ago

Joao sent this as a possible fix: https://github.com/ceph/ceph/pull/28177

Actions #17

Updated by Theo O over 4 years ago

I am encountering similar issues on a cluster with all daemons running

ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)

with the following ops on the leader mon (with node IPs redacted)

{
    "ops": [
        {
            "description": "osd_failure(failed timeout osd.23 [v2:osd-node1:6817/151142,v1:osd-node1:6818/151142] for 28sec e13553 v13553)",
            "initiated_at": "2019-08-06 11:31:53.163578",
            "age": 1308.2272771569999,
            "duration": 1308.227311912,
            "type_data": {
                "events": [
                    {
                        "time": "2019-08-06 11:31:53.163578",
                        "event": "initiated" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.163578",
                        "event": "header_read" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.163577",
                        "event": "throttled" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.163580",
                        "event": "all_read" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164160",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164162",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164162",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164162",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164168",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164169",
                        "event": "osdmap:preprocess_failure" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164171",
                        "event": "osdmap:prepare_update" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164172",
                        "event": "osdmap:prepare_failure" 
                    }
                ],
                "info": {
                    "seq": 1290829,
                    "src_is_mon": false,
                    "source": "osd.17 v2:osd-node2:6813/149653",
                    "forwarded_to_leader": false
                }
            }
        },
        {
            "description": "osd_failure(failed timeout osd.24 [v2:osd-node1:6807/151331,v1:osd-node1:6809/151331] for 28sec e13553 v13553)",
            "initiated_at": "2019-08-06 11:31:53.164483",
            "age": 1308.2263727659999,
            "duration": 1308.2265255320001,
            "type_data": {
                "events": [
                    {
                        "time": "2019-08-06 11:31:53.164483",
                        "event": "initiated" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164483",
                        "event": "header_read" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.163582",
                        "event": "throttled" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164500",
                        "event": "all_read" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164512",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164515",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164515",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164516",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164527",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164528",
                        "event": "osdmap:preprocess_failure" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164532",
                        "event": "osdmap:prepare_update" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164533",
                        "event": "osdmap:prepare_failure" 
                    }
                ],
                "info": {
                    "seq": 1290831,
                    "src_is_mon": false,
                    "source": "osd.17 v2:osd-node2:6813/149653",
                    "forwarded_to_leader": false
                }
            }
        }
    ],
    "num_ops": 2
}
Actions #18

Updated by Josh Durgin over 4 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 28177
Actions #19

Updated by Greg Farnum over 4 years ago

  • Status changed from Fix Under Review to 7
Actions #20

Updated by Greg Farnum over 4 years ago

  • Backport set to nautilus, mimic, luminous
Actions #21

Updated by Neha Ojha over 4 years ago

  • Status changed from 7 to Pending Backport
Actions #22

Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #41862: nautilus: Mimic MONs have slow/long running ops added
Actions #23

Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #41863: mimic: Mimic MONs have slow/long running ops added
Actions #24

Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #41864: luminous: Mimic MONs have slow/long running ops added
Actions #25

Updated by Марк Коренберг over 4 years ago

The same on Nautilus 14.2.3

Actions #26

Updated by Aaron Bassett over 4 years ago

Seeing the same on 14.2.4

Actions #27

Updated by Nathan Cutler over 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions #28

Updated by Neha Ojha about 4 years ago

  • Related to Bug #43893: lingering osd_failure ops (due to failure_info holding references?) added
Actions #29

Updated by Aleksei Gutikov almost 4 years ago

Something similar, Ceph v14.2.0.

{
            "description": "osd_pgtemp(e49509 {2.162=[378,160,56]} v49509)",
            "initiated_at": "2020-05-07 14:01:17.837714",
            "age": 79318.903942007993,
            "duration": 79318.903961598,
            "type_data": {
                "events": [
                    {
                        "time": "2020-05-07 14:01:17.837714",
                        "event": "initiated" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837714",
                        "event": "header_read" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837716",
                        "event": "throttled" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837720",
                        "event": "all_read" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837738",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837740",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837741",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837741",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837742",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837763",
                        "event": "forward_request_leader" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837778",
                        "event": "forwarded" 
                    }
                ],
                "info": {
                    "seq": 882657,
                    "src_is_mon": false,
                    "source": "osd.378 v1:10.240.2.217:6861/288088",
                    "forwarded_to_leader": true
                }
            }
        }

mon status output (for one that has slow ops):

{
    "name": "MSK-SR1-R2-CEPH-HOT1",
    "rank": 1,
    "state": "peon",
    "election_epoch": 62,
    "quorum": [
        0,
        1,
        2
    ],
    "quorum_age": 81311,
    "features": {
        "required_con": "2449958747315912708",
        "required_mon": [
            "kraken",
            "luminous",
            "mimic",
            "osdmap-prune",
            "nautilus" 
        ],
        "quorum_con": "4611087854031667199",
        "quorum_mon": [
            "kraken",
            "luminous",
            "mimic",
            "osdmap-prune",
            "nautilus" 
        ]
    },
    "outside_quorum": [],
    "extra_probe_peers": [],
    "sync_provider": [],
    "monmap": {
        "epoch": 2,
        "fsid": "ad99506a-05a5-11e8-975e-74d4351a7990",
        "modified": "2020-04-26 19:22:44.160085",
        "created": "2020-04-26 19:13:21.011855",
        "min_mon_release": 14,
        "min_mon_release_name": "nautilus",
        "features": {
            "persistent": [
                "kraken",
                "luminous",
                "mimic",
                "osdmap-prune",
                "nautilus" 
            ],
            "optional": []
        },
        "mons": [
            {
                "rank": 0,
                "name": "MSK-SR1-R1-CEPH-HOT1",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v1",
                            "addr": "10.240.2.201:6789",
                            "nonce": 0
                        },
                        {
                            "type": "v2",
                            "addr": "10.240.2.201:3300",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "10.240.2.201:6789/0",
                "public_addr": "10.240.2.201:6789/0" 
            },
            {
                "rank": 1,
                "name": "MSK-SR1-R2-CEPH-HOT1",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v1",
                            "addr": "10.240.2.202:6789",
                            "nonce": 0
                        },
                        {
                            "type": "v2",
                            "addr": "10.240.2.202:3300",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "10.240.2.202:6789/0",
                "public_addr": "10.240.2.202:6789/0" 
            },
            {
                "rank": 2,
                "name": "MSK-SR1-R3-CEPH-HOT1",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "10.240.2.203:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "10.240.2.203:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "10.240.2.203:6789/0",
                "public_addr": "10.240.2.203:6789/0" 
            }
        ]
    },
    "feature_map": {
        "mon": [
            {
                "features": "0x3ffddff8ffacffff",
                "release": "luminous",
                "num": 1
            }
        ],
        "osd": [
            {
                "features": "0x3ffddff8ffacffff",
                "release": "luminous",
                "num": 200
            }
        ],
        "client": [
            {
                "features": "0x27018fb86aa42ada",
                "release": "jewel",
                "num": 10
            },
            {
                "features": "0x3ffddff8ffacffff",
                "release": "luminous",
                "num": 786
            }
        ]
    }
}

Actions #30

Updated by Daniel Poelzleithner almost 4 years ago

We had this issue yesterday. We had a broken mon cluster which I was able to repair by shutting down all mons, scaling monmap to 1 and starting one mon server. After the osds went back online, I zapped all mons and let them rejoin the cluster. I had to restart all mds, as they had strange hickups and cause also SLOW requests.

Actions #31

Updated by Kefu Chai over 3 years ago

please note, this fix was backported as f0697a9af54bf866572036bd6d582abd5299d0c8

 git tag --contains f0697a9af54bf866572036bd6d582abd5299d0c8                                                                                                      Thu Sep 10 12:41:54 2020
v14.2.10
v14.2.5
v14.2.6
v14.2.7
v14.2.8
v14.2.9

and is included in nautilus 14.2.5 and up. if you are still using < 14.2.5. might want to consider upgrade at least your monitor to 14.2.5.

Actions

Also available in: Atom PDF