Bug #24531: Mimic MONs have slow/long running ops - RADOS - Ceph

Actions

Copy link

Bug #24531

closed

Mimic MONs have slow/long running ops

Added by Wido den Hollander almost 6 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

Joao Eduardo Luis

Category:

Target version:

% Done:

Source:

Tags:

Backport:

nautilus, mimic, luminous

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v13.2.0

ceph-qa-suite:

Component(RADOS):

Monitor

Pull request ID:

28177

Crash signature (v1):

Crash signature (v2):

Description

When setting up a Mimic 13.2.0 cluster I saw a message like this:

2018-06-15 00:23:55.639369 mon.dc3-f3-cl1-mon1 mon.0 10.16.5.141:6789/0 27880 : cluster [WRN] Health check update: 438 slow ops, oldest one blocked for 32476 sec, daemons [mon.dc3-f3-cl1-mon1,mon.dc3-f4-cl1-mon1,mon.dc3-f5-cl1-mon1] have slow ops. (SLOW_OPS)

Asking this mon for his Ops I see ops which have been running for a very long time:

ceph daemon mon.dc3-f3-cl1-mon1 ops

        {
            "description": "osd_failure(failed timeout osd.266 10.16.5.171:6888/30602 for 1384sec e1380 v1380)",
            "initiated_at": "2018-06-15 13:36:46.593698",
            "age": 9702.180330,
            "duration": 9702.180347,
            "type_data": {
                "events": [
                    {
                        "time": "2018-06-15 13:36:46.593698",
                        "event": "initiated" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.593698",
                        "event": "header_read" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.593699",
                        "event": "throttled" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.593703",
                        "event": "all_read" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599249",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599250",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599250",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599250",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599255",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599255",
                        "event": "osdmap:preprocess_failure" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599257",
                        "event": "osdmap:prepare_update" 
                    },
                    {
                        "time": "2018-06-15 13:36:46.599257",
                        "event": "osdmap:prepare_failure" 
                    }
                ],
                "info": {
                    "seq": 972,
                    "src_is_mon": false,
                    "source": "osd.247 10.16.5.172:6868/29968",
                    "forwarded_to_leader": false
                }
            }
        },

The current status of this cluster:

  cluster:
    id:     f8348af9-0d79-4ffd-a2ef-07ca83659935
    health: HEALTH_WARN
            No Zabbix server configured

  services:
    mon: 3 daemons, quorum dc3-f3-cl1-mon1,dc3-f4-cl1-mon1,dc3-f5-cl1-mon1
    mgr: dc3-f3-cl1-mon1(active), standbys: dc3-f5-cl1-mon1, dc3-f4-cl1-mon1
    mds: cephfs-2/2/2 up  {0=dc3-f3-cl1-mds1=up:active,1=dc3-f4-cl1-mds1=up:active}, 1 up:standby
    osd: 288 osds: 288 up, 288 in

  data:
    pools:   3 pools, 8256 pgs
    objects: 499.2 k objects, 1.9 TiB
    usage:   14 TiB used, 1.0 PiB / 1.0 PiB avail
    pgs:     8256 active+clean

  io:
    client:   1.1 GiB/s wr, 0 op/s rd, 547 op/s wr

[root@dc3-f3-cl1-mon1 ceph]# ceph osd dump|head -n 10
epoch 1463
fsid f8348af9-0d79-4ffd-a2ef-07ca83659935
created 2018-06-15 12:23:37.382975
modified 2018-06-15 15:50:11.132981
flags sortbitwise,recovery_deletes,purged_snapdirs
crush_version 510
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
[root@dc3-f3-cl1-mon1 ceph]#

Files

Download all files

ops.json (15.3 KB) ops.json		Wido den Hollander, 06/15/2018 02:18 PM
mon.b.slowops (20.9 KB) mon.b.slowops		jun gong, 05/22/2019 06:26 AM
mon.c.slowops (13.1 KB) mon.c.slowops		jun gong, 05/22/2019 06:26 AM
mon.a.slowops (6.53 KB) mon.a.slowops		jun gong, 05/22/2019 06:26 AM

Related issues 5 (0 open — 5 closed)

Actions

Copy link

Updated by Greg Farnum almost 6 years ago

What's the output of "ceph versions" on this cluster?

We had issues in the lab with OSD failure reports not getting cleaned up properly from that op tracker, but I don't think that particular log has turned up and it's a bit confusing how that could have happened.

Actions

Copy link

Updated by Wido den Hollander almost 6 years ago

{
    "mon": {
        "ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)": 3
    },
    "mgr": {
        "ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)": 3
    },
    "osd": {
        "ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)": 288
    },
    "mds": {
        "ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)": 3
    },
    "overall": {
        "ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)": 297
    }
}

This cluster was installed today and we started to do some physical tests by pulling disks, pulling power cords, etc, etc.

Everything recovered just fine, but I saw these messages pop up in the logs and also in 'ceph health'.

Although the message is gone in the status the ops are still there.

Actions

Copy link

Updated by Josh Durgin almost 6 years ago

Assignee set to Joao Eduardo Luis

Joao, could you take a look at this?

Actions

Copy link

Updated by Hector Martin over 5 years ago

I just hit this on a 13.2.1 single-host cluster with 1 mon and 8 OSDs. The log is basically identical to the one Wido reported. It seems osd.3 flapped and left behind two stuck ops with identical events.

Actions

Copy link

Updated by Josh Durgin over 5 years ago

Has duplicate Bug #37768: mon gets stuck op for failing OSDs added

Actions

Copy link

Updated by Josh Durgin over 5 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Paul Emmerich over 5 years ago

I've seen this on a 13.2.2 cluster after restarting OSDs

Actions

Copy link

Updated by Tobias Rehn about 5 years ago

I am also seeing this on latest mimic (13.2.4). So far it seems like its cosmetic and has no impact.

[root@mon01 ~]# ceph -s
  cluster:
    id:     7b62928c-3bb4-4196-a7c7-5da4fad63c53
    health: HEALTH_WARN
            4 slow ops, oldest one blocked for 103474 sec, daemons [mon.mon01,mon.mon02,mon.mon03] have slow ops.

Actions

Copy link

Updated by Stig Telfer about 5 years ago

I see the same symptoms on a system running 13.2.2 - each monitor has a small number of slow ops, all initiated within a couple of minutes of one another about six weeks ago.

Actions

Copy link

#10

Updated by Paul Emmerich about 5 years ago

I've now encountered this on a total of 3 different clusters with 13.2.2 and 13.2.4

Actions

Copy link

#11

Updated by Марк Коренберг about 5 years ago

Seems, the same:

mmarkk@mmwork:/mnt/cephfs/tg/.snap$ ceph -s
  cluster:
    id:     56ed206b-67cf-42a6-be65-9baf32334fc9
    health: HEALTH_WARN
            3 slow ops, oldest one blocked for 150263 sec, mon.node3 has slow ops

{
    "ops": [
        {
            "description": "remove_snaps({52=[2,3]} v0)",
            "initiated_at": "2019-01-29 21:48:01.206747",
            "age": 150508.902028,
            "duration": 150508.902042,
            "type_data": {
                "events": [
                    {
                        "time": "2019-01-29 21:48:01.206747",
                        "event": "initiated" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206747",
                        "event": "header_read" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206749",
                        "event": "throttled" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206754",
                        "event": "all_read" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206785",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206789",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206790",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206791",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206802",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206804",
                        "event": "osdmap:preprocess_remove_snaps" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206810",
                        "event": "forward_request_leader" 
                    },
                    {
                        "time": "2019-01-29 21:48:01.206839",
                        "event": "forwarded" 
                    }
                ],
                "info": {
                    "seq": 1364847,
                    "src_is_mon": false,
                    "source": "mds.0 10.80.20.103:6808/1421988905",
                    "forwarded_to_leader": true
                }
            }
        },
        {
            "description": "remove_snaps({52=[4,5]} v0)",
            "initiated_at": "2019-01-29 21:48:51.202066",
            "age": 150458.906709,
            "duration": 150458.908302,
            "type_data": {
                "events": [
                    {
                        "time": "2019-01-29 21:48:51.202066",
                        "event": "initiated" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202066",
                        "event": "header_read" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202068",
                        "event": "throttled" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202072",
                        "event": "all_read" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202102",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202106",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202107",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202107",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202117",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202118",
                        "event": "osdmap:preprocess_remove_snaps" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202123",
                        "event": "forward_request_leader" 
                    },
                    {
                        "time": "2019-01-29 21:48:51.202151",
                        "event": "forwarded" 
                    }
                ],
                "info": {
                    "seq": 1365062,
                    "src_is_mon": false,
                    "source": "mds.0 10.80.20.103:6808/1421988905",
                    "forwarded_to_leader": true
                }
            }
        },
        {
            "description": "remove_snaps({52=[6,7]} v0)",
            "initiated_at": "2019-01-29 21:58:01.208627",
            "age": 149908.900147,
            "duration": 149908.901793,
            "type_data": {
                "events": [
                    {
                        "time": "2019-01-29 21:58:01.208627",
                        "event": "initiated" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208627",
                        "event": "header_read" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208629",
                        "event": "throttled" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208635",
                        "event": "all_read" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208667",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208671",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208672",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208672",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208682",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208683",
                        "event": "osdmap:preprocess_remove_snaps" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208688",
                        "event": "forward_request_leader" 
                    },
                    {
                        "time": "2019-01-29 21:58:01.208736",
                        "event": "forwarded" 
                    }
                ],
                "info": {
                    "seq": 1367397,
                    "src_is_mon": false,
                    "source": "mds.0 10.80.20.103:6808/1421988905",
                    "forwarded_to_leader": true
                }
            }
        }
    ],
    "num_ops": 3
}

Actions

Copy link

#12

Updated by Марк Коренберг about 5 years ago

I have restarted mon.node3 and now everything is OK.

Actions

Copy link

#13

Updated by Dan van der Ster about 5 years ago

We had this happen twice this week on a v13.2.5 cluster. (The cluster was recently upgraded from v12.2.11, where this never happened before.)

On our cluster it happens like this:

some osd flapping glitch (many osds all go down and up for a few minutes)
osds resolve themselves, all coming back up, and PGs getting active correctly

a few thousand leftover slow requests on the mons that never resolve:

     health: HEALTH_WARN
            5782 slow ops, oldest one blocked for 24106 sec, daemons [mon.cepherin-mon-084bea7b06,mon.cepherin-mon-7cb9b591e1,mon.cepherin1] have slow ops.

I've dumped the ops on the peon and leader mons.

Peon:

            "description": "osd_alive(want up_thru 313212 have 313212)",
              ... a few thousand times
            "description": "osd_alive(want up_thru 313213 have 313213)",
            "description": "osd_alive(want up_thru 313218 have 313218)",
            "description": "osd_failure(failed timeout osd.1283 128.142.25.117:6908/2980276 for 10sec e313238 v313238)",

Leader:

            "description": "osd_failure(failed timeout osd.483 128.142.212.174:6878/3315556 for 21sec e313146 v313146)",
            "description": "osd_failure(failed timeout osd.485 128.142.212.174:6825/3315534 for 21sec e313146 v313146)",
            "description": "osd_failure(failed timeout osd.56 128.142.168.12:6876/337817 for 20sec e313159 v313159)",
            "description": "osd_failure(failed timeout osd.527 128.142.210.141:6865/3134286 for 23sec e313176 v313176)",
            "description": "osd_failure(failed timeout osd.264 128.142.162.136:6832/1715260 for 21sec e313159 v313159)",
            "description": "osd_failure(failed timeout osd.5 128.142.168.12:6849/337797 for 22sec e313154 v313154)",
            "description": "osd_failure(failed timeout osd.126 128.142.162.75:6885/4002283 for 24sec e313154 v313154)",
            "description": "osd_failure(failed timeout osd.143 128.142.162.75:6887/4002279 for 50sec e313200 v313200)",
            "description": "osd_failure(failed timeout osd.236 128.142.162.78:6812/2713226 for 29sec e313157 v313157)",
            "description": "osd_failure(failed timeout osd.238 128.142.162.78:6943/2713280 for 29sec e313157 v313157)",
            "description": "osd_failure(failed timeout osd.1248 128.142.25.108:6832/2865592 for 20sec e313140 v313140)",
            "description": "osd_failure(failed timeout osd.1283 128.142.25.117:6908/2980276 for 10sec e313238 v313238)",

The leader ops are all stuck in:

                    {
                        "time": "2019-04-16 23:49:42.506169",
                        "event": "osdmap:prepare_failure" 
                    }

workaround to remove these mon slow reqs is to restart the mons (in any order).

I've posted all logs here: ceph-post-file: c2082cef-f867-4084-abb8-510e865a6c6e
Leader ops are here: ceph-post-file: e48ff1e2-0256-432b-894a-d84aa7230bf6
Peon ops are here: ceph-post-file: 28e55100-1c1b-4d37-b280-5651d91f28c8

Actions

Copy link

#14

Updated by jun gong almost 5 years ago

same problem with Dan van der Ster,on a v13.2.5 cluster five hours ago.
I restart osd.0 when monitor logs show oldest slow ops(osd failure osd.0 ......),then monitor logs show oldest slow ops on osd.1
finally,i restart all 3 osds,then this problem resolved.

Actions

Copy link Download all files

#15

Updated by jun gong almost 5 years ago

File mon.a.slowops mon.a.slowops added
File mon.b.slowops mon.b.slowops added
File mon.c.slowops mon.c.slowops added

The attached file is three mon's dump_historic_slow_ops file.

I deploy v13.2.5 ceph by rook in kunnertes cluster,I restart all three osd one by one resolved this problem,so I can't found osd's slow ops.

The problem happend at 2019-05-21 09:18,and i restart three osd ordered between 10:00 to 10:30.

Actions

Copy link

#16

Updated by Dan van der Ster almost 5 years ago

Joao sent this as a possible fix: https://github.com/ceph/ceph/pull/28177

Actions

Copy link

#17

Updated by Theo O over 4 years ago

I am encountering similar issues on a cluster with all daemons running

ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)

with the following ops on the leader mon (with node IPs redacted)

{
    "ops": [
        {
            "description": "osd_failure(failed timeout osd.23 [v2:osd-node1:6817/151142,v1:osd-node1:6818/151142] for 28sec e13553 v13553)",
            "initiated_at": "2019-08-06 11:31:53.163578",
            "age": 1308.2272771569999,
            "duration": 1308.227311912,
            "type_data": {
                "events": [
                    {
                        "time": "2019-08-06 11:31:53.163578",
                        "event": "initiated" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.163578",
                        "event": "header_read" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.163577",
                        "event": "throttled" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.163580",
                        "event": "all_read" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164160",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164162",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164162",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164162",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164168",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164169",
                        "event": "osdmap:preprocess_failure" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164171",
                        "event": "osdmap:prepare_update" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164172",
                        "event": "osdmap:prepare_failure" 
                    }
                ],
                "info": {
                    "seq": 1290829,
                    "src_is_mon": false,
                    "source": "osd.17 v2:osd-node2:6813/149653",
                    "forwarded_to_leader": false
                }
            }
        },
        {
            "description": "osd_failure(failed timeout osd.24 [v2:osd-node1:6807/151331,v1:osd-node1:6809/151331] for 28sec e13553 v13553)",
            "initiated_at": "2019-08-06 11:31:53.164483",
            "age": 1308.2263727659999,
            "duration": 1308.2265255320001,
            "type_data": {
                "events": [
                    {
                        "time": "2019-08-06 11:31:53.164483",
                        "event": "initiated" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164483",
                        "event": "header_read" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.163582",
                        "event": "throttled" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164500",
                        "event": "all_read" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164512",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164515",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164515",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164516",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164527",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164528",
                        "event": "osdmap:preprocess_failure" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164532",
                        "event": "osdmap:prepare_update" 
                    },
                    {
                        "time": "2019-08-06 11:31:53.164533",
                        "event": "osdmap:prepare_failure" 
                    }
                ],
                "info": {
                    "seq": 1290831,
                    "src_is_mon": false,
                    "source": "osd.17 v2:osd-node2:6813/149653",
                    "forwarded_to_leader": false
                }
            }
        }
    ],
    "num_ops": 2
}

Actions

Copy link

#18

Updated by Josh Durgin over 4 years ago

Status changed from New to Fix Under Review
Pull request ID set to 28177

Actions

Copy link

#19

Updated by Greg Farnum over 4 years ago

Status changed from Fix Under Review to 7

Actions

Copy link

#20

Updated by Greg Farnum over 4 years ago

Backport set to nautilus, mimic, luminous

Actions

Copy link

#21

Updated by Neha Ojha over 4 years ago

Status changed from 7 to Pending Backport

Actions

Copy link

#22

Updated by Nathan Cutler over 4 years ago

Copied to Backport #41862: nautilus: Mimic MONs have slow/long running ops added

Actions

Copy link

#23

Updated by Nathan Cutler over 4 years ago

Copied to Backport #41863: mimic: Mimic MONs have slow/long running ops added

Actions

Copy link

#24

Updated by Nathan Cutler over 4 years ago

Copied to Backport #41864: luminous: Mimic MONs have slow/long running ops added

Actions

Copy link

#25

Updated by Марк Коренберг over 4 years ago

The same on Nautilus 14.2.3

Actions

Copy link

#26

Updated by Aaron Bassett over 4 years ago

Seeing the same on 14.2.4

Actions

Copy link

#27

Updated by Nathan Cutler over 4 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

#28

Updated by Neha Ojha about 4 years ago

Related to Bug #43893: lingering osd_failure ops (due to failure_info holding references?) added

Actions

Copy link

#29

Updated by Aleksei Gutikov almost 4 years ago

Something similar, Ceph v14.2.0.

{
            "description": "osd_pgtemp(e49509 {2.162=[378,160,56]} v49509)",
            "initiated_at": "2020-05-07 14:01:17.837714",
            "age": 79318.903942007993,
            "duration": 79318.903961598,
            "type_data": {
                "events": [
                    {
                        "time": "2020-05-07 14:01:17.837714",
                        "event": "initiated" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837714",
                        "event": "header_read" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837716",
                        "event": "throttled" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837720",
                        "event": "all_read" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837738",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837740",
                        "event": "mon:_ms_dispatch" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837741",
                        "event": "mon:dispatch_op" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837741",
                        "event": "psvc:dispatch" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837742",
                        "event": "osdmap:preprocess_query" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837763",
                        "event": "forward_request_leader" 
                    },
                    {
                        "time": "2020-05-07 14:01:17.837778",
                        "event": "forwarded" 
                    }
                ],
                "info": {
                    "seq": 882657,
                    "src_is_mon": false,
                    "source": "osd.378 v1:10.240.2.217:6861/288088",
                    "forwarded_to_leader": true
                }
            }
        }

mon status output (for one that has slow ops):

{
    "name": "MSK-SR1-R2-CEPH-HOT1",
    "rank": 1,
    "state": "peon",
    "election_epoch": 62,
    "quorum": [
        0,
        1,
        2
    ],
    "quorum_age": 81311,
    "features": {
        "required_con": "2449958747315912708",
        "required_mon": [
            "kraken",
            "luminous",
            "mimic",
            "osdmap-prune",
            "nautilus" 
        ],
        "quorum_con": "4611087854031667199",
        "quorum_mon": [
            "kraken",
            "luminous",
            "mimic",
            "osdmap-prune",
            "nautilus" 
        ]
    },
    "outside_quorum": [],
    "extra_probe_peers": [],
    "sync_provider": [],
    "monmap": {
        "epoch": 2,
        "fsid": "ad99506a-05a5-11e8-975e-74d4351a7990",
        "modified": "2020-04-26 19:22:44.160085",
        "created": "2020-04-26 19:13:21.011855",
        "min_mon_release": 14,
        "min_mon_release_name": "nautilus",
        "features": {
            "persistent": [
                "kraken",
                "luminous",
                "mimic",
                "osdmap-prune",
                "nautilus" 
            ],
            "optional": []
        },
        "mons": [
            {
                "rank": 0,
                "name": "MSK-SR1-R1-CEPH-HOT1",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v1",
                            "addr": "10.240.2.201:6789",
                            "nonce": 0
                        },
                        {
                            "type": "v2",
                            "addr": "10.240.2.201:3300",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "10.240.2.201:6789/0",
                "public_addr": "10.240.2.201:6789/0" 
            },
            {
                "rank": 1,
                "name": "MSK-SR1-R2-CEPH-HOT1",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v1",
                            "addr": "10.240.2.202:6789",
                            "nonce": 0
                        },
                        {
                            "type": "v2",
                            "addr": "10.240.2.202:3300",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "10.240.2.202:6789/0",
                "public_addr": "10.240.2.202:6789/0" 
            },
            {
                "rank": 2,
                "name": "MSK-SR1-R3-CEPH-HOT1",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "10.240.2.203:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "10.240.2.203:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "10.240.2.203:6789/0",
                "public_addr": "10.240.2.203:6789/0" 
            }
        ]
    },
    "feature_map": {
        "mon": [
            {
                "features": "0x3ffddff8ffacffff",
                "release": "luminous",
                "num": 1
            }
        ],
        "osd": [
            {
                "features": "0x3ffddff8ffacffff",
                "release": "luminous",
                "num": 200
            }
        ],
        "client": [
            {
                "features": "0x27018fb86aa42ada",
                "release": "jewel",
                "num": 10
            },
            {
                "features": "0x3ffddff8ffacffff",
                "release": "luminous",
                "num": 786
            }
        ]
    }
}

Actions

Copy link

#30

Updated by Daniel Poelzleithner almost 4 years ago

We had this issue yesterday. We had a broken mon cluster which I was able to repair by shutting down all mons, scaling monmap to 1 and starting one mon server. After the osds went back online, I zapped all mons and let them rejoin the cluster. I had to restart all mds, as they had strange hickups and cause also SLOW requests.

Actions

Copy link

#31

Updated by Kefu Chai over 3 years ago

please note, this fix was backported as f0697a9af54bf866572036bd6d582abd5299d0c8

 git tag --contains f0697a9af54bf866572036bd6d582abd5299d0c8                                                                                                      Thu Sep 10 12:41:54 2020
v14.2.10
v14.2.5
v14.2.6
v14.2.7
v14.2.8
v14.2.9

and is included in nautilus 14.2.5 and up. if you are still using < 14.2.5. might want to consider upgrade at least your monitor to 14.2.5.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #24531

Mimic MONs have slow/long running ops

Updated by Greg Farnum almost 6 years ago

Updated by Wido den Hollander almost 6 years ago

Updated by Josh Durgin almost 6 years ago

Updated by Hector Martin over 5 years ago

Updated by Josh Durgin over 5 years ago

Updated by Josh Durgin over 5 years ago

Updated by Paul Emmerich over 5 years ago

Updated by Tobias Rehn about 5 years ago

Updated by Stig Telfer about 5 years ago

Updated by Paul Emmerich about 5 years ago

Updated by Марк Коренберг about 5 years ago

Updated by Марк Коренберг about 5 years ago

Updated by Dan van der Ster about 5 years ago

Updated by jun gong almost 5 years ago

Updated by jun gong almost 5 years ago

Updated by Dan van der Ster almost 5 years ago

Updated by Theo O over 4 years ago

Updated by Josh Durgin over 4 years ago

Updated by Greg Farnum over 4 years ago

Updated by Greg Farnum over 4 years ago

Updated by Neha Ojha over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Марк Коренберг over 4 years ago

Updated by Aaron Bassett over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Neha Ojha about 4 years ago

Updated by Aleksei Gutikov almost 4 years ago

Updated by Daniel Poelzleithner almost 4 years ago

Updated by Kefu Chai over 3 years ago