Project

General

Profile

Actions

Bug #57900

open

mon/crush_ops.sh: mons out of quorum

Added by Laura Flores over 1 year ago. Updated over 1 year ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/teuthology-2022-10-09_07:01:03-rados-quincy-distro-default-smithi/7059463

2022-10-09T17:24:06.314 INFO:teuthology.orchestra.run.smithi078.stderr:2022-10-09T17:24:06.162+0000 7fb466fb9700  1 -- 172.21.15.78:0/2058838378 >> [v2:172.21.15.88:6824/34015,v1:172.21.15.88:6825/34015] conn(0x7fb440034f40 msgr2=0x7fb4400373f0 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2022-10-09T17:24:06.315 INFO:teuthology.orchestra.run.smithi078.stderr:2022-10-09T17:24:06.162+0000 7fb466fb9700  1 --2- 172.21.15.78:0/2058838378 >> [v2:172.21.15.88:6824/34015,v1:172.21.15.88:6825/34015] conn(0x7fb440034f40 0x7fb4400373f0 secure :-1 s=READY pgs=204 cs=0 l=1 rev1=1 crypto rx=0x7fb454003bd0 tx=0x7fb4540092c0 comp rx=0 tx=0).stop
2022-10-09T17:24:06.315 INFO:teuthology.orchestra.run.smithi078.stderr:2022-10-09T17:24:06.162+0000 7fb466fb9700 10 monclient: shutdown
2022-10-09T17:24:06.315 INFO:teuthology.orchestra.run.smithi078.stderr:2022-10-09T17:24:06.162+0000 7fb466fb9700 20 monclient: shutdown discarding 0 pending message(s)
2022-10-09T17:24:06.315 INFO:teuthology.orchestra.run.smithi078.stderr:2022-10-09T17:24:06.162+0000 7fb466fb9700  1 -- 172.21.15.78:0/2058838378 >> [v2:172.21.15.78:3302/0,v1:172.21.15.78:6791/0] conn(0x7fb460137660 msgr2=0x7fb46011cb00 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2022-10-09T17:24:06.316 INFO:teuthology.orchestra.run.smithi078.stderr:2022-10-09T17:24:06.162+0000 7fb466fb9700  1 --2- 172.21.15.78:0/2058838378 >> [v2:172.21.15.78:3302/0,v1:172.21.15.78:6791/0] conn(0x7fb460137660 0x7fb46011cb00 secure :-1 s=READY pgs=146 cs=0 l=1 rev1=1 crypto rx=0x7fb44800b4c0 tx=0x7fb4480090f0 comp rx=0 tx=0).stop
2022-10-09T17:24:06.316 INFO:teuthology.orchestra.run.smithi078.stderr:2022-10-09T17:24:06.162+0000 7fb465556700  1 -- 172.21.15.78:0/2058838378 reap_dead start
2022-10-09T17:24:06.316 INFO:teuthology.orchestra.run.smithi078.stderr:2022-10-09T17:24:06.162+0000 7fb465556700  1 -- 172.21.15.78:0/2058838378 reap_dead start
2022-10-09T17:24:06.316 INFO:teuthology.orchestra.run.smithi078.stderr:2022-10-09T17:24:06.162+0000 7fb466fb9700  1 -- 172.21.15.78:0/2058838378 shutdown_connections
2022-10-09T17:24:06.317 INFO:teuthology.orchestra.run.smithi078.stderr:2022-10-09T17:24:06.162+0000 7fb466fb9700  1 -- 172.21.15.78:0/2058838378 >> 172.21.15.78:0/2058838378 conn(0x7fb4600a7030 msgr2=0x7fb4600a4eb0 unknown :-1 s=STATE_NONE l=0).mark_down
2022-10-09T17:24:06.317 INFO:teuthology.orchestra.run.smithi078.stderr:2022-10-09T17:24:06.162+0000 7fb466fb9700  1 -- 172.21.15.78:0/2058838378 shutdown_connections
2022-10-09T17:24:06.317 INFO:teuthology.orchestra.run.smithi078.stderr:2022-10-09T17:24:06.162+0000 7fb466fb9700  1 -- 172.21.15.78:0/2058838378 wait complete.
2022-10-09T17:24:06.318 DEBUG:teuthology.orchestra.run.smithi078:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell mon.c mon_status
2022-10-09T17:24:06.319 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):

Around the time of the Traceback, we can see that monitors were out of quorum:

2022-10-09T17:24:06.299+0000 7fc65e5a0700 20 mon.c@4(peon).health dump:{
    "quorum_health": {},
    "leader_health": {
        "MON_DOWN": {
            "severity": "HEALTH_WARN",
            "summary": {
                "message": "4/9 mons down, quorum a,b,c,h,e",
                "count": 4
            },
            "detail": [
                {
                    "message": "mon.f (rank 1) addr [v2:172.21.15.88:3300/0,v1:172.21.15.88:6789/0] is down (out of quorum)" 
                },
                {
                    "message": "mon.g (rank 3) addr [v2:172.21.15.88:3301/0,v1:172.21.15.88:6790/0] is down (out of quorum)" 
                },
                {
                    "message": "mon.d (rank 6) addr [v2:172.21.15.78:3303/0,v1:172.21.15.78:6792/0] is down (out of quorum)" 
                },
                {
                    "message": "mon.i (rank 7) addr [v2:172.21.15.88:3303/0,v1:172.21.15.88:6792/0] is down (out of quorum)" 
                }
            ]
        }
    }
}

Could be related to https://tracker.ceph.com/issues/45318, but I've seen two instances now where this fails specifically during mon/crush_ops.sh.

Second instance on Pacific:
/a/yuriw-2022-10-05_20:44:57-rados-wip-yuri4-testing-2022-10-05-0917-pacific-distro-default-smithi/7055682


Related issues 1 (1 open0 closed)

Related to RADOS - Bug #45318: Health check failed: 2/6 mons down, quorum b,a,c,e (MON_DOWN)" in cluster log running tasks/mon_clock_no_skews.yamlNew

Actions
Actions #1

Updated by Laura Flores over 1 year ago

  • Related to Bug #45318: Health check failed: 2/6 mons down, quorum b,a,c,e (MON_DOWN)" in cluster log running tasks/mon_clock_no_skews.yaml added
Actions #2

Updated by Laura Flores over 1 year ago

  • Backport set to quincy,pacific
Actions #3

Updated by Laura Flores over 1 year ago

  • Project changed from Ceph to RADOS
Actions #4

Updated by Radoslaw Zarzynski over 1 year ago

Just suggestion from the bug scrub: this is a mon thrashing test. None of mon loga seems to have a trace of crash but there are some network-related faults. Too less time for a mon's reboot?

Actions #5

Updated by Laura Flores over 1 year ago

@Radoslaw Smigielski so the suggestion is to give the mons more time to reboot?

This is the workunit:
https://github.com/ceph/ceph/blob/main/qa/workunits/mon/crush_ops.sh

I can check to see which command was last issued, and then perhaps have the script sleep for a few seconds.

Actions #6

Updated by Laura Flores over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to Laura Flores
Actions #7

Updated by Laura Flores over 1 year ago

/a/yuriw-2023-01-24_22:20:59-rados-wip-yuri-testing-2023-01-23-0926-distro-default-smithi/7136648

Time to revisit this one. Seems very transient.

Actions #8

Updated by Radoslaw Zarzynski over 1 year ago

@Radoslaw Smigielski so the suggestion is to give the mons more time to reboot?

Yes, exactly that.

Actions

Also available in: Atom PDF