Bug #45318
openHealth check failed: 2/6 mons down, quorum b,a,c,e (MON_DOWN)" in cluster log running tasks/mon_clock_no_skews.yaml
0%
Description
/a/teuthology-2020-04-26_02:30:03-rados-octopus-distro-basic-smithi/4984906
The MON log shows it came back up around 09:41:11.
2020-04-26T09:41:09.547+0000 7f3e7a667540 4 rocksdb: [db/db_impl.cc:390] Shutdown: canceling all background work 2020-04-26T09:41:09.551+0000 7f3e7a667540 4 rocksdb: [db/db_impl.cc:563] Shutdown complete 2020-04-26T09:41:09.551+0000 7f3e7a667540 0 ceph-mon: created monfs at /var/lib/ceph/mon/ceph-d for mon.d 2020-04-26T09:41:11.139+0000 7f2561752540 0 ceph version 15.2.1-136-ga8c125c7d7 (a8c125c7d78f5cd973863993d258cd717ade4c99) octopus (stable), process ceph-mon, pid 12193
Around that time in the teuthology log we see.
2020-04-26T09:41:11.031 INFO:tasks.ceph.mon.d:Restarting daemon 2020-04-26T09:41:11.031 INFO:teuthology.orchestra.run.smithi101:> true 2020-04-26T09:41:11.036 INFO:teuthology.orchestra.run.smithi101:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f --cluster ceph -i d 2020-04-26T09:41:11.082 INFO:tasks.ceph.mon.d:Started
So it is restarted around 09:41:11 but the warning is issued at 09:41:26. The monitor log shows that quorum wasn't achieved until 09:41:27.
2020-04-26T09:41:27.343+0000 7f254c186700 7 mon.d@2(peon).log v7 update_from_paxos applying incremental log 7 2020-04-26T09:41:26.428693+0000 mon.b (mon.0) 27 : cluster [INF] mon.b is new leader, mons b,a,d,c,f,e in quorum (ranks 0,1,2,3,4,5)
I don't think we can whitelist this message if we want to catch actual failures in the mons during the test.
Updated by Neha Ojha almost 4 years ago
- Backport set to octopus
rados/multimon/{clusters/21 msgr-failures/few msgr/async-v1only no_pools objectstore/bluestore-comp-zlib rados supported-random-distro$/{rhel_8} tasks/mon_clock_no_skews}
/a/yuriw-2020-06-06_21:25:23-rados-wip-yuri-master_6.6.20-distro-basic-smithi/5122668
Updated by Brad Hubbard almost 4 years ago
'msgr-failures/few', 'msgr/async-v1only', 'no_pools', 'objectstore/bluestore-comp-zlib', 'rados', 'rados/multimon/{clusters/21', 'supported-random-distro$/{rhel_latest}', 'tasks/mon_clock_no_skews}'
/a/yuriw-2020-07-13_23:00:15-rados-wip-yuri8-testing-2020-07-13-1946-octopus-distro-basic-smithi/5223949
Updated by Neha Ojha over 2 years ago
- Status changed from New to Can't reproduce
Updated by Neha Ojha over 2 years ago
- Subject changed from Health check failed: 2/6 mons down, quorum b,a,c,e (MON_DOWN)" in cluster log running tasks/mon_clock_no_skews.yaml to octopus: Health check failed: 2/6 mons down, quorum b,a,c,e (MON_DOWN)" in cluster log running tasks/mon_clock_no_skews.yaml
- Status changed from Can't reproduce to New
Octopus still has this issue /a/yuriw-2022-01-24_18:01:47-rados-wip-yuri10-testing-2022-01-24-0810-octopus-distro-default-smithi/6638290/
rados/multimon/{clusters/21 msgr-failures/many msgr/async-v1only no_pools objectstore/bluestore-comp-zstd rados supported-random-distro$/{centos_8} tasks/mon_clock_no_skews}
Updated by Laura Flores about 2 years ago
Happening in Pacific too:
/a/yuriw-2022-01-27_14:57:16-rados-wip-yuri-testing-2022-01-26-1810-pacific-distro-default-smithi/6643686
rados/multimon/{clusters/9 mon_election/classic msgr-failures/few msgr/async-v1only no_pools objectstore/bluestore-stupid rados supported-random-distro$/{rhel_8} tasks/mon_clock_no_skews}
Updated by Neha Ojha about 2 years ago
- Backport changed from octopus to octopus, pacific
Updated by Neha Ojha about 2 years ago
- Priority changed from High to Normal
Not seeing this very frequently, most likely a result of failure injection
Updated by Laura Flores almost 2 years ago
/a/yuriw-2022-06-02_00:50:42-rados-wip-yuri4-testing-2022-06-01-1350-pacific-distro-default-smithi/6859916
Updated by Radoslaw Zarzynski almost 2 years ago
- Subject changed from octopus: Health check failed: 2/6 mons down, quorum b,a,c,e (MON_DOWN)" in cluster log running tasks/mon_clock_no_skews.yaml to Health check failed: 2/6 mons down, quorum b,a,c,e (MON_DOWN)" in cluster log running tasks/mon_clock_no_skews.yaml
- Backport changed from octopus, pacific to pacific,octopus,quincy
This is isn't octupus-specific as we saw it in pacific as well.
Updated by Radoslaw Zarzynski almost 2 years ago
- Tags set to medium-hanging-fruit
Updated by Kamoltat (Junior) Sirivadhna over 1 year ago
/a/yuriw-2022-08-04_11:58:29-rados-wip-yuri3-testing-2022-08-03-0828-pacific-distro-default-smithi/6958138
Updated by Laura Flores over 1 year ago
- Translation missing: en.field_tag_list set to test-failure
Updated by Laura Flores over 1 year ago
- Related to Bug #57900: mon/crush_ops.sh: mons out of quorum added