Project

General

Profile

Bug #45318

Health check failed: 2/6 mons down, quorum b,a,c,e (MON_DOWN)" in cluster log running tasks/mon_clock_no_skews.yaml

Added by Brad Hubbard almost 4 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
medium-hanging-fruit
Backport:
pacific,octopus,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/teuthology-2020-04-26_02:30:03-rados-octopus-distro-basic-smithi/4984906

The MON log shows it came back up around 09:41:11.

2020-04-26T09:41:09.547+0000 7f3e7a667540  4 rocksdb: [db/db_impl.cc:390] Shutdown: canceling all background work
2020-04-26T09:41:09.551+0000 7f3e7a667540  4 rocksdb: [db/db_impl.cc:563] Shutdown complete
2020-04-26T09:41:09.551+0000 7f3e7a667540  0 ceph-mon: created monfs at /var/lib/ceph/mon/ceph-d for mon.d
2020-04-26T09:41:11.139+0000 7f2561752540  0 ceph version 15.2.1-136-ga8c125c7d7 (a8c125c7d78f5cd973863993d258cd717ade4c99) octopus (stable), process ceph-mon, pid 12193

Around that time in the teuthology log we see.

2020-04-26T09:41:11.031 INFO:tasks.ceph.mon.d:Restarting daemon
2020-04-26T09:41:11.031 INFO:teuthology.orchestra.run.smithi101:> true
2020-04-26T09:41:11.036 INFO:teuthology.orchestra.run.smithi101:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f --cluster ceph -i d
2020-04-26T09:41:11.082 INFO:tasks.ceph.mon.d:Started

So it is restarted around 09:41:11 but the warning is issued at 09:41:26. The monitor log shows that quorum wasn't achieved until 09:41:27.

2020-04-26T09:41:27.343+0000 7f254c186700  7 mon.d@2(peon).log v7 update_from_paxos applying incremental log 7 2020-04-26T09:41:26.428693+0000 mon.b (mon.0) 27 : cluster [INF] mon.b is new leader, mons b,a,d,c,f,e in quorum (ranks 0,1,2,3,4,5)

I don't think we can whitelist this message if we want to catch actual failures in the mons during the test.


Related issues

Related to RADOS - Bug #57900: mon/crush_ops.sh: mons out of quorum In Progress

History

#1 Updated by Brad Hubbard almost 4 years ago

  • Project changed from Ceph to RADOS

#2 Updated by Brad Hubbard almost 4 years ago

  • Description updated (diff)

#3 Updated by Brad Hubbard almost 4 years ago

  • Description updated (diff)

#4 Updated by Neha Ojha over 3 years ago

  • Backport set to octopus

rados/multimon/{clusters/21 msgr-failures/few msgr/async-v1only no_pools objectstore/bluestore-comp-zlib rados supported-random-distro$/{rhel_8} tasks/mon_clock_no_skews}

/a/yuriw-2020-06-06_21:25:23-rados-wip-yuri-master_6.6.20-distro-basic-smithi/5122668

#5 Updated by Brad Hubbard over 3 years ago

'msgr-failures/few', 'msgr/async-v1only', 'no_pools', 'objectstore/bluestore-comp-zlib', 'rados', 'rados/multimon/{clusters/21', 'supported-random-distro$/{rhel_latest}', 'tasks/mon_clock_no_skews}'

/a/yuriw-2020-07-13_23:00:15-rados-wip-yuri8-testing-2020-07-13-1946-octopus-distro-basic-smithi/5223949

#6 Updated by Neha Ojha about 2 years ago

  • Status changed from New to Can't reproduce

#7 Updated by Neha Ojha about 2 years ago

  • Subject changed from Health check failed: 2/6 mons down, quorum b,a,c,e (MON_DOWN)" in cluster log running tasks/mon_clock_no_skews.yaml to octopus: Health check failed: 2/6 mons down, quorum b,a,c,e (MON_DOWN)" in cluster log running tasks/mon_clock_no_skews.yaml
  • Status changed from Can't reproduce to New

Octopus still has this issue /a/yuriw-2022-01-24_18:01:47-rados-wip-yuri10-testing-2022-01-24-0810-octopus-distro-default-smithi/6638290/

rados/multimon/{clusters/21 msgr-failures/many msgr/async-v1only no_pools objectstore/bluestore-comp-zstd rados supported-random-distro$/{centos_8} tasks/mon_clock_no_skews}

#8 Updated by Laura Flores about 2 years ago

Happening in Pacific too:

/a/yuriw-2022-01-27_14:57:16-rados-wip-yuri-testing-2022-01-26-1810-pacific-distro-default-smithi/6643686

rados/multimon/{clusters/9 mon_election/classic msgr-failures/few msgr/async-v1only no_pools objectstore/bluestore-stupid rados supported-random-distro$/{rhel_8} tasks/mon_clock_no_skews}

#9 Updated by Neha Ojha about 2 years ago

  • Backport changed from octopus to octopus, pacific

#10 Updated by Neha Ojha almost 2 years ago

  • Priority changed from High to Normal

Not seeing this very frequently, most likely a result of failure injection

#11 Updated by Laura Flores over 1 year ago

/a/yuriw-2022-06-02_00:50:42-rados-wip-yuri4-testing-2022-06-01-1350-pacific-distro-default-smithi/6859916

#12 Updated by Radoslaw Zarzynski over 1 year ago

  • Subject changed from octopus: Health check failed: 2/6 mons down, quorum b,a,c,e (MON_DOWN)" in cluster log running tasks/mon_clock_no_skews.yaml to Health check failed: 2/6 mons down, quorum b,a,c,e (MON_DOWN)" in cluster log running tasks/mon_clock_no_skews.yaml
  • Backport changed from octopus, pacific to pacific,octopus,quincy

This is isn't octupus-specific as we saw it in pacific as well.

#13 Updated by Radoslaw Zarzynski over 1 year ago

  • Tags set to medium-hanging-fruit

#14 Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

/a/yuriw-2022-08-04_11:58:29-rados-wip-yuri3-testing-2022-08-03-0828-pacific-distro-default-smithi/6958138

#15 Updated by Laura Flores over 1 year ago

  • Tags set to test-failure

#16 Updated by Laura Flores over 1 year ago

  • Related to Bug #57900: mon/crush_ops.sh: mons out of quorum added

Also available in: Atom PDF