Bug #47654: test_mon_pg: mon fails to join quorum to due election strategy mismatch - RADOS - Ceph

Actions

Copy link

Bug #47654

closed

test_mon_pg: mon fails to join quorum to due election strategy mismatch

Added by Neha Ojha over 3 years ago. Updated almost 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Greg Farnum

Category:

Target version:

% Done:

Source:

Tags:

Backport:

pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Monitor

Pull request ID:

40483

Crash signature (v1):

Crash signature (v2):

Description

2020-09-25T09:27:25.786 INFO:tasks.workunit.client.0.smithi158.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/test.sh:2938: main:  test_mon_pg
2020-09-25T09:27:25.786 INFO:tasks.workunit.client.0.smithi158.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/test.sh:2037: test_mon_pg:  wait_for_health_ok
2020-09-25T09:27:25.787 INFO:tasks.workunit.client.0.smithi158.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/../../standalone/ceph-helpers.sh:1717: wait_for_health_ok:  wait_for_health HEALTH_OK
2020-09-25T09:27:25.787 INFO:tasks.workunit.client.0.smithi158.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/../../standalone/ceph-helpers.sh:1696: wait_for_health:  local grepstr=HEALTH_OK
2020-09-25T09:27:25.787 INFO:tasks.workunit.client.0.smithi158.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/../../standalone/ceph-helpers.sh:1697: wait_for_health:  delays=($(get_timeout_delays $TIMEOUT .1))
2020-09-25T09:27:25.788 INFO:tasks.workunit.client.0.smithi158.stderr://home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/../../standalone/ceph-helpers.sh:1697: wait_for_health:  get_timeout_delays 300 .1
.
.
.
2020-09-25T09:32:39.895 INFO:tasks.workunit.client.0.smithi158.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/../../standalone/ceph-helpers.sh:1700: wait_for_health:  ceph health detail
2020-09-25T09:32:39.895 INFO:tasks.workunit.client.0.smithi158.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/../../standalone/ceph-helpers.sh:1700: wait_for_health:  grep HEALTH_OK
2020-09-25T09:32:40.405 INFO:tasks.workunit.client.0.smithi158.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/../../standalone/ceph-helpers.sh:1701: wait_for_health:  ((  27 >= 27  ))
2020-09-25T09:32:40.406 INFO:tasks.workunit.client.0.smithi158.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/../../standalone/ceph-helpers.sh:1702: wait_for_health:  ceph health detail
2020-09-25T09:32:40.892 INFO:tasks.workunit.client.0.smithi158.stdout:HEALTH_WARN 1/3 mons down, quorum a,b; 47 slow ops, oldest one blocked for 707 sec, mon.c has slow ops
2020-09-25T09:32:40.892 INFO:tasks.workunit.client.0.smithi158.stdout:[WRN] MON_DOWN: 1/3 mons down, quorum a,b
2020-09-25T09:32:40.892 INFO:tasks.workunit.client.0.smithi158.stdout:    mon.c (rank 2) addr [v2:172.21.15.158:3302/0,v1:172.21.15.158:6791/0] is down (out of quorum)
2020-09-25T09:32:40.893 INFO:tasks.workunit.client.0.smithi158.stdout:[WRN] SLOW_OPS: 47 slow ops, oldest one blocked for 707 sec, mon.c has slow ops
2020-09-25T09:32:40.903 INFO:tasks.workunit.client.0.smithi158.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/../../standalone/ceph-helpers.sh:1703: wait_for_health:  return 1
2020-09-25T09:32:40.903 INFO:tasks.workunit.client.0.smithi158.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/../../standalone/ceph-helpers.sh:1717: wait_for_health_ok:  return 1

Looks like mon.c was down and hence we never got to HEALTH_OK.

rados/singleton-bluestore/{all/cephtool mon_election/classic msgr-failures/few msgr/async-v2only objectstore/bluestore-bitmap rados supported-random-distro$/{rhel_8}}

/a/teuthology-2020-09-25_07:01:01-rados-master-distro-basic-smithi/5466864

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Priority changed from Normal to High

/a/teuthology-2020-09-28_07:01:02-rados-master-distro-basic-smithi/5476834

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Priority changed from High to Urgent

/a/teuthology-2020-09-30_07:01:02-rados-master-distro-basic-smithi/5483682

Actions

Copy link

Updated by Neha Ojha over 3 years ago

/a/teuthology-2020-10-23_07:01:02-rados-master-distro-basic-smithi/5550826

Actions

Copy link

Updated by Neha Ojha over 3 years ago

sentry event: https://sentry.ceph.com/organizations/ceph/issues/10/events/2bdb1a2346cf4325b1bfaa7adf609f15/?project=2&statsPeriod=14d
fails in 1/10 times with this issue: https://pulpito.ceph.com/nojha-2020-10-24_00:17:08-rados:singleton-bluestore-master-distro-basic-smithi/

Actions

Copy link

Updated by Neha Ojha over 3 years ago

/a/teuthology-2020-10-28_07:01:02-rados-master-distro-basic-smithi/5567283

Actions

Copy link

Updated by Neha Ojha over 3 years ago

I am able to reproduce this withtout msgr failure injection.

rados:singleton-bluestore/{all/cephtool mon_election/connectivity msgr/async-v2only objectstore/bluestore-bitmap rados supported-random-distro$/{centos_8}}

/a/nojha-2020-10-28_21:12:45-rados:singleton-bluestore-master-distro-basic-smithi/5569512/

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Looking at logs from /a/nojha-2020-10-28_21:12:45-rados:singleton-bluestore-master-distro-basic-smithi/5569512/

We fail because mon.c is down.

2020-10-28T22:09:49.408 INFO:tasks.ceph.mon.c.smithi109.stderr:2020-10-28T22:09:49.406+0000 7fa47aadd700 -1 mon.c@2(electing) e8 get_health_metrics reporting 5 slow ops, oldest is log(1 entries from seq 170 at 2020-10-28T21:57:53.031500+0000)
2020-10-28T22:09:49.689 INFO:tasks.workunit.client.0.smithi109.stdout:HEALTH_WARN 1/3 mons down, quorum a,b; 5 slow ops, oldest one blocked for 711 sec, mon.c has slow ops
2020-10-28T22:09:49.689 INFO:tasks.workunit.client.0.smithi109.stdout:[WRN] MON_DOWN: 1/3 mons down, quorum a,b
2020-10-28T22:09:49.690 INFO:tasks.workunit.client.0.smithi109.stdout:    mon.c (rank 2) addr [v2:172.21.15.109:3302/0,v1:172.21.15.109:6791/0] is down (out of quorum)
2020-10-28T22:09:49.690 INFO:tasks.workunit.client.0.smithi109.stdout:[WRN] SLOW_OPS: 5 slow ops, oldest one blocked for 711 sec, mon.c has slow ops
2020-10-28T22:09:49.700 INFO:tasks.workunit.client.0.smithi109.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/../../standalone/ceph-helpers.sh:1703: wait_for_health:  return 1

Looking at mon.c logs, this is where the first slow op is seen:

2020-10-28T21:58:24.070+0000 7fa47aadd700 -1 mon.c@2(electing) e8 get_health_metrics reporting 1 slow ops, oldest is log(1 entries from seq 170 at 2020-10-28T21:57:53.031500+0000)

This op arrived here:

2020-10-28T21:57:53.033+0000 7fa4782d8700  1 -- [v2:172.21.15.109:3302/0,v1:172.21.15.109:6791/0] <== mon.2 v2:172.21.15.109:3302/0 0 ==== log(1 entries from seq 170 at 2020-10-28T21:57:53.031500+0000) v1 ==== 0+0+0 (unknown 0 0 0) 0x5584ba76f500 con 0x5584b914ac00

The problem seems to be that mon.c never gets out of the electing state, with repeated "failed to assign global_id" and connection resets.

2020-10-28T21:57:58.150+0000 7fa4772d6700 10 mon.c@2(electing) e6 handle_auth_request con 0x5584b9e93c00 (start) method 2 payload 22
2020-10-28T21:57:58.150+0000 7fa4772d6700 10 mon.c@2(electing).auth v21 _assign_global_id inactive (num_mon 0 rank 0)
2020-10-28T21:57:58.150+0000 7fa4772d6700  1 mon.c@2(electing) e6 handle_auth_request failed to assign global_id
2020-10-28T21:57:58.150+0000 7fa4772d6700  1 --2- [v2:172.21.15.109:3302/0,v1:172.21.15.109:6791/0] >>  conn(0x5584b9e93c00 0x5584ba0daa00 secure :-1 s=AUTH_ACCEPTING pgs=0 cs=0 l=1 rev1=1 rx=0 tx=0).stop
2020-10-28T21:57:58.150+0000 7fa4782d8700 10 mon.c@2(electing) e6 ms_handle_reset 0x5584b9e93c00 -
2020-10-28T21:57:58.152+0000 7fa4772d6700  1 --2- [v2:172.21.15.109:3302/0,v1:172.21.15.109:6791/0] >>  conn(0x5584bb28e400 0x5584ba0db900 unknown :-1 s=NONE pgs=0 cs=0 l=0 rev1=0 rx=0 tx=0).accept
2020-10-28T21:57:58.152+0000 7fa4772d6700  1 --2- [v2:172.21.15.109:3302/0,v1:172.21.15.109:6791/0] >>  conn(0x5584bb28e400 0x5584ba0db900 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rev1=0 rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0
2020-10-28T21:57:58.152+0000 7fa4772d6700 10 mon.c@2(electing) e6 handle_auth_request con 0x5584bb28e400 (start) method 2 payload 22
2020-10-28T21:57:58.152+0000 7fa4772d6700 10 mon.c@2(electing).auth v21 _assign_global_id inactive (num_mon 0 rank 0)
2020-10-28T21:57:58.152+0000 7fa4772d6700  1 mon.c@2(electing) e6 handle_auth_request failed to assign global_id
2020-10-28T21:57:58.152+0000 7fa4772d6700  1 --2- [v2:172.21.15.109:3302/0,v1:172.21.15.109:6791/0] >>  conn(0x5584bb28e400 0x5584ba0db900 secure :-1 s=AUTH_ACCEPTING pgs=0 cs=0 l=1 rev1=1 rx=0 tx=0).stop
2020-10-28T21:57:58.152+0000 7fa4772d6700  1 -- [v2:172.21.15.109:3302/0,v1:172.21.15.109:6791/0] reap_dead start
2020-10-28T21:57:58.152+0000 7fa4782d8700 10 mon.c@2(electing) e6 ms_handle_reset 0x5584bb28e400 -

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Based on f7099f72faccb09aea5054c0b428bf89be67141c, "failed to assign global_id" is expected when we are not quorum. The fact that this happens repeatedly seems to a symptom of the problem, not the cause.

mon/AuthMonitor: provide auth_lock-safe _assign_global_id()

Break assign_global_id() into two parts:

- _assign_global_id(), called under auth_lock, that does nothing but
  assign an id (or fail).  This makes use of the num_mon and num_rank
  values, which means it only works while we are in quorum.
- increase_max_global_id(), if it looks like we should.

Also, adjust _should_increase_max_global_id() to also work under auth_lock
since we are looking at last_allocated_id.  All last_allocated_id users
are not protected by auth_lock.

Fixes: http://tracker.ceph.com/issues/38372
Signed-off-by: Sage Weil <sage@redhat.com>

Actions

Copy link

Updated by Neha Ojha over 3 years ago

When mon.c calls for election

2020-10-28T21:57:58.394+0000 7fa4782d8700  1 -- [v2:172.21.15.109:3302/0,v1:172.21.15.109:6791/0] send_to--> mon [v2:172.21.15.109:3300/0,v1:172.21.15.109:6789/0] -- election(b71fc7b4-6851-40a0-ae18-b429e5c51ccd propose rel 16 e29) v9 -- ?+0 0x5584b9e078c0
2020-10-28T21:57:58.394+0000 7fa4782d8700  1 -- [v2:172.21.15.109:3302/0,v1:172.21.15.109:6791/0] --> [v2:172.21.15.109:3300/0,v1:172.21.15.109:6789/0] -- election(b71fc7b4-6851-40a0-ae18-b429e5c51ccd propose rel 16 e29) v9 -- 0x5584b9e078c0 con 0x5584b9e0a800
2020-10-28T21:57:58.394+0000 7fa4782d8700  1 -- [v2:172.21.15.109:3302/0,v1:172.21.15.109:6791/0] send_to--> mon [v2:172.21.15.109:3301/0,v1:172.21.15.109:6790/0] -- election(b71fc7b4-6851-40a0-ae18-b429e5c51ccd propose rel 16 e29) v9 -- ?+0 0x5584b9e72240
2020-10-28T21:57:58.394+0000 7fa4782d8700  1 -- [v2:172.21.15.109:3302/0,v1:172.21.15.109:6791/0] --> [v2:172.21.15.109:3301/0,v1:172.21.15.109:6790/0] -- election(b71fc7b4-6851-40a0-ae18-b429e5c51ccd propose rel 16 e29) v9 -- 0x5584b9e72240 con 0x5584b914bc00

mon.a and mon.b reject it because of "somehow got an Election message with different strategy"

2020-10-28T21:57:58.394+0000 7f7e3695f700  1 -- [v2:172.21.15.109:3300/0,v1:172.21.15.109:6789/0] <== mon.2 v2:172.21.15.109:3302/0 2090 ==== election(b71fc7b4-6851-40a0-ae18-b429e5c51ccd propose rel 16 e29) v9 ==== 1037+0+0 (secure 0 0 0) 0x562bd86bed80 con 0x562bd61adc00
2020-10-28T21:57:58.394+0000 7f7e3695f700  5 mon.a@0(electing).elector(29) dispatch somehow got an Election message with different strategy ^C from local 1; dropping for now to let race resolve

2020-10-28T21:57:58.394+0000 7f05a69e7700  1 -- [v2:172.21.15.109:3301/0,v1:172.21.15.109:6790/0] <== mon.2 v2:172.21.15.109:3302/0 671 ==== election(b71fc7b4-6851-40a0-ae18-b429e5c51ccd propose rel 16 e29) v9 ==== 1037+0+0 (secure 0 0 0) 0x55707a99c480 con 0x557079277c00
2020-10-28T21:57:58.394+0000 7f05a69e7700  5 mon.b@1(electing).elector(29) dispatch somehow got an Election message with different strategy ^C from local 1; dropping for now to let race resolve

This keeps happening when mon.c tries to join quorum. It appears that mon.c has no idea about election strategy.

Actions

Copy link

#10

Updated by Greg Farnum over 3 years ago

Hmm, I'm confused about the "^C" output as the incoming message strategy. Going over things, as best I can tell those are always properly initialized by the sender on construction. It can default-initialize to 0 (rather than the valid 1/2/3) but that doesn't seem to be used; if the message is too old to know about such things it's just decoded as 1 (classic).

I do see the /a/nojha-2020-10-28_21:12:45-rados:singleton-bluestore-master-distro-basic-smithi/5569512/orig.config.yaml sets "mon election default strategy: 3", yet the monitors say they are using strategy 1. This would lead me to think the different monitors are going through different setup paths and some don't respect that (or get to see it) — but that doesn't appear to be the case.

On first inspection I am quite puzzled.

Actions

Copy link

#11

Updated by Greg Farnum over 3 years ago

Oh the bug does occur while executing commands to change the strategy. But this all still looks fine to me and certainly not like anything that would result in unprintable values!

Actions

Copy link

#12

Updated by Neha Ojha over 3 years ago

Subject changed from qa/workunits/cephtool/test.sh: test_mon_pg fails to start due to MON_DOWN to test_mon_pg: mon fails to join quorum to due election strategy mismatch

Actions

Copy link

#13

Updated by Neha Ojha over 3 years ago

Status changed from New to Triaged

Greg Farnum wrote:

Oh the bug does occur while executing commands to change the strategy. But this all still looks fine to me and certainly not like anything that would result in unprintable values!

I can get around the unprintable value by casting em->strategy using ElectionLogic::election_strategy.
The real issue seems to be that any election strategy change is committed on mons in quorum. So here mon.c was not part of quorum, so it was never communicated the election strategy change to classic. Therefore, it continues to call for election using the old strategy 3.

Logs that reflect the above: /a/nojha-2020-11-03_21:51:09-rados:singleton-bluestore-wip-47654-distro-basic-smithi/5588297/

Actions

Copy link

#14

Updated by Greg Farnum over 3 years ago

Neha Ojha wrote:

Greg Farnum wrote:

Oh the bug does occur while executing commands to change the strategy. But this all still looks fine to me and certainly not like anything that would result in unprintable values!

I can get around the unprintable value by casting em->strategy using ElectionLogic::election_strategy.
The real issue seems to be that any election strategy change is committed on mons in quorum. So here mon.c was not part of quorum, so it was never communicated the election strategy change to classic. Therefore, it continues to call for election using the old strategy 3.

Logs that reflect the above: /a/nojha-2020-11-03_21:51:09-rados:singleton-bluestore-wip-47654-distro-basic-smithi/5588297/

Hmm, yes, something's gone wrong here. The election strategy is committed into the monmap and generates a new one, and the out-of-quorum monitor should pick up the change there. But the mon.a log clearly shows mon.c sending messages with the wrong strategy at timestamps 2020-11-03T22:19:28.707 and 2020-11-03T22:19:33.698, despite mon.c upgrading from 7 to 8 at 2020-11-03T22:19:28.695.

Will investigate!

Actions

Copy link

#15