Project

General

Profile

Bug #52316

qa/tasks/mon_thrash.py: _do_thrash AssertionError len(s['quorum']) == len(mons)

Added by Aishwarya Mathuria over 2 years ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

100%

Source:
Tags:
backport_processed
Backport:
pacific,quincy,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-08-17T03:12:45.055 INFO:tasks.workunit.client.0.smithi135.stderr:2021-08-17T03:12:45.052+0000 7f27d941a700 1 -- 172.21.15.135:0/2440531935 >> v1:172.21.15.13:6790/0 conn(0x7f27b000a460 legacy=0x7f27b000a840 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v1:172.21.15.13:6790/0
2021-08-17T03:12:45.055 INFO:tasks.workunit.client.0.smithi135.stderr:2021-08-17T03:12:45.052+0000 7f27d8c19700 1 -- 172.21.15.135:0/2440531935 >> v1:172.21.15.135:6789/0 conn(0x7f27b000db60 legacy=0x7f27b000f1a0 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v1:172.21.15.135:6789/0
2021-08-17T03:12:45.055 INFO:tasks.workunit.client.0.smithi135.stderr:2021-08-17T03:12:45.052+0000 7f27d8c19700 1 -- 172.21.15.135:0/2440531935 >> v1:172.21.15.13:6789/0 conn(0x7f27b0012880 legacy=0x7f27b000e6b0 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v1:172.21.15.13:6789/0
2021-08-17T03:12:52.741 DEBUG:teuthology.orchestra.run:got remote process result: 124
2021-08-17T03:12:52.743 INFO:tasks.workunit:Stopping ['mon/test_mon_osdmap_prune.sh'] on client.0...
2021-08-17T03:12:52.743 DEBUG:teuthology.orchestra.run.smithi135:> sudo rm rf - /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2021-08-17T03:12:52.992 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_git_teuthology_321319b12ea4ff9b63c7655015a3156de2c3b279/teuthology/run_tasks.py", line 91, in run_tasks
manager = run_one_task(taskname, ctx=ctx, config=config)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_321319b12ea4ff9b63c7655015a3156de2c3b279/teuthology/run_tasks.py", line 70, in run_one_task
return task(**kwargs)
File "/home/teuthworker/src/github.com_ceph_ceph-c_a00b573203353a8db1f3e59f48e827ed27479b62/qa/tasks/workunit.py", line 134, in task
coverage_and_limits=not config.get('no_coverage_and_limits', None))
File "/home/teuthworker/src/git.ceph.com_git_teuthology_321319b12ea4ff9b63c7655015a3156de2c3b279/teuthology/parallel.py", line 84, in exit
for result in self:
File "/home/teuthworker/src/git.ceph.com_git_teuthology_321319b12ea4ff9b63c7655015a3156de2c3b279/teuthology/parallel.py", line 98, in next
resurrect_traceback(result)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_321319b12ea4ff9b63c7655015a3156de2c3b279/teuthology/parallel.py", line 30, in resurrect_traceback
raise exc.exc_info1
File "/home/teuthworker/src/git.ceph.com_git_teuthology_321319b12ea4ff9b63c7655015a3156de2c3b279/teuthology/parallel.py", line 23, in capture_traceback
return func(*args, **kwargs)
File "/home/teuthworker/src/github.com_ceph_ceph-c_a00b573203353a8db1f3e59f48e827ed27479b62/qa/tasks/workunit.py", line 426, in _run_tests
label="workunit test {workunit}".format(workunit=workunit)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_321319b12ea4ff9b63c7655015a3156de2c3b279/teuthology/orchestra/remote.py", line 509, in run
r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_321319b12ea4ff9b63c7655015a3156de2c3b279/teuthology/orchestra/run.py", line 455, in run
r.wait()
File "/home/teuthworker/src/git.ceph.com_git_teuthology_321319b12ea4ff9b63c7655015a3156de2c3b279/teuthology/orchestra/run.py", line 161, in wait
self._raise_for_status()
File "/home/teuthworker/src/git.ceph.com_git_teuthology_321319b12ea4ff9b63c7655015a3156de2c3b279/teuthology/orchestra/run.py", line 183, in _raise_for_status
node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test mon/test_mon_osdmap_prune.sh) on smithi135 with status 124: 'mkdir p - /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=a00b573203353a8db1f3e59f48e827ed27479b62 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/mon/test_mon_osdmap_prune.sh'

Teuthology logs: https://pulpito.ceph.com/yuriw-2021-08-16_21:15:00-rados-wip-yuri-testing-master-8.16.21-distro-basic-smithi/6341993/


Related issues

Copied to RADOS - Backport #61450: pacific: qa/tasks/mon_thrash.py: _do_thrash AssertionError len(s['quorum']) == len(mons) Resolved
Copied to RADOS - Backport #61451: quincy: qa/tasks/mon_thrash.py: _do_thrash AssertionError len(s['quorum']) == len(mons) Resolved
Copied to RADOS - Backport #61452: reef: qa/tasks/mon_thrash.py: _do_thrash AssertionError len(s['quorum']) == len(mons) Resolved

History

#1 Updated by Neha Ojha over 2 years ago

  • Subject changed from mon/test_mon_osdmap_prune.sh fails to qa/tasks/mon_thrash.py: _do_thrash AssertionError len(s['quorum']) == len(mons)
2021-08-17T00:22:05.475 ERROR:tasks.mon_thrash.mon_thrasher:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_a00b573203353a8db1f3e59f48e827ed27479b62/qa/tasks/mon_thrash.py", line 232, in do_thrash
    self._do_thrash()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_a00b573203353a8db1f3e59f48e827ed27479b62/qa/tasks/mon_thrash.py", line 266, in _do_thrash
    assert len(s['quorum']) == len(mons)
AssertionError

Before this:

2021-08-17T00:21:52.681 INFO:tasks.mon_thrash.ceph_manager:quorum is size 3
2021-08-17T00:21:52.681 INFO:tasks.mon_thrash.mon_thrasher:making sure all monitors are in the quorum
2021-08-17T00:21:52.682 DEBUG:teuthology.orchestra.run.smithi013:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell mon.a mon_status
...
2021-08-17T00:21:53.071 INFO:teuthology.orchestra.run.smithi013.stdout:{
2021-08-17T00:21:53.071 INFO:teuthology.orchestra.run.smithi013.stdout:    "name": "a",
2021-08-17T00:21:53.071 INFO:teuthology.orchestra.run.smithi013.stdout:    "rank": 0,
2021-08-17T00:21:53.072 INFO:teuthology.orchestra.run.smithi013.stdout:    "state": "leader",
2021-08-17T00:21:53.072 INFO:teuthology.orchestra.run.smithi013.stdout:    "election_epoch": 42,
2021-08-17T00:21:53.072 INFO:teuthology.orchestra.run.smithi013.stdout:    "quorum": [
2021-08-17T00:21:53.072 INFO:teuthology.orchestra.run.smithi013.stdout:        0,
2021-08-17T00:21:53.072 INFO:teuthology.orchestra.run.smithi013.stdout:        1,
2021-08-17T00:21:53.073 INFO:teuthology.orchestra.run.smithi013.stdout:        2
2021-08-17T00:21:53.073 INFO:teuthology.orchestra.run.smithi013.stdout:    ],
...
2021-08-17T00:21:53.179 DEBUG:teuthology.orchestra.run.smithi013:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell mon.c mon_status
...
2021-08-17T00:22:05.419 INFO:teuthology.orchestra.run.smithi013.stdout:{
2021-08-17T00:22:05.419 INFO:teuthology.orchestra.run.smithi013.stdout:    "name": "c",
2021-08-17T00:22:05.419 INFO:teuthology.orchestra.run.smithi013.stdout:    "rank": 2,
2021-08-17T00:22:05.420 INFO:teuthology.orchestra.run.smithi013.stdout:    "state": "peon",
2021-08-17T00:22:05.420 INFO:teuthology.orchestra.run.smithi013.stdout:    "election_epoch": 46,
2021-08-17T00:22:05.420 INFO:teuthology.orchestra.run.smithi013.stdout:    "quorum": [
2021-08-17T00:22:05.425 INFO:teuthology.orchestra.run.smithi013.stdout:        0,
2021-08-17T00:22:05.425 INFO:teuthology.orchestra.run.smithi013.stdout:        2
2021-08-17T00:22:05.426 INFO:teuthology.orchestra.run.smithi013.stdout:    ],

The assertion error is seen because peon mon.c reports 2 mons (not 3) in quorum.

#2 Updated by Laura Flores almost 2 years ago

  • Backport set to octopus

/a/yuriw-2022-05-13_14:13:55-rados-wip-yuri3-testing-2022-05-12-1609-octopus-distro-default-smithi/6832711

mon.c reports 3 mons in quorum:

2022-05-14T10:26:57.539 INFO:teuthology.orchestra.run.smithi003.stdout:{
2022-05-14T10:26:57.540 INFO:teuthology.orchestra.run.smithi003.stdout:    "name": "c",
2022-05-14T10:26:57.540 INFO:teuthology.orchestra.run.smithi003.stdout:    "rank": 2,
2022-05-14T10:26:57.540 INFO:teuthology.orchestra.run.smithi003.stdout:    "state": "peon",
2022-05-14T10:26:57.540 INFO:teuthology.orchestra.run.smithi003.stdout:    "election_epoch": 90,
2022-05-14T10:26:57.541 INFO:teuthology.orchestra.run.smithi003.stdout:    "quorum": [
2022-05-14T10:26:57.541 INFO:teuthology.orchestra.run.smithi003.stdout:        0,
2022-05-14T10:26:57.541 INFO:teuthology.orchestra.run.smithi003.stdout:        1,
2022-05-14T10:26:57.541 INFO:teuthology.orchestra.run.smithi003.stdout:        2
2022-05-14T10:26:57.542 INFO:teuthology.orchestra.run.smithi003.stdout:    ],

... but mon.b only reports 2:

2022-05-14T10:27:10.038 INFO:teuthology.orchestra.run.smithi003.stdout:{
2022-05-14T10:27:10.038 INFO:teuthology.orchestra.run.smithi003.stdout:    "name": "b",
2022-05-14T10:27:10.038 INFO:teuthology.orchestra.run.smithi003.stdout:    "rank": 1,
2022-05-14T10:27:10.039 INFO:teuthology.orchestra.run.smithi003.stdout:    "state": "peon",
2022-05-14T10:27:10.039 INFO:teuthology.orchestra.run.smithi003.stdout:    "election_epoch": 96,
2022-05-14T10:27:10.039 INFO:teuthology.orchestra.run.smithi003.stdout:    "quorum": [
2022-05-14T10:27:10.039 INFO:teuthology.orchestra.run.smithi003.stdout:        0,
2022-05-14T10:27:10.040 INFO:teuthology.orchestra.run.smithi003.stdout:        1
2022-05-14T10:27:10.040 INFO:teuthology.orchestra.run.smithi003.stdout:    ],

...

Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_aa4a6ef62a4724eafd160ebcc111d3b5d7550b84/qa/tasks/mon_thrash.py", line 232, in do_thrash
    self._do_thrash()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_aa4a6ef62a4724eafd160ebcc111d3b5d7550b84/qa/tasks/mon_thrash.py", line 266, in _do_thrash
    assert len(s['quorum']) == len(mons)
AssertionError

#3 Updated by Laura Flores over 1 year ago

/a/yuriw-2022-06-13_16:36:31-rados-wip-yuri7-testing-2022-06-13-0706-distro-default-smithi/6876523

Description: rados/monthrash/{ceph clusters/9-mons mon_election/connectivity msgr-failures/few msgr/async objectstore/bluestore-low-osd-mem-target rados supported-random-distro$/{centos_8} thrashers/sync-many workloads/rados_api_tests}

#4 Updated by Laura Flores about 1 year ago

  • Tags set to test-failure

/a/lflores-2023-02-09_16:38:16-rados-wip-lflores-testing-2023-02-06-1529-distro-default-smithi/7164042

#5 Updated by Laura Flores 12 months ago

/a/yuriw-2023-03-08_23:00:31-rados-wip-yuri11-testing-2023-03-08-1220-distro-default-smithi/7198899

#6 Updated by Nitzan Mordechai 9 months ago

  • Status changed from New to In Progress
  • Assignee set to Nitzan Mordechai

Since we are failing at the first assert of the check for quorum, and we had few iterators over the thrashing, it looks like we are not delaying enough between the thrashing.
we can see from the quorum_age that not all the monitors ticks update the quorum on the time of check.

will increase the thrash_delay from 1 to the default of mon_tick_interval which is 5

#7 Updated by Nitzan Mordechai 9 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 51570

#8 Updated by Radoslaw Zarzynski 9 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport changed from octopus to pacific,quincy,reef

#9 Updated by Backport Bot 9 months ago

  • Copied to Backport #61450: pacific: qa/tasks/mon_thrash.py: _do_thrash AssertionError len(s['quorum']) == len(mons) added

#10 Updated by Backport Bot 9 months ago

  • Copied to Backport #61451: quincy: qa/tasks/mon_thrash.py: _do_thrash AssertionError len(s['quorum']) == len(mons) added

#11 Updated by Backport Bot 9 months ago

  • Copied to Backport #61452: reef: qa/tasks/mon_thrash.py: _do_thrash AssertionError len(s['quorum']) == len(mons) added

#12 Updated by Backport Bot 9 months ago

  • Tags set to backport_processed

#13 Updated by Konstantin Shalygin 3 months ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100

Also available in: Atom PDF