Project

General

Profile

Actions

Bug #46318

open

mon_recovery: quorum_status times out

Added by Neha Ojha almost 4 years ago. Updated 2 months ago.

Status:
Need More Info
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2020-06-30T22:11:16.279 INFO:tasks.mon_recovery.ceph_manager:quorum is size 3
2020-06-30T22:11:16.279 INFO:tasks.mon_recovery:stopping all monitors
2020-06-30T22:11:16.280 DEBUG:tasks.ceph.mon.b:waiting for process to exit
2020-06-30T22:11:16.280 INFO:teuthology.orchestra.run:waiting for 300
2020-06-30T22:11:16.290 INFO:tasks.ceph.mon.b:Stopped
2020-06-30T22:11:16.291 DEBUG:tasks.ceph.mon.c:waiting for process to exit
2020-06-30T22:11:16.291 INFO:teuthology.orchestra.run:waiting for 300
2020-06-30T22:11:16.302 INFO:tasks.ceph.mon.c:Stopped
2020-06-30T22:11:16.302 DEBUG:tasks.ceph.mon.a:waiting for process to exit
2020-06-30T22:11:16.303 INFO:teuthology.orchestra.run:waiting for 300
2020-06-30T22:11:16.331 INFO:teuthology.orchestra.run.smithi188.stdout:ERROR: (22) Invalid argument
2020-06-30T22:11:16.331 INFO:teuthology.orchestra.run.smithi188.stdout:op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck. Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.
2020-06-30T22:11:16.341 INFO:tasks.ceph.mon.a:Stopped
2020-06-30T22:11:16.342 INFO:tasks.mon_recovery:forming a minimal quorum for ['b', 'c', 'a'], then adding monitors
2020-06-30T22:11:16.342 INFO:tasks.ceph.mon.b:Restarting daemon
2020-06-30T22:11:16.342 INFO:teuthology.orchestra.run.smithi074:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f --cluster ceph -i b
2020-06-30T22:11:16.344 INFO:teuthology.orchestra.run.smithi188:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok dump_ops_in_flight
2020-06-30T22:11:16.346 INFO:tasks.ceph.mon.b:Started
2020-06-30T22:11:16.346 INFO:tasks.ceph.mon.c:Restarting daemon
2020-06-30T22:11:16.347 INFO:teuthology.orchestra.run.smithi188:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f --cluster ceph -i c
2020-06-30T22:11:16.385 INFO:tasks.ceph.osd.3.smithi188.stderr:2020-06-30T22:11:16.383+0000 7fd37fcec700 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper kill ceph-osd -f --cluster ceph -i 3  (PID: 26486) UID: 0
2020-06-30T22:11:16.386 INFO:tasks.ceph.mon.c:Started
2020-06-30T22:11:16.387 INFO:tasks.mon_recovery.ceph_manager:waiting for quorum size 2
2020-06-30T22:11:16.387 INFO:teuthology.orchestra.run.smithi188:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early quorum_status
.
.
2020-06-30T22:13:16.552 DEBUG:teuthology.orchestra.run:got remote process result: 124
2020-06-30T22:13:16.553 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 90, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 69, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-testing-2020-06-30-1711-octopus/qa/tasks/mon_recovery.py", line 64, in task
    manager.wait_for_mon_quorum_size(num)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-testing-2020-06-30-1711-octopus/qa/tasks/ceph_manager.py", line 2896, in wait_for_mon_quorum_size
    while not len(self.get_mon_quorum()) == size:
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-testing-2020-06-30-1711-octopus/qa/tasks/ceph_manager.py", line 2885, in get_mon_quorum
    out = self.raw_cluster_cmd('quorum_status')
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-testing-2020-06-30-1711-octopus/qa/tasks/ceph_manager.py", line 1363, in raw_cluster_cmd
    stdout=BytesIO(),
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 204, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 446, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 160, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 182, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed on smithi188 with status 124: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early quorum_status'

/a/yuriw-2020-06-30_21:52:40-rados-wip-yuri-testing-2020-06-30-1711-octopus-distro-basic-smithi/5191923


Related issues 2 (1 open1 closed)

Related to CephFS - Bug #43902: qa: mon_thrash: timeout "ceph quorum_status"Triaged

Actions
Related to RADOS - Bug #43887: ceph_test_rados_delete_pools_parallel failureResolvedNitzan Mordechai

Actions
Actions #1

Updated by Patrick Donnelly almost 4 years ago

  • Related to Bug #43902: qa: mon_thrash: timeout "ceph quorum_status" added
Actions #2

Updated by Joao Eduardo Luis over 3 years ago

  • Category set to Correctness/Safety
  • Status changed from New to Triaged
  • Assignee set to Joao Eduardo Luis
Actions #3

Updated by Joao Eduardo Luis over 3 years ago

  • Priority changed from Normal to Urgent
Actions #4

Updated by Deepika Upadhyay over 3 years ago

/a/yuriw-2020-08-26_18:16:40-rados-wip-yuri-testing-2020-08-26-1631-octopus-distro-basic-smithi/5378436

2020-08-27T01:12:07.707 INFO:tasks.mon_recovery.ceph_manager:quorum is size 2
2020-08-27T01:12:07.707 INFO:tasks.mon_recovery:causing some monitor log activity
2020-08-27T01:12:07.707 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '1 of 30'
2020-08-27T01:12:08.485 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '2 of 30'
2020-08-27T01:12:09.489 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '3 of 30'
2020-08-27T01:12:10.495 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '4 of 30'
2020-08-27T01:12:11.495 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '5 of 30'
2020-08-27T01:12:12.499 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '6 of 30'
2020-08-27T01:12:13.214 INFO:teuthology.orchestra.run.smithi125.stderr:2020-08-27T01:12:13.213+0000 7f637b7fe700  0 --2- 172.21.15.125:0/1708960995 >> v2:172.21.15.43:3300/0 conn(0x7f637c10b260 0x7f637c10b640 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=0 rev1=1 rx=0 tx=0).send_auth_request get_initial_auth_request returned -2
2020-08-27T01:12:13.499 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '7 of 30'
2020-08-27T01:12:14.505 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '8 of 30'
2020-08-27T01:12:15.507 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '9 of 30'
2020-08-27T01:12:16.508 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '10 of 30'
2020-08-27T01:12:17.511 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '11 of 30'
2020-08-27T01:12:21.664 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '12 of 30'
2020-08-27T01:12:22.668 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '13 of 30'
2020-08-27T01:12:23.667 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '14 of 30'
2020-08-27T01:12:24.671 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '15 of 30'
2020-08-27T01:12:25.672 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '16 of 30'
2020-08-27T01:12:26.674 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '17 of 30'
2020-08-27T01:12:27.677 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '18 of 30'
2020-08-27T01:12:28.679 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '19 of 30'
2020-08-27T01:12:29.681 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '20 of 30'
2020-08-27T01:12:30.684 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '21 of 30'
2020-08-27T01:12:31.687 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '22 of 30'
2020-08-27T01:12:32.689 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '23 of 30'
2020-08-27T01:12:33.691 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '24 of 30'
2020-08-27T01:12:34.693 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '25 of 30'
2020-08-27T01:12:35.696 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '26 of 30'
2020-08-27T01:12:36.703 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '27 of 30'
2020-08-27T01:12:37.704 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '28 of 30'
2020-08-27T01:12:38.711 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log '29 of 30'
2020-08-27T01:12:39.707 INFO:tasks.mon_recovery:adding mon c back in
2020-08-27T01:12:39.708 INFO:tasks.ceph.mon.c:Restarting daemon
2020-08-27T01:12:39.708 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f --cluster ceph -i c
2020-08-27T01:12:39.711 INFO:tasks.ceph.mon.c:Started
2020-08-27T01:12:39.712 INFO:tasks.mon_recovery.ceph_manager:waiting for quorum size 3
2020-08-27T01:12:39.712 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early quorum_status
2020-08-27T01:12:40.033 INFO:teuthology.orchestra.run.smithi125.stdout:{"election_epoch":66,"quorum":[0,1],"quorum_names":["b","a"],"quorum_leader_name":"b","quorum_age":32,"features":{"quorum_con":"4540138292836696063","quorum_mon":["kraken","luminous","mimic","osdmap-prune","nautilus","octopus"]},"monmap":{"epoch":1,"fsid":"af484dec-9dda-4a14-8664-c564084077c7","modified":"2020-08-27T01:09:05.934223Z","created":"2020-08-27T01:09:05.934223Z","min_mon_release":15,"min_mon_release_name":"octopus","features":{"persistent":["kraken","luminous","mimic","osdmap-prune","nautilus","octopus"],"optional":[]},"mons":[{"rank":0,"name":"b","public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.43:3300","nonce":0},{"type":"v1","addr":"172.21.15.43:6789","nonce":0}]},"addr":"172.21.15.43:6789/0","public_addr":"172.21.15.43:6789/0","priority":0,"weight":0},{"rank":1,"name":"a","public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.125:3300","nonce":0},{"type":"v1","addr":"172.21.15.125:6789","nonce":0}]},"addr":"172.21.15.125:6789/0","public_addr":"172.21.15.125:6789/0","priority":0,"weight":0},{"rank":2,"name":"c","public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.125:3301","nonce":0},{"type":"v1","addr":"172.21.15.125:6790","nonce":0}]},"addr":"172.21.15.125:6790/0","public_addr":"172.21.15.125:6790/0","priority":0,"weight":0}]}}
2020-08-27T01:12:40.045 INFO:tasks.mon_recovery.ceph_manager:quorum_status is {"election_epoch":66,"quorum":[0,1],"quorum_names":["b","a"],"quorum_leader_name":"b","quorum_age":32,"features":{"quorum_con":"4540138292836696063","quorum_mon":["kraken","luminous","mimic","osdmap-prune","nautilus","octopus"]},"monmap":{"epoch":1,"fsid":"af484dec-9dda-4a14-8664-c564084077c7","modified":"2020-08-27T01:09:05.934223Z","created":"2020-08-27T01:09:05.934223Z","min_mon_release":15,"min_mon_release_name":"octopus","features":{"persistent":["kraken","luminous","mimic","osdmap-prune","nautilus","octopus"],"optional":[]},"mons":[{"rank":0,"name":"b","public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.43:3300","nonce":0},{"type":"v1","addr":"172.21.15.43:6789","nonce":0}]},"addr":"172.21.15.43:6789/0","public_addr":"172.21.15.43:6789/0","priority":0,"weight":0},{"rank":1,"name":"a","public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.125:3300","nonce":0},{"type":"v1","addr":"172.21.15.125:6789","nonce":0}]},"addr":"172.21.15.125:6789/0","public_addr":"172.21.15.125:6789/0","priority":0,"weight":0},{"rank":2,"name":"c","public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.125:3301","nonce":0},{"type":"v1","addr":"172.21.15.125:6790","nonce":0}]},"addr":"172.21.15.125:6790/0","public_addr":"172.21.15.125:6790/0","priority":0,"weight":0}]}}

2020-08-27T01:12:43.053 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early quorum_status
2020-08-27T01:12:48.101 INFO:teuthology.orchestra.run.smithi125.stdout:{"election_epoch":70,"quorum":[0,1],"quorum_names":["b","a"],"quorum_leader_name":"b","quorum_age":0,"features":{"quorum_con":"4540138292836696063","quorum_mon":["kraken","luminous","mimic","osdmap-prune","nautilus","octopus"]},"monmap":{"epoch":1,"fsid":"af484dec-9dda-4a14-8664-c564084077c7","modified":"2020-08-27T01:09:05.934223Z","created":"2020-08-27T01:09:05.934223Z","min_mon_release":15,"min_mon_release_name":"octopus","features":{"persistent":["kraken","luminous","mimic","osdmap-prune","nautilus","octopus"],"optional":[]},"mons":[{"rank":0,"name":"b","public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.43:3300","nonce":0},{"type":"v1","addr":"172.21.15.43:6789","nonce":0}]},"addr":"172.21.15.43:6789/0","public_addr":"172.21.15.43:6789/0","priority":0,"weight":0},{"rank":1,"name":"a","public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.125:3300","nonce":0},{"type":"v1","addr":"172.21.15.125:6789","nonce":0}]},"addr":"172.21.15.125:6789/0","public_addr":"172.21.15.125:6789/0","priority":0,"weight":0},{"rank":2,"name":"c","public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.125:3301","nonce":0},{"type":"v1","addr":"172.21.15.125:6790","nonce":0}]},"addr":"172.21.15.125:6790/0","public_addr":"172.21.15.125:6790/0","priority":0,"weight":0}]}}
2020-08-27T01:12:48.112 INFO:tasks.mon_recovery.ceph_manager:quorum_status is {"election_epoch":70,"quorum":[0,1],"quorum_names":["b","a"],"quorum_leader_name":"b","quorum_age":0,"features":{"quorum_con":"4540138292836696063","quorum_mon":["kraken","luminous","mimic","osdmap-prune","nautilus","octopus"]},"monmap":{"epoch":1,"fsid":"af484dec-9dda-4a14-8664-c564084077c7","modified":"2020-08-27T01:09:05.934223Z","created":"2020-08-27T01:09:05.934223Z","min_mon_release":15,"min_mon_release_name":"octopus","features":{"persistent":["kraken","luminous","mimic","osdmap-prune","nautilus","octopus"],"optional":[]},"mons":[{"rank":0,"name":"b","public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.43:3300","nonce":0},{"type":"v1","addr":"172.21.15.43:6789","nonce":0}]},"addr":"172.21.15.43:6789/0","public_addr":"172.21.15.43:6789/0","priority":0,"weight":0},{"rank":1,"name":"a","public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.125:3300","nonce":0},{"type":"v1","addr":"172.21.15.125:6789","nonce":0}]},"addr":"172.21.15.125:6789/0","public_addr":"172.21.15.125:6789/0","priority":0,"weight":0},{"rank":2,"name":"c","public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.125:3301","nonce":0},{"type":"v1","addr":"172.21.15.125:6790","nonce":0}]},"addr":"172.21.15.125:6790/0","public_addr":"172.21.15.125:6790/0","priority":0,"weight":0}]}}

2020-08-27T01:12:51.114 INFO:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early quorum_status
2020-08-27T01:14:51.158 DEBUG:teuthology.orchestra.run:got remote process result: 124
2020-08-27T01:14:51.161 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 90, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 69, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-testing-2020-08-26-1631-octopus/qa/tasks/mon_recovery.py", line 80, in task
    manager.wait_for_mon_quorum_size(len(mons))
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-testing-2020-08-26-1631-octopus/qa/tasks/ceph_manager.py", line 2896, in wait_for_mon_quorum_size
    while not len(self.get_mon_quorum()) == size:
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-testing-2020-08-26-1631-octopus/qa/tasks/ceph_manager.py", line 2885, in get_mon_quorum
    out = self.raw_cluster_cmd('quorum_status')
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-testing-2020-08-26-1631-octopus/qa/tasks/ceph_manager.py", line 1363, in raw_cluster_cmd
    stdout=BytesIO(),
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 204, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 446, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 160, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 182, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed on smithi125 with status 124: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early quorum_status'
2020-08-27T01:14:51.209 ERROR:teuthology.run_tasks: Sentry event: https://sentry.ceph.com/sepia/teuthology/?q=d715fff7af674694aeb0e6b26a0898f3

Actions #5

Updated by Neha Ojha over 3 years ago

Joao, are you working on a fix for this?

Actions #6

Updated by Neha Ojha over 3 years ago

rados/monthrash/{ceph clusters/3-mons mon_election/connectivity msgr-failures/few msgr/async-v1only objectstore/bluestore-stupid rados supported-random-distro$/{ubuntu_latest} thrashers/one workloads/rados_mon_workunits}

2020-10-21T03:43:37.094 INFO:tasks.mon_thrash.ceph_manager:waiting for quorum size 2
2020-10-21T03:43:37.094 INFO:teuthology.orchestra.run.smithi114:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph quorum_status
2020-10-21T03:43:41.583 DEBUG:teuthology.orchestra.run:got remote process result: 124

/a/yuriw-2020-10-20_21:33:24-rados-wip-yuri8-testing-master-distro-basic-smithi/5543217

Actions #7

Updated by Kefu Chai over 3 years ago

/a/kchai-2020-11-08_14:53:34-rados-wip-kefu-testing-2020-11-07-2116-distro-basic-smithi/5602229/

Actions #8

Updated by Neha Ojha over 3 years ago

/a/teuthology-2020-12-06_07:01:02-rados-master-distro-basic-smithi/5685125

Actions #9

Updated by Neha Ojha over 3 years ago

We are still seeing these.

/a/teuthology-2021-01-18_07:01:01-rados-master-distro-basic-smithi/5798278

Actions #10

Updated by Sage Weil about 3 years ago

  • Assignee deleted (Joao Eduardo Luis)
Actions #11

Updated by Sage Weil about 3 years ago

  • Status changed from Triaged to In Progress
  • Assignee set to Sage Weil

Neha Ojha wrote:

We are still seeing these.

/a/teuthology-2021-01-18_07:01:01-rados-master-distro-basic-smithi/5798278

The quorum_status command that timed out never reached the mon. Either the client can't connect, or the mon is rejecting the conenction for some reason. The mom quorum formed fine.

Trying to reproduce with client-side logging.

Actions #12

Updated by Sage Weil about 3 years ago

  • Status changed from In Progress to Need More Info

having trouble reproducing (after about 150 jobs). adding increased debugging to master with https://github.com/ceph/ceph/pull/39696

Actions #13

Updated by Sage Weil about 3 years ago

same symptom... cli command fails to contact mon

/a/sage-2021-02-28_18:35:15-rados-wip-sage-testing-2021-02-28-1217-distro-basic-smithi/5921243

2021-02-28T19:40:42.545 DEBUG:teuthology.orchestra.run.smithi014:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd last-stat-seq osd.2
2021-02-28T19:42:42.575 DEBUG:teuthology.orchestra.run:got remote process result: 124
2021-02-28T19:42:42.577 ERROR:teuthology.run_tasks:Manager failed: thrashosds
Traceback (most recent call last):

the command message never reaches the mon. no client side debug log
Actions #14

Updated by Neha Ojha over 2 years ago

  • Priority changed from Urgent to Normal

Haven't seen this in recent rados runs.

Actions #15

Updated by Laura Flores 5 months ago

/a/yuriw-2023-11-27_22:36:50-rados-wip-yuri-testing-2023-11-27-1028-pacific-distro-default-smithi/7469028

Actions #16

Updated by Nitzan Mordechai 5 months ago

Laura Flores wrote:

/a/yuriw-2023-11-27_22:36:50-rados-wip-yuri-testing-2023-11-27-1028-pacific-distro-default-smithi/7469028

Laura, it looks like it related to tracker https://tracker.ceph.com/issues/43887
the test started at 06:14:55 and killed at 18:16:01 (timeout probably after 12 hours)
ceph_test_rados_delete_pools_parallel didn't complete, I didn't backported it to pacific.

Actions #17

Updated by Radoslaw Zarzynski 5 months ago

  • Related to Bug #43887: ceph_test_rados_delete_pools_parallel failure added
Actions #18

Updated by Laura Flores 2 months ago

/a/yuriw-2024-02-01_21:25:50-rados-wip-yuri2-testing-2024-02-01-0939-pacific-distro-default-smithi/7542501

Actions

Also available in: Atom PDF