Bug #45761: mon_thrasher: "Error ENXIO: mon unavailable" during sync_force command leads to "failed to reach quorum size 9 before timeout expired" - RADOS - Ceph

Actions

Copy link

Bug #45761

open

mon_thrasher: "Error ENXIO: mon unavailable" during sync_force command leads to "failed to reach quorum size 9 before timeout expired"

Added by Brad Hubbard almost 4 years ago. Updated almost 3 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Deepika Upadhyay

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

octopus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v16.0.0

ceph-qa-suite:

Component(RADOS):

Pull request ID:

37904

Crash signature (v1):

Crash signature (v2):

Description

/a/yuriw-2020-05-28_02:23:45-rados-wip-yuri-master_5.27.20-distro-basic-smithi/5097794

2020-05-28T06:13:04.288 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.g
2020-05-28T06:13:04.289 INFO:tasks.mon_thrash.mon_thrasher:killing mon.g
2020-05-28T06:13:04.289 DEBUG:tasks.ceph.mon.g:waiting for process to exit
2020-05-28T06:13:04.289 INFO:teuthology.orchestra.run:waiting for 300
2020-05-28T06:13:04.349 INFO:tasks.ceph.mon.g:Stopped
2020-05-28T06:13:04.349 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.f
2020-05-28T06:13:04.350 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.f store
2020-05-28T06:13:04.350 INFO:teuthology.orchestra.run.smithi078:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early tell mon.f sync_force --yes-i-really-mean-it
2020-05-28T06:13:20.481 INFO:teuthology.orchestra.run.smithi078.stderr:Error ENXIO: mon unavailable
2020-05-28T06:13:20.485 DEBUG:teuthology.orchestra.run:got remote process result: 6
2020-05-28T06:13:20.486 ERROR:tasks.mon_thrash.mon_thrasher:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-master_5.27.20/qa/tasks/mon_thrash.py", line 232, in do_thrash
    self._do_thrash()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-master_5.27.20/qa/tasks/mon_thrash.py", line 285, in _do_thrash
    self.thrash_store(mon)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-master_5.27.20/qa/tasks/mon_thrash.py", line 175, in thrash_store
    '--yes-i-really-mean-it')
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-master_5.27.20/qa/tasks/ceph_manager.py", line 1363, in raw_cluster_cmd
    stdout=BytesIO(),
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 206, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 473, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 162, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 184, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed on smithi078 with status 6: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early tell mon.f sync_force --yes-i-really-mean-it'
2020-05-28T06:13:24.555 INFO:tasks.workunit.client.0.smithi139.stderr:noup is unset
2020-05-28T06:13:24.568 INFO:tasks.workunit.client.0.smithi139.stderr:+ (( ++i ))
2020-05-28T06:13:24.568 INFO:tasks.workunit.client.0.smithi139.stderr:+ (( i < num ))
2020-05-28T06:13:24.568 INFO:tasks.workunit.client.0.smithi139.stderr:+ ceph osd set noup
2020-05-28T06:13:25.825 INFO:tasks.daemonwatchdog.daemon_watchdog:MonitorThrasher failed
2020-05-28T06:13:25.825 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK! unmounting mounts and killing all daemons
2020-05-28T06:13:25.825 ERROR:tasks.daemonwatchdog.daemon_watchdog:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-master_5.27.20/qa/tasks/daemonwatchdog.py", line 37, in _run
    self.watch()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-master_5.27.20/qa/tasks/daemonwatchdog.py", line 113, in watch
    self.bark()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-master_5.27.20/qa/tasks/daemonwatchdog.py", line 53, in bark
    for mount in self.ctx.mounts.values():
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/config.py", line 254, in __getattr__
    raise AttributeError(name)
AttributeError: mounts

Several of the MONs are never restarted so we end with.

2020-05-28T06:47:14.331 INFO:teuthology.run:Summary data:
description: rados/monthrash/{ceph clusters/9-mons msgr-failures/mon-delay msgr/async-v1only
  objectstore/bluestore-low-osd-mem-target rados supported-random-distro$/{ubuntu_latest}
  thrashers/force-sync-many workloads/rados_mon_osdmap_prune}
duration: 2537.953857898712
failure_reason: failed to reach quorum size 9 before timeout expired

/a/yuriw-2020-05-22_19:55:53-rados-wip-yuri-master_5.22.20-distro-basic-smithi/5083290

This looks like the same issue but with a different error.

2020-05-22T23:38:28.429 INFO:tasks.mon_thrash.mon_thrasher:monitors to thrash: ['d', 'a', 'h']
2020-05-22T23:38:28.429 INFO:tasks.mon_thrash.mon_thrasher:monitors to freeze: []
2020-05-22T23:38:28.430 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.d
2020-05-22T23:38:28.430 INFO:tasks.mon_thrash.mon_thrasher:killing mon.d
2020-05-22T23:38:28.430 DEBUG:tasks.ceph.mon.d:waiting for process to exit
2020-05-22T23:38:28.430 INFO:teuthology.orchestra.run:waiting for 300
2020-05-22T23:38:28.440 INFO:tasks.ceph.mon.d:Stopped
2020-05-22T23:38:28.441 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.a
2020-05-22T23:38:28.441 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.a store
2020-05-22T23:38:28.441 INFO:teuthology.orchestra.run.smithi204:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early tell mon.a sync_force --yes-i-really-mean-it
2020-05-22T23:38:45.598 INFO:teuthology.orchestra.run.smithi204.stderr:Error ENXIO: problem getting command descriptions from mon.a
2020-05-22T23:38:45.610 DEBUG:teuthology.orchestra.run:got remote process result: 6
2020-05-22T23:38:45.610 ERROR:tasks.mon_thrash.mon_thrasher:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-master_5.22.20/qa/tasks/mon_thrash.py", line 232, in do_thrash
    self._do_thrash()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-master_5.22.20/qa/tasks/mon_thrash.py", line 285, in _do_thrash
    self.thrash_store(mon)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-master_5.22.20/qa/tasks/mon_thrash.py", line 175, in thrash_store
    '--yes-i-really-mean-it')
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-master_5.22.20/qa/tasks/ceph_manager.py", line 1363, in raw_cluster_cmd
    stdout=BytesIO(),
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/remote.py", line 206, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/run.py", line 473, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/run.py", line 162, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/run.py", line 184, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed on smithi204 with status 6: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early tell mon.a sync_force --yes-i-really-mean-it'

Actions

Copy link

Updated by Brad Hubbard almost 4 years ago

Description updated (diff)

Actions

Copy link

Updated by Brad Hubbard almost 4 years ago

/a/yuriw-2020-05-30_02:18:17-rados-wip-yuri-master_5.29.20-distro-basic-smithi/5104057

Actions

Copy link

Updated by Brad Hubbard almost 4 years ago

Project changed from Orchestrator to RADOS

Moving this since it appears to be a problem with the mon_thrasher (or the MONs or monclients).

2020-05-28T06:13:04.349 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.f
2020-05-28T06:13:04.350 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.f store
2020-05-28T06:13:04.350 INFO:teuthology.orchestra.run.smithi078:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early tell mon.f sync_force --yes-i-really-mean-it
2020-05-28T06:13:20.481 INFO:teuthology.orchestra.run.smithi078.stderr:Error ENXIO: mon unavailable
2020-05-28T06:13:20.485 DEBUG:teuthology.orchestra.run:got remote process result: 6

In each case so far analysed when the ENXIO is seen in teuthology.log we see the following disconnection in the log of the monitor involved at the same time.

2020-05-28T06:13:18.723+0000 7fb04fba4700 20 Putting signature in client message(seq # 96): sig = 16967977594287793094
2020-05-28T06:13:19.363+0000 7fb04dba0700 11 mon.f@1(electing) e1 tick
2020-05-28T06:13:19.363+0000 7fb04dba0700 20 mon.f@1(electing) e1 sync_trim_providers
2020-05-28T06:13:19.731+0000 7fb04fba4700  1 -- v1:172.21.15.139:6789/0 >> v1:172.21.15.139:6792/0 conn(0x557846841800 legacy=0x557847568000 unknown :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed to v1:172.21.15.139:6792/0
2020-05-28T06:13:19.731+0000 7fb04b39b700 10 mon.f@1(electing) e1 ms_handle_refused 0x557846841800 v1:172.21.15.139:6792/0
2020-05-28T06:13:20.479+0000 7fb04fba4700  1 -- v1:172.21.15.139:6789/0 >> v1:172.21.15.78:0/1301653674 conn(0x557847dc7c00 legacy=0x5578481e6800 unknown :6789 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 38
2020-05-28T06:13:20.479+0000 7fb04fba4700  1 -- v1:172.21.15.139:6789/0 >> v1:172.21.15.78:0/1301653674 conn(0x557847dc7c00 legacy=0x5578481e6800 unknown :6789 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2020-05-28T06:13:20.479+0000 7fb04fba4700  1 --1- v1:172.21.15.139:6789/0 >> v1:172.21.15.78:0/1301653674 conn(0x557847dc7c00 0x5578481e6800 :6789 s=OPENED pgs=12 cs=1 l=1).handle_message read tag failed
2020-05-28T06:13:20.479+0000 7fb04fba4700  1 --1- v1:172.21.15.139:6789/0 >> v1:172.21.15.78:0/1301653674 conn(0x557847dc7c00 0x5578481e6800 :6789 s=OPENED pgs=12 cs=1 l=1).fault on lossy channel, failing
2020-05-28T06:13:20.479+0000 7fb04b39b700 10 mon.f@1(electing) e1 ms_handle_reset 0x557847dc7c00 v1:172.21.15.78:0/1301653674
2020-05-28T06:13:20.479+0000 7fb04b39b700 10 mon.f@1(electing) e1 reset/close on session client.4972 v1:172.21.15.78:0/1301653674
2020-05-28T06:13:20.479+0000 7fb04b39b700 10 mon.f@1(electing) e1 remove_session 0x557847612fc0 client.4972 v1:172.21.15.78:0/1301653674 features 0x3f01cfb8ffbdffff

I'm wondering if maybe the issue could be in src/pybind/ceph_argparse.py: send_command_retry ? Maybe we need to accommodate the 'mon unavailable' message in the list of non-fatal errors? I'll see if I can reproduce this and test that theory.

Actions

Copy link

Updated by Neha Ojha almost 4 years ago

Backport set to octopus

/a/yuriw-2020-06-02_15:07:59-rados-wip-yuri7-testing-2020-06-01-2256-octopus-distro-basic-smithi/5113082 - octopus

Actions

Copy link

Updated by Neha Ojha almost 4 years ago

/a/dis-2020-06-28_18:43:20-rados-wip-msgr21-fix-reuse-rebuildci-distro-basic-smithi/5186759

Actions

Copy link

Updated by Neha Ojha almost 4 years ago

Priority changed from Normal to High

/a/yuriw-2020-07-06_19:37:47-rados-wip-yuri7-testing-2020-07-06-1754-octopus-distro-basic-smithi/5204398

Actions

Copy link

Updated by Brad Hubbard over 3 years ago

/a/yuriw-2020-07-13_23:00:15-rados-wip-yuri8-testing-2020-07-13-1946-octopus-distro-basic-smithi/5224050

Actions

Copy link

Updated by Brad Hubbard over 3 years ago

/a/yuriw-2020-08-06_00:31:28-rados-wip-yuri8-testing-octopus-distro-basic-smithi/5291043

Actions

Copy link

Updated by Deepika Upadhyay over 3 years ago

Assignee set to Deepika Upadhyay

Actions

Copy link

#10

Updated by Deepika Upadhyay over 3 years ago

http://qa-proxy.ceph.com/teuthology/yuriw-2020-09-16_23:57:37-rados-wip-yuri8-testing-2020-09-16-2220-octopus-distro-basic-smithi/5441427/teuthology.log

Error ENXIO: problem getting command descriptions from mon.f

http://qa-proxy.ceph.com/teuthology/yuriw-2020-10-01_17:46:11-rados-wip-yuri5-testing-2020-10-01-0834-octopus-distro-basic-smithi/5486785/teuthology.log

Actions

Copy link

#11

Updated by Deepika Upadhyay over 3 years ago

seems like leader has been down since long:

detail": [
                {
                    "message": "mon.f (rank 0) addr v1:172.21.15.78:6789/0 is down (out of quoru
                },
                {
                    "message": "mon.d (rank 7) addr v1:172.21.15.192:6792/0 is down (out of quor
                },
                {
                    "message": "mon.e (rank 8) addr v1:172.21.15.192:6793/0 is down (out of quor

right before it osdmap is found inactive:

paxosservice(osdmap 1..16) _active - not active

at root of which I also found the same connection failure Brad saw

Actions

Copy link

#12

Updated by Deepika Upadhyay over 3 years ago

http://qa-proxy.ceph.com/teuthology/yuriw-2020-10-20_19:54:27-rados-wip-yuri-testing-2020-10-20-0934-octopus-distro-basic-smithi/5542878/teuthology.log

Actions

Copy link

#13

Updated by Deepika Upadhyay over 3 years ago

to reproduce, we just need to change, `s/mon client directed command retry: 5/mon client directed command retry: 2 running interactive on error

giving enough time for mon to recover by using mon client directed command retry: 7 resolves this issue, although we might still need to investigate why mon is needing more time to recover back

Actions

Copy link

#14

Updated by Deepika Upadhyay over 3 years ago

observed failing test case in interactive on error mode, with this config for yaml:

base config: /a/yuriw-2020-10-01_17:46:11-rados-wip-yuri5-testing-2020-10-01-0834-octopus-distro-basic-smithi/5486785/orig.config.yaml

modified to command retry reduced to 2:

 /home/ideepika/tracker-45761.yaml 
failure log: /home/ideepika/fail_45761.log

amended it for passing configuration with command retry increased to 7:

yaml config: https://gist.github.com/ideepika/fa2820b7250c714d25b0c8cb206c9e13
success log on teuthology: /home/ideepika/success_tracker-45761.log

Actions

Copy link

#15

Updated by Kefu Chai over 3 years ago

Status changed from New to Fix Under Review
Pull request ID set to 37904

Actions

Copy link

#16

Updated by Deepika Upadhyay over 3 years ago

for reproducing the issue: failed run: https://pulpito.ceph.com/ideepika-2020-12-10_12:48:41-rados-wip-yuri4-testing-2020-12-07-1147-octopus-distro-basic-smithi/

Actions

Copy link

#17

Updated by Neha Ojha over 3 years ago

Deepika Upadhyay wrote:

for reproducing the issue: failed run: https://pulpito.ceph.com/ideepika-2020-12-10_12:48:41-rados-wip-yuri4-testing-2020-12-07-1147-octopus-distro-basic-smithi/

Deepika, is this reproducible on master as well?

Actions

Copy link

#18

Updated by Deepika Upadhyay over 3 years ago

Neha Ojha wrote:

Deepika Upadhyay wrote:

for reproducing the issue: failed run: https://pulpito.ceph.com/ideepika-2020-12-10_12:48:41-rados-wip-yuri4-testing-2020-12-07-1147-octopus-distro-basic-smithi/

Deepika, is this reproducible on master as well?

not checked yet, will update soon with quick test

Actions

Copy link

#19

Updated by Deepika Upadhyay over 3 years ago

master with 5 default: https://pulpito.ceph.com/ideepika-2020-12-16_07:01:58-rados:monthrash-master-distro-basic-smithi/

@Neha ., yes, master also seems to have this issue

Actions

Copy link

#20

Updated by Neha Ojha over 3 years ago

Deepika Upadhyay wrote:

master with 5 default: https://pulpito.ceph.com/ideepika-2020-12-16_07:01:58-rados:monthrash-master-distro-basic-smithi/

@Neha ., yes, master also seems to have this issue

Ok good, the next step is to look at the mon logs and see what's preventing the mons from forming quorum.

Actions

Copy link

#21

Updated by Deepika Upadhyay about 3 years ago

-/ceph/teuthology-archive/yuriw-2021-02-15_20:25:26-rados-wip-yuri3-testing-2021-02-15-1020-nautilus-distro-basic-gibba/5885183/teuthology.log

2021-02-16 02:43:17.470 7f2cbf86e700  1 -- v1:172.21.2.133:6800/33014 >> v1:172.21.2.123:0/3420527435 conn(0x5569b364f000 legacy=0x5569b3701000 unknown :6800 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 43
2021-02-16 02:43:17.470 7f2cbf86e700  1 -- v1:172.21.2.133:6800/33014 >> v1:172.21.2.123:0/3420527435 conn(0x5569b364f000 legacy=0x5569b3701000 unknown :6800 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2021-02-16 02:43:17.470 7f2cbf86e700  1 --1- v1:172.21.2.133:6800/33014 >> v1:172.21.2.123:0/3420527435 conn(0x5569b364f000 0x5569b3701000 :6800 s=OPENED pgs=4 cs=1 l=1).handle_message read tag failed
2021-02-16 02:43:17.470 7f2cbf86e700  1 --1- v1:172.21.2.133:6800/33014 >> v1:172.21.2.123:0/3420527435 conn(0x5569b364f000 0x5569b3701000 :6800 s=OPENED pgs=4 cs=1 l=1).fault on lossy channel, failing
2021-02-16 02:43:17.470 7f2cbf86e700  1 -- v1:172.21.2.133:6800/33014 reap_dead start

/ceph/teuthology-archive/yuriw-2021-02-15_20:25:26-rados-wip-yuri3-testing-2021-02-15-1020-nautilus-distro-basic-gibba/5885183/remote/gibba033/log/ceph-mgr.x.log.gz


2021-02-16 02:43:17.465 7f3708aa3700 10 mon.a@0(leader).paxosservice(logm 1..19)  setting proposal_timer 0x55b300983500 with delay of 0.870552
2021-02-16 02:43:17.493 7f3704a9b700  1 -- v1:172.21.2.123:0/33083 mark_down 0x55b300b16d00 -- pipe dne
2021-02-16 02:43:17.493 7f3708aa3700 10 mon.a@0(leader) e1 ms_handle_reset 0x55b300a189c0 v1:172.21.2.133:0/2161352487
2021-02-16 02:43:17.493 7f3708aa3700 10 mon.a@0(leader) e1 reset/close on session client.? v1:172.21.2.133:0/2161352487
2021-02-16 02:43:17.493 7f3708aa3700 10 mon.a@0(leader) e1 remove_session 0x55b300a81500 client.? v1:172.21.2.133:0/2161352487 features 0x3ffddff8ffecffff
2021-02-16 02:43:17.493 7f3704a9b700  1 -- v1:172.21.2.123:0/33083 --> v1:172.21.2.133:6800/33014 -- mgropen(mon.a) v3 -- ?+0 0x55b30088c2c0 con 0x55b30087b860
2021-02-16 02:43:17.493 7f372000b700  0 -- v1:172.21.2.123:0/33083 >> v1:172.21.2.133:6800/33014 pipe(0x55b3003d2000 sd=34 :0 s=1 pgs=0 cs=0 l=0 c=0x55b30087b860).fault
2021-02-16 02:43:17.493 7f3704a9b700 10 mon.a@0(leader) e1 ms_handle_refused 0x55b30087b860 v1:172.21.2.133:6800/33014
2021-02-16 02:43:17.493 7f3704a9b700 10 mon.a@0(leader) e1 ms_handle_refused 0x55b30087b860 v1:172.21.2.133:6800/33014

/ceph/teuthology-archive/yuriw-2021-02-15_20:25:26-rados-wip-yuri3-testing-2021-02-15-
1020-nautilus-distro-basic-gibba/5885183/remote/gibba023/log/ceph-mon.a.log.gz-

EDIT: should be generic connection failures, not related to this tracker and hence not nautilus, marking irrelevant until seen again

Actions

Copy link

#22

Updated by Brad Hubbard about 3 years ago

Deepika Upadhyay wrote:

/ceph/teuthology-archive/yuriw-2021-02-15_20:25:26-rados-wip-yuri3-testing-2021-02-15-1020-nautilus-distro-basic-gibba/5885183/teuthology.log

[...]
/ceph/teuthology-archive/yuriw-2021-02-15_20:25:26-rados-wip-yuri3-testing-2021-02-15-1020-nautilus-distro-basic-gibba/5885183/remote/gibba033/log/ceph-mgr.x.log.gz

[...]

/ceph/teuthology-archive/yuriw-2021-02-15_20:25:26-rados-wip-yuri3-testing-2021-02-15-
1020-nautilus-distro-basic-gibba/5885183/remote/gibba023/log/ceph-mon.a.log.gz

~~This looks like it could be the same sort of general networking issue seen in https://tracker.ceph.com/issues/49259~~

Actions

Copy link

#23

Updated by Neha Ojha about 3 years ago

Deepika Upadhyay wrote:

/ceph/teuthology-archive/yuriw-2021-02-15_20:25:26-rados-wip-yuri3-testing-2021-02-15-1020-nautilus-distro-basic-gibba/5885183/teuthology.log

rados/multimon/{clusters/9 msgr-failures/few msgr/random no_pools objectstore/bluestore-bitmap rados supported-random-distro$/{centos_latest} tasks/mon_clock_no_skews}

This test has msgr failure injection, which can result in such messenger logging. You can verify if this was related to failure injection by looking for "injecting socket failure" in the logs.

[...]
/ceph/teuthology-archive/yuriw-2021-02-15_20:25:26-rados-wip-yuri3-testing-2021-02-15-1020-nautilus-distro-basic-gibba/5885183/remote/gibba033/log/ceph-mgr.x.log.gz

[...]

/ceph/teuthology-archive/yuriw-2021-02-15_20:25:26-rados-wip-yuri3-testing-2021-02-15-
1020-nautilus-distro-basic-gibba/5885183/remote/gibba023/log/ceph-mon.a.log.gz

Actions

Copy link

#24

Updated by Brad Hubbard about 3 years ago

Argh! So it does, my bad. Please ignore comment 22 for now.

Actions

Copy link

#25

Updated by Sage Weil about 3 years ago

I hit another instance of this here: /a/sage-2021-02-28_18:35:15-rados-wip-sage-testing-2021-02-28-1217-distro-basic-smithi/5921279

Teuthology tried to tell mon.c something, but the command timed out because the quorum lease timed out.

mon.a was the leader, but it failed to get a response from mon.e (mon.8), because something went wrong with the messenger.

mon.e (filtering by connection pointer):

2021-02-28T19:52:15.957+0000 7f34daa1f700  1 -- v1:172.21.15.139:6793/0 --> v1:172.21.15.139:6789/0 -- paxos(lease_ack lc 838 fc 252 pn 0 opn 0) v4 -- 0x5615ce8f9400 con 0x5615cc927c00
2021-02-28T19:52:15.957+0000 7f34daa1f700  1 -- v1:172.21.15.139:6793/0 <== mon.0 v1:172.21.15.139:6789/0 117 ==== route(log(last 4) v1 tid 4) v3 ==== 175+0+0 (unknown 2237411492 0 0) 0x5615cc85da20 con 0x5615cc927c00
2021-02-28T19:52:15.957+0000 7f34daa1f700 20 is_capable service=mon command= read addr v1:172.21.15.139:6789/0 on cap allow *
2021-02-28T19:52:15.957+0000 7f34daa1f700 20 is_capable service=mon command= exec addr v1:172.21.15.139:6789/0 on cap allow *
2021-02-28T19:52:15.957+0000 7f34daa1f700  1 -- v1:172.21.15.139:6793/0 <== mon.0 v1:172.21.15.139:6789/0 118 ==== route(log(last 5) v1 tid 5) v3 ==== 175+0+0 (unknown 589811606 0 0) 0x5615cc85dce0 con 0x5615cc927c00
2021-02-28T19:52:15.957+0000 7f34daa1f700 20 is_capable service=mon command= read addr v1:172.21.15.139:6789/0 on cap allow *
2021-02-28T19:52:15.957+0000 7f34daa1f700 20 is_capable service=mon command= exec addr v1:172.21.15.139:6789/0 on cap allow *
2021-02-28T19:52:16.469+0000 7f34dd224700  1 -- v1:172.21.15.139:6793/0 send_to--> mon v1:172.21.15.139:6789/0 -- mon_ping(ping stamp 2021-02-28T19:52:16.470812+0000) v1 -- ?+0 0x5615ce2cfe00
2021-02-28T19:52:16.469+0000 7f34dd224700  1 -- v1:172.21.15.139:6793/0 --> v1:172.21.15.139:6789/0 -- mon_ping(ping stamp 2021-02-28T19:52:16.470812+0000) v1 -- 0x5615ce2cfe00 con 0x5615cc927c00
2021-02-28T19:52:16.470+0000 7f34daa1f700  1 -- v1:172.21.15.139:6793/0 <== mon.0 v1:172.21.15.139:6789/0 119 ==== mon_ping(ping stamp 2021-02-28T19:52:16.470812+0000) v1 ==== 1615+0+0 (unknown 4230031922 0 0) 0x5615ce2cf380 con 0x5615cc927c00
2021-02-28T19:52:16.470+0000 7f34daa1f700 20 is_capable service=mon command= read addr v1:172.21.15.139:6789/0 on cap allow *
2021-02-28T19:52:16.470+0000 7f34daa1f700  1 -- v1:172.21.15.139:6793/0 --> v1:172.21.15.139:6789/0 -- mon_ping(ping_reply stamp 2021-02-28T19:52:16.470812+0000) v1 -- 0x5615ce9c5980 con 0x5615cc927c00
2021-02-28T19:52:16.470+0000 7f34daa1f700  1 -- v1:172.21.15.139:6793/0 <== mon.0 v1:172.21.15.139:6789/0 120 ==== mon_ping(ping_reply stamp 2021-02-28T19:52:16.470812+0000) v1 ==== 1615+0+0 (unknown 23643458 0 0) 0x5615ce2cfc80 con 0x5615cc927c00
2021-02-28T19:52:16.470+0000 7f34daa1f700 20 is_capable service=mon command= read addr v1:172.21.15.139:6789/0 on cap allow *

mon.a:

2021-02-28T19:52:15.955+0000 7f5c80238700  1 -- v1:172.21.15.139:6789/0 --> v1:172.21.15.139:6793/0 -- paxos(lease lc 838 fc 252 pn 0 opn 0) v4 -- 0x55b7fc329000 con 0x55b7f97e3c00
2021-02-28T19:52:15.955+0000 7f5c80238700  1 -- v1:172.21.15.139:6789/0 --> v1:172.21.15.139:6793/0 -- route(log(last 4) v1 tid 4) v3 -- 0x55b7faf7cb00 con 0x55b7f97e3c00
2021-02-28T19:52:15.955+0000 7f5c80238700  1 -- v1:172.21.15.139:6789/0 --> v1:172.21.15.139:6793/0 -- route(log(last 5) v1 tid 5) v3 -- 0x55b7faf7de40 con 0x55b7f97e3c00
2021-02-28T19:52:15.957+0000 7f5c81a3b700  1 -- v1:172.21.15.139:6789/0 <== mon.8 v1:172.21.15.139:6793/0 95 ==== paxos(lease_ack lc 838 fc 252 pn 0 opn 0) v4 ==== 118+0+0 (unknown 2888524482 0 0) 0x55b7fc329000 con 0x55b7f97e3c00
2021-02-28T19:52:16.469+0000 7f5c84240700  1 -- v1:172.21.15.139:6789/0 --> v1:172.21.15.139:6793/0 -- mon_ping(ping stamp 2021-02-28T19:52:16.470812+0000) v1 -- 0x55b7fc0fa900 con 0x55b7f97e3c00
2021-02-28T19:52:16.470+0000 7f5c81a3b700  1 -- v1:172.21.15.139:6789/0 <== mon.8 v1:172.21.15.139:6793/0 96 ==== mon_ping(ping stamp 2021-02-28T19:52:16.470812+0000) v1 ==== 1615+0+0 (unknown 2242411163 0 0) 0x55b7fc142780 con 0x55b7f97e3c00
2021-02-28T19:52:16.470+0000 7f5c81a3b700  1 -- v1:172.21.15.139:6789/0 --> v1:172.21.15.139:6793/0 -- mon_ping(ping_reply stamp 2021-02-28T19:52:16.470812+0000) v1 -- 0x55b7fc2bd200 con 0x55b7f97e3c00
2021-02-28T19:52:16.470+0000 7f5c81a3b700  1 -- v1:172.21.15.139:6789/0 <== mon.8 v1:172.21.15.139:6793/0 97 ==== mon_ping(ping_reply stamp 2021-02-28T19:52:16.470812+0000) v1 ==== 1615+0+0 (unknown 4235948061 0 0) 0x55b7fc0fa900 con 0x55b7f97e3c00
2021-02-28T19:52:16.871+0000 7f5c80a39700  1 -- v1:172.21.15.139:6789/0 >> v1:172.21.15.139:6793/0 conn(0x55b7f97e3c00 legacy=0x55b7f8b20800 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 42
2021-02-28T19:52:16.871+0000 7f5c80a39700  1 -- v1:172.21.15.139:6789/0 >> v1:172.21.15.139:6793/0 conn(0x55b7f97e3c00 legacy=0x55b7f8b20800 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2021-02-28T19:52:16.871+0000 7f5c80a39700  1 --1- v1:172.21.15.139:6789/0 >> v1:172.21.15.139:6793/0 conn(0x55b7f97e3c00 0x55b7f8b20800 :-1 s=OPENED pgs=185 cs=5 l=0).handle_message read tag failed
2021-02-28T19:52:16.953+0000 7f5c84240700  1 -- v1:172.21.15.139:6789/0 --> v1:172.21.15.139:6793/0 -- paxos(begin lc 838 fc 0 pn 1700 opn 0) v4 -- 0x55b7fc2c5c00 con 0x55b7f97e3c00
2021-02-28T19:52:16.953+0000 7f5c80a39700  1 -- v1:172.21.15.139:6789/0 >> v1:172.21.15.139:6793/0 conn(0x55b7f97e3c00 legacy=0x55b7f8b20800 unknown :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed to v1:172.21.15.139:6793/0
2021-02-28T19:52:16.953+0000 7f5c81a3b700 10 mon.a@0(leader) e1 ms_handle_refused 0x55b7f97e3c00 v1:172.21.15.139:6793/0
2021-02-28T19:52:17.154+0000 7f5c80a39700  1 -- v1:172.21.15.139:6789/0 >> v1:172.21.15.139:6793/0 conn(0x55b7f97e3c00 legacy=0x55b7f8b20800 unknown :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed to v1:172.21.15.139:6793/0
2021-02-28T19:52:17.154+0000 7f5c81a3b700 10 mon.a@0(leader) e1 ms_handle_refused 0x55b7f97e3c00 v1:172.21.15.139:6793/0
2021-02-28T19:52:17.470+0000 7f5c84240700  1 -- v1:172.21.15.139:6789/0 --> v1:172.21.15.139:6793/0 -- mon_ping(ping stamp 2021-02-28T19:52:17.471005+0000) v1 -- 0x55b7fc22ea80 con 0x55b7f97e3c00
2021-02-28T19:52:17.555+0000 7f5c80a39700  1 -- v1:172.21.15.139:6789/0 >> v1:172.21.15.139:6793/0 conn(0x55b7f97e3c00 legacy=0x55b7f8b20800 unknown :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed to v1:172.21.15.139:6793/0
2021-02-28T19:52:17.555+0000 7f5c81a3b700 10 mon.a@0(leader) e1 ms_handle_refused 0x55b7f97e3c00 v1:172.21.15.139:6793/0
2021-02-28T19:52:18.356+0000 7f5c80a39700  1 -- v1:172.21.15.139:6789/0 >> v1:172.21.15.139:6793/0 conn(0x55b7f97e3c00 legacy=0x55b7f8b20800 unknown :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed to v1:172.21.15.139:6793/0
2021-02-28T19:52:18.356+0000 7f5c81a3b700 10 mon.a@0(leader) e1 ms_handle_refused 0x55b7f97e3c00 v1:172.21.15.139:6793/0
2021-02-28T19:52:19.957+0000 7f5c80a39700  1 -- v1:172.21.15.139:6789/0 >> v1:172.21.15.139:6793/0 conn(0x55b7f97e3c00 legacy=0x55b7f8b20800 unknown :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed to v1:172.21.15.139:6793/0
2021-02-28T19:52:19.957+0000 7f5c81a3b700 10 mon.a@0(leader) e1 ms_handle_refused 0x55b7f97e3c00 v1:172.21.15.139:6793/0
2021-02-28T19:52:23.158+0000 7f5c80a39700  1 -- v1:172.21.15.139:6789/0 >> v1:172.21.15.139:6793/0 conn(0x55b7f97e3c00 legacy=0x55b7f8b20800 unknown :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed to v1:172.21.15.139:6793/0
2021-02-28T19:52:23.158+0000 7f5c81a3b700 10 mon.a@0(leader) e1 ms_handle_refused 0x55b7f97e3c00 v1:172.21.15.139:6793/0
2021-02-28T19:52:25.958+0000 7f5c81a3b700  1 -- v1:172.21.15.139:6789/0 --> v1:172.21.15.139:6793/0 -- election(88ac1a94-00fd-4ac3-8de1-135b9119f045 propose rel 17 e59) v9 -- 0x55b7fc385b00 con 0x55b7f97e3c00
2021-02-28T19:52:29.559+0000 7f5c80a39700  1 -- v1:172.21.15.139:6789/0 >> v1:172.21.15.139:6793/0 conn(0x55b7f97e3c00 legacy=0x55b7f8b20800 unknown :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed to v1:172.21.15.139:6793/0
2021-02-28T19:52:29.559+0000 7f5c81a3b700 10 mon.a@0(electing) e1 ms_handle_refused 0x55b7f97e3c00 v1:172.21.15.139:6793/0
2021-02-28T19:52:30.966+0000 7f5c81a3b700  1 -- v1:172.21.15.139:6789/0 --> v1:172.21.15.139:6793/0 -- election(88ac1a94-00fd-4ac3-8de1-135b9119f045 propose rel 17 e61) v9 -- 0x55b7f8a4efc0 con 0x55b7f97e3c00
2021-02-28T19:52:42.360+0000 7f5c80a39700  1 -- v1:172.21.15.139:6789/0 >> v1:172.21.15.139:6793/0 conn(0x55b7f97e3c00 legacy=0x55b7f8b20800 unknown :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed to v1:172.21.15.139:6793/0
2021-02-28T19:52:42.360+0000 7f5c81a3b700 10 mon.a@0(leader) e1 ms_handle_refused 0x55b7f97e3c00 v1:172.21.15.139:6793/0
2021-02-28T19:52:57.361+0000 7f5c80a39700  1 -- v1:172.21.15.139:6789/0 >> v1:172.21.15.139:6793/0 conn(0x55b7f97e3c00 legacy=0x55b7f8b20800 unknown :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed to v1:172.21.15.139:6793/0
2021-02-28T19:52:57.361+0000 7f5c81a3b700 10 mon.a@0(leader) e1 ms_handle_refused 0x55b7f97e3c00 v1:172.21.15.139:6793/0
...

i think we need debug_ms = 20 to see where that 'reconnected failed to ...' is coming from.

Actions

Copy link

#26

Updated by Deepika Upadhyay almost 3 years ago

Status changed from Fix Under Review to Need More Info

Stopped Reproducing, please reopen if you hit another instance

Actions

Copy link

#27

Updated by Neha Ojha almost 3 years ago

Priority changed from High to Normal

reducing priority based on https://tracker.ceph.com/issues/45761#note-26

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #45761

mon_thrasher: "Error ENXIO: mon unavailable" during sync_force command leads to "failed to reach quorum size 9 before timeout expired"

Updated by Brad Hubbard almost 4 years ago

Updated by Brad Hubbard almost 4 years ago

Updated by Brad Hubbard almost 4 years ago

Updated by Neha Ojha almost 4 years ago

Updated by Neha Ojha almost 4 years ago

Updated by Neha Ojha almost 4 years ago

Updated by Brad Hubbard over 3 years ago

Updated by Brad Hubbard over 3 years ago

Updated by Deepika Upadhyay over 3 years ago

Updated by Deepika Upadhyay over 3 years ago

Updated by Deepika Upadhyay over 3 years ago

Updated by Deepika Upadhyay over 3 years ago

Updated by Deepika Upadhyay over 3 years ago

Updated by Deepika Upadhyay over 3 years ago

Updated by Kefu Chai over 3 years ago

Updated by Deepika Upadhyay over 3 years ago

Updated by Neha Ojha over 3 years ago

Updated by Deepika Upadhyay over 3 years ago

Updated by Deepika Upadhyay over 3 years ago

Updated by Neha Ojha over 3 years ago

Updated by Deepika Upadhyay about 3 years ago

Updated by Brad Hubbard about 3 years ago

Updated by Neha Ojha about 3 years ago

Updated by Brad Hubbard about 3 years ago

Updated by Sage Weil about 3 years ago

Updated by Deepika Upadhyay almost 3 years ago

Updated by Neha Ojha almost 3 years ago