Project

General

Profile

Bug #43885

failed to reach quorum size 9 before timeout expired

Added by Sage Weil about 4 years ago. Updated about 4 years ago.

Status:
Can't reproduce
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This pops up occasionally. Here is a recent one:

2020-01-29T01:28:50.875 INFO:tasks.mon_thrash.ceph_manager:waiting for quorum size 5
2020-01-29T01:28:50.875 INFO:teuthology.orchestra.run.smithi097:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph quorum_status
2020-01-29T01:29:05.877 INFO:tasks.ceph.mon.h.smithi072.stderr:2020-01-29T01:29:05.875+0000 7f7a098d0700 -1 mon.h@4(electing) e1 get_health_metrics reporting 2 slow ops, oldest is log(1 entries from seq 24 at 2020-01-29T01:28:35.447431+0000)
2020-01-29T01:29:06.037 INFO:tasks.ceph.osd.4.smithi072.stderr:2020-01-29T01:29:06.035+0000 7fbbc4a71700 -1 osd.4 167 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.44653.0:1 164.6s0 164.7fc1f406 (undecoded) ondisk+write+known_if_redirected e167)
...
2020-01-29T01:30:50.913 DEBUG:teuthology.orchestra.run:got remote process result: 124
2020-01-29T01:30:50.913 ERROR:tasks.mon_thrash.mon_thrasher:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-sage-testing-2020-01-28-1413/qa/tasks/mon_thrash.py", line 232, in do_thrash
    self._do_thrash()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-sage-testing-2020-01-28-1413/qa/tasks/mon_thrash.py", line 299, in _do_thrash
    self.manager.wait_for_mon_quorum_size(len(mons)-len(mons_to_kill))
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-sage-testing-2020-01-28-1413/qa/tasks/ceph_manager.py", line 2861, in wait_for_mon_quorum_size
    while not len(self.get_mon_quorum()) == size:
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-sage-testing-2020-01-28-1413/qa/tasks/ceph_manager.py", line 2850, in get_mon_quorum
    out = self.raw_cluster_cmd('quorum_status')
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-sage-testing-2020-01-28-1413/qa/tasks/ceph_manager.py", line 1353, in raw_cluster_cmd
    stdout=StringIO(),
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 198, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 433, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 158, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 180, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed on smithi097 with status 124: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph quorum_status'

/a/sage-2020-01-28_23:42:06-rados-wip-sage-testing-2020-01-28-1413-distro-basic-smithi/4715297
description: rados/monthrash/{ceph.yaml clusters/9-mons.yaml msgr-failures/mon-delay.yaml
msgr/async-v2only.yaml objectstore/bluestore-avl.yaml rados.yaml supported-random-distro$/{rhel_8.yaml}
thrashers/one.yaml workloads/rados_api_tests.yaml}

History

#1 Updated by Neha Ojha about 4 years ago

I wonder if this is somehow related to the election issue we saw in https://tracker.ceph.com/issues/42977. Seems to be failing in monthrash tests too.

#2 Updated by Sage Weil about 4 years ago

/a/sage-2020-01-29_20:14:58-rados-wip-sage-testing-2020-01-29-1034-distro-basic-smithi/4718154

description: rados/monthrash/{ceph.yaml clusters/9-mons.yaml msgr-failures/few.yaml
msgr/async-v2only.yaml objectstore/filestore-xfs.yaml rados.yaml supported-random-distro$/{centos_8.yaml}
thrashers/force-sync-many.yaml workloads/snaps-few-objects.yaml}

#3 Updated by Neha Ojha about 4 years ago

  • Assignee set to Sridhar Seshasayee

#4 Updated by Brad Hubbard about 4 years ago

There does not appear to be a crash in this case, but there is an election that seems to take a long time followed by lot of slow requests on mons and osds (check logs around the 2020-01-29T01:29:06 timestamp) so I strongly suspect this is related to #42977 as Neha has already suggested. Sridar is going to have a go at reproducing this and, if he can, trying to reproduce with Greg's patch for #42977

#5 Updated by Sridhar Seshasayee about 4 years ago

Update: Tried running the test a few times but haven't been able to reproduce it. I will continue my attempts. In the meantime, I am also looking at the existing logs and code for a better understanding.

#6 Updated by Sridhar Seshasayee about 4 years ago

Since I could not reproduce the issue, I analyzed logs from the original run:
/a/sage-2020-01-29_20:14:58-rados-wip-sage-testing-2020-01-29-1034-distro-basic-smithi/4718154

As already mentioned, the election process took a very long time (~10 mins)
and slow ops were reported after mons were brought down as part of the test.

Out of the 5 remaining mons, one of the mons (mon.8 or ceph-mon.e) was stuck in the "probing" state
for around 10 mins after the other mons (4 nos) were taken down as part of the test. The 4 remaining
mons that were alive were in "electing" state during that time.

ceph-mon.e entering probing state:

2020-01-29T01:28:45.319+0000 7f7026ba1700 1 mon.e@8(peon).paxos(paxos active c 1..475) lease_timeout -- calling new election
...
... {a=[v2:172.21.15.97:3300/0,v1:172.21.15.97:6789/0],b=[v2:172.21.15.97:3301/0,v1:172.21.15.97:6790/0],c=[v2:172.21.15.97:3302/0,v1:172.21.15.97:6791/0],d=[v2:172.21.15.97:3303/0,v1:172.21.15.97:6792/0],e=[v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0],f=[v2:172.21.15.72:3300/0,v1:172.21.15.72:6789/0],g=[v2:172.21.15.72:3301/0,v1:172.21.15.72:6790/0],h=[v2:172.21.15.72:3302/0,v1:172.21.15.72:6791/0],i=[v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0]}
2020-01-29T01:28:45.319+0000 7f7026ba1700 10 mon.e@8(probing) e1 _reset

ceph-mon.e exiting probing state:

2020-01-29T01:38:03.041+0000 7f702439c700 10 mon.e@8(probing) e1 monmap is e1: 9 mons at {a=[v2:172.21.15.97:3300/0,v1:172.21.15.97:6789/0],b=[v2:172.21.15.97:3301/0,v1:172.21.15.97:6790/0],c=[v2:172.21.15.97:3302/0,v1:172.21.15.97:6791/0],d=[v2:172.21.15.97:3303/0,v1:172.21.15.97:6792/0],e=[v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0],f=[v2:172.21.15.72:3300/0,v1:172.21.15.72:6789/0],g=[v2:172.21.15.72:3301/0,v1:172.21.15.72:6790/0],h=[v2:172.21.15.72:3302/0,v1:172.21.15.72:6791/0],i=[v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0]}
2020-01-29T01:38:03.041+0000 7f702439c700 10 mon.e@8(probing) e1 peer name is i
2020-01-29T01:38:03.041+0000 7f702439c700 10 mon.e@8(probing) e1 mon.i is outside the quorum
2020-01-29T01:38:03.041+0000 7f702439c700 10 mon.e@8(probing) e1 outside_quorum now d,e,g,h,i, need 5
2020-01-29T01:38:03.041+0000 7f702439c700 10 mon.e@8(probing) e1 that's enough to form a new quorum, calling election
2020-01-29T01:38:03.041+0000 7f702439c700 10 mon.e@8(probing) e1 start_election

The analysis below focuses on the events between the above time since during this time ceph-mon.e could not communicate with ceph-mon.i.
On ceph-mon.i the following logs indicate the connection being stale and getting reset continuously during the probe.
This happened for quite a while (~7 mins) as shown below and was a major factor in the slow election process. Therefore
this should be the focus of any further analysis in my view:

2020-01-29T01:30:17.918+0000 7f9400b21700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x565175249600 0x5651751d2100 secure :-1 s=SESSION_ACCEPTING pgs=239 cs=0 l=0 rx=0x5651732af0e0 tx=0x5651751c4080).handle_existing_connection this is a stale connection, peer_global_seq=239 existing->peer_global_seq=802, stopping this connection.
2020-01-29T01:30:17.918+0000 7f9400b21700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x565175249600 0x5651751d2100 secure :-1 s=SESSION_ACCEPTING pgs=239 cs=0 l=0 rx=0x5651732af0e0 tx=0x5651751c4080).stop
2020-01-29T01:30:17.918+0000 7f93fcb19700 10 mon.i@6(electing) e1 ms_handle_reset 0x565175249600 v2:172.21.15.97:3304/0

The above continued until:

2020-01-29T01:37:48.019+0000 7f9400b21700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x56517603bb00 0x565176047b80 secure :-1 s=SESSION_ACCEPTING pgs=789 cs=0 l=0 rx=0x56517608f830 tx=0x56517600f300).handle_existing_connection this is a stale connection, peer_global_seq=789 existing->peer_global_seq=802, stopping this connection.
2020-01-29T01:37:48.019+0000 7f9400b21700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x56517603bb00 0x565176047b80 secure :-1 s=SESSION_ACCEPTING pgs=789 cs=0 l=0 rx=0x56517608f830 tx=0x56517600f300).stop
2020-01-29T01:37:48.019+0000 7f93fcb19700 10 mon.i@6(electing) e1 ms_handle_reset 0x56517603bb00 v2:172.21.15.97:3304/0

Then as part of handling reset from ceph-mon.e it appears that the connection was established with ceph-mon.i
after a bunch of retries:

2020-01-29T01:38:03.026+0000 7f93fcb19700 10 mon.i@6(electing) e1 ms_handle_reset 0x56517615d200 v2:172.21.15.97:3304/0
2020-01-29T01:38:03.028+0000 7f9401322700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x5651709adb00 0x565170a11080 unknown :-1 s=READY pgs=809 cs=0 l=0 rx=0x5651761276b0 tx=0x565175a1eb80).ready entity=mon.8 client_cookie=23668d2a35e1f3c0 server_cookie=d4410579a09c2bd4 in_seq=0 out_seq=0
2020-01-29T01:38:03.028+0000 7f93fcb19700 10 mon.i@6(electing) e1 ms_handle_accept con 0x5651709adb00 session 0x565170a426c0 already on list
2020-01-29T01:38:03.028+0000 7f9401322700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x5651709adb00 0x565170a11080 unknown :-1 s=THROTTLE_DONE pgs=809 cs=0 l=0 rx=0x5651761276b0 tx=0x565175a1eb80).handle_read_frame_epilogue_main read frame epilogue bytes=32
...
...
2020-01-29T01:38:03.030+0000 7f93fbb17700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> conn(0x56517615db00 0x565170ce7b80 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
...
...
2020-01-29T01:38:03.030+0000 7f93fbb17700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> conn(0x56517615db00 0x565170ce7b80 secure :-1 s=AUTH_ACCEPTING_SIGN pgs=0 cs=0 l=0 rx=0x56517617d4d0 tx=0x565175f4ce00).handle_read_frame_epilogue_main read frame epilogue bytes=32
...
...
2020-01-29T01:38:03.031+0000 7f93fbb17700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> conn(0x56517615db00 0x565170ce7b80 secure :-1 s=SESSION_ACCEPTING pgs=0 cs=0 l=0 rx=0x56517617d4d0 tx=0x565175f4ce00).handle_read_frame_epilogue_main read frame epilogue bytes=32
...
...
2020-01-29T01:38:03.031+0000 7f9401322700 1 -- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x5651709adb00 msgr2=0x565170a11080 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk reading from fd=16 : Unknown error 104
2020-01-29T01:38:03.031+0000 7f9401322700 1 -
[v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x5651709adb00 msgr2=0x565170a11080 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2020-01-29T01:38:03.031+0000 7f9401322700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x5651709adb00 0x565170a11080 unknown :-1 s=READY pgs=809 cs=0 l=0 rx=0x5651761276b0 tx=0x565175a1eb80).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)
2020-01-29T01:38:03.031+0000 7f9401322700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x5651709adb00 0x565170a11080 unknown :-1 s=READY pgs=809 cs=0 l=0 rx=0x5651761276b0 tx=0x565175a1eb80)._fault with nothing to send, going to standby
2020-01-29T01:38:03.031+0000 7f93fcb19700 1 -- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] <== mon.8 v2:172.21.15.97:3304/0 1 ==== mon_probe(probe b82a6b99-a08b-464e-ac3d-fcb1aa6bfeb5 name e mon_release octopus) v7 ==== 59+0+0 (unknown 0 0 0) 0x565170a0b8c0 con 0x5651709adb00

ceph-mon.i then started accepting probes from ceph-mon.e:

2020-01-29T01:38:03.031+0000 7f93fcb19700 5 mon.i@6(electing).elector(329) handle_propose from mon.8
2020-01-29T01:38:03.031+0000 7f93fcb19700 10 mon.i@6(electing).elector(329) handle_propose required features 2449958747315978244 mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus]), peer features 4540138292836696063 mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus])
2020-01-29T01:38:03.031+0000 7f93fcb19700 5 paxos.6).electionLogic(329) ignoring old propose
2020-01-29T01:38:03.031+0000 7f93fcb19700 1 -- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] <== mon.8 v2:172.21.15.97:3304/0 5 ==== mon_probe(probe b82a6b99-a08b-464e-ac3d-fcb1aa6bfeb5 name e mon_release octopus) v7 ==== 59+0+0 (unknown 0 0 0) 0x565171da98c0 con 0x5651709adb00
2020-01-29T01:38:03.031+0000 7f93fcb19700 20 mon.i@6(electing) e1 _ms_dispatch existing session 0x565170a426c0 for mon.8
2020-01-29T01:38:03.031+0000 7f93fcb19700 20 mon.i@6(electing) e1 entity caps allow *
2020-01-29T01:38:03.031+0000 7f93fcb19700 20 is_capable service=mon command= read addr v2:172.21.15.97:3304/0 on cap allow *
2020-01-29T01:38:03.031+0000 7f93fcb19700 20 allow so far , doing grant allow *
2020-01-29T01:38:03.031+0000 7f93fcb19700 20 allow all
2020-01-29T01:38:03.031+0000 7f93fcb19700 10 mon.i@6(electing) e1 handle_probe mon_probe(probe b82a6b99-a08b-464e-ac3d-fcb1aa6bfeb5 name e mon_release octopus) v7
2020-01-29T01:38:03.031+0000 7f93fcb19700 10 mon.i@6(electing) e1 handle_probe_probe mon.8 v2:172.21.15.97:3304/0mon_probe(probe b82a6b99-a08b-464e-ac3d-fcb1aa6bfeb5 name e mon_release octopus) v7 features 4540138292836696063
...
...
2020-01-29T01:38:03.033+0000 7f93fbb17700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x56517615db00 0x565170ce7b80 secure :-1 s=SESSION_ACCEPTING pgs=811 cs=0 l=0 rx=0x56517617d4d0 tx=0x565175f4ce00).handle_reconnect reconnect race detected, this connection loses to existing=0x5651709adb00
...
...
2020-01-29T01:38:03.036+0000 7f93fcb19700 1 -- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] --> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] -- mon_probe(reply b82a6b99-a08b-464e-ac3d-fcb1aa6bfeb5 name i paxos( fc 1 lc 475 ) mon_release octopus) v7 -- 0x56517629f340 con 0x5651709adb00
2020-01-29T01:38:03.036+0000 7f93fbb17700 1 -- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x56517615db00 msgr2=0x565170ce7b80 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 32
2020-01-29T01:38:03.036+0000 7f93fbb17700 1 -- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x56517615db00 msgr2=0x565170ce7b80 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2020-01-29T01:38:03.036+0000 7f93fcb19700 1 -- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] <== mon.8 v2:172.21.15.97:3304/0 156 ==== mon_probe(probe b82a6b99-a08b-464e-ac3d-fcb1aa6bfeb5 name e mon_release octopus) v7 ==== 59+0+0 (secure 0 0 0) 0x5651761d5b80 con 0x5651709adb00
...
...
2020-01-29T01:38:03.036+0000 7f93fbb17700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x56517615db00 0x565170ce7b80 secure :-1 s=SESSION_ACCEPTING pgs=811 cs=0 l=0 rx=0x56517617d4d0 tx=0x565175f4ce00).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)

One of the reconnects failed because of socket error injection shown below. Socket error injections too probably played a role in the slow election process:

2020-01-29T01:38:03.041+0000 7f9401322700 0 -- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x5651709adb00 msgr2=0x565170a11080 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send injecting socket failure
2020-01-29T01:38:03.041+0000 7f9401322700 1 -- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x5651709adb00 msgr2=0x565170a11080 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send send error: (32) Broken pipe
2020-01-29T01:38:03.041+0000 7f93fcb19700 1 -- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] --> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] -- mon_probe(reply b82a6b99-a08b-464e-ac3d-fcb1aa6bfeb5 name i paxos( fc 1 lc 475 ) mon_release octopus) v7 -- 0x565176468580 con 0x5651709adb00
2020-01-29T01:38:03.041+0000 7f9401322700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x5651709adb00 0x565170a11080 secure :-1 s=READY pgs=809 cs=1 l=0 rx=0x56517617da10 tx=0x565175317d00).write_message error sending 0x5651764682c0, (32) Broken pipe
2020-01-29T01:38:03.041+0000 7f9401322700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x5651709adb00 0x565170a11080 secure :-1 s=READY pgs=809 cs=1 l=0 rx=0x56517617da10 tx=0x565175317d00).write_event send msg failed
2020-01-29T01:38:03.041+0000 7f9401322700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x5651709adb00 0x565170a11080 secure :-1 s=READY pgs=809 cs=1 l=0 rx=0x56517617da10 tx=0x565175317d00).write_event send msg failed
2020-01-29T01:38:03.041+0000 7f9401322700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x5651709adb00 0x565170a11080 secure :-1 s=READY pgs=809 cs=1 l=0 rx=0x56517617da10 tx=0x565175317d00)._fault initiating reconnect

The probe eventually succeeded after a lot of retries:

2020-01-29T01:38:03.041+0000 7f93fcb19700 10 mon.i@6(electing) e1 handle_probe mon_probe(probe b82a6b99-a08b-464e-ac3d-fcb1aa6bfeb5 name e mon_release octopus) v7
2020-01-29T01:38:03.041+0000 7f93fcb19700 10 mon.i@6(electing) e1 handle_probe_probe mon.8 v2:172.21.15.97:3304/0mon_probe(probe b82a6b99-a08b-464e-ac3d-fcb1aa6bfeb5 name e mon_release octopus) v7 features 4540138292836696063
2020-01-29T01:38:03.041+0000 7f93fcb19700 1 -- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] --> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] -- mon_probe(reply b82a6b99-a08b-464e-ac3d-fcb1aa6bfeb5 name i paxos( fc 1 lc 475 ) mon_release octopus) v7 -- 0x565176468840 con 0x5651709adb00
2020-01-29T01:38:03.042+0000 7f9401322700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x5651709adb00 0x565170a11080 unknown :-1 s=BANNER_CONNECTING pgs=809 cs=2 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
2020-01-29T01:38:03.042+0000 7f9401322700 10 mon.i@6(electing) e1 get_authorizer for mon
2020-01-29T01:38:03.042+0000 7f9401322700 10 cephx: build_service_ticket service mon secret_id 18446744073709551615 ticket_info.ticket.name=mon. ticket.global_id 0
2020-01-29T01:38:03.042+0000 7f9401322700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x5651709adb00 0x565170a11080 secure :-1 s=AUTH_CONNECTING_SIGN pgs=809 cs=2 l=0 rx=0x56517617d5c0 tx=0x565171135000).handle_read_frame_epilogue_main read frame epilogue bytes=32
2020-01-29T01:38:03.048+0000 7f9401322700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x5651709adb00 0x565170a11080 secure :-1 s=SESSION_RECONNECTING pgs=809 cs=2 l=0 rx=0x56517617d5c0 tx=0x565171135000).handle_read_frame_epilogue_main read frame epilogue bytes=32
2020-01-29T01:38:03.049+0000 7f9401322700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x5651709adb00 0x565170a11080 secure :-1 s=READY pgs=809 cs=2 l=0 rx=0x56517617d5c0 tx=0x565171135000).ready entity=mon.8 client_cookie=23668d2a35e1f3c0 server_cookie=d4410579a09c2bd4 in_seq=283 out_seq=279
2020-01-29T01:38:03.049+0000 7f9401322700 1 --2- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] >> [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] conn(0x5651709adb00 0x565170a11080 secure :-1 s=THROTTLE_DONE pgs=809 cs=2 l=0 rx=0x56517617d5c0 tx=0x565171135000).handle_read_frame_epilogue_main read frame epilogue bytes=32
2020-01-29T01:38:03.049+0000 7f93fcb19700 1 -- [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] <== mon.8 v2:172.21.15.97:3304/0 284 ==== election(b82a6b99-a08b-464e-ac3d-fcb1aa6bfeb5 propose rel 15 e107) v8 ==== 1058+0+0 (secure 0 0 0) 0x56517609d200 con 0x5651709adb00
...
...
2020-01-29T01:38:03.049+0000 7f93fcb19700 5 mon.i@6(electing).elector(329) handle_propose from mon.8
2020-01-29T01:38:03.049+0000 7f93fcb19700 10 mon.i@6(electing).elector(329) handle_propose required features 2449958747315978244 mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus]), peer features 4540138292836696063 mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus])
2020-01-29T01:38:03.049+0000 7f93fcb19700 5 paxos.6).electionLogic(329) ignoring old propose
...
...
2020-01-29T01:38:06.113+0000 7f93fcb19700 5 mon.i@6(electing).elector(329) handle_propose from mon.2
2020-01-29T01:38:06.113+0000 7f93fcb19700 10 mon.i@6(electing).elector(329) handle_propose required features 2449958747315978244 mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus]), peer features 4540138292836696063 mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus])
2020-01-29T01:38:06.113+0000 7f93fcb19700 10 paxos.6).electionLogic(329) bump_epoch329 to 331

On ceph-mon.e at this point the probe gets a response and finally quorum could be formed.

2020-01-29T01:38:03.041+0000 7f702439c700 1 -- [v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0] <== mon.6 v2:172.21.15.72:3303/0 1 ==== mon_probe(reply b82a6b99-a08b-464e-ac3d-fcb1aa6bfeb5 name i paxos( fc 1 lc 475 ) mon_release octopus) v7 ==== 1038+0+0 (unknown 0 0 0) 0x55af97a49080 con 0x55af94e41200
2020-01-29T01:38:03.041+0000 7f702439c700 10 mon.e@8(probing) e1 _ms_dispatch new session 0x55af979f6480 MonSession(mon.6 [v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0] is open , features 0x3f01cfb8ffadffff (luminous)) features 0x3f01cfb8ffadfff
...
...
2020-01-29T01:38:03.041+0000 7f702439c700 5 mon.e@8(probing) e1 _ms_dispatch setting monitor caps on this connection
2020-01-29T01:38:03.041+0000 7f702439c700 20 mon.e@8(probing) e1 entity caps allow *
2020-01-29T01:38:03.041+0000 7f702439c700 20 is_capable service=mon command= read addr v2:172.21.15.72:3303/0 on cap allow *
2020-01-29T01:38:03.041+0000 7f702439c700 20 allow so far , doing grant allow *
2020-01-29T01:38:03.041+0000 7f702439c700 20 allow all
2020-01-29T01:38:03.041+0000 7f702439c700 10 mon.e@8(probing) e1 handle_probe mon_probe(reply b82a6b99-a08b-464e-ac3d-fcb1aa6bfeb5 name i paxos( fc 1 lc 475 ) mon_release octopus) v7
2020-01-29T01:38:03.041+0000 7f702439c700 10 mon.e@8(probing) e1 handle_probe_reply mon.6 v2:172.21.15.72:3303/0 mon_probe(reply b82a6b99-a08b-464e-ac3d-fcb1aa6bfeb5 name i paxos( fc 1 lc 475 ) mon_release octopus) v7
2020-01-29T01:38:03.041+0000 7f702439c700 10 mon.e@8(probing) e1 monmap is e1: 9 mons at {a=[v2:172.21.15.97:3300/0,v1:172.21.15.97:6789/0],b=[v2:172.21.15.97:3301/0,v1:172.21.15.97:6790/0],c=[v2:172.21.15.97:3302/0,v1:172.21.15.97:6791/0],d=[v2:172.21.15.97:3303/0,v1:172.21.15.97:6792/0],e=[v2:172.21.15.97:3304/0,v1:172.21.15.97:6793/0],f=[v2:172.21.15.72:3300/0,v1:172.21.15.72:6789/0],g=[v2:172.21.15.72:3301/0,v1:172.21.15.72:6790/0],h=[v2:172.21.15.72:3302/0,v1:172.21.15.72:6791/0],i=[v2:172.21.15.72:3303/0,v1:172.21.15.72:6792/0]}
2020-01-29T01:38:03.041+0000 7f702439c700 10 mon.e@8(probing) e1 peer name is i
2020-01-29T01:38:03.041+0000 7f702439c700 10 mon.e@8(probing) e1 mon.i is outside the quorum
2020-01-29T01:38:03.041+0000 7f702439c700 10 mon.e@8(probing) e1 outside_quorum now d,e,g,h,i, need 5
2020-01-29T01:38:03.041+0000 7f702439c700 10 mon.e@8(probing) e1 that's enough to form a new quorum, calling election
2020-01-29T01:38:03.041+0000 7f702439c700 10 mon.e@8(probing) e1 start_election

Eventually the ceph-mon.e receives a propose from ceph-mon.g post which
ceph-mon.g wins becomes the leader:

2020-01-29T01:38:11.114+0000 7ff33656b700 10 paxos.2).electionLogic(331) bump_epoch331 to 332
2020-01-29T01:38:11.114+0000 7ff33656b700 10 mon.g@2(electing) e1 join_election
...
...
2020-01-29T01:38:11.114+0000 7ff33656b700 10 mon.g@2(electing) e1 win_election epoch 332 quorum 2,4,6,7,8 features 4540138292836696063 mon_features mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus]) min_mon_release octopus
2020-01-29T01:38:11.114+0000 7ff33656b700 0 log_channel(cluster) log [INF] : mon.g is new leader, mons g,h,i,d,e in quorum (ranks 2,4,6,7,8)

Further analysis in my view should be focused on why the connection between 'mon e' and 'mon i' was continuously reported as stale and reset for around 7 mins.

#7 Updated by Neha Ojha about 4 years ago

  • Status changed from New to Can't reproduce
  • Priority changed from High to Normal

This hasn't shown up in master for a while and Sridhar has also not been able to reproduce this, hence reducing priority.

Also available in: Atom PDF