Project

General

Profile

Bug #43216

MDSMonitor: removes MDS coming out of quorum election

Added by Patrick Donnelly about 3 years ago. Updated 7 months ago.

Status:
Resolved
Priority:
Urgent
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDSMonitor
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Event sequence:

- 2019-12-07T12:26:26.854 mon_thrash kills mon.a(leader)
- 2019-12-07T12:27:07.843 mon_thrash revives mon.a
- 2019-12-07T12:27:12.478 socket fault causes mds.a to reconnect to revived mon.a(probing)
- 2019-12-07T12:27:32.074 mon.b replaces mds.a
- 2019-12-07T12:27:33.494 mon.a back in quorum, drops beacons from evicted mds.a

2019-12-07T11:07:02.936 INFO:teuthology.orchestra.run.smithi146.stderr:dumped fsmap epoch 6
2019-12-07T11:07:02.936 INFO:teuthology.orchestra.run.smithi146.stdout:
2019-12-07T11:07:02.936 INFO:teuthology.orchestra.run.smithi146.stdout:{"epoch":6,"default_fscid":1,"compat":{"compat":{},"ro_compat":{},"incompat":{"feature_1":"base v0.20","feature_2":"client writeable ranges","feature_3":"default file layouts on dirs","feature_4":"dir inode in separate object","feature_5":"mds uses versioned encoding","feature_6":"dirfrag is stored in omap","feature_8":"no anchor table","feature_9":"file layout v2","feature_10":"snaprealm v2"}},"feature_flags":{"enable_multiple":false,"ever_enabled_multiple":false},"standbys":[{"gid":4350,"name":"b","rank":-1,"incarnation":0,"state":"up:standby","state_seq":1,"addr":"172.21.15.194:6833/1676386680","addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.194:6832","nonce":1676386680},{"type":"v1","addr":"172.21.15.194:6833","nonce":1676386680}]},"export_targets":[],"features":4540138292836696063,"flags":0,"epoch":4}],"filesystems":[{"mdsmap":{"epoch":6,"flags":18,"ever_allowed_features":0,"explicitly_allowed_features":0,"created":"2019-12-07T11:06:38.797089+0000","modified":"2019-12-07T11:06:44.297113+0000","tableserver":0,"root":0,"session_timeout":60,"session_autoclose":300,"min_compat_client":"0 (unknown)","max_file_size":1099511627776,"last_failure":0,"last_failure_osd_epoch":0,"compat":{"compat":{},"ro_compat":{},"incompat":{"feature_1":"base v0.20","feature_2":"client writeable ranges","feature_3":"default file layouts on dirs","feature_4":"dir inode in separate object","feature_5":"mds uses versioned encoding","feature_6":"dirfrag is stored in omap","feature_8":"no anchor table","feature_9":"file layout v2","feature_10":"snaprealm v2"}},"max_mds":1,"in":[0],"up":{"mds_0":4341},"failed":[],"damaged":[],"stopped":[],"info":{"gid_4341":{"gid":4341,"name":"a","rank":0,"incarnation":5,"state":"up:active","state_seq":2,"addr":"172.21.15.146:6835/4036894910","addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.146:6834","nonce":4036894910},{"type":"v1","addr":"172.21.15.146:6835","nonce":4036894910}]},"export_targets":[],"features":4540138292836696063,"flags":0}},"data_pools":[3],"metadata_pool":2,"enabled":true,"fs_name":"cephfs","balancer":"","standby_count_wanted":0},"id":1}]}
2019-12-07T11:07:02.941 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 -- 172.21.15.146:0/2284702334 >> [v2:172.21.15.146:6800/8811,v1:172.21.15.146:6801/8811] conn(0x7f46f8023630 msgr2=0x7f46f8025ac0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-12-07T11:07:02.941 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 --2- 172.21.15.146:0/2284702334 >> [v2:172.21.15.146:6800/8811,v1:172.21.15.146:6801/8811] conn(0x7f46f8023630 0x7f46f8025ac0 crc :-1 s=READY pgs=96 cs=0 l=1 rx=0 tx=0).stop
2019-12-07T11:07:02.941 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 -- 172.21.15.146:0/2284702334 >> [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] conn(0x7f471410ac20 msgr2=0x7f47141452c0 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-12-07T11:07:02.941 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 --2- 172.21.15.146:0/2284702334 >> [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] conn(0x7f471410ac20 0x7f47141452c0 secure :-1 s=READY pgs=193 cs=0 l=1 rx=0x7f470c0103b0 tx=0x7f470c010750).stop
2019-12-07T11:07:02.941 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 -- 172.21.15.146:0/2284702334 shutdown_connections
2019-12-07T11:07:02.941 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 --2- 172.21.15.146:0/2284702334 >> [v2:172.21.15.146:6800/8811,v1:172.21.15.146:6801/8811] conn(0x7f46f8023630 0x7f46f8025ac0 unknown :-1 s=CLOSED pgs=96 cs=0 l=1 rx=0 tx=0).stop
2019-12-07T11:07:02.941 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 --2- 172.21.15.146:0/2284702334 >> [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] conn(0x7f471410ac20 0x7f47141452c0 unknown :-1 s=CLOSED pgs=193 cs=0 l=1 rx=0 tx=0).stop
2019-12-07T11:07:02.942 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 -- 172.21.15.146:0/2284702334 shutdown_connections
2019-12-07T11:07:02.942 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 -- 172.21.15.146:0/2284702334 wait complete.
2019-12-07T11:07:02.942 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 -- 172.21.15.146:0/2284702334 >> 172.21.15.146:0/2284702334 conn(0x7f471411ab50 msgr2=0x7f4714111310 unknown :-1 s=STATE_NONE l=0).mark_down
2019-12-07T11:07:02.951 INFO:tasks.mon_thrash.mon_thrasher:start thrashing
2019-12-07T11:07:02.951 INFO:tasks.mon_thrash.mon_thrasher:seed: 1575716820, revive delay: 20.0, thrash delay: 10.0 thrash many: False, maintain quorum: True store thrash: False, probability: 50 freeze mon: prob 10.0 duration 15.0
...
2019-12-07T12:26:26.854 INFO:tasks.mon_thrash.mon_thrasher:monitors to thrash: ['a']
2019-12-07T12:26:26.854 INFO:tasks.mon_thrash.mon_thrasher:monitors to freeze: []
2019-12-07T12:26:26.854 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.a
2019-12-07T12:26:26.854 INFO:tasks.mon_thrash.mon_thrasher:killing mon.a
2019-12-07T12:26:26.855 DEBUG:tasks.ceph.mon.a:waiting for process to exit
2019-12-07T12:26:26.855 INFO:teuthology.orchestra.run:waiting for 300
2019-12-07T12:26:26.873 INFO:tasks.ceph.mon.a:Stopped
2019-12-07T12:26:26.874 INFO:tasks.mon_thrash.ceph_manager:waiting for quorum size 2
...
2019-12-07T12:27:07.843 INFO:tasks.mon_thrash.mon_thrasher:reviving mon.a
2019-12-07T12:27:07.843 INFO:tasks.ceph.mon.a:Restarting daemon
2019-12-07T12:27:07.844 INFO:teuthology.orchestra.run.smithi146:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f --cluster ceph -i a
...
2019-12-07T13:06:08.201 INFO:teuthology.orchestra.run.smithi146.stderr:dumped fsmap epoch 56
2019-12-07T13:06:08.201 INFO:teuthology.orchestra.run.smithi146.stdout:
2019-12-07T13:06:08.202 INFO:teuthology.orchestra.run.smithi146.stdout:{"epoch":56,"default_fscid":1,"compat":{"compat":{},"ro_compat":{},"incompat":{"feature_1":"base v0.20","feature_2":"client writeable ranges","feature_3":"default file layouts on dirs","feature_4":"dir inode in separate object","feature_5":"mds uses versioned encoding","feature_6":"dirfrag is stored in omap","feature_8":"no anchor table","feature_9":"file layout v2","feature_10":"snaprealm v2"}},"feature_flags":{"enable_multiple":false,"ever_enabled_multiple":false},"standbys":[{"gid":774256,"name":"a","rank":-1,"incarnation":0,"state":"up:standby","state_seq":1,"addr":"172.21.15.146:6835/2775721496","addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.146:6834","nonce":2775721496},{"type":"v1","addr":"172.21.15.146:6835","nonce":2775721496}]},"export_targets":[],"features":4540138292836696063,"flags":0,"epoch":52}],"filesystems":[{"mdsmap":{"epoch":56,"flags":18,"ever_allowed_features":2,"explicitly_allowed_features":2,"created":"2019-12-07T11:06:38.797089+0000","modified":"2019-12-07T12:27:43.614231+0000","tableserver":0,"root":0,"session_timeout":60,"session_autoclose":300,"min_compat_client":"0 (unknown)","max_file_size":1099511627776,"last_failure":0,"last_failure_osd_epoch":87,"compat":{"compat":{},"ro_compat":{},"incompat":{"feature_1":"base v0.20","feature_2":"client writeable ranges","feature_3":"default file layouts on dirs","feature_4":"dir inode in separate object","feature_5":"mds uses versioned encoding","feature_6":"dirfrag is stored in omap","feature_8":"no anchor table","feature_9":"file layout v2","feature_10":"snaprealm v2"}},"max_mds":1,"in":[0],"up":{"mds_0":4350},"failed":[],"damaged":[],"stopped":[],"info":{"gid_4350":{"gid":4350,"name":"b","rank":0,"incarnation":51,"state":"up:active","state_seq":1219,"addr":"172.21.15.194:6833/1676386680","addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.194:6832","nonce":1676386680},{"type":"v1","addr":"172.21.15.194:6833","nonce":1676386680}]},"export_targets":[],"features":4540138292836696063,"flags":0}},"data_pools":[3],"metadata_pool":2,"enabled":true,"fs_name":"cephfs","balancer":"","standby_count_wanted":0},"id":1}]}
2019-12-07T13:06:08.204 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 -- 172.21.15.146:0/1562519687 >> [v2:172.21.15.146:6800/8811,v1:172.21.15.146:6801/8811] conn(0x7f798c023630 msgr2=0x7f798c025ae0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-12-07T13:06:08.205 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 --2- 172.21.15.146:0/1562519687 >> [v2:172.21.15.146:6800/8811,v1:172.21.15.146:6801/8811] conn(0x7f798c023630 0x7f798c025ae0 crc :-1 s=READY pgs=2216 cs=0 l=1 rx=0 tx=0).stop
2019-12-07T13:06:08.205 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 -- 172.21.15.146:0/1562519687 >> [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] conn(0x7f79a811c840 msgr2=0x7f79a811b870 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-12-07T13:06:08.205 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 --2- 172.21.15.146:0/1562519687 >> [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] conn(0x7f79a811c840 0x7f79a811b870 secure :-1 s=READY pgs=114 cs=0 l=1 rx=0x7f79a00107d0 tx=0x7f79a0010600).stop
2019-12-07T13:06:08.205 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 -- 172.21.15.146:0/1562519687 shutdown_connections
2019-12-07T13:06:08.205 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 --2- 172.21.15.146:0/1562519687 >> [v2:172.21.15.146:6800/8811,v1:172.21.15.146:6801/8811] conn(0x7f798c023630 0x7f798c025ae0 unknown :-1 s=CLOSED pgs=2216 cs=0 l=1 rx=0 tx=0).stop
2019-12-07T13:06:08.205 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 --2- 172.21.15.146:0/1562519687 >> [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] conn(0x7f79a811c840 0x7f79a811b870 unknown :-1 s=CLOSED pgs=114 cs=0 l=1 rx=0 tx=0).stop
2019-12-07T13:06:08.206 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 --2- 172.21.15.146:0/1562519687 >> [v2:172.21.15.194:3301/0,v1:172.21.15.194:6790/0] conn(0x7f79a8123a20 0x7f79a811bdb0 unknown :-1 s=CLOSED pgs=0 cs=0 l=1 rx=0 tx=0).stop
2019-12-07T13:06:08.206 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 -- 172.21.15.146:0/1562519687 shutdown_connections
2019-12-07T13:06:08.206 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 -- 172.21.15.146:0/1562519687 wait complete.
2019-12-07T13:06:08.206 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 -- 172.21.15.146:0/1562519687 >> 172.21.15.146:0/1562519687 conn(0x7f79a81062d0 msgr2=0x7f79a810c6f0 unknown :-1 s=STATE_NONE l=0).mark_down
2019-12-07T13:06:08.216 ERROR:tasks.mon_thrash.mon_thrasher:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20191205.235703/qa/tasks/mon_thrash.py", line 232, in do_thrash
    self._do_thrash()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20191205.235703/qa/tasks/mon_thrash.py", line 345, in _do_thrash
    'MDS Failover'
AssertionError: MDS Failover
2019-12-07T13:06:08.227 INFO:tasks.mon_thrash.ceph_manager:waiting for quorum size 3

From failed mon's log:

2019-12-07T12:27:28.947+0000 7fd04893a700  1 -- [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] <== mds.0 v2:172.21.15.146:6834/4036894910 3 ==== mdsbeacon(4341/a up:active seq 1213 v50) v7 ==== 416+0+0 (crc 0 0 0) 0x55df3e537500 con 0x55df3e7f5800
2019-12-07T12:27:28.947+0000 7fd04893a700 20 mon.a@0(probing) e1 _ms_dispatch existing session 0x55df3e7dfdc0 for mds.0
2019-12-07T12:27:28.947+0000 7fd04893a700 20 mon.a@0(probing) e1  entity mds.a caps allow *
2019-12-07T12:27:28.947+0000 7fd04893a700 10 mon.a@0(probing).paxosservice(mdsmap 1..50) dispatch 0x55df3e537500 mdsbeacon(4341/a up:active seq 1213 v50) v7 from mds.0 v2:172.21.15.146:6834/4036894910 con 0x55df3e7f5800
2019-12-07T12:27:28.947+0000 7fd04893a700  5 mon.a@0(probing).paxos(paxos recovering c 3013..3743) is_readable = 0 - now=2019-12-07T12:27:28.948423+0000 lease_expire=1970-01-01T00:00:00.000000+0000 has v0 lc 3743
2019-12-07T12:27:28.947+0000 7fd04893a700 10 mon.a@0(probing).paxosservice(mdsmap 1..50)  waiting for paxos -> readable (v50)
2019-12-07T12:27:28.947+0000 7fd04893a700  5 mon.a@0(probing).paxos(paxos recovering c 3013..3743) is_readable = 0 - now=2019-12-07T12:27:28.948443+0000 lease_expire=1970-01-01T00:00:00.000000+0000 has v0 lc 3743
...
2019-12-07T12:27:32.477+0000 7fd04893a700  1 -- [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] <== mds.0 v2:172.21.15.146:6834/4036894910 4 ==== mdsbeacon(4341/a up:active seq 1214 v50) v7 ==== 416+0+0 (crc 0 0 0) 0x55df3e537800 con 0x55df3e7f5800
2019-12-07T12:27:32.477+0000 7fd04893a700 20 mon.a@0(probing) e1 _ms_dispatch existing session 0x55df3e7dfdc0 for mds.0
2019-12-07T12:27:32.477+0000 7fd04893a700 20 mon.a@0(probing) e1  entity mds.a caps allow *
2019-12-07T12:27:32.477+0000 7fd04893a700 10 mon.a@0(probing).paxosservice(mdsmap 1..50) dispatch 0x55df3e537800 mdsbeacon(4341/a up:active seq 1214 v50) v7 from mds.0 v2:172.21.15.146:6834/4036894910 con 0x55df3e7f5800
2019-12-07T12:27:32.477+0000 7fd04893a700  5 mon.a@0(probing).paxos(paxos recovering c 3013..3743) is_readable = 0 - now=2019-12-07T12:27:32.478790+0000 lease_expire=1970-01-01T00:00:00.000000+0000 has v0 lc 3743
2019-12-07T12:27:32.477+0000 7fd04893a700 10 mon.a@0(probing).paxosservice(mdsmap 1..50)  waiting for paxos -> readable (v50)
2019-12-07T12:27:32.477+0000 7fd04893a700  5 mon.a@0(probing).paxos(paxos recovering c 3013..3743) is_readable = 0 - now=2019-12-07T12:27:32.478797+0000 lease_expire=1970-01-01T00:00:00.000000+0000 has v0 lc 3743
...
2019-12-07T12:27:33.494+0000 7fd04893a700 20 mon.a@0(leader).paxos(paxos updating c 3264..3799) finish_round waiting_for_readable
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51) dispatch 0x55df3e536900 mdsbeacon(4341/a up:active seq 1210 v50) v7 from mds.0 v2:172.21.15.146:6834/4036894910 con 0x55df3e669c00
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51)  discarding message from disconnected client mds.0 v2:172.21.15.146:6834/4036894910 mdsbeacon(4341/a up:active seq 1210 v50) v7
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51) dispatch 0x55df3e536600 mdsbeacon(4341/a up:active seq 1211 v50) v7 from mds.0 v2:172.21.15.146:6834/4036894910 con 0x55df3e7f4400
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51)  discarding message from disconnected client mds.0 v2:172.21.15.146:6834/4036894910 mdsbeacon(4341/a up:active seq 1211 v50) v7
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51) dispatch 0x55df3e537200 mdsbeacon(4341/a up:active seq 1212 v50) v7 from mds.0 v2:172.21.15.146:6834/4036894910 con 0x55df3e850400
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51)  discarding message from disconnected client mds.0 v2:172.21.15.146:6834/4036894910 mdsbeacon(4341/a up:active seq 1212 v50) v7
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51) dispatch 0x55df3e537500 mdsbeacon(4341/a up:active seq 1213 v50) v7 from mds.0 v2:172.21.15.146:6834/4036894910 con 0x55df3e7f5800
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51)  discarding message from disconnected client mds.0 v2:172.21.15.146:6834/4036894910 mdsbeacon(4341/a up:active seq 1213 v50) v7
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51) dispatch 0x55df3e537800 mdsbeacon(4341/a up:active seq 1214 v50) v7 from mds.0 v2:172.21.15.146:6834/4036894910 con 0x55df3e7f5800
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51)  discarding message from disconnected client mds.0 v2:172.21.15.146:6834/4036894910 mdsbeacon(4341/a up:active seq 1214 v50) v7

From: /ceph/teuthology-archive/pdonnell-2019-12-06_03:07:34-fs-wip-pdonnell-testing-20191205.235703-distro-basic-smithi/4573014/remote/smithi146/log/ceph-mon.a.log.gz

2019-12-07T12:27:12.479+0000 7fe907327700  1 -- [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] <== mds.0 v2:172.21.15.146:6834/4036894910 14 ==== mdsbeacon(4341/a up:active seq 1209 v50) v7 ==== 416+0+0 (crc 0 0 0) 0x5570be8d1500 con 0x5570c09d2400
2019-12-07T12:27:12.479+0000 7fe907327700 20 mon.b@1(leader) e1 _ms_dispatch existing session 0x5570c0986c40 for mds.0
2019-12-07T12:27:12.479+0000 7fe907327700 20 mon.b@1(leader) e1  entity mds.a caps allow *
2019-12-07T12:27:12.479+0000 7fe907327700 10 mon.b@1(leader).paxosservice(mdsmap 1..50) dispatch 0x5570be8d1500 mdsbeacon(4341/a up:active seq 1209 v50) v7 from mds.0 v2:172.21.15.146:6834/4036894910 con 0x5570c09d2400
2019-12-07T12:27:12.479+0000 7fe907327700  5 mon.b@1(leader).paxos(paxos active c 3264..3774) is_readable = 1 - now=2019-12-07T12:27:12.480273+0000 lease_expire=2019-12-07T12:27:17.106786+0000 has v0 lc 3774
2019-12-07T12:27:12.479+0000 7fe907327700 10 mon.b@1(leader).mds e50 preprocess_query mdsbeacon(4341/a up:active seq 1209 v50) v7 from mds.0 [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910]
2019-12-07T12:27:12.479+0000 7fe907327700 20 is_capable service=mds command= exec addr v2:172.21.15.146:6834/4036894910 on cap allow *
2019-12-07T12:27:12.479+0000 7fe907327700 20  allow so far , doing grant allow *
2019-12-07T12:27:12.479+0000 7fe907327700 20  allow all
2019-12-07T12:27:12.479+0000 7fe907327700  5 mon.b@1(leader).mds e50 preprocess_beacon mdsbeacon(4341/a up:active seq 1209 v50) v7 from mds.0 [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
2019-12-07T12:27:12.479+0000 7fe907327700 10 mon.b@1(leader).mds e50 preprocess_beacon: GID exists in map: 4341
2019-12-07T12:27:12.479+0000 7fe907327700  5 mon.b@1(leader).mds e50 _note_beacon mdsbeacon(4341/a up:active seq 1209 v50) v7 noting time
2019-12-07T12:27:12.479+0000 7fe907327700  2 mon.b@1(leader) e1 send_reply 0x5570c0d8b5f0 0x5570be8d1200 mdsbeacon(4341/a up:active seq 1209 v50) v7
2019-12-07T12:27:12.479+0000 7fe907327700  1 -- [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] --> [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] -- mdsbeacon(4341/a up:active seq 1209 v50) v7 -- 0x5570be8d1200 con 0x5570c09d2400
2019-12-07T12:27:12.479+0000 7fe906325700  1 -- [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] >> [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] conn(0x5570c09d2400 msgr2=0x5570be8b3180 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 41
2019-12-07T12:27:12.479+0000 7fe906325700  1 -- [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] >> [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] conn(0x5570c09d2400 msgr2=0x5570be8b3180 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2019-12-07T12:27:12.479+0000 7fe906325700  1 --2- [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] >> [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] conn(0x5570c09d2400 0x5570be8b3180 crc :-1 s=READY pgs=679 cs=0 l=1 rx=0 tx=0).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)
2019-12-07T12:27:12.479+0000 7fe906325700  1 --2- [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] >> [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] conn(0x5570c09d2400 0x5570be8b3180 crc :-1 s=READY pgs=679 cs=0 l=1 rx=0 tx=0).stop
2019-12-07T12:27:12.479+0000 7fe907327700 10 mon.b@1(leader) e1 ms_handle_reset 0x5570c09d2400 v2:172.21.15.146:6834/4036894910
2019-12-07T12:27:12.479+0000 7fe907327700 10 mon.b@1(leader) e1 reset/close on session mds.0 [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910]
2019-12-07T12:27:12.479+0000 7fe907327700 10 mon.b@1(leader) e1 remove_session 0x5570c0986c40 mds.0 [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] features 0x3f01cfb8ffadffff
...
2019-12-07T12:27:32.074+0000 7fe909b2c700 20 mon.b@1(leader).mds e50 maybe_resize_cluster in 1 max 1
2019-12-07T12:27:32.074+0000 7fe909b2c700  1 mon.b@1(leader).mds e50 no beacon from mds.0.5 (gid: 4341 addr: [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] state: up:active) since 19.5945s
2019-12-07T12:27:32.074+0000 7fe909b2c700  1 mon.b@1(leader).mds e50  replacing 4341 [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] mds.0.5 up:active with 4350/b [v2:172.21.15.194:6832/1676386680,v1:172.21.15.194:6833/1676386680]
2019-12-07T12:27:32.074+0000 7fe909b2c700  0 log_channel(cluster) log [WRN] : daemon mds.a is not responding, replacing it as rank 0 with standby daemon mds.b

From new leader: /ceph/teuthology-archive/pdonnell-2019-12-06_03:07:34-fs-wip-pdonnell-testing-20191205.235703-distro-basic-smithi/4573014/remote/smithi194/log/ceph-mon.b.log.gz

From MDS:

2019-12-07T12:27:12.478+0000 7f3ccadb9700  1 -- [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] <== mon.1 v2:172.21.15.194:3300/0 15 ==== mdsbeacon(4341/a up:active seq 1209 v50) v7 ==== 126+0+0 (crc 0 0 0) 0x55b6ac770900 con 0x55b6abb97000
2019-12-07T12:27:12.478+0000 7f3ccadb9700  5 mds.beacon.a received beacon reply up:active seq 1209 rtt 0.000999984
2019-12-07T12:27:12.478+0000 7f3ccadb9700  0 -- [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] >> [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] conn(0x55b6abb97000 msgr2=0x55b6a6efb600 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until injecting socket failure
2019-12-07T12:27:12.478+0000 7f3ccadb9700  1 -- [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] >> [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] conn(0x55b6abb97000 msgr2=0x55b6a6efb600 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 37
2019-12-07T12:27:12.478+0000 7f3ccadb9700  1 -- [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] >> [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] conn(0x55b6abb97000 msgr2=0x55b6a6efb600 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2019-12-07T12:27:12.478+0000 7f3ccadb9700  1 --2- [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] >> [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] conn(0x55b6abb97000 0x55b6a6efb600 crc :-1 s=READY pgs=49 cs=0 l=1 rx=0 tx=0).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)

Injected socket fault caused MDS to connected to revived mon.a which was in synchronizing state. This caused beacons to be left on the floor.


Related issues

Copied to CephFS - Backport #52636: pacific: MDSMonitor: removes MDS coming out of quorum election Resolved

History

#1 Updated by Patrick Donnelly about 3 years ago

  • Description updated (diff)

#2 Updated by Patrick Donnelly about 3 years ago

  • Status changed from In Progress to New
  • Assignee deleted (Patrick Donnelly)

#3 Updated by Patrick Donnelly about 3 years ago

  • Target version deleted (v15.0.0)

#4 Updated by Patrick Donnelly over 1 year ago

/ceph/teuthology-archive/pdonnell-2021-06-03_03:40:33-fs-wip-pdonnell-testing-20210603.020013-distro-basic-smithi/6148290/teuthology.log

#5 Updated by Patrick Donnelly over 1 year ago

/ceph/teuthology-archive/pdonnell-2021-06-11_18:02:10-fs-wip-pdonnell-testing-20210611.162716-distro-basic-smithi/6166663/teuthology.log

#6 Updated by Patrick Donnelly over 1 year ago

  • Status changed from New to Triaged
  • Assignee set to Patrick Donnelly
  • Target version set to v17.0.0
  • Backport deleted (nautilus,mimic)

/ceph/teuthology-archive/pdonnell-2021-06-16_21:26:55-fs-wip-pdonnell-testing-20210616.191804-distro-basic-smithi/6175768/screen

#7 Updated by Patrick Donnelly over 1 year ago

  • Status changed from Triaged to Fix Under Review
  • Backport set to pacific
  • Pull request ID set to 42943

#8 Updated by Patrick Donnelly over 1 year ago

  • Status changed from Fix Under Review to Pending Backport

#9 Updated by Backport Bot over 1 year ago

  • Copied to Backport #52636: pacific: MDSMonitor: removes MDS coming out of quorum election added

#10 Updated by Patrick Donnelly 7 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF