Project

General

Profile

Actions

Bug #43216

closed

MDSMonitor: removes MDS coming out of quorum election

Added by Patrick Donnelly over 4 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Urgent
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDSMonitor
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Event sequence:

- 2019-12-07T12:26:26.854 mon_thrash kills mon.a(leader)
- 2019-12-07T12:27:07.843 mon_thrash revives mon.a
- 2019-12-07T12:27:12.478 socket fault causes mds.a to reconnect to revived mon.a(probing)
- 2019-12-07T12:27:32.074 mon.b replaces mds.a
- 2019-12-07T12:27:33.494 mon.a back in quorum, drops beacons from evicted mds.a

2019-12-07T11:07:02.936 INFO:teuthology.orchestra.run.smithi146.stderr:dumped fsmap epoch 6
2019-12-07T11:07:02.936 INFO:teuthology.orchestra.run.smithi146.stdout:
2019-12-07T11:07:02.936 INFO:teuthology.orchestra.run.smithi146.stdout:{"epoch":6,"default_fscid":1,"compat":{"compat":{},"ro_compat":{},"incompat":{"feature_1":"base v0.20","feature_2":"client writeable ranges","feature_3":"default file layouts on dirs","feature_4":"dir inode in separate object","feature_5":"mds uses versioned encoding","feature_6":"dirfrag is stored in omap","feature_8":"no anchor table","feature_9":"file layout v2","feature_10":"snaprealm v2"}},"feature_flags":{"enable_multiple":false,"ever_enabled_multiple":false},"standbys":[{"gid":4350,"name":"b","rank":-1,"incarnation":0,"state":"up:standby","state_seq":1,"addr":"172.21.15.194:6833/1676386680","addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.194:6832","nonce":1676386680},{"type":"v1","addr":"172.21.15.194:6833","nonce":1676386680}]},"export_targets":[],"features":4540138292836696063,"flags":0,"epoch":4}],"filesystems":[{"mdsmap":{"epoch":6,"flags":18,"ever_allowed_features":0,"explicitly_allowed_features":0,"created":"2019-12-07T11:06:38.797089+0000","modified":"2019-12-07T11:06:44.297113+0000","tableserver":0,"root":0,"session_timeout":60,"session_autoclose":300,"min_compat_client":"0 (unknown)","max_file_size":1099511627776,"last_failure":0,"last_failure_osd_epoch":0,"compat":{"compat":{},"ro_compat":{},"incompat":{"feature_1":"base v0.20","feature_2":"client writeable ranges","feature_3":"default file layouts on dirs","feature_4":"dir inode in separate object","feature_5":"mds uses versioned encoding","feature_6":"dirfrag is stored in omap","feature_8":"no anchor table","feature_9":"file layout v2","feature_10":"snaprealm v2"}},"max_mds":1,"in":[0],"up":{"mds_0":4341},"failed":[],"damaged":[],"stopped":[],"info":{"gid_4341":{"gid":4341,"name":"a","rank":0,"incarnation":5,"state":"up:active","state_seq":2,"addr":"172.21.15.146:6835/4036894910","addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.146:6834","nonce":4036894910},{"type":"v1","addr":"172.21.15.146:6835","nonce":4036894910}]},"export_targets":[],"features":4540138292836696063,"flags":0}},"data_pools":[3],"metadata_pool":2,"enabled":true,"fs_name":"cephfs","balancer":"","standby_count_wanted":0},"id":1}]}
2019-12-07T11:07:02.941 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 -- 172.21.15.146:0/2284702334 >> [v2:172.21.15.146:6800/8811,v1:172.21.15.146:6801/8811] conn(0x7f46f8023630 msgr2=0x7f46f8025ac0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-12-07T11:07:02.941 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 --2- 172.21.15.146:0/2284702334 >> [v2:172.21.15.146:6800/8811,v1:172.21.15.146:6801/8811] conn(0x7f46f8023630 0x7f46f8025ac0 crc :-1 s=READY pgs=96 cs=0 l=1 rx=0 tx=0).stop
2019-12-07T11:07:02.941 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 -- 172.21.15.146:0/2284702334 >> [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] conn(0x7f471410ac20 msgr2=0x7f47141452c0 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-12-07T11:07:02.941 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 --2- 172.21.15.146:0/2284702334 >> [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] conn(0x7f471410ac20 0x7f47141452c0 secure :-1 s=READY pgs=193 cs=0 l=1 rx=0x7f470c0103b0 tx=0x7f470c010750).stop
2019-12-07T11:07:02.941 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 -- 172.21.15.146:0/2284702334 shutdown_connections
2019-12-07T11:07:02.941 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 --2- 172.21.15.146:0/2284702334 >> [v2:172.21.15.146:6800/8811,v1:172.21.15.146:6801/8811] conn(0x7f46f8023630 0x7f46f8025ac0 unknown :-1 s=CLOSED pgs=96 cs=0 l=1 rx=0 tx=0).stop
2019-12-07T11:07:02.941 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 --2- 172.21.15.146:0/2284702334 >> [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] conn(0x7f471410ac20 0x7f47141452c0 unknown :-1 s=CLOSED pgs=193 cs=0 l=1 rx=0 tx=0).stop
2019-12-07T11:07:02.942 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 -- 172.21.15.146:0/2284702334 shutdown_connections
2019-12-07T11:07:02.942 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 -- 172.21.15.146:0/2284702334 wait complete.
2019-12-07T11:07:02.942 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T11:07:02.941+0000 7f4718d81700  1 -- 172.21.15.146:0/2284702334 >> 172.21.15.146:0/2284702334 conn(0x7f471411ab50 msgr2=0x7f4714111310 unknown :-1 s=STATE_NONE l=0).mark_down
2019-12-07T11:07:02.951 INFO:tasks.mon_thrash.mon_thrasher:start thrashing
2019-12-07T11:07:02.951 INFO:tasks.mon_thrash.mon_thrasher:seed: 1575716820, revive delay: 20.0, thrash delay: 10.0 thrash many: False, maintain quorum: True store thrash: False, probability: 50 freeze mon: prob 10.0 duration 15.0
...
2019-12-07T12:26:26.854 INFO:tasks.mon_thrash.mon_thrasher:monitors to thrash: ['a']
2019-12-07T12:26:26.854 INFO:tasks.mon_thrash.mon_thrasher:monitors to freeze: []
2019-12-07T12:26:26.854 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.a
2019-12-07T12:26:26.854 INFO:tasks.mon_thrash.mon_thrasher:killing mon.a
2019-12-07T12:26:26.855 DEBUG:tasks.ceph.mon.a:waiting for process to exit
2019-12-07T12:26:26.855 INFO:teuthology.orchestra.run:waiting for 300
2019-12-07T12:26:26.873 INFO:tasks.ceph.mon.a:Stopped
2019-12-07T12:26:26.874 INFO:tasks.mon_thrash.ceph_manager:waiting for quorum size 2
...
2019-12-07T12:27:07.843 INFO:tasks.mon_thrash.mon_thrasher:reviving mon.a
2019-12-07T12:27:07.843 INFO:tasks.ceph.mon.a:Restarting daemon
2019-12-07T12:27:07.844 INFO:teuthology.orchestra.run.smithi146:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f --cluster ceph -i a
...
2019-12-07T13:06:08.201 INFO:teuthology.orchestra.run.smithi146.stderr:dumped fsmap epoch 56
2019-12-07T13:06:08.201 INFO:teuthology.orchestra.run.smithi146.stdout:
2019-12-07T13:06:08.202 INFO:teuthology.orchestra.run.smithi146.stdout:{"epoch":56,"default_fscid":1,"compat":{"compat":{},"ro_compat":{},"incompat":{"feature_1":"base v0.20","feature_2":"client writeable ranges","feature_3":"default file layouts on dirs","feature_4":"dir inode in separate object","feature_5":"mds uses versioned encoding","feature_6":"dirfrag is stored in omap","feature_8":"no anchor table","feature_9":"file layout v2","feature_10":"snaprealm v2"}},"feature_flags":{"enable_multiple":false,"ever_enabled_multiple":false},"standbys":[{"gid":774256,"name":"a","rank":-1,"incarnation":0,"state":"up:standby","state_seq":1,"addr":"172.21.15.146:6835/2775721496","addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.146:6834","nonce":2775721496},{"type":"v1","addr":"172.21.15.146:6835","nonce":2775721496}]},"export_targets":[],"features":4540138292836696063,"flags":0,"epoch":52}],"filesystems":[{"mdsmap":{"epoch":56,"flags":18,"ever_allowed_features":2,"explicitly_allowed_features":2,"created":"2019-12-07T11:06:38.797089+0000","modified":"2019-12-07T12:27:43.614231+0000","tableserver":0,"root":0,"session_timeout":60,"session_autoclose":300,"min_compat_client":"0 (unknown)","max_file_size":1099511627776,"last_failure":0,"last_failure_osd_epoch":87,"compat":{"compat":{},"ro_compat":{},"incompat":{"feature_1":"base v0.20","feature_2":"client writeable ranges","feature_3":"default file layouts on dirs","feature_4":"dir inode in separate object","feature_5":"mds uses versioned encoding","feature_6":"dirfrag is stored in omap","feature_8":"no anchor table","feature_9":"file layout v2","feature_10":"snaprealm v2"}},"max_mds":1,"in":[0],"up":{"mds_0":4350},"failed":[],"damaged":[],"stopped":[],"info":{"gid_4350":{"gid":4350,"name":"b","rank":0,"incarnation":51,"state":"up:active","state_seq":1219,"addr":"172.21.15.194:6833/1676386680","addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.194:6832","nonce":1676386680},{"type":"v1","addr":"172.21.15.194:6833","nonce":1676386680}]},"export_targets":[],"features":4540138292836696063,"flags":0}},"data_pools":[3],"metadata_pool":2,"enabled":true,"fs_name":"cephfs","balancer":"","standby_count_wanted":0},"id":1}]}
2019-12-07T13:06:08.204 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 -- 172.21.15.146:0/1562519687 >> [v2:172.21.15.146:6800/8811,v1:172.21.15.146:6801/8811] conn(0x7f798c023630 msgr2=0x7f798c025ae0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-12-07T13:06:08.205 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 --2- 172.21.15.146:0/1562519687 >> [v2:172.21.15.146:6800/8811,v1:172.21.15.146:6801/8811] conn(0x7f798c023630 0x7f798c025ae0 crc :-1 s=READY pgs=2216 cs=0 l=1 rx=0 tx=0).stop
2019-12-07T13:06:08.205 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 -- 172.21.15.146:0/1562519687 >> [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] conn(0x7f79a811c840 msgr2=0x7f79a811b870 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-12-07T13:06:08.205 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 --2- 172.21.15.146:0/1562519687 >> [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] conn(0x7f79a811c840 0x7f79a811b870 secure :-1 s=READY pgs=114 cs=0 l=1 rx=0x7f79a00107d0 tx=0x7f79a0010600).stop
2019-12-07T13:06:08.205 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 -- 172.21.15.146:0/1562519687 shutdown_connections
2019-12-07T13:06:08.205 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 --2- 172.21.15.146:0/1562519687 >> [v2:172.21.15.146:6800/8811,v1:172.21.15.146:6801/8811] conn(0x7f798c023630 0x7f798c025ae0 unknown :-1 s=CLOSED pgs=2216 cs=0 l=1 rx=0 tx=0).stop
2019-12-07T13:06:08.205 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 --2- 172.21.15.146:0/1562519687 >> [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] conn(0x7f79a811c840 0x7f79a811b870 unknown :-1 s=CLOSED pgs=114 cs=0 l=1 rx=0 tx=0).stop
2019-12-07T13:06:08.206 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 --2- 172.21.15.146:0/1562519687 >> [v2:172.21.15.194:3301/0,v1:172.21.15.194:6790/0] conn(0x7f79a8123a20 0x7f79a811bdb0 unknown :-1 s=CLOSED pgs=0 cs=0 l=1 rx=0 tx=0).stop
2019-12-07T13:06:08.206 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 -- 172.21.15.146:0/1562519687 shutdown_connections
2019-12-07T13:06:08.206 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 -- 172.21.15.146:0/1562519687 wait complete.
2019-12-07T13:06:08.206 INFO:teuthology.orchestra.run.smithi146.stderr:2019-12-07T13:06:08.201+0000 7f79afe07700  1 -- 172.21.15.146:0/1562519687 >> 172.21.15.146:0/1562519687 conn(0x7f79a81062d0 msgr2=0x7f79a810c6f0 unknown :-1 s=STATE_NONE l=0).mark_down
2019-12-07T13:06:08.216 ERROR:tasks.mon_thrash.mon_thrasher:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20191205.235703/qa/tasks/mon_thrash.py", line 232, in do_thrash
    self._do_thrash()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20191205.235703/qa/tasks/mon_thrash.py", line 345, in _do_thrash
    'MDS Failover'
AssertionError: MDS Failover
2019-12-07T13:06:08.227 INFO:tasks.mon_thrash.ceph_manager:waiting for quorum size 3

From failed mon's log:

2019-12-07T12:27:28.947+0000 7fd04893a700  1 -- [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] <== mds.0 v2:172.21.15.146:6834/4036894910 3 ==== mdsbeacon(4341/a up:active seq 1213 v50) v7 ==== 416+0+0 (crc 0 0 0) 0x55df3e537500 con 0x55df3e7f5800
2019-12-07T12:27:28.947+0000 7fd04893a700 20 mon.a@0(probing) e1 _ms_dispatch existing session 0x55df3e7dfdc0 for mds.0
2019-12-07T12:27:28.947+0000 7fd04893a700 20 mon.a@0(probing) e1  entity mds.a caps allow *
2019-12-07T12:27:28.947+0000 7fd04893a700 10 mon.a@0(probing).paxosservice(mdsmap 1..50) dispatch 0x55df3e537500 mdsbeacon(4341/a up:active seq 1213 v50) v7 from mds.0 v2:172.21.15.146:6834/4036894910 con 0x55df3e7f5800
2019-12-07T12:27:28.947+0000 7fd04893a700  5 mon.a@0(probing).paxos(paxos recovering c 3013..3743) is_readable = 0 - now=2019-12-07T12:27:28.948423+0000 lease_expire=1970-01-01T00:00:00.000000+0000 has v0 lc 3743
2019-12-07T12:27:28.947+0000 7fd04893a700 10 mon.a@0(probing).paxosservice(mdsmap 1..50)  waiting for paxos -> readable (v50)
2019-12-07T12:27:28.947+0000 7fd04893a700  5 mon.a@0(probing).paxos(paxos recovering c 3013..3743) is_readable = 0 - now=2019-12-07T12:27:28.948443+0000 lease_expire=1970-01-01T00:00:00.000000+0000 has v0 lc 3743
...
2019-12-07T12:27:32.477+0000 7fd04893a700  1 -- [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] <== mds.0 v2:172.21.15.146:6834/4036894910 4 ==== mdsbeacon(4341/a up:active seq 1214 v50) v7 ==== 416+0+0 (crc 0 0 0) 0x55df3e537800 con 0x55df3e7f5800
2019-12-07T12:27:32.477+0000 7fd04893a700 20 mon.a@0(probing) e1 _ms_dispatch existing session 0x55df3e7dfdc0 for mds.0
2019-12-07T12:27:32.477+0000 7fd04893a700 20 mon.a@0(probing) e1  entity mds.a caps allow *
2019-12-07T12:27:32.477+0000 7fd04893a700 10 mon.a@0(probing).paxosservice(mdsmap 1..50) dispatch 0x55df3e537800 mdsbeacon(4341/a up:active seq 1214 v50) v7 from mds.0 v2:172.21.15.146:6834/4036894910 con 0x55df3e7f5800
2019-12-07T12:27:32.477+0000 7fd04893a700  5 mon.a@0(probing).paxos(paxos recovering c 3013..3743) is_readable = 0 - now=2019-12-07T12:27:32.478790+0000 lease_expire=1970-01-01T00:00:00.000000+0000 has v0 lc 3743
2019-12-07T12:27:32.477+0000 7fd04893a700 10 mon.a@0(probing).paxosservice(mdsmap 1..50)  waiting for paxos -> readable (v50)
2019-12-07T12:27:32.477+0000 7fd04893a700  5 mon.a@0(probing).paxos(paxos recovering c 3013..3743) is_readable = 0 - now=2019-12-07T12:27:32.478797+0000 lease_expire=1970-01-01T00:00:00.000000+0000 has v0 lc 3743
...
2019-12-07T12:27:33.494+0000 7fd04893a700 20 mon.a@0(leader).paxos(paxos updating c 3264..3799) finish_round waiting_for_readable
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51) dispatch 0x55df3e536900 mdsbeacon(4341/a up:active seq 1210 v50) v7 from mds.0 v2:172.21.15.146:6834/4036894910 con 0x55df3e669c00
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51)  discarding message from disconnected client mds.0 v2:172.21.15.146:6834/4036894910 mdsbeacon(4341/a up:active seq 1210 v50) v7
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51) dispatch 0x55df3e536600 mdsbeacon(4341/a up:active seq 1211 v50) v7 from mds.0 v2:172.21.15.146:6834/4036894910 con 0x55df3e7f4400
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51)  discarding message from disconnected client mds.0 v2:172.21.15.146:6834/4036894910 mdsbeacon(4341/a up:active seq 1211 v50) v7
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51) dispatch 0x55df3e537200 mdsbeacon(4341/a up:active seq 1212 v50) v7 from mds.0 v2:172.21.15.146:6834/4036894910 con 0x55df3e850400
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51)  discarding message from disconnected client mds.0 v2:172.21.15.146:6834/4036894910 mdsbeacon(4341/a up:active seq 1212 v50) v7
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51) dispatch 0x55df3e537500 mdsbeacon(4341/a up:active seq 1213 v50) v7 from mds.0 v2:172.21.15.146:6834/4036894910 con 0x55df3e7f5800
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51)  discarding message from disconnected client mds.0 v2:172.21.15.146:6834/4036894910 mdsbeacon(4341/a up:active seq 1213 v50) v7
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51) dispatch 0x55df3e537800 mdsbeacon(4341/a up:active seq 1214 v50) v7 from mds.0 v2:172.21.15.146:6834/4036894910 con 0x55df3e7f5800
2019-12-07T12:27:33.494+0000 7fd04893a700 10 mon.a@0(leader).paxosservice(mdsmap 1..51)  discarding message from disconnected client mds.0 v2:172.21.15.146:6834/4036894910 mdsbeacon(4341/a up:active seq 1214 v50) v7

From: /ceph/teuthology-archive/pdonnell-2019-12-06_03:07:34-fs-wip-pdonnell-testing-20191205.235703-distro-basic-smithi/4573014/remote/smithi146/log/ceph-mon.a.log.gz

2019-12-07T12:27:12.479+0000 7fe907327700  1 -- [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] <== mds.0 v2:172.21.15.146:6834/4036894910 14 ==== mdsbeacon(4341/a up:active seq 1209 v50) v7 ==== 416+0+0 (crc 0 0 0) 0x5570be8d1500 con 0x5570c09d2400
2019-12-07T12:27:12.479+0000 7fe907327700 20 mon.b@1(leader) e1 _ms_dispatch existing session 0x5570c0986c40 for mds.0
2019-12-07T12:27:12.479+0000 7fe907327700 20 mon.b@1(leader) e1  entity mds.a caps allow *
2019-12-07T12:27:12.479+0000 7fe907327700 10 mon.b@1(leader).paxosservice(mdsmap 1..50) dispatch 0x5570be8d1500 mdsbeacon(4341/a up:active seq 1209 v50) v7 from mds.0 v2:172.21.15.146:6834/4036894910 con 0x5570c09d2400
2019-12-07T12:27:12.479+0000 7fe907327700  5 mon.b@1(leader).paxos(paxos active c 3264..3774) is_readable = 1 - now=2019-12-07T12:27:12.480273+0000 lease_expire=2019-12-07T12:27:17.106786+0000 has v0 lc 3774
2019-12-07T12:27:12.479+0000 7fe907327700 10 mon.b@1(leader).mds e50 preprocess_query mdsbeacon(4341/a up:active seq 1209 v50) v7 from mds.0 [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910]
2019-12-07T12:27:12.479+0000 7fe907327700 20 is_capable service=mds command= exec addr v2:172.21.15.146:6834/4036894910 on cap allow *
2019-12-07T12:27:12.479+0000 7fe907327700 20  allow so far , doing grant allow *
2019-12-07T12:27:12.479+0000 7fe907327700 20  allow all
2019-12-07T12:27:12.479+0000 7fe907327700  5 mon.b@1(leader).mds e50 preprocess_beacon mdsbeacon(4341/a up:active seq 1209 v50) v7 from mds.0 [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
2019-12-07T12:27:12.479+0000 7fe907327700 10 mon.b@1(leader).mds e50 preprocess_beacon: GID exists in map: 4341
2019-12-07T12:27:12.479+0000 7fe907327700  5 mon.b@1(leader).mds e50 _note_beacon mdsbeacon(4341/a up:active seq 1209 v50) v7 noting time
2019-12-07T12:27:12.479+0000 7fe907327700  2 mon.b@1(leader) e1 send_reply 0x5570c0d8b5f0 0x5570be8d1200 mdsbeacon(4341/a up:active seq 1209 v50) v7
2019-12-07T12:27:12.479+0000 7fe907327700  1 -- [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] --> [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] -- mdsbeacon(4341/a up:active seq 1209 v50) v7 -- 0x5570be8d1200 con 0x5570c09d2400
2019-12-07T12:27:12.479+0000 7fe906325700  1 -- [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] >> [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] conn(0x5570c09d2400 msgr2=0x5570be8b3180 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 41
2019-12-07T12:27:12.479+0000 7fe906325700  1 -- [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] >> [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] conn(0x5570c09d2400 msgr2=0x5570be8b3180 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2019-12-07T12:27:12.479+0000 7fe906325700  1 --2- [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] >> [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] conn(0x5570c09d2400 0x5570be8b3180 crc :-1 s=READY pgs=679 cs=0 l=1 rx=0 tx=0).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)
2019-12-07T12:27:12.479+0000 7fe906325700  1 --2- [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] >> [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] conn(0x5570c09d2400 0x5570be8b3180 crc :-1 s=READY pgs=679 cs=0 l=1 rx=0 tx=0).stop
2019-12-07T12:27:12.479+0000 7fe907327700 10 mon.b@1(leader) e1 ms_handle_reset 0x5570c09d2400 v2:172.21.15.146:6834/4036894910
2019-12-07T12:27:12.479+0000 7fe907327700 10 mon.b@1(leader) e1 reset/close on session mds.0 [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910]
2019-12-07T12:27:12.479+0000 7fe907327700 10 mon.b@1(leader) e1 remove_session 0x5570c0986c40 mds.0 [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] features 0x3f01cfb8ffadffff
...
2019-12-07T12:27:32.074+0000 7fe909b2c700 20 mon.b@1(leader).mds e50 maybe_resize_cluster in 1 max 1
2019-12-07T12:27:32.074+0000 7fe909b2c700  1 mon.b@1(leader).mds e50 no beacon from mds.0.5 (gid: 4341 addr: [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] state: up:active) since 19.5945s
2019-12-07T12:27:32.074+0000 7fe909b2c700  1 mon.b@1(leader).mds e50  replacing 4341 [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] mds.0.5 up:active with 4350/b [v2:172.21.15.194:6832/1676386680,v1:172.21.15.194:6833/1676386680]
2019-12-07T12:27:32.074+0000 7fe909b2c700  0 log_channel(cluster) log [WRN] : daemon mds.a is not responding, replacing it as rank 0 with standby daemon mds.b

From new leader: /ceph/teuthology-archive/pdonnell-2019-12-06_03:07:34-fs-wip-pdonnell-testing-20191205.235703-distro-basic-smithi/4573014/remote/smithi194/log/ceph-mon.b.log.gz

From MDS:

2019-12-07T12:27:12.478+0000 7f3ccadb9700  1 -- [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] <== mon.1 v2:172.21.15.194:3300/0 15 ==== mdsbeacon(4341/a up:active seq 1209 v50) v7 ==== 126+0+0 (crc 0 0 0) 0x55b6ac770900 con 0x55b6abb97000
2019-12-07T12:27:12.478+0000 7f3ccadb9700  5 mds.beacon.a received beacon reply up:active seq 1209 rtt 0.000999984
2019-12-07T12:27:12.478+0000 7f3ccadb9700  0 -- [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] >> [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] conn(0x55b6abb97000 msgr2=0x55b6a6efb600 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until injecting socket failure
2019-12-07T12:27:12.478+0000 7f3ccadb9700  1 -- [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] >> [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] conn(0x55b6abb97000 msgr2=0x55b6a6efb600 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 37
2019-12-07T12:27:12.478+0000 7f3ccadb9700  1 -- [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] >> [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] conn(0x55b6abb97000 msgr2=0x55b6a6efb600 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2019-12-07T12:27:12.478+0000 7f3ccadb9700  1 --2- [v2:172.21.15.146:6834/4036894910,v1:172.21.15.146:6835/4036894910] >> [v2:172.21.15.194:3300/0,v1:172.21.15.194:6789/0] conn(0x55b6abb97000 0x55b6a6efb600 crc :-1 s=READY pgs=49 cs=0 l=1 rx=0 tx=0).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)

Injected socket fault caused MDS to connected to revived mon.a which was in synchronizing state. This caused beacons to be left on the floor.


Related issues 1 (0 open1 closed)

Copied to CephFS - Backport #52636: pacific: MDSMonitor: removes MDS coming out of quorum electionResolvedCory SnyderActions
Actions #1

Updated by Patrick Donnelly over 4 years ago

  • Description updated (diff)
Actions #2

Updated by Patrick Donnelly over 4 years ago

  • Status changed from In Progress to New
  • Assignee deleted (Patrick Donnelly)
Actions #3

Updated by Patrick Donnelly over 4 years ago

  • Target version deleted (v15.0.0)
Actions #4

Updated by Patrick Donnelly almost 3 years ago

/ceph/teuthology-archive/pdonnell-2021-06-03_03:40:33-fs-wip-pdonnell-testing-20210603.020013-distro-basic-smithi/6148290/teuthology.log

Actions #5

Updated by Patrick Donnelly almost 3 years ago

/ceph/teuthology-archive/pdonnell-2021-06-11_18:02:10-fs-wip-pdonnell-testing-20210611.162716-distro-basic-smithi/6166663/teuthology.log

Actions #6

Updated by Patrick Donnelly almost 3 years ago

  • Status changed from New to Triaged
  • Assignee set to Patrick Donnelly
  • Target version set to v17.0.0
  • Backport deleted (nautilus,mimic)

/ceph/teuthology-archive/pdonnell-2021-06-16_21:26:55-fs-wip-pdonnell-testing-20210616.191804-distro-basic-smithi/6175768/screen

Actions #7

Updated by Patrick Donnelly over 2 years ago

  • Status changed from Triaged to Fix Under Review
  • Backport set to pacific
  • Pull request ID set to 42943
Actions #8

Updated by Patrick Donnelly over 2 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #9

Updated by Backport Bot over 2 years ago

  • Copied to Backport #52636: pacific: MDSMonitor: removes MDS coming out of quorum election added
Actions #10

Updated by Patrick Donnelly almost 2 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF