Actions
Bug #42920
openmds: removed from map due to dropped (?) beacons
Status:
New
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Development
Tags:
Backport:
nautilus,mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, MDSMonitor
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2019-11-19 03:07:21.589 7fb9936d7700 1 mds.li1190-231 asok_command: get subtrees (starting...) 2019-11-19 03:07:21.616 7fb9936d7700 1 mds.li1190-231 asok_command: get subtrees (complete) 2019-11-19 03:07:23.751 7fb9936d7700 1 mds.li1190-231 asok_command: session ls (starting...) 2019-11-19 03:08:09.758 7fb99244b700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2019-11-19 03:08:09.758 7fb99244b700 1 mds.beacon.li1190-231 MDS connection to Monitors appears to be laggy; 49.1691s since last acked beacon 2019-11-19 03:08:09.760 7fb9936d7700 1 mds.li1190-231 asok_command: session ls (complete) 2019-11-19 03:08:09.760 7fb98fc46700 1 mds.0.266 skipping upkeep work because connection to Monitors appears laggy 2019-11-19 03:08:09.760 7fb9936d7700 1 mds.li1190-231 asok_command: ops (starting...) 2019-11-19 03:08:10.062 7fb995edc700 0 --1- [v2:192.168.140.68:6800/3353277732,v1:192.168.140.68:6801/3353277732] >> v1:192.168.192.228:0/3922646583 conn(0x5612a9fb2400 0x5612a91d8800 :6801 s=OPENED pgs=2 cs=1 l=0).fault server, going to standby 2019-11-19 03:08:10.555 7fb995edc700 0 --1- [v2:192.168.140.68:6800/3353277732,v1:192.168.140.68:6801/3353277732] >> v1:192.168.178.144:0/1772090923 conn(0x5612a9ede400 0x5612a9fc8000 :6801 s=READ_MESSAGE_FRONT pgs=2 cs=1 l=0).fault server, going to standby 2019-11-19 03:08:10.570 7fb995edc700 0 --1- [v2:192.168.140.68:6800/3353277732,v1:192.168.140.68:6801/3353277732] >> v1:192.168.215.5:0/485684352 conn(0x5612a9edf400 0x5612a9fca000 :6801 s=READ_MESSAGE_FRONT pgs=2 cs=1 l=0).fault server, going to standby 2019-11-19 03:08:10.570 7fb9936d7700 1 mds.li1190-231 asok_command: ops (complete) 2019-11-19 03:08:10.574 7fb9936d7700 1 mds.li1190-231 asok_command: status (starting...) 2019-11-19 03:08:10.577 7fb9936d7700 1 mds.li1190-231 asok_command: status (complete) 2019-11-19 03:08:10.578 7fb9936d7700 1 mds.li1190-231 asok_command: get subtrees (starting...) 2019-11-19 03:08:10.583 7fb9936d7700 1 mds.li1190-231 asok_command: get subtrees (complete) 2019-11-19 03:08:10.631 7fb99244b700 1 mds.li1190-231 Updating MDS map to version 315 from mon.0 2019-11-19 03:08:10.632 7fb995edc700 0 --1- [v2:192.168.140.68:6800/3353277732,v1:192.168.140.68:6801/3353277732] >> v1:192.168.177.71:0/1318272242 conn(0x5612aa206c00 0x5612aa268800 :6801 s=OPENED pgs=3 cs=1 l=0).fault server, going to standby 2019-11-19 03:08:10.633 7fb99244b700 1 mds.li1190-231 Map removed me (mds.-1 gid:33266) from cluster due to lost contact; respawning 2019-11-19 03:08:10.633 7fb99244b700 1 mds.li1190-231 respawn! ... -7149> 2019-11-19 03:07:23.099 7fb991449700 10 monclient: tick -7148> 2019-11-19 03:07:23.099 7fb991449700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-11-19 03:06:53.099623) -7147> 2019-11-19 03:07:23.751 7fb9936d7700 1 mds.li1190-231 asok_command: session ls (starting...) -7146> 2019-11-19 03:07:24.590 7fb98f445700 5 mds.beacon.li1190-231 Sending beacon up:active seq 436 -7145> 2019-11-19 03:07:24.590 7fb98f445700 10 monclient: _send_mon_message to mon.li218-159 at v2:192.168.190.176:3300/0 -7144> 2019-11-19 03:07:26.692 7fb991449700 10 monclient: tick -7143> 2019-11-19 03:07:26.692 7fb991449700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-11-19 03:07:03.099825) -7142> 2019-11-19 03:07:26.692 7fb991449700 10 monclient: tick -7141> 2019-11-19 03:07:26.692 7fb991449700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-11-19 03:07:13.100098) -7140> 2019-11-19 03:07:26.692 7fb991449700 1 monclient: no keepalive since 2019-11-19 03:07:13.099250 (30.000862 seconds), reconnecting -7139> 2019-11-19 03:07:26.692 7fb991449700 10 monclient: _reopen_session rank -1 -7138> 2019-11-19 03:08:09.640 7fb98b43d700 5 mds.0.log _submit_thread 158724719726~2193 : EUpdate mkdir [metablob 0x10001149da3, 3 dirs] -7137> 2019-11-19 03:08:09.758 7fb99244b700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 -7136> 2019-11-19 03:08:09.758 7fb99244b700 1 mds.beacon.li1190-231 MDS connection to Monitors appears to be laggy; 49.1691s since last acked beacon -7135> 2019-11-19 03:08:09.758 7fb98b43d700 5 mds.0.log _submit_thread 158724721939~1890 : EUpdate openc [metablob 0x1000115f6cb, 2 dirs] -7134> 2019-11-19 03:08:09.760 7fb9936d7700 1 mds.li1190-231 asok_command: session ls (complete) -7133> 2019-11-19 03:08:09.760 7fb9936d7700 5 asok(0x5612a91f2000) AdminSocket: request 'session ls' '' to 0x5612a917c5a0 returned 12917 bytes -7132> 2019-11-19 03:08:09.760 7fb98fc46700 1 mds.0.266 skipping upkeep work because connection to Monitors appears laggy ... -3366> 2019-11-19 03:08:10.354 7fb98f445700 5 mds.beacon.li1190-231 Sending beacon up:active seq 437 ... -3> 2019-11-19 03:08:10.631 7fb99244b700 1 mds.li1190-231 Updating MDS map to version 315 from mon.0 -2> 2019-11-19 03:08:10.632 7fb995edc700 0 --1- [v2:192.168.140.68:6800/3353277732,v1:192.168.140.68:6801/3353277732] >> v1:192.168.177.71:0/1318272242 conn(0x5612aa206c00 0x5612aa268800 :6801 s=OPENED pgs=3 cs=1 l=0).fault server, going to standby -1> 2019-11-19 03:08:10.633 7fb99244b700 1 mds.li1190-231 Map removed me (mds.-1 gid:33266) from cluster due to lost contact; respawning 0> 2019-11-19 03:08:10.633 7fb99244b700 1 mds.li1190-231 respawn!
Log attached. Like #42919, this was from an experiment with 128 jobs running git-clone/rm on 32 nodes. Except this ticket is with two active MDS.
MDS says it's sending beacons but the monitors are not receiving them. Might be related to #36540. Will need logs with the MDS at debug_ms=1 to confirm the messages went out but no reason not to think so.
Files
Actions