Project

General

Profile

Actions

Bug #42920

open

mds: removed from map due to dropped (?) beacons

Added by Patrick Donnelly over 4 years ago. Updated over 4 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
nautilus,mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, MDSMonitor
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2019-11-19 03:07:21.589 7fb9936d7700  1 mds.li1190-231 asok_command: get subtrees (starting...)
2019-11-19 03:07:21.616 7fb9936d7700  1 mds.li1190-231 asok_command: get subtrees (complete)
2019-11-19 03:07:23.751 7fb9936d7700  1 mds.li1190-231 asok_command: session ls (starting...)
2019-11-19 03:08:09.758 7fb99244b700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
2019-11-19 03:08:09.758 7fb99244b700  1 mds.beacon.li1190-231 MDS connection to Monitors appears to be laggy; 49.1691s since last acked beacon
2019-11-19 03:08:09.760 7fb9936d7700  1 mds.li1190-231 asok_command: session ls (complete)
2019-11-19 03:08:09.760 7fb98fc46700  1 mds.0.266 skipping upkeep work because connection to Monitors appears laggy
2019-11-19 03:08:09.760 7fb9936d7700  1 mds.li1190-231 asok_command: ops (starting...)
2019-11-19 03:08:10.062 7fb995edc700  0 --1- [v2:192.168.140.68:6800/3353277732,v1:192.168.140.68:6801/3353277732] >> v1:192.168.192.228:0/3922646583 conn(0x5612a9fb2400 0x5612a91d8800 :6801 s=OPENED pgs=2 cs=1 l=0).fault server, going to standby
2019-11-19 03:08:10.555 7fb995edc700  0 --1- [v2:192.168.140.68:6800/3353277732,v1:192.168.140.68:6801/3353277732] >> v1:192.168.178.144:0/1772090923 conn(0x5612a9ede400 0x5612a9fc8000 :6801 s=READ_MESSAGE_FRONT pgs=2 cs=1 l=0).fault server, going to standby
2019-11-19 03:08:10.570 7fb995edc700  0 --1- [v2:192.168.140.68:6800/3353277732,v1:192.168.140.68:6801/3353277732] >> v1:192.168.215.5:0/485684352 conn(0x5612a9edf400 0x5612a9fca000 :6801 s=READ_MESSAGE_FRONT pgs=2 cs=1 l=0).fault server, going to standby
2019-11-19 03:08:10.570 7fb9936d7700  1 mds.li1190-231 asok_command: ops (complete)
2019-11-19 03:08:10.574 7fb9936d7700  1 mds.li1190-231 asok_command: status (starting...)
2019-11-19 03:08:10.577 7fb9936d7700  1 mds.li1190-231 asok_command: status (complete)
2019-11-19 03:08:10.578 7fb9936d7700  1 mds.li1190-231 asok_command: get subtrees (starting...)
2019-11-19 03:08:10.583 7fb9936d7700  1 mds.li1190-231 asok_command: get subtrees (complete)
2019-11-19 03:08:10.631 7fb99244b700  1 mds.li1190-231 Updating MDS map to version 315 from mon.0
2019-11-19 03:08:10.632 7fb995edc700  0 --1- [v2:192.168.140.68:6800/3353277732,v1:192.168.140.68:6801/3353277732] >> v1:192.168.177.71:0/1318272242 conn(0x5612aa206c00 0x5612aa268800 :6801 s=OPENED pgs=3 cs=1 l=0).fault server, going to standby
2019-11-19 03:08:10.633 7fb99244b700  1 mds.li1190-231 Map removed me (mds.-1 gid:33266) from cluster due to lost contact; respawning
2019-11-19 03:08:10.633 7fb99244b700  1 mds.li1190-231 respawn!
...
 -7149> 2019-11-19 03:07:23.099 7fb991449700 10 monclient: tick
 -7148> 2019-11-19 03:07:23.099 7fb991449700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-11-19 03:06:53.099623)
 -7147> 2019-11-19 03:07:23.751 7fb9936d7700  1 mds.li1190-231 asok_command: session ls (starting...)
 -7146> 2019-11-19 03:07:24.590 7fb98f445700  5 mds.beacon.li1190-231 Sending beacon up:active seq 436
 -7145> 2019-11-19 03:07:24.590 7fb98f445700 10 monclient: _send_mon_message to mon.li218-159 at v2:192.168.190.176:3300/0
 -7144> 2019-11-19 03:07:26.692 7fb991449700 10 monclient: tick
 -7143> 2019-11-19 03:07:26.692 7fb991449700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-11-19 03:07:03.099825)
 -7142> 2019-11-19 03:07:26.692 7fb991449700 10 monclient: tick
 -7141> 2019-11-19 03:07:26.692 7fb991449700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-11-19 03:07:13.100098)
 -7140> 2019-11-19 03:07:26.692 7fb991449700  1 monclient: no keepalive since 2019-11-19 03:07:13.099250 (30.000862 seconds), reconnecting
 -7139> 2019-11-19 03:07:26.692 7fb991449700 10 monclient: _reopen_session rank -1
 -7138> 2019-11-19 03:08:09.640 7fb98b43d700  5 mds.0.log _submit_thread 158724719726~2193 : EUpdate mkdir [metablob 0x10001149da3, 3 dirs]
 -7137> 2019-11-19 03:08:09.758 7fb99244b700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
 -7136> 2019-11-19 03:08:09.758 7fb99244b700  1 mds.beacon.li1190-231 MDS connection to Monitors appears to be laggy; 49.1691s since last acked beacon
 -7135> 2019-11-19 03:08:09.758 7fb98b43d700  5 mds.0.log _submit_thread 158724721939~1890 : EUpdate openc [metablob 0x1000115f6cb, 2 dirs]
 -7134> 2019-11-19 03:08:09.760 7fb9936d7700  1 mds.li1190-231 asok_command: session ls (complete)
 -7133> 2019-11-19 03:08:09.760 7fb9936d7700  5 asok(0x5612a91f2000) AdminSocket: request 'session ls' '' to 0x5612a917c5a0 returned 12917 bytes
 -7132> 2019-11-19 03:08:09.760 7fb98fc46700  1 mds.0.266 skipping upkeep work because connection to Monitors appears laggy
...
 -3366> 2019-11-19 03:08:10.354 7fb98f445700  5 mds.beacon.li1190-231 Sending beacon up:active seq 437
...
    -3> 2019-11-19 03:08:10.631 7fb99244b700  1 mds.li1190-231 Updating MDS map to version 315 from mon.0
    -2> 2019-11-19 03:08:10.632 7fb995edc700  0 --1- [v2:192.168.140.68:6800/3353277732,v1:192.168.140.68:6801/3353277732] >> v1:192.168.177.71:0/1318272242 conn(0x5612aa206c00 0x5612aa268800 :6801 s=OPENED pgs=3 cs=1 l=0).fault server, going to standby
    -1> 2019-11-19 03:08:10.633 7fb99244b700  1 mds.li1190-231 Map removed me (mds.-1 gid:33266) from cluster due to lost contact; respawning
     0> 2019-11-19 03:08:10.633 7fb99244b700  1 mds.li1190-231 respawn!

Log attached. Like #42919, this was from an experiment with 128 jobs running git-clone/rm on 32 nodes. Except this ticket is with two active MDS.

MDS says it's sending beacons but the monitors are not receiving them. Might be related to #36540. Will need logs with the MDS at debug_ms=1 to confirm the messages went out but no reason not to think so.


Files

ceph-mon.li456-243.log.1.gz (72.4 KB) ceph-mon.li456-243.log.1.gz Patrick Donnelly, 11/20/2019 10:34 PM
ceph-mds.li1190-231.log.1.gz (312 KB) ceph-mds.li1190-231.log.1.gz Patrick Donnelly, 11/20/2019 10:34 PM

Related issues 2 (2 open0 closed)

Related to Messengers - Bug #36540: msg: messages are queued but not sentNew

Actions
Related to CephFS - Bug #42919: mds: heartbeat timeout during large scale git-clone/rm workloadNew

Actions
Actions #1

Updated by Patrick Donnelly over 4 years ago

  • Related to Bug #36540: msg: messages are queued but not sent added
Actions #2

Updated by Patrick Donnelly over 4 years ago

  • Related to Bug #42919: mds: heartbeat timeout during large scale git-clone/rm workload added
Actions #3

Updated by Patrick Donnelly over 4 years ago

  • Target version deleted (v15.0.0)
Actions

Also available in: Atom PDF