Actions
Bug #42919
openmds: heartbeat timeout during large scale git-clone/rm workload
Status:
New
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Development
Tags:
Backport:
nautilus,mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2019-11-18 22:45:29.093 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:29.093 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 19.932s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:29.642 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:29.642 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 20.481s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:30.090 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:30.164 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 20.929s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:30.642 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:30.642 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 21.481s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:31.130 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:31.130 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 21.969s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:31.643 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:31.665 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 22.482s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:32.069 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:32.069 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 22.908s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:32.650 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:32.650 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 23.489s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:33.140 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:33.140 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 23.979s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:33.643 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:33.666 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.482s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:33.867 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:33.867 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.706s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:33.867 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:33.867 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.706s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:33.867 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:33.867 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.706s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:33.867 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:33.867 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.706s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:33.867 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:33.867 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.706s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:33.867 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:33.867 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.706s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:33.867 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:33.867 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.706s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:33.867 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:33.867 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.706s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:33.867 7fa2f6e0c700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2019-11-18 22:45:33.867 7fa2f6e0c700 0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.706s ago); MDS internal heartbeat is not healthy! 2019-11-18 22:45:38.426 7fa2f9e12700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2019-11-18 22:45:38.426 7fa2f9e12700 1 mds.beacon.li1190-231 MDS connection to Monitors appears to be laggy; 29.2651s since last acked beacon 2019-11-18 22:45:38.426 7fa2f760d700 1 mds.0.165 skipping upkeep work because connection to Monitors appears laggy 2019-11-18 22:45:38.428 7fa2fb09e700 1 mds.li1190-231 asok_command: session ls (complete) 2019-11-18 22:45:38.429 7fa2fb09e700 1 mds.li1190-231 asok_command: ops (starting...) 2019-11-18 22:45:38.533 7fa2fd8a3700 0 --1- [v2:192.168.140.68:6800/3233224676,v1:192.168.140.68:6801/3233224676] >> v1:192.168.194.37:0/3498553825 conn(0x556116586400 0x556116569800 :6801 s=OPENED pgs=2 cs=1 l=0).fault server, going to standby 2019-11-18 22:45:38.563 7fa2fd0a2700 0 --1- [v2:192.168.140.68:6800/3233224676,v1:192.168.140.68:6801/3233224676] >> v1:192.168.222.196:0/1260798504 conn(0x55611655c400 0x556116551000 :6801 s=OPENED pgs=2 cs=1 l=0).fault server, going to standby 2019-11-18 22:45:38.573 7fa2fd0a2700 0 --1- [v2:192.168.140.68:6800/3233224676,v1:192.168.140.68:6801/3233224676] >> v1:192.168.142.218:0/2648402227 conn(0x556116530800 0x556115773800 :6801 s=OPENED pgs=2 cs=1 l=0).fault server, going to standby 2019-11-18 22:45:38.575 7fa2fd0a2700 0 --1- [v2:192.168.140.68:6800/3233224676,v1:192.168.140.68:6801/3233224676] >> v1:192.168.178.144:0/3718666179 conn(0x5561165a0c00 0x5561165a6800 :6801 s=READ_MESSAGE_FRONT pgs=2 cs=1 l=0).fault server, going to standby 2019-11-18 22:45:38.579 7fa2fb09e700 1 mds.li1190-231 asok_command: ops (complete) 2019-11-18 22:45:38.584 7fa2fb09e700 1 mds.li1190-231 asok_command: status (starting...) 2019-11-18 22:45:38.584 7fa2fb09e700 1 mds.li1190-231 asok_command: status (complete) 2019-11-18 22:45:38.584 7fa2fb09e700 1 mds.li1190-231 asok_command: get subtrees (starting...) 2019-11-18 22:45:38.588 7fa2fb09e700 1 mds.li1190-231 asok_command: get subtrees (complete) 2019-11-18 22:45:38.598 7fa2fd0a2700 0 --1- [v2:192.168.140.68:6800/3233224676,v1:192.168.140.68:6801/3233224676] >> v1:192.168.131.157:0/3670314862 conn(0x556116586800 0x55611658c000 :6801 s=OPENED pgs=2 cs=1 l=0).fault server, going to standby 2019-11-18 22:45:38.649 7fa2fd0a2700 0 --1- [v2:192.168.140.68:6800/3233224676,v1:192.168.140.68:6801/3233224676] >> v1:192.168.215.5:0/2690195664 conn(0x55611657bc00 0x556116568800 :6801 s=READ_MESSAGE_FRONT pgs=2 cs=1 l=0).fault server, going to standby 2019-11-18 22:45:38.660 7fa2fd0a2700 0 --1- [v2:192.168.140.68:6800/3233224676,v1:192.168.140.68:6801/3233224676] >> v1:192.168.177.71:0/3773098837 conn(0x556116531400 0x55611654f000 :6801 s=OPENED pgs=2 cs=1 l=0).fault server, going to standby 2019-11-18 22:45:38.701 7fa2f3e06700 -1 mds.0.journaler.mdlog(rw) _prezeroed got (108) Cannot send after transport endpoint shutdown 2019-11-18 22:45:38.701 7fa2f3e06700 -1 mds.0.journaler.mdlog(rw) handle_write_error (108) Cannot send after transport endpoint shutdown 2019-11-18 22:45:38.703 7fa2f3e06700 -1 MDSIOContextBase: blacklisted! Restarting...
This is from a workload with 32 clients and 128 parallel git-clone processes (4 per node). The single active MDS was in the "create" stage of this workload where git is checking out the branch (although there is a mix of where clients are in the git-clone process).
Log attached.
Files
Updated by Patrick Donnelly over 4 years ago
- Related to Bug #42920: mds: removed from map due to dropped (?) beacons added
Actions