Project

General

Profile

Actions

Bug #42919

open

mds: heartbeat timeout during large scale git-clone/rm workload

Added by Patrick Donnelly over 4 years ago. Updated over 4 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
nautilus,mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2019-11-18 22:45:29.093 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:29.093 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 19.932s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:29.642 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:29.642 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 20.481s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:30.090 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:30.164 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 20.929s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:30.642 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:30.642 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 21.481s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:31.130 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:31.130 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 21.969s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:31.643 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:31.665 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 22.482s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:32.069 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:32.069 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 22.908s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:32.650 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:32.650 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 23.489s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:33.140 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:33.140 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 23.979s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:33.643 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:33.666 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.482s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:33.867 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:33.867 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.706s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:33.867 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:33.867 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.706s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:33.867 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:33.867 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.706s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:33.867 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:33.867 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.706s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:33.867 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:33.867 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.706s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:33.867 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:33.867 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.706s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:33.867 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:33.867 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.706s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:33.867 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:33.867 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.706s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:33.867 7fa2f6e0c700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-11-18 22:45:33.867 7fa2f6e0c700  0 mds.beacon.li1190-231 Skipping beacon heartbeat to monitors (last acked 24.706s ago); MDS internal heartbeat is not healthy!
2019-11-18 22:45:38.426 7fa2f9e12700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
2019-11-18 22:45:38.426 7fa2f9e12700  1 mds.beacon.li1190-231 MDS connection to Monitors appears to be laggy; 29.2651s since last acked beacon
2019-11-18 22:45:38.426 7fa2f760d700  1 mds.0.165 skipping upkeep work because connection to Monitors appears laggy
2019-11-18 22:45:38.428 7fa2fb09e700  1 mds.li1190-231 asok_command: session ls (complete)
2019-11-18 22:45:38.429 7fa2fb09e700  1 mds.li1190-231 asok_command: ops (starting...)
2019-11-18 22:45:38.533 7fa2fd8a3700  0 --1- [v2:192.168.140.68:6800/3233224676,v1:192.168.140.68:6801/3233224676] >> v1:192.168.194.37:0/3498553825 conn(0x556116586400 0x556116569800 :6801 s=OPENED pgs=2 cs=1 l=0).fault server, going to standby
2019-11-18 22:45:38.563 7fa2fd0a2700  0 --1- [v2:192.168.140.68:6800/3233224676,v1:192.168.140.68:6801/3233224676] >> v1:192.168.222.196:0/1260798504 conn(0x55611655c400 0x556116551000 :6801 s=OPENED pgs=2 cs=1 l=0).fault server, going to standby
2019-11-18 22:45:38.573 7fa2fd0a2700  0 --1- [v2:192.168.140.68:6800/3233224676,v1:192.168.140.68:6801/3233224676] >> v1:192.168.142.218:0/2648402227 conn(0x556116530800 0x556115773800 :6801 s=OPENED pgs=2 cs=1 l=0).fault server, going to standby
2019-11-18 22:45:38.575 7fa2fd0a2700  0 --1- [v2:192.168.140.68:6800/3233224676,v1:192.168.140.68:6801/3233224676] >> v1:192.168.178.144:0/3718666179 conn(0x5561165a0c00 0x5561165a6800 :6801 s=READ_MESSAGE_FRONT pgs=2 cs=1 l=0).fault server, going to standby
2019-11-18 22:45:38.579 7fa2fb09e700  1 mds.li1190-231 asok_command: ops (complete)
2019-11-18 22:45:38.584 7fa2fb09e700  1 mds.li1190-231 asok_command: status (starting...)
2019-11-18 22:45:38.584 7fa2fb09e700  1 mds.li1190-231 asok_command: status (complete)
2019-11-18 22:45:38.584 7fa2fb09e700  1 mds.li1190-231 asok_command: get subtrees (starting...)
2019-11-18 22:45:38.588 7fa2fb09e700  1 mds.li1190-231 asok_command: get subtrees (complete)
2019-11-18 22:45:38.598 7fa2fd0a2700  0 --1- [v2:192.168.140.68:6800/3233224676,v1:192.168.140.68:6801/3233224676] >> v1:192.168.131.157:0/3670314862 conn(0x556116586800 0x55611658c000 :6801 s=OPENED pgs=2 cs=1 l=0).fault server, going to standby
2019-11-18 22:45:38.649 7fa2fd0a2700  0 --1- [v2:192.168.140.68:6800/3233224676,v1:192.168.140.68:6801/3233224676] >> v1:192.168.215.5:0/2690195664 conn(0x55611657bc00 0x556116568800 :6801 s=READ_MESSAGE_FRONT pgs=2 cs=1 l=0).fault server, going to standby
2019-11-18 22:45:38.660 7fa2fd0a2700  0 --1- [v2:192.168.140.68:6800/3233224676,v1:192.168.140.68:6801/3233224676] >> v1:192.168.177.71:0/3773098837 conn(0x556116531400 0x55611654f000 :6801 s=OPENED pgs=2 cs=1 l=0).fault server, going to standby
2019-11-18 22:45:38.701 7fa2f3e06700 -1 mds.0.journaler.mdlog(rw) _prezeroed got (108) Cannot send after transport endpoint shutdown
2019-11-18 22:45:38.701 7fa2f3e06700 -1 mds.0.journaler.mdlog(rw) handle_write_error (108) Cannot send after transport endpoint shutdown
2019-11-18 22:45:38.703 7fa2f3e06700 -1 MDSIOContextBase: blacklisted!  Restarting...

This is from a workload with 32 clients and 128 parallel git-clone processes (4 per node). The single active MDS was in the "create" stage of this workload where git is checking out the branch (although there is a mix of where clients are in the git-clone process).

Log attached.


Files

ceph-mds.li1190-231.log.1.gz (347 KB) ceph-mds.li1190-231.log.1.gz Patrick Donnelly, 11/20/2019 10:30 PM

Related issues 1 (1 open0 closed)

Related to CephFS - Bug #42920: mds: removed from map due to dropped (?) beaconsNew

Actions
Actions #1

Updated by Patrick Donnelly over 4 years ago

  • Related to Bug #42920: mds: removed from map due to dropped (?) beacons added
Actions #2

Updated by Patrick Donnelly over 4 years ago

  • Target version deleted (v15.0.0)
Actions

Also available in: Atom PDF