Project

General

Profile

Bug #46042

mds: EMetablob replay too long will cause mds restart

Added by Yanhu Cao 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature:

Related issues

Copied to CephFS - Backport #46188: octopus: mds: EMetablob replay too long will cause mds restart Resolved
Copied to CephFS - Backport #46189: nautilus: mds: EMetablob replay too long will cause mds restart Resolved

History

#1 Updated by Yanhu Cao 5 months ago

We encountered the warning 'mdlog behind on trimming' and MDS crashed. Then the standby MDS recovers its journal and other metadata, the journal is too large and replay too long, causing mds send skipping beacon, heartbeat map not healthy.

13:30:35.526832 7f1043717700  1 mds.3.133170 replay_start
13:30:35.526838 7f1043717700  1 mds.3.133170  recovery set is 0,1,2,4,5,6,7,8,9,10
13:30:35.526851 7f1043717700  1 mds.3.133170  waiting for osdmap 13607 (which blacklists prior instance)
13:30:35.545321 7f103cf0a700  0 mds.3.cache creating system inode with ino:0x103
13:30:35.545474 7f103cf0a700  0 mds.3.cache creating system inode with ino:0x1
13:30:54.777085 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:30:54.777099 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:30:55.773679 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:30:58.777168 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:30:58.777189 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:00.773772 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:02.777222 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:02.777246 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:05.773850 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:06.777283 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:06.777297 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:10.773928 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:10.777342 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:10.777351 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:14.777407 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:14.777421 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:15.774005 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:18.777487 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:18.777507 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:20.774098 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:22.777538 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:22.777553 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:25.774178 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:26.777612 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:26.777635 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:30.774254 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:30.777662 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:30.777671 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:34.777725 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:34.777740 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:35.774331 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:38.777787 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:38.777803 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:40.774413 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:42.777850 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:42.777864 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:45.774491 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:46.777912 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:46.777926 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:50.774536 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:50.777972 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:50.777981 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:51.664922 7f1040f12700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
13:31:51.667667 7f1043717700  1 mds.CEPH-168-0-3 map removed me (mds.-1 gid:19981704) from cluster due to lost contact; respawning
13:31:51.667672 7f1043717700  1 mds.CEPH-168-0-3 respawn

#2 Updated by Patrick Donnelly 5 months ago

  • Subject changed from EMetablob replay too long will cause mds restart to mds: EMetablob replay too long will cause mds restart
  • Status changed from New to Fix Under Review
  • Assignee set to Yanhu Cao
  • Target version set to v16.0.0
  • Source set to Community (dev)
  • Backport set to octopus,nautilus
  • Pull request ID set to 35582
  • Component(FS) MDS added

#3 Updated by Patrick Donnelly 5 months ago

  • Status changed from Fix Under Review to Pending Backport

#4 Updated by Nathan Cutler 5 months ago

  • Copied to Backport #46188: octopus: mds: EMetablob replay too long will cause mds restart added

#5 Updated by Nathan Cutler 5 months ago

  • Copied to Backport #46189: nautilus: mds: EMetablob replay too long will cause mds restart added

#6 Updated by Nathan Cutler 4 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF