Bug #46042: mds: EMetablob replay too long will cause mds restart - CephFS - Ceph

Actions

Copy link

Bug #46042

closed

mds: EMetablob replay too long will cause mds restart

Added by Yanhu Cao almost 4 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Yanhu Cao

Category:

Target version:

Ceph - v16.0.0

% Done:

Source:

Community (dev)

Tags:

Backport:

octopus,nautilus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

35582

Crash signature (v1):

Crash signature (v2):

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Yanhu Cao almost 4 years ago

We encountered the warning 'mdlog behind on trimming' and MDS crashed. Then the standby MDS recovers its journal and other metadata, the journal is too large and replay too long, causing mds send skipping beacon, heartbeat map not healthy.

13:30:35.526832 7f1043717700  1 mds.3.133170 replay_start
13:30:35.526838 7f1043717700  1 mds.3.133170  recovery set is 0,1,2,4,5,6,7,8,9,10
13:30:35.526851 7f1043717700  1 mds.3.133170  waiting for osdmap 13607 (which blacklists prior instance)
13:30:35.545321 7f103cf0a700  0 mds.3.cache creating system inode with ino:0x103
13:30:35.545474 7f103cf0a700  0 mds.3.cache creating system inode with ino:0x1
13:30:54.777085 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:30:54.777099 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:30:55.773679 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:30:58.777168 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:30:58.777189 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:00.773772 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:02.777222 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:02.777246 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:05.773850 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:06.777283 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:06.777297 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:10.773928 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:10.777342 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:10.777351 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:14.777407 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:14.777421 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:15.774005 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:18.777487 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:18.777507 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:20.774098 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:22.777538 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:22.777553 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:25.774178 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:26.777612 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:26.777635 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:30.774254 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:30.777662 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:30.777671 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:34.777725 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:34.777740 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:35.774331 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:38.777787 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:38.777803 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:40.774413 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:42.777850 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:42.777864 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:45.774491 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:46.777912 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:46.777926 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:50.774536 7f1044719700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:50.777972 7f1040711700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
13:31:50.777981 7f1040711700  1 mds.beacon.CEPH-168-0-3 _send skipping beacon, heartbeat map not healthy
13:31:51.664922 7f1040f12700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
13:31:51.667667 7f1043717700  1 mds.CEPH-168-0-3 map removed me (mds.-1 gid:19981704) from cluster due to lost contact; respawning
13:31:51.667672 7f1043717700  1 mds.CEPH-168-0-3 respawn

Actions

Copy link

Updated by Patrick Donnelly almost 4 years ago

Subject changed from EMetablob replay too long will cause mds restart to mds: EMetablob replay too long will cause mds restart
Status changed from New to Fix Under Review
Assignee set to Yanhu Cao
Target version set to v16.0.0
Source set to Community (dev)
Backport set to octopus,nautilus
Pull request ID set to 35582
Component(FS) MDS added

Actions

Copy link

Updated by Patrick Donnelly almost 4 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

Updated by Nathan Cutler almost 4 years ago

Copied to Backport #46188: octopus: mds: EMetablob replay too long will cause mds restart added

Actions

Copy link

Updated by Nathan Cutler almost 4 years ago

Copied to Backport #46189: nautilus: mds: EMetablob replay too long will cause mds restart added

Actions

Copy link

Updated by Nathan Cutler over 3 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #46042

mds: EMetablob replay too long will cause mds restart

Updated by Yanhu Cao almost 4 years ago

Updated by Patrick Donnelly almost 4 years ago

Updated by Patrick Donnelly almost 4 years ago

Updated by Nathan Cutler almost 4 years ago

Updated by Nathan Cutler almost 4 years ago

Updated by Nathan Cutler over 3 years ago