Project

General

Profile

Actions

Bug #62663

closed

MDS: inode nlink value is -1 causing MDS to continuously crash

Added by Austin Axworthy 9 months ago. Updated 7 months ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

All MDS daemons are continuously crashing. The logs are reporting an inode nlink value is set to -1. I have included details below of the filesystem workflow.

Workflow:
This filesystem has a heavy workload using hardlinks. Data within the filesystem can be processed with up to 10-20 processes at a time. Each process will generate a hardlink, meaning there could be up to 20 hardlinks at a time. Once the processing is complete the hard links are removed and cleaned up.
Leading up to the crash, the MDS performance was very degraded, which lead to a restart of the active MDS. The secondary daemon was experiencing similar issues, and eventually the MDS was failed back over to the original daemon. The original MDS then entered a continuous crash, causing the filesystem to go offline. When investigating the logs the following error was found, inode 0x10005f79654 nl=-1 as well as FAILED ceph_assert(stray_in->get_inode()->nlink >= 1).

4982059 2023-07-12T13:16:33.413-0400 7f337eec5700 10 mds.0.cache.strays  inode is [inode 0x10005f79654 [...10,head] ~mds0/stray1/10005f79654 auth v22224244 snaprealm=0x55b0a8b17600 DIRTYPARENT s=13123258 nl=-1 n(v04982059  rc2023-07-03T14:19:43.854341-0400 b13123258 1=1+0) (iversion lock) | openingsnapparents=0 dirtyparent=1 dirty=0 0x55b0aa54ec00]
4982060 2023-07-12T13:16:33.413-0400 7f337eec5700 20 mds.0.cache.strays _eval_stray_remote [dentry #0x100/stray1/10005f79654 [10,head] auth (dversion lock) v=22224244 ino=0x10005f79654 state=1342177296 | inodepin=14982060  dirty=0 0x55b0a9306f00]
4982061 2023-07-12T13:16:33.414-0400 7f337eec5700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6/rpm/el84982061 /BUILD/ceph-17.2.6/src/mds/StrayManager.cc: In function 'void StrayManager::_eval_stray_remote(CDentry*, CDentry*)' thread 7f337eec5700 time 2023-07-12T13:16:33.414966-0400
4982062 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6/rpm/el8/BUILD/ceph-17.2.6/src/mds/StrayManager.cc: 64982062 22: FAILED ceph_assert(stray_in->get_inode()->nlink >= 1)

A "cephfs-data-scan scan_links" was done after removing the omap key of the object reporting the issue. The output of the scan_links is as follows.

]

2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609ae80 expected 1 has 0
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609ae85 from 0x10005fa3bd4/filename
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609ae85 expected 1 has 0
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609ae8f from 0x10005fa3bd4/filename
2023-07-24T21:23:19.830-0400 7100195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609ae8f expected 1 has 0
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609aec5 from 0x10005fa3bd4/filename
2023-07-24T21:23:19.830-0400 7100195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609aec5 expected 1 has 0
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609aeda from 0x1000609aed9/filename
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609aede expected 1 has 0 
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609aeed expected 1 has 0 
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609aefb expected 1 has 0 
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609af05 expected 1 has 0 
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609af0e expected 1 has 0 
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609af1b from 0x10005fa3bd4/filename
2023-07-24T21:23:19.830-0400 7700195a4740 -1 datascan.scan_links: Bad link on 0x1000609af1b expected 1 has 0 
2023-07-24T21:23:19.830-0400 700195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609af3a from 0x10005fa3bd4/filename
2023-07-24T21:23:19.830-0400 7100195a4740 -1 datascan.scan_links: Bad link on 0x1000609af3a expected 1 has 0
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609af44 from 0x10005fa3bd4/filename
2023-07-24T21:23:19.830-0400 7100195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609af44 expected 1 has 0 
2023-07-24T21:23:19.830-0400 700195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609af6c from 0x1000609af6b/filename
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609b32f from 0x1000609b32c/filename
2023-07-24T21:23:19.830-0400 700195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609c012 from 0x1000609c011/filename
2023-07-24T21:23:19.830-0400 7100195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609c968 from 0x1000609c966/filename
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609cb02 from 0x1000609cb00/filename
2023-07-24T21:23:19.830-0400 700195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609d3c2 from 0x1000609d3c1/filename
2023-07-24T21:23:19.830-0400 7100195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609d523 from 0x1000609d522/filename
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609d7fc from 0x1000609d7fa/filename
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609dbec from 0x1000609dbea/filename

It was determined that the specific inode reporting the error had 3 hardlinks, one of which was in the process of being deleted when the issues first presented.

The filesystem has a single rank with a standby daemon. This cluster has had ENOSPCE issues in the past due to the number of stray files generated by the deletion of hardlinks.

Actions

Also available in: Atom PDF