Bug #62663
closedMDS: inode nlink value is -1 causing MDS to continuously crash
0%
Description
All MDS daemons are continuously crashing. The logs are reporting an inode nlink value is set to -1. I have included details below of the filesystem workflow.
Workflow:
This filesystem has a heavy workload using hardlinks. Data within the filesystem can be processed with up to 10-20 processes at a time. Each process will generate a hardlink, meaning there could be up to 20 hardlinks at a time. Once the processing is complete the hard links are removed and cleaned up.
Leading up to the crash, the MDS performance was very degraded, which lead to a restart of the active MDS. The secondary daemon was experiencing similar issues, and eventually the MDS was failed back over to the original daemon. The original MDS then entered a continuous crash, causing the filesystem to go offline. When investigating the logs the following error was found, inode 0x10005f79654 nl=-1 as well as FAILED ceph_assert(stray_in->get_inode()->nlink >= 1).
4982059 2023-07-12T13:16:33.413-0400 7f337eec5700 10 mds.0.cache.strays inode is [inode 0x10005f79654 [...10,head] ~mds0/stray1/10005f79654 auth v22224244 snaprealm=0x55b0a8b17600 DIRTYPARENT s=13123258 nl=-1 n(v04982059 rc2023-07-03T14:19:43.854341-0400 b13123258 1=1+0) (iversion lock) | openingsnapparents=0 dirtyparent=1 dirty=0 0x55b0aa54ec00]
4982060 2023-07-12T13:16:33.413-0400 7f337eec5700 20 mds.0.cache.strays _eval_stray_remote [dentry #0x100/stray1/10005f79654 [10,head] auth (dversion lock) v=22224244 ino=0x10005f79654 state=1342177296 | inodepin=14982060 dirty=0 0x55b0a9306f00]
4982061 2023-07-12T13:16:33.414-0400 7f337eec5700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6/rpm/el84982061 /BUILD/ceph-17.2.6/src/mds/StrayManager.cc: In function 'void StrayManager::_eval_stray_remote(CDentry*, CDentry*)' thread 7f337eec5700 time 2023-07-12T13:16:33.414966-0400
4982062 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6/rpm/el8/BUILD/ceph-17.2.6/src/mds/StrayManager.cc: 64982062 22: FAILED ceph_assert(stray_in->get_inode()->nlink >= 1)
A "cephfs-data-scan scan_links" was done after removing the omap key of the object reporting the issue. The output of the scan_links is as follows.
]
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609ae80 expected 1 has 0
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609ae85 from 0x10005fa3bd4/filename
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609ae85 expected 1 has 0
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609ae8f from 0x10005fa3bd4/filename
2023-07-24T21:23:19.830-0400 7100195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609ae8f expected 1 has 0
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609aec5 from 0x10005fa3bd4/filename
2023-07-24T21:23:19.830-0400 7100195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609aec5 expected 1 has 0
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609aeda from 0x1000609aed9/filename
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609aede expected 1 has 0
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609aeed expected 1 has 0
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609aefb expected 1 has 0
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609af05 expected 1 has 0
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609af0e expected 1 has 0
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609af1b from 0x10005fa3bd4/filename
2023-07-24T21:23:19.830-0400 7700195a4740 -1 datascan.scan_links: Bad link on 0x1000609af1b expected 1 has 0
2023-07-24T21:23:19.830-0400 700195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609af3a from 0x10005fa3bd4/filename
2023-07-24T21:23:19.830-0400 7100195a4740 -1 datascan.scan_links: Bad link on 0x1000609af3a expected 1 has 0
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609af44 from 0x10005fa3bd4/filename
2023-07-24T21:23:19.830-0400 7100195a4740 -1 datascan.scan_links: Bad nlink on 0x1000609af44 expected 1 has 0
2023-07-24T21:23:19.830-0400 700195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609af6c from 0x1000609af6b/filename
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609b32f from 0x1000609b32c/filename
2023-07-24T21:23:19.830-0400 700195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609c012 from 0x1000609c011/filename
2023-07-24T21:23:19.830-0400 7100195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609c968 from 0x1000609c966/filename
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609cb02 from 0x1000609cb00/filename
2023-07-24T21:23:19.830-0400 700195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609d3c2 from 0x1000609d3c1/filename
2023-07-24T21:23:19.830-0400 7100195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609d523 from 0x1000609d522/filename
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609d7fc from 0x1000609d7fa/filename
2023-07-24T21:23:19.830-0400 7f00195a4740 -1 datascan.scan_links: Remove duplicated ino 0x0x1000609dbec from 0x1000609dbea/filename
It was determined that the specific inode reporting the error had 3 hardlinks, one of which was in the process of being deleted when the issues first presented.
The filesystem has a single rank with a standby daemon. This cluster has had ENOSPCE issues in the past due to the number of stray files generated by the deletion of hardlinks.