Project

General

Profile

Bug #49132

mds crashed "assert_condition": "state == LOCK_XLOCK || state == LOCK_XLOCKDONE",

Added by Kenneth Waegeman almost 2 years ago. Updated 22 days ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
pacific, quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

[root@mds06 ~]# ceph crash ls-new
ID                                                               ENTITY    NEW 
2021-02-02_13:37:07.487784Z_d50629e2-49f0-47ca-b305-40937ad11dbf mds.mds06  *  
=
[root@mds06 ~]# ceph crash info 2021-02-02_13:37:07.487784Z_d50629e2-49f0-47ca-b305-40937ad11dbf
{
    "os_version_id": "7", 
    "assert_condition": "state == LOCK_XLOCK || state == LOCK_XLOCKDONE", 
    "utsname_release": "3.10.0-1160.6.1.el7.x86_64", 
    "os_name": "CentOS Linux", 
    "entity_name": "mds.mds06", 
    "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16/rpm/el7/BUILD/ceph-14.2.16/src/mds/ScatterLock.h", 
    "timestamp": "2021-02-02 13:37:07.487784Z", 
    "process_name": "ceph-mds", 
    "utsname_machine": "x86_64", 
    "assert_line": 85, 
    "utsname_sysname": "Linux", 
    "os_version": "7 (Core)", 
    "os_id": "centos", 
    "assert_thread_name": "fn_anonymous", 
    "utsname_version": "#1 SMP Thu Nov 19 12:50:20 CET 2020", 
    "backtrace": [
        "(()+0xf630) [0x7fba63e1a630]", 
        "(gsignal()+0x37) [0x7fba629f7387]", 
        "(abort()+0x148) [0x7fba629f8a78]", 
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x7fba65f65b76]", 
        "(()+0x25ccef) [0x7fba65f65cef]", 
        "(MDCache::truncate_inode(CInode*, LogSegment*)+0x44e) [0x55ea2196d49e]", 
        "(C_MDS_inode_update_finish::finish(int)+0x133) [0x55ea21947b83]", 
        "(MDSContext::complete(int)+0x74) [0x55ea21b0db94]", 
        "(MDSIOContextBase::complete(int)+0x16f) [0x55ea21b0ddef]", 
        "(MDSLogContextBase::complete(int)+0x40) [0x55ea21b0e000]", 
        "(Finisher::finisher_thread_entry()+0x16f) [0x7fba65fef75f]", 
        "(()+0x7ea5) [0x7fba63e12ea5]", 
        "(clone()+0x6d) [0x7fba62abf96d]" 
    ], 
    "utsname_hostname": "mds06.gigalith.os", 
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16/rpm/el7/BUILD/ceph-14.2.16/src/mds/ScatterLock.h: In function 'void ScatterLock::set_xlock_snap_sync(MDSContext*)' thread 7fba5626e700 time 2021-02-02 14:37:07.486072\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16/rpm/el7/BUILD/ceph-14.2.16/src/mds/ScatterLock.h: 85: FAILED ceph_assert(state == LOCK_XLOCK || state == LOCK_XLOCKDONE)\n", 
    "crash_id": "2021-02-02_13:37:07.487784Z_d50629e2-49f0-47ca-b305-40937ad11dbf", 
    "assert_func": "void ScatterLock::set_xlock_snap_sync(MDSContext*)", 
    "ceph_version": "14.2.16" 
}

mds-logs.gz - MDS logs for 6/15/2021 crash (150 KB) Andras Pataki, 06/16/2021 12:17 AM

History

#1 Updated by Sage Weil over 1 year ago

  • Project changed from Ceph to CephFS

#2 Updated by Patrick Donnelly over 1 year ago

  • Description updated (diff)

#3 Updated by Andras Pataki over 1 year ago

We also had a crash that resembles this quite closely - I'm attaching the MDS logs. Are there any leads regarding potential causes for this issue?
The cluster is on CentOS 8, Ceph Nautilus 14.2.20.

{
    "crash_id": "2021-06-15_04:23:24.432205Z_167f4609-298f-45e8-b0db-fc04b5bbacd0",
    "timestamp": "2021-06-15 04:23:24.432205Z",
    "process_name": "ceph-mds",
    "entity_name": "mds.popeye-mds-2-09",
    "ceph_version": "14.2.20",
    "utsname_hostname": "popeye-mds-2-09",
    "utsname_sysname": "Linux",
    "utsname_release": "5.10.32.2.fi",
    "utsname_version": "#1 SMP Fri Apr 23 03:57:43 EDT 2021",
    "utsname_machine": "x86_64",
    "os_name": "CentOS Linux",
    "os_id": "centos",
    "os_version_id": "8",
    "os_version": "8",
    "assert_condition": "state == LOCK_XLOCK || state == LOCK_XLOCKDONE",
    "assert_func": "void ScatterLock::set_xlock_snap_sync(MDSContext*)",
    "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/14.2.20/rpm/el8/BUILD/ceph-14.2.20/src/mds/ScatterLock.h",
    "assert_line": 85,
    "assert_thread_name": "fn_anonymous",
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/14.2.20/rpm/el8/BUILD/ceph-14.2.20/src/mds/ScatterLock.h: In function 'void ScatterLock::set_xlock_snap_sync(MDSContext*)' thread 7fffde9cd700 time 2021-06-14 21:23:24.430600\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/14.2.20/rpm/el8/BUILD/ceph-14.2.20/src/mds/ScatterLock.h: 85: FAILED ceph_assert(state == LOCK_XLOCK || state == LOCK_XLOCKDONE)\n",
    "backtrace": [
        "(()+0x12b20) [0x7fffecd8bb20]",
        "(gsignal()+0x10f) [0x7fffeb7dc7ff]",
        "(abort()+0x127) [0x7fffeb7c6c35]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a7) [0x7fffeef93a8b]",
        "(()+0x275c54) [0x7fffeef93c54]",
        "(ScatterLock::set_xlock_snap_sync(MDSContext*)+0x1eb) [0x5555556b670b]",
        "(MDCache::truncate_inode(CInode*, LogSegment*)+0x267) [0x555555643b37]",
        "(C_MDS_inode_update_finish::finish(int)+0x133) [0x55555561bbe3]",
        "(MDSContext::complete(int)+0x7f) [0x5555558067df]",
        "(MDSIOContextBase::complete(int)+0x17f) [0x555555806a8f]",
        "(MDSLogContextBase::complete(int)+0x44) [0x555555806d54]",
        "(Finisher::finisher_thread_entry()+0x18d) [0x7fffef02462d]",
        "(()+0x814a) [0x7fffecd8114a]",
        "(clone()+0x43) [0x7fffeb8a1f23]" 
    ]
}

    -1> 2021-06-14 21:23:24.430 7fffde9cd700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/14.2.20/rpm/el8/BUILD/ceph-14.2.20/src/mds/ScatterLock.h: In function 'void ScatterLock::set_xlock_snap_sync(MDSContext*)' thread 7fffde9cd700 time 2021-06-14 21:23:24.430600
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/14.2.20/rpm/el8/BUILD/ceph-14.2.20/src/mds/ScatterLock.h: 85: FAILED ceph_assert(state == LOCK_XLOCK || state == LOCK_XLOCKDONE)

 ceph version 14.2.20 (36274af6eb7f2a5055f2d53ad448f2694e9046a0) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x156) [0x7fffeef93a3a]
 2: (()+0x275c54) [0x7fffeef93c54]
 3: (ScatterLock::set_xlock_snap_sync(MDSContext*)+0x1eb) [0x5555556b670b]
 4: (MDCache::truncate_inode(CInode*, LogSegment*)+0x267) [0x555555643b37]
 5: (C_MDS_inode_update_finish::finish(int)+0x133) [0x55555561bbe3]
 6: (MDSContext::complete(int)+0x7f) [0x5555558067df]
 7: (MDSIOContextBase::complete(int)+0x17f) [0x555555806a8f]
 8: (MDSLogContextBase::complete(int)+0x44) [0x555555806d54]
 9: (Finisher::finisher_thread_entry()+0x18d) [0x7fffef02462d]
 10: (()+0x814a) [0x7fffecd8114a]
 11: (clone()+0x43) [0x7fffeb8a1f23]

     0> 2021-06-14 21:23:24.430 7fffde9cd700 -1 *** Caught signal (Aborted) **
 in thread 7fffde9cd700 thread_name:fn_anonymous

 ceph version 14.2.20 (36274af6eb7f2a5055f2d53ad448f2694e9046a0) nautilus (stable)
 1: (()+0x12b20) [0x7fffecd8bb20]
 2: (gsignal()+0x10f) [0x7fffeb7dc7ff]
 3: (abort()+0x127) [0x7fffeb7c6c35]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a7) [0x7fffeef93a8b]
 5: (()+0x275c54) [0x7fffeef93c54]
 6: (ScatterLock::set_xlock_snap_sync(MDSContext*)+0x1eb) [0x5555556b670b]
 7: (MDCache::truncate_inode(CInode*, LogSegment*)+0x267) [0x555555643b37]
 8: (C_MDS_inode_update_finish::finish(int)+0x133) [0x55555561bbe3]
 9: (MDSContext::complete(int)+0x7f) [0x5555558067df]
 10: (MDSIOContextBase::complete(int)+0x17f) [0x555555806a8f]
 11: (MDSLogContextBase::complete(int)+0x44) [0x555555806d54]
 12: (Finisher::finisher_thread_entry()+0x18d) [0x7fffef02462d]
 13: (()+0x814a) [0x7fffecd8114a]
 14: (clone()+0x43) [0x7fffeb8a1f23]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#4 Updated by Andras Pataki over 1 year ago

Any ideas about the cause of these crashes?  Lately on one of our large cluster, we're getting increasingly more of them.  Just today 3 MDS crashes of this flavor.  Here is the JSON crash report for one of them:

    {
        "crash_id": "2021-08-02_15:37:30.810116Z_44963d0c-37da-4bf8-aa06-c602452b0423",
        "timestamp": "2021-08-02 15:37:30.810116Z",
        "process_name": "ceph-mds",
        "entity_name": "mds.popeye-mds-0-08",
        "ceph_version": "14.2.20",
        "utsname_hostname": "popeye-mds-0-08",
        "utsname_sysname": "Linux",
        "utsname_release": "5.10.32.2.fi",
        "utsname_version": "#1 SMP Fri Apr 23 03:57:43 EDT 2021",
        "utsname_machine": "x86_64",
        "os_name": "CentOS Linux",
        "os_id": "centos",
        "os_version_id": "8",
        "os_version": "8",
        "assert_condition": "state == LOCK_XLOCK || state == LOCK_XLOCKDONE",
        "assert_func": "void ScatterLock::set_xlock_snap_sync(MDSContext*)",
        "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/14.2.20/rpm/el8/BUILD/ceph-14.2.20/src/mds/ScatterLock.h",
        "assert_line": 85,
        "assert_thread_name": "fn_anonymous",
        "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/14.2.20/rpm/el8/BUILD/ceph-14.2.20/src/mds/ScatterLock.h: In function 'void ScatterLock::set_xlock_snap_sync(MDSContext*)' thread 7fffde9cd700 time 2021-08-02 08:37:30.808586\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/14.2.20/rpm/el8/BUILD/ceph-14.2.20/src/mds/ScatterLock.h: 85: FAILED ceph_assert(state == LOCK_XLOCK || state == LOCK_XLOCKDONE)\n",
        "backtrace": [
            "(()+0x12b20) [0x7fffecd8bb20]",
            "(gsignal()+0x10f) [0x7fffeb7dc7ff]",
            "(abort()+0x127) [0x7fffeb7c6c35]",
            "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a7) [0x7fffeef93a8b]",
            "(()+0x275c54) [0x7fffeef93c54]",
            "(ScatterLock::set_xlock_snap_sync(MDSContext*)+0x1eb) [0x5555556b670b]",
            "(MDCache::truncate_inode(CInode*, LogSegment*)+0x267) [0x555555643b37]",
            "(C_MDS_inode_update_finish::finish(int)+0x133) [0x55555561bbe3]",
            "(MDSContext::complete(int)+0x7f) [0x5555558067df]",
            "(MDSIOContextBase::complete(int)+0x17f) [0x555555806a8f]",
            "(MDSLogContextBase::complete(int)+0x44) [0x555555806d54]",
            "(Finisher::finisher_thread_entry()+0x18d) [0x7fffef02462d]",
            "(()+0x814a) [0x7fffecd8114a]",
            "(clone()+0x43) [0x7fffeb8a1f23]" 
        ]
    }

On our other large cluster we don't see such crashes.  One major config difference is that on this cluster, we use daily snapshot (rolling 7 past days).  Could it be related to snapshot trimming?

I'm happy to help with the debugging if I can get some pointers on what would be helpful to do.

#5 Updated by Patrick Donnelly over 1 year ago

  • Status changed from New to Triaged
  • Assignee set to Xiubo Li
  • Target version set to v17.0.0
  • Backport set to pacific
  • Component(FS) MDS added
  • Labels (FS) crash added

#6 Updated by Xiubo Li over 1 year ago

Andras Pataki wrote:

[...]

On our other large cluster we don't see such crashes. One major config difference is that on this cluster, we use daily snapshot (rolling 7 past days). Could it be related to snapshot trimming?

BTW, what's the 'max_mds' for your cluster ? Were you using the kernel client or fuse one ?

Is it possible to set the debug log to 25 for MDS daemons to reproduce it again ? Locally I haven't reproduced it.

#7 Updated by Andras Pataki over 1 year ago

max_mds is 1 - we are running a single active MDS only. The clients are all ceph-fuse.
I've tried setting the MDS log level to 25, but it slows down the MDS to unbearable levels (simple ls on a directory with a few dozen files takes a minute even). So I can't leave this log level on the MDS for extended periods of time. Unfortunately the crashes happen at random times - so I'm open to suggestions how to better capture the crash state. Would something more than 5 but less than 25 help?

#8 Updated by Xiubo Li over 1 year ago

Andras Pataki wrote:

max_mds is 1 - we are running a single active MDS only. The clients are all ceph-fuse.
I've tried setting the MDS log level to 25, but it slows down the MDS to unbearable levels (simple ls on a directory with a few dozen files takes a minute even). So I can't leave this log level on the MDS for extended periods of time. Unfortunately the crashes happen at random times - so I'm open to suggestions how to better capture the crash state. Would something more than 5 but less than 25 help?

Yeah, log level to 25 will slow down the MDS. Could you try to set it to 15 ? At least please set the core dump log level to 25. something like:

debug mds = 15/25

And also please set:

debug ms = 1

From the logs attached, I can see that there has many clients are running at the same time and manipulating some shared dirs/files at the same time. And the crash was during in one client and trying to deleting or truncating a not opened file. It will call the do_open_truncate() when opening the file and journal the "open truncate" mdlog.

Currently the code will set the lock state to "LOCK_XLOCKDONE" in early reply just before flushing the "open truncate" mdlog even, during flusing the mdlog one request has changed the the lock state. But I cannot get any clue from your logs.

Locally I am also trying to reproduce it by using following script, which will emulate your use cases, and I was using the kclient:

while [ 1 ]; do date; for d in A B C; do (for i in {1..3}; do ./bin/mount.ceph :/ /mnt/kcephfs.$d -o noshare; rm -rf /mnt/kcephfs.$d/file$i.txt; rmdir /mnt/kcephfs.$d/.snap/snap$i; dd if=/dev/zero of=/mnt/kcephfs.$d/file$i.txt bs=1M count=8; mkdir -p /mnt/kcephfs.$d/.snap/snap$i; umount -fl /mnt/kcephfs.$d; done ) & done; wait; date; done

But not reproduced yet.

I will try the fuse clients to see whether could it work.

#9 Updated by Xiubo Li over 1 year ago

  • Pull request ID set to 42861

#10 Updated by Xiubo Li about 1 year ago

  • Status changed from Triaged to Fix Under Review

#11 Updated by Andras Pataki about 1 year ago

I haven't had luck keeping the MDS running well with higher log levels unfortunately. However, I do have one more data point to add to this issue. On another cluster, we just enabled snapshots - and the crashes started happening pretty much immediately:

2021-10-28_15:03:26.632350Z_4c1931a6-2948-478d-8880-2d2beb7bc336 mds.cephmds00
2021-10-29_04:07:15.848755Z_9d765612-d148-452b-9d14-0d73f56368e2 mds.cephmds01
2021-11-01_01:59:55.895573Z_25ab7513-5560-4506-be49-23cc38be36fd mds.cephmds03
2021-11-01_02:02:54.279955Z_6eac3458-3cc1-4289-809b-50b9ba351e79 mds.cephmds04
2021-11-01_23:55:11.440345Z_f0823d54-c4b3-4e37-b290-564258cf8e86 mds.cephmds02 *
2021-11-03_08:12:12.400879Z_e67143ef-2bc4-4f8c-a5f9-00d4c7ddbf2a mds.cephmds01 *
2021-11-03_08:13:03.079998Z_2b94d760-e2ed-43f5-aab6-d910f9547164 mds.cephmds03 *

prior to enabling snapshots - we had no such MDS crashes on this cluster. So it seems that the issue correlates to cephfs snapshots. The times for the crashes do not match with snapshots being created or removed, however.

Also - I've seen very high MDS CPU utilization with snapshots enabled during file removals seemingly - I'm not sure if this could be related. Here is a stack trace for the 100% busy thread (called fn_anonymous):

#0 0x0000555555787d4c in CInode::get_oldest_parent_dn (this=0x555843e14700) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.h:780
#1 CInode::is_ancestor_of (this=0x55561ebbee00, other=other@entry=0x555843e14700) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:917
#2 0x00005555557e4147 in SnapRealm::split_at (this=this@entry=0x555557c3a500, child=0x5555c1ad7900) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/SnapRealm.cc:593
#3 0x00005555557a1a5f in CInode::open_snaprealm (this=0x55561ebbee00, nosplit=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:2764
#4 0x00005555557a1e96 in CInode::pop_projected_snaprealm (this=0x55561ebbee00, next_snaprealm=0x555571b24000, early=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:586
#5 0x00005555557a2462 in CInode::early_pop_projected_snaprealm (this=this@entry=0x55561ebbee00) at /usr/include/c++/8/bits/stl_list.h:1122
#6 0x00005555555d1dc0 in Server::_unlink_local_finish (this=0x555557055680, mdr=..., dn=0x5558e50845a0, straydn=0x55582c1981e0, dnpv=513) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/Server.cc:7025
#7 0x00005555558067df in Context::complete (r=0, this=0x555715b043c0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/include/Context.h:77
#8 MDSContext::complete (this=0x555715b043c0, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:29
#9 0x0000555555806a8f in MDSIOContextBase::complete (this=0x555715b043c0, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:114
#10 0x0000555555806d54 in MDSLogContextBase::complete (this=<optimized out>, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:123
#11 0x00007fffef02062d in Finisher::finisher_thread_entry() () from target:/usr/lib64/ceph/libceph-common.so.0
#12 0x00007fffecd7d14a in start_thread () from target:/lib64/libpthread.so.0
#13 0x00007fffeb896dc3 in clone () from target:/lib64/libc.so.6

Not sure if this is related - but it is certainly a bit unusual.

#12 Updated by Xiubo Li about 1 year ago

Andras Pataki wrote:

I haven't had luck keeping the MDS running well with higher log levels unfortunately. However, I do have one more data point to add to this issue. On another cluster, we just enabled snapshots - and the crashes started happening pretty much immediately:

2021-10-28_15:03:26.632350Z_4c1931a6-2948-478d-8880-2d2beb7bc336 mds.cephmds00
2021-10-29_04:07:15.848755Z_9d765612-d148-452b-9d14-0d73f56368e2 mds.cephmds01
2021-11-01_01:59:55.895573Z_25ab7513-5560-4506-be49-23cc38be36fd mds.cephmds03
2021-11-01_02:02:54.279955Z_6eac3458-3cc1-4289-809b-50b9ba351e79 mds.cephmds04
2021-11-01_23:55:11.440345Z_f0823d54-c4b3-4e37-b290-564258cf8e86 mds.cephmds02 *
2021-11-03_08:12:12.400879Z_e67143ef-2bc4-4f8c-a5f9-00d4c7ddbf2a mds.cephmds01 *
2021-11-03_08:13:03.079998Z_2b94d760-e2ed-43f5-aab6-d910f9547164 mds.cephmds03 *

prior to enabling snapshots - we had no such MDS crashes on this cluster. So it seems that the issue correlates to cephfs snapshots. The times for the crashes do not match with snapshots being created or removed, however.

Also - I've seen very high MDS CPU utilization with snapshots enabled during file removals seemingly - I'm not sure if this could be related. Here is a stack trace for the 100% busy thread (called fn_anonymous):

#0 0x0000555555787d4c in CInode::get_oldest_parent_dn (this=0x555843e14700) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.h:780
#1 CInode::is_ancestor_of (this=0x55561ebbee00, other=other@entry=0x555843e14700) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:917
#2 0x00005555557e4147 in SnapRealm::split_at (this=this@entry=0x555557c3a500, child=0x5555c1ad7900) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/SnapRealm.cc:593
#3 0x00005555557a1a5f in CInode::open_snaprealm (this=0x55561ebbee00, nosplit=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:2764
#4 0x00005555557a1e96 in CInode::pop_projected_snaprealm (this=0x55561ebbee00, next_snaprealm=0x555571b24000, early=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:586
#5 0x00005555557a2462 in CInode::early_pop_projected_snaprealm (this=this@entry=0x55561ebbee00) at /usr/include/c++/8/bits/stl_list.h:1122
#6 0x00005555555d1dc0 in Server::_unlink_local_finish (this=0x555557055680, mdr=..., dn=0x5558e50845a0, straydn=0x55582c1981e0, dnpv=513) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/Server.cc:7025
#7 0x00005555558067df in Context::complete (r=0, this=0x555715b043c0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/include/Context.h:77
#8 MDSContext::complete (this=0x555715b043c0, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:29
#9 0x0000555555806a8f in MDSIOContextBase::complete (this=0x555715b043c0, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:114
#10 0x0000555555806d54 in MDSLogContextBase::complete (this=<optimized out>, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:123
#11 0x00007fffef02062d in Finisher::finisher_thread_entry() () from target:/usr/lib64/ceph/libceph-common.so.0
#12 0x00007fffecd7d14a in start_thread () from target:/lib64/libpthread.so.0
#13 0x00007fffeb896dc3 in clone () from target:/lib64/libc.so.6

Not sure if this is related - but it is certainly a bit unusual.

I don't think they are the same issue, please raise a new tracker about this.

#13 Updated by Igor Fedotov 22 days ago

  • Backport changed from pacific to pacific, quincy

#14 Updated by Igor Fedotov 22 days ago

Alternative fix is available at https://github.com/ceph/ceph/pull/48743

Also available in: Atom PDF