Project

General

Profile

Bug #53192

High cephfs MDS latency and CPU load with snapshots and unlink operations

Added by Andras Pataki about 1 year ago. Updated 3 months ago.

Status:
Fix Under Review
Priority:
Normal
Category:
Performance/Resource Usage
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
pacific,octopus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
snapshots
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have recently enabled snapshots on our large Nautilus cluster (running 14.2.20) and our fairly smooth running cephfs installation became quite unhappy - with latencies increasing to such an extent than listing directories on clients to many seconds to minutes even. Need some help figuring out where the problem lies and whether there is any triage steps we could take to bring our cluster to a more usable performance level.

It looks like the performance problems relate to removing lots of files, like running rm -rf on a subtree or several subtrees in parallel. Before enabling snapshots, rm -rf of a tree typically runs at about 400 unlinks/second on a single client (single threaded remove) and the MDS load was relatively light.

After snapshots were enabled, we created one snapshot per day in the root of cephfs, but kept only one active snapshot (as an initial steps), so each day we created one new snapshot and removed the previous day's older snapshot. What we noticed was that running 'rm -rf' dropped to about 20-30 unlinks/second, and while these removes were running, the latency on the MDS for all operations (even lookups and getattr's) increased by several orders of magnitude (100 to 1000 times). As soon as the rm processes were killed, latency came back to normal within a few seconds.

stracing the rm process shows that most removes proceed quite fast (1-2ms), but some get stuck for several hundred milliseconds. An example strace:

    09:23:54.854668 unlinkat(7, "ebdyc.p", 0) = 0 <0.001106>
    09:23:54.855864 unlinkat(7, "Dyy.p", 0) = 0 <0.352603>
    09:23:55.208547 unlinkat(7, "u.p", 0)   = 0 <0.002474>
    09:23:55.211098 unlinkat(7, "Dxy.p", 0) = 0 <0.001317>
    09:23:55.212491 unlinkat(7, "v.p", 0)   = 0 <0.001437>
    09:23:55.214005 unlinkat(7, "Dxx.p", 0) = 0 <0.001506>
    09:23:55.215622 unlinkat(6, "timestep_35050", AT_REMOVEDIR) = 0 <0.001874>
    09:23:55.221117 unlinkat(7, "ebdyc.p", 0) = 0 <0.335875>
    09:23:55.557076 unlinkat(7, "Dyy.p", 0) = 0 <0.001657>
    09:23:55.558809 unlinkat(7, "u.p", 0)   = 0 <0.001450>
    09:23:55.560349 unlinkat(7, "Dxy.p", 0) = 0 <0.001507>
    09:23:55.561940 unlinkat(7, "v.p", 0)   = 0 <0.001349>
    09:23:55.563365 unlinkat(7, "Dxx.p", 0) = 0 <0.001182>
    09:23:55.564668 unlinkat(6, "timestep_45250", AT_REMOVEDIR) = 0 <0.002633>
    09:23:55.885885 unlinkat(7, "ebdyc.p", 0) = 0 <0.005142>
    09:23:55.891111 unlinkat(7, "Dyy.p", 0) = 0 <0.001656>
    09:23:55.892844 unlinkat(7, "u.p", 0)   = 0 <0.001685>
    09:23:55.894589 unlinkat(7, "Dxy.p", 0) = 0 <0.001407>
    09:23:55.896056 unlinkat(7, "v.p", 0)   = 0 <0.001520>
    09:23:55.897669 unlinkat(7, "Dxx.p", 0) = 0 <0.001601>

On the MDS side, there is a thread called 'fn_anonymous' that goes to 100% CPU utilization when these removes are running. Attaching a debugger shows stack traces that look like this:

#0  0x0000555555787d4c in CInode::get_oldest_parent_dn (this=0x555843e14700) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.h:780
#1  CInode::is_ancestor_of (this=0x55561ebbee00, other=other@entry=0x555843e14700) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:917
#2  0x00005555557e4147 in SnapRealm::split_at (this=this@entry=0x555557c3a500, child=0x5555c1ad7900) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/SnapRealm.cc:593
#3  0x00005555557a1a5f in CInode::open_snaprealm (this=0x55561ebbee00, nosplit=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:2764
#4  0x00005555557a1e96 in CInode::pop_projected_snaprealm (this=0x55561ebbee00, next_snaprealm=0x555571b24000, early=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:586
#5  0x00005555557a2462 in CInode::early_pop_projected_snaprealm (this=this@entry=0x55561ebbee00) at /usr/include/c++/8/bits/stl_list.h:1122
#6  0x00005555555d1dc0 in Server::_unlink_local_finish (this=0x555557055680, mdr=..., dn=0x5558e50845a0, straydn=0x55582c1981e0, dnpv=513) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/Server.cc:7025
#7  0x00005555558067df in Context::complete (r=0, this=0x555715b043c0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/include/Context.h:77
#8  MDSContext::complete (this=0x555715b043c0, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:29
#9  0x0000555555806a8f in MDSIOContextBase::complete (this=0x555715b043c0, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:114
#10 0x0000555555806d54 in MDSLogContextBase::complete (this=<optimized out>, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:123
#11 0x00007fffef02062d in Finisher::finisher_thread_entry() () from target:/usr/lib64/ceph/libceph-common.so.0
#12 0x00007fffecd7d14a in start_thread () from target:/lib64/libpthread.so.0
#13 0x00007fffeb896dc3 in clone () from target:/lib64/libc.so.6

#0  0x0000555555787d37 in CInode::is_ancestor_of (this=0x55580f021800, other=0x5557c1b1c300, other@entry=0x5557c9146700) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CDentry.h:150
#1  0x00005555557e4147 in SnapRealm::split_at (this=this@entry=0x555557ea0280, child=0x55584e1c1b80) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/SnapRealm.cc:593
#2  0x00005555557a1a5f in CInode::open_snaprealm (this=0x55580f021800, nosplit=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:2764
#3  0x00005555557a1e96 in CInode::pop_projected_snaprealm (this=0x55580f021800, next_snaprealm=0x55556425f140, early=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:586
#4  0x00005555557a2462 in CInode::early_pop_projected_snaprealm (this=this@entry=0x55580f021800) at /usr/include/c++/8/bits/stl_list.h:1122
#5  0x00005555555d1dc0 in Server::_unlink_local_finish (this=0x555556438500, mdr=..., dn=0x5558541e8f00, straydn=0x555946731860, dnpv=77222) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/Server.cc:7025
#6  0x00005555558067df in Context::complete (r=0, this=0x555943b55f90) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/include/Context.h:77
#7  MDSContext::complete (this=0x555943b55f90, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:29
#8  0x0000555555806a8f in MDSIOContextBase::complete (this=0x555943b55f90, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:114
#9  0x0000555555806d54 in MDSLogContextBase::complete (this=<optimized out>, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:123
#10 0x00007fffef02062d in Finisher::finisher_thread_entry() () from target:/usr/lib64/ceph/libceph-common.so.0
#11 0x00007fffecd7d14a in start_thread () from target:/lib64/libpthread.so.0
#12 0x00007fffeb896dc3 in clone () from target:/lib64/libc.so.6

The total CPU utilization of the MDS around these times is perhaps 120%, out of which 100% is this fn_anonymous thread, and the rest of the threads are quite idle - presumably waiting for this thread to do it work.

I'm measuring latency of the MDS using the perf stat counters - taking a snapshot of the counters every 5 seconds and looking at the latency counters like:

        "reply_latency": {
            "avgcount": 862377798,
            "sum": 11712188.751232780,
            "avgtime": 0.013581273
        },

and doing (sum2 - sum1) / (avgcount2 - avgcount1) over the interval to find the interval latency.

When not doing removes: typical reply latencies are under 1ms - even processing a thousand lookups. The MDS CPU utilization is also <50%.

MDS:
    req/s:  3101.7,  reply/s:  3101.7,  lat:  0.0001
Latencies:
    req_create      :  req/s:   12.2,  lat:  0.0008
    req_getattr     :  req/s:   83.4,  lat:  0.0000
    req_lookup      :  req/s: 1102.1,  lat:  0.0000
    req_open        :  req/s:    8.3,  lat:  0.0105
    req_readdir     :  req/s:    1.0,  lat:  0.0010
    req_rename      :  req/s:    1.8,  lat:  0.0001
    req_setfilelock :  req/s: 1890.4,  lat:  0.0000
    req_unlink      :  req/s:    2.6,  lat:  0.0001

When the removes run, both the overall latency and all the operation latencies increase by 2 to 3 orders of magnitude - and this fn_anonymous thread gets really busy. Since lookup latencies are a tens of seconds - cephfs is pretty much unusuable (listing directories takes a minute sometimes or more). Also, unlinks run at only about 20-30 per second:

MDS:
    req/s:  3916.6,  reply/s:  3917.0,  lat:  0.3086
Latencies:
    req_create      :  req/s:  426.4,  lat:  0.5452
    req_getattr     :  req/s:   58.3,  lat:  0.0269
    req_lookup      :  req/s: 3340.6,  lat:  0.2881
    req_mkdir       :  req/s:    0.6,  lat:  0.0417
    req_open        :  req/s:   51.1,  lat:  0.1990
    req_readdir     :  req/s:    3.1,  lat:  0.0338
    req_rename      :  req/s:    2.2,  lat:  0.3481
    req_rmdir       :  req/s:    3.3,  lat:  0.0545
    req_rmxattr     :  req/s:    0.2,  lat:  0.3395
    req_setattr     :  req/s:    0.4,  lat:  0.1154
    req_setfilelock :  req/s:    8.8,  lat:  0.0672
    req_unlink      :  req/s:   21.5,  lat:  0.0113

Since this problem started, I've disabled snapshot creation and removed all snapshots. Unfortunately this has not helped. It seems that removes of trees that were created before snapshots are painfully slow still, but removes of trees created after snapshots were stopped go fast without any issues.

Also, all other parts of the cluster are healthly. The metadata pool is stored on separate nodes on NVMe's, and those OSDs are quite lightly loaded - so the entire performance issue seems to be within the MDS.
I'm very happy to provide any additional details/logs or try potential solutions. Any help would be much appreciated.

History

#1 Updated by Andras Pataki about 1 year ago

I've done some further debugging to understand the MDS performance problem that has been impacting us more. The finisher thread (fn_anonymous) uses 100% CPU on remove operations - it turns out remove directory operations. Stack traces look like this:

#0  0x0000555555787d37 in CInode::is_ancestor_of (this=0x55580f021800, other=0x5557c1b1c300, other@entry=0x5557c9146700) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CDentry.h:150
#1  0x00005555557e4147 in SnapRealm::split_at (this=this@entry=0x555557ea0280, child=0x55584e1c1b80) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/SnapRealm.cc:593
#2  0x00005555557a1a5f in CInode::open_snaprealm (this=0x55580f021800, nosplit=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:2764
#3  0x00005555557a1e96 in CInode::pop_projected_snaprealm (this=0x55580f021800, next_snaprealm=0x55556425f140, early=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:586
#4  0x00005555557a2462 in CInode::early_pop_projected_snaprealm (this=this@entry=0x55580f021800) at /usr/include/c++/8/bits/stl_list.h:1122
#5  0x00005555555d1dc0 in Server::_unlink_local_finish (this=0x555556438500, mdr=..., dn=0x5558541e8f00, straydn=0x555946731860, dnpv=77222) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/Server.cc:7025

In SnapRealm::split_at, there are two loops that seem to be the root cause of the MDS performance problem:

  dout(10) << " open_children are " << open_children << dendl;
  for (set<SnapRealm*>::iterator p = open_children.begin();
       p != open_children.end(); ) {
   ...

and

  // split inodes_with_caps
  for (elist<CInode*>::iterator p = inodes_with_caps.begin(member_offset(CInode, item_caps));
       !p.end(); ) {
    CInode *in = *p;
    ++p;
    // does inode fall within the child realm?
    if (child->inode->is_ancestor_of(in)) {
      dout(20) << " child gets " << *in << dendl;
      in->move_to_realm(child);
    } else {
      dout(20) << "    keeping " << *in << dendl;
    }
  }

The second loop goes through hundreds of thousands of inodes and does nothing (ends in the else case of the if since they aren't repated to the directory). For the trace cases I looked at, the first loop over open_children also loops through over ten thousand entries. It looks like these two loops are responsible for tens of milliseconds of time spent.

This is done in the finisher thread, which there is only one of, and looks like many MDS operations go through it to be committed. As a consequence, high latency on directory removes makes the whole MDS quite unresponsive.

Questions I still don't understand:
  • Do we have to call open_snaprealm/split_at for each directory that is removed? If not, when is this called? For our use case, it seems to be called really a lot (especially given how heavy of a loop it needs to do).
  • Do these calls have to hold up all other operations in the MDS?
  • How would this work with a relatively large MDS cache with hundreds of clients?

And most importantly - what are practical ways we could mitigate/triage this problem?

#2 Updated by Patrick Donnelly about 1 year ago

Thanks for opening a ticket and doing some digging, Andras,

Andras Pataki wrote:

I've done some further debugging to understand the MDS performance problem that has been impacting us more. The finisher thread (fn_anonymous) uses 100% CPU on remove operations - it turns out remove directory operations. Stack traces look like this:

[...]

In SnapRealm::split_at, there are two loops that seem to be the root cause of the MDS performance problem:

[...]

and

[...]

The second loop goes through hundreds of thousands of inodes and does nothing (ends in the else case of the if since they aren't repated to the directory). For the trace cases I looked at, the first loop over open_children also loops through over ten thousand entries. It looks like these two loops are responsible for tens of milliseconds of time spent.

Yes, it seems so.

This is done in the finisher thread, which there is only one of, and looks like many MDS operations go through it to be committed. As a consequence, high latency on directory removes makes the whole MDS quite unresponsive.

Questions I still don't understand:
  • Do we have to call open_snaprealm/split_at for each directory that is removed? If not, when is this called? For our use case, it seems to be called really a lot (especially given how heavy of a loop it needs to do).

I'm not sure but it seems at least we could optimize the trivial case of an empty directory.

  • Do these calls have to hold up all other operations in the MDS?

The fair_mutex I mentioned in the mailing list thread would help but you'd still have a long queue of operations in the finisher thread which would slow down other things.

  • How would this work with a relatively large MDS cache with hundreds of clients?

A large cache adds to the work because, for a snapshot that covers the whole file system, the MDS will look at every inode in cache.

And most importantly - what are practical ways we could mitigate/triage this problem?

Well, you can ask users to move the subtrees to a global trash directory rather than `rm -rf` (yuck!). Then write a script that gently deletes it.

I will try to reproduce this and create a fix.

Thanks Andras.

#3 Updated by Patrick Donnelly about 1 year ago

Just want to share an update that I was able to reproduce the problem. I'll test out a fix soon...

#4 Updated by Andras Pataki about 1 year ago

That's great news! I'm also trying to set up a test environment to reproduce the issue outside our main cluster (eyeing to test a potential fix there). Thanks Patrick.

#5 Updated by Patrick Donnelly about 1 year ago

  • Status changed from New to Fix Under Review
  • Assignee set to Patrick Donnelly
  • Target version set to v17.0.0
  • Source set to Community (user)
  • Backport set to pacific,octopus
  • Pull request ID set to 43911
  • Component(FS) MDS added
  • Labels (FS) snapshots added

#6 Updated by Patrick Donnelly about 1 year ago

Andras, I think there are other inefficiencies I've not yet identified but this fix addresses your specific problem.

#7 Updated by Andras Pataki about 1 year ago

Thanks Patrick! How safe do you feel this patch is? Does it need a lot of testing or is it safe to deploy?

#8 Updated by Patrick Donnelly about 1 year ago

Andras Pataki wrote:

Thanks Patrick! How safe do you feel this patch is? Does it need a lot of testing or is it safe to deploy?

It is safe to deploy in my opinion.

#9 Updated by Greg Farnum 7 months ago

Hmm, the stuff in split_at looks like we can and should just swap the logic -- instead of iterating over all inodes in the parent snaprealm, we should look at all the open inodes underneath the newly-created snaprealm's inode. That logic is slightly more involved (we have to descend the hierarchy instead of iterating through a list) but should not be a big deal.

There are some other uses of the inodes_with_caps list in MDCache which may not all be so easy, but I didn't look at them in as much detail. We're doing notifies to the clients of all inodes in do_realm_invalidate_and_update_notify, for instance, but it looks like that only applies to children of things we're renaming et al -- which is both necessary and also generally a smaller set.

#10 Updated by Laura Flores 3 months ago

This topic was discussed during the User + Dev meeting on Aug. 8th, 2022. One revelation that came of the meeting (as per Greg) was that unmounting clients at night could help.

Greg also recommended that Andras try out the methods mentioned in note#9, and update the Tracker with any additional findings.

I also looked into Patrick's PR, and I see it's currently in draft form. The last comment there was:

From Patrick:

So this doesn't work because directories may be "empty" because there are no HEAD dentries but some snapshotted dentries exist, i.e. CDir::get_num_snap_items() > 0. So there will be some inodes with caps, potentially, that refer to the snapshotted inodes associated with those dentries.

The solution, I believe, is to create a hierarchical data structure part of CDir which allows tracking child inodes with caps or something that allows us to optimize the common cases (small directories with few or 0 snapshotted dentries).

So it looks like the final solution may just need some more reviews/refactoring.

Also available in: Atom PDF