https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2021-11-09T16:35:04ZCeph CephFS - Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operationshttps://tracker.ceph.com/issues/53192?journal_id=2057382021-11-09T16:35:04ZAndras Patakiapataki@simonsfoundation.org
<ul></ul><p>I've done some further debugging to understand the MDS performance problem that has been impacting us more. The finisher thread (fn_anonymous) uses 100% CPU on remove operations - it turns out remove directory operations. Stack traces look like this:</p>
<pre>
#0 0x0000555555787d37 in CInode::is_ancestor_of (this=0x55580f021800, other=0x5557c1b1c300, other@entry=0x5557c9146700) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CDentry.h:150
#1 0x00005555557e4147 in SnapRealm::split_at (this=this@entry=0x555557ea0280, child=0x55584e1c1b80) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/SnapRealm.cc:593
#2 0x00005555557a1a5f in CInode::open_snaprealm (this=0x55580f021800, nosplit=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:2764
#3 0x00005555557a1e96 in CInode::pop_projected_snaprealm (this=0x55580f021800, next_snaprealm=0x55556425f140, early=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:586
#4 0x00005555557a2462 in CInode::early_pop_projected_snaprealm (this=this@entry=0x55580f021800) at /usr/include/c++/8/bits/stl_list.h:1122
#5 0x00005555555d1dc0 in Server::_unlink_local_finish (this=0x555556438500, mdr=..., dn=0x5558541e8f00, straydn=0x555946731860, dnpv=77222) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/Server.cc:7025
</pre>
<p>In SnapRealm::split_at, there are two loops that seem to be the root cause of the MDS performance problem:</p>
<pre>
dout(10) << " open_children are " << open_children << dendl;
for (set<SnapRealm*>::iterator p = open_children.begin();
p != open_children.end(); ) {
...
</pre>
<p>and</p>
<pre>
// split inodes_with_caps
for (elist<CInode*>::iterator p = inodes_with_caps.begin(member_offset(CInode, item_caps));
!p.end(); ) {
CInode *in = *p;
++p;
// does inode fall within the child realm?
if (child->inode->is_ancestor_of(in)) {
dout(20) << " child gets " << *in << dendl;
in->move_to_realm(child);
} else {
dout(20) << " keeping " << *in << dendl;
}
}
</pre>
<p>The second loop goes through hundreds of thousands of inodes and does nothing (ends in the else case of the if since they aren't repated to the directory). For the trace cases I looked at, the first loop over open_children also loops through over ten thousand entries. It looks like these two loops are responsible for tens of milliseconds of time spent.</p>
<p>This is done in the finisher thread, which there is only one of, and looks like many MDS operations go through it to be committed. As a consequence, high latency on directory removes makes the whole MDS quite unresponsive.</p>
Questions I still don't understand:
<ul>
<li>Do we have to call open_snaprealm/split_at for each directory that is removed? If not, when is this called? For our use case, it seems to be called really a lot (especially given how heavy of a loop it needs to do).</li>
<li>Do these calls have to hold up all other operations in the MDS?</li>
<li>How would this work with a relatively large MDS cache with hundreds of clients?</li>
</ul>
<p>And most importantly - what are practical ways we could mitigate/triage this problem?</p> CephFS - Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operationshttps://tracker.ceph.com/issues/53192?journal_id=2058762021-11-11T19:25:35ZPatrick Donnellypdonnell@redhat.com
<ul></ul><p>Thanks for opening a ticket and doing some digging, Andras,</p>
<p>Andras Pataki wrote:</p>
<blockquote>
<p>I've done some further debugging to understand the MDS performance problem that has been impacting us more. The finisher thread (fn_anonymous) uses 100% CPU on remove operations - it turns out remove directory operations. Stack traces look like this:</p>
<p>[...]</p>
<p>In SnapRealm::split_at, there are two loops that seem to be the root cause of the MDS performance problem:</p>
<p>[...]</p>
<p>and</p>
<p>[...]</p>
<p>The second loop goes through hundreds of thousands of inodes and does nothing (ends in the else case of the if since they aren't repated to the directory). For the trace cases I looked at, the first loop over open_children also loops through over ten thousand entries. It looks like these two loops are responsible for tens of milliseconds of time spent.</p>
</blockquote>
<p>Yes, it seems so.</p>
<blockquote>
<p>This is done in the finisher thread, which there is only one of, and looks like many MDS operations go through it to be committed. As a consequence, high latency on directory removes makes the whole MDS quite unresponsive.</p>
Questions I still don't understand:
<ul>
<li>Do we have to call open_snaprealm/split_at for each directory that is removed? If not, when is this called? For our use case, it seems to be called really a lot (especially given how heavy of a loop it needs to do).</li>
</ul>
</blockquote>
<p>I'm not sure but it seems at least we could optimize the trivial case of an empty directory.</p>
<blockquote>
<ul>
<li>Do these calls have to hold up all other operations in the MDS?</li>
</ul>
</blockquote>
<p>The fair_mutex I mentioned in the mailing list thread would help but you'd still have a long queue of operations in the finisher thread which would slow down other things.</p>
<blockquote>
<ul>
<li>How would this work with a relatively large MDS cache with hundreds of clients?</li>
</ul>
</blockquote>
<p>A large cache adds to the work because, for a snapshot that covers the whole file system, the MDS will look at every inode in cache.</p>
<blockquote>
<p>And most importantly - what are practical ways we could mitigate/triage this problem?</p>
</blockquote>
<p>Well, you can ask users to move the subtrees to a global trash directory rather than `rm -rf` (yuck!). Then write a script that gently deletes it.</p>
<p>I will try to reproduce this and create a fix.</p>
<p>Thanks Andras.</p> CephFS - Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operationshttps://tracker.ceph.com/issues/53192?journal_id=2058872021-11-12T00:19:26ZPatrick Donnellypdonnell@redhat.com
<ul></ul><p>Just want to share an update that I was able to reproduce the problem. I'll test out a fix soon...</p> CephFS - Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operationshttps://tracker.ceph.com/issues/53192?journal_id=2059102021-11-12T10:28:55ZAndras Patakiapataki@simonsfoundation.org
<ul></ul><p>That's great news! I'm also trying to set up a test environment to reproduce the issue outside our main cluster (eyeing to test a potential fix there). Thanks Patrick.</p> CephFS - Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operationshttps://tracker.ceph.com/issues/53192?journal_id=2059232021-11-12T18:09:59ZPatrick Donnellypdonnell@redhat.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Fix Under Review</i></li><li><strong>Assignee</strong> set to <i>Patrick Donnelly</i></li><li><strong>Target version</strong> set to <i>v17.0.0</i></li><li><strong>Source</strong> set to <i>Community (user)</i></li><li><strong>Backport</strong> set to <i>pacific,octopus</i></li><li><strong>Pull request ID</strong> set to <i>43911</i></li><li><strong>Component(FS)</strong> <i>MDS</i> added</li><li><strong>Labels (FS)</strong> <i>snapshots</i> added</li></ul> CephFS - Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operationshttps://tracker.ceph.com/issues/53192?journal_id=2059322021-11-12T18:25:20ZPatrick Donnellypdonnell@redhat.com
<ul></ul><p>Andras, I think there are other inefficiencies I've not yet identified but this fix addresses your specific problem.</p> CephFS - Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operationshttps://tracker.ceph.com/issues/53192?journal_id=2060172021-11-15T17:42:53ZAndras Patakiapataki@simonsfoundation.org
<ul></ul><p>Thanks Patrick! How safe do you feel this patch is? Does it need a lot of testing or is it safe to deploy?</p> CephFS - Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operationshttps://tracker.ceph.com/issues/53192?journal_id=2061662021-11-17T01:04:16ZPatrick Donnellypdonnell@redhat.com
<ul></ul><p>Andras Pataki wrote:</p>
<blockquote>
<p>Thanks Patrick! How safe do you feel this patch is? Does it need a lot of testing or is it safe to deploy?</p>
</blockquote>
<p>It is safe to deploy in my opinion.</p> CephFS - Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operationshttps://tracker.ceph.com/issues/53192?journal_id=2150692022-04-22T17:30:13ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>Hmm, the stuff in split_at looks like we can and should just swap the logic -- instead of iterating over all inodes in the parent snaprealm, we should look at all the open inodes underneath the newly-created snaprealm's inode. That logic is slightly more involved (we have to descend the hierarchy instead of iterating through a list) but should not be a big deal.</p>
<p>There are some other uses of the inodes_with_caps list in MDCache which may not all be so easy, but I didn't look at them in as much detail. We're doing notifies to the clients of all inodes in do_realm_invalidate_and_update_notify, for instance, but it looks like that only applies to children of things we're renaming et al -- which is both necessary and also generally a smaller set.</p> CephFS - Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operationshttps://tracker.ceph.com/issues/53192?journal_id=2237612022-08-18T16:24:44ZLaura Flores
<ul></ul><p>This topic was discussed during the User + Dev meeting on Aug. 8th, 2022. One revelation that came of the meeting (as per Greg) was that unmounting clients at night could help.</p>
<p>Greg also recommended that Andras try out the methods mentioned in note#9, and update the Tracker with any additional findings.</p>
<p>I also looked into Patrick's PR, and I see it's currently in draft form. The last comment there was:</p>
<p>From Patrick:<br /><pre><code class="text syntaxhl"><span class="CodeRay">So this doesn't work because directories may be "empty" because there are no HEAD dentries but some snapshotted dentries exist, i.e. CDir::get_num_snap_items() > 0. So there will be some inodes with caps, potentially, that refer to the snapshotted inodes associated with those dentries.
The solution, I believe, is to create a hierarchical data structure part of CDir which allows tracking child inodes with caps or something that allows us to optimize the common cases (small directories with few or 0 snapshotted dentries).
</span></code></pre></p>
<p>So it looks like the final solution may just need some more reviews/refactoring.</p> CephFS - Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operationshttps://tracker.ceph.com/issues/53192?journal_id=2355252023-04-26T16:24:19ZFlorian Pritz
<ul></ul><p>Has there been any progress on this in the last year? We believe we are hitting the same issue on a production cluster (all NVMe OSDs, v16.2.11), but so far we are unable to reproduce it with an artificial test (see below) on the same cluster.</p>
<p>The symptoms we see are very bad `rm -rf` performance (handfull of ops/s) and also bad access performance for other clients on the same cluster. At the same time we see 100% CPU usage on the active MDS node. Prometheus metrics show that `rmdir` operations that usually complete in 1-2ms, instead take 20ms, 60ms, or in one case even 350ms (for many minutes without change).</p>
<p>One single example from an MDS log dump:<br /><pre><code class="text syntaxhl"><span class="CodeRay">2023-04-18T16:45:18.130+0200 7f8d6e33b700 4 mds.0.server handle_client_request client_request(client.188008615:5997200 readdir #0x1000344befc 2023-04-18T16:45:18.129945+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:18.214+0200 7f8d6732d700 5 mds.0.log _submit_thread 29859233665684~2795 : EUpdate scatter_writebehind [metablob 0x1000344befc, 2 dirs]
2023-04-18T16:45:18.290+0200 7f8d6732d700 5 mds.0.log _submit_thread 29859233668499~5378 : EUpdate unlink_local [metablob 0x1000344befc, 4 dirs]
2023-04-18T16:45:18.294+0200 7f8d6e33b700 4 mds.0.server handle_client_request client_request(client.188008615:5997207 rmdir #0x1000344befc/818 2023-04-18T16:45:18.289945+0200 caller_uid=0, caller_gid=0{}) v2
</span></code></pre></p>
<p>For the artificial test we create 400k directories (4000 first level, 15 next level, then 3 more levels with 1 directory each, then a file with some random data in each leaf directory) and then run `rm -rf` on the top level directory. However, even if I create and remove a snapshot at the root level of the filesystem and run the `rm -rf` after that, I cannot seem to trigger the bug.</p>
<p>A failover to a standby MDS seems to temporarily fix the problem, but I assume that's mostly related to the cache starting out empty since it comes back eventually.</p>
<p>Does anyone have a reliable way to reproduce this (Patrick mentioned he can repro) or some insight into which parameters affect the likelyhood of this happening? Or even better, has any of the mentioned refactoring happened in other places and would this bug be fixed in newer versions? We are currently running ceph 16.2.11, but it appears that an upgrade to the latest 17.2.x would not yet fix this, correct?</p> CephFS - Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operationshttps://tracker.ceph.com/issues/53192?journal_id=2355262023-04-26T16:29:55ZFlorian Pritz
<ul></ul><p>The log I copied is not actually showing the rmdir latency since I filtered incorrectly, sorry about that. This one is just a sample from a grep for `client.188008615` showing consecutive operations from a single client:</p>
<pre><code class="text syntaxhl"><span class="CodeRay">2023-04-18T16:45:20.922+0200 7f8d6e33b700 4 mds.0.server handle_client_request client_request(client.188008615:5997269 rmdir #0x10003671e40/d9e 2023-04-18T16:45:20.917957+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:21.014+0200 7f8d6e33b700 4 mds.0.server handle_client_request client_request(client.188008615:5997270 rmdir #0x10003671e3f/daa 2023-04-18T16:45:21.013957+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:21.106+0200 7f8d6e33b700 4 mds.0.server handle_client_request client_request(client.188008615:5997271 rmdir #0x10003671e3e/3f4 2023-04-18T16:45:21.101958+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:21.194+0200 7f8d6e33b700 4 mds.0.server handle_client_request client_request(client.188008615:5997272 rmdir #0x1000353cc7f/0e3 2023-04-18T16:45:21.193958+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:21.282+0200 7f8d6e33b700 4 mds.0.server handle_client_request client_request(client.188008615:5997273 rmdir #0x1000343178d/371 2023-04-18T16:45:21.281959+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:21.370+0200 7f8d6e33b700 4 mds.0.server handle_client_request client_request(client.188008615:5997274 readdir #0x1000401b907 2023-04-18T16:45:21.369959+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:21.374+0200 7f8d6e33b700 4 mds.0.server handle_client_request client_request(client.188008615:5997275 readdir #0x1000401b908 2023-04-18T16:45:21.369959+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:21.374+0200 7f8d6e33b700 4 mds.0.server handle_client_request client_request(client.188008615:5997276 readdir #0x1000401b909 2023-04-18T16:45:21.373959+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:21.378+0200 7f8d6e33b700 4 mds.0.server handle_client_request client_request(client.188008615:5997277 readdir #0x1000401b90a 2023-04-18T16:45:21.373959+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:21.378+0200 7f8d6e33b700 4 mds.0.server handle_client_request client_request(client.188008615:5997278 readdir #0x1000401b90b 2023-04-18T16:45:21.377959+0200 caller_uid=0, caller_gid=0{}) v2
</span></code></pre> CephFS - Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operationshttps://tracker.ceph.com/issues/53192?journal_id=2519342023-12-19T08:39:39ZKonstantin Shalygink0ste@k0ste.ru
<ul><li><strong>Target version</strong> changed from <i>v17.0.0</i> to <i>v19.0.0</i></li><li><strong>Backport</strong> changed from <i>pacific,octopus</i> to <i>reef quincy pacific</i></li></ul> CephFS - Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operationshttps://tracker.ceph.com/issues/53192?journal_id=2553232024-02-26T05:24:00ZVenky Shankarvshankar@redhat.com
<ul></ul><p>Hi Andras,</p>
<p>Nice to hear from you in the User/Dev monthly meet-up. You had a question related to what exactly is SnapRealm::split_at() function doing (which is the source of latency when snapshots are taken in cephfs) - I replied to it in the PR itself: <a class="external" href="https://github.com/ceph/ceph/pull/43911#pullrequestreview-883184649">https://github.com/ceph/ceph/pull/43911#pullrequestreview-883184649</a></p>
<p>Copying here for brevity:</p>
<p>--<br />OK, so the "parent" snaprealm tracks all the inodes under its tree which have caps and the root directory always has a snaprealm which implies that every inode under it used by a client is in the inodes_with_caps list for the root directory. And then when a child snapshot is created, the split_at iterates through the entire, probably millions entries and "moves" those inodes which now fall under the child snap realm.<br />--</p>
<p>As you can see, even when no snapshots exists, the root snaprealm tracks <code>all inodes</code> which have (client) caps (<code>SnapRealm::inodes_with_caps</code>) and when a snapshot is taken anywhere in the tree, this (potentially huge) list is iterated to <code>move</code> inodes which would now fall under the child snaprealm.</p>
<p>If these inode (which have caps) are tracked at the directory level under which they exists, the iteration part when splitting a snaprealm can be avoided, however, there is also the case when these scattered lists would need to be merged (gather).</p> CephFS - Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operationshttps://tracker.ceph.com/issues/53192?journal_id=2554242024-02-27T04:18:35ZPatrick Donnellypdonnell@redhat.com
<ul><li><strong>Backport</strong> changed from <i>reef quincy pacific</i> to <i>squid,reef,quincy</i></li><li><strong>Pull request ID</strong> changed from <i>43911</i> to <i>55768</i></li></ul> CephFS - Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operationshttps://tracker.ceph.com/issues/53192?journal_id=2560482024-03-05T13:28:42ZAbhishek Lekshmananabhishek.lekshmanan@gmail.com
<ul></ul><p>Hi Venky, Patrick</p>
<p>further to our talk, we saw the MDS growing with a lot of log segments and crashing in the up:replay/resolve loop. I reported what we saw as a new issue, let us know if this is somehow related and what we can do to move things along <br /><a class="external" href="https://tracker.ceph.com/issues/64717">https://tracker.ceph.com/issues/64717</a></p> CephFS - Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operationshttps://tracker.ceph.com/issues/53192?journal_id=2565352024-03-11T10:33:18ZVenky Shankarvshankar@redhat.com
<ul></ul><p>Hi Andras,</p>
<p>Patrick has a proposed fix that optimizes the iteration - <a class="external" href="https://github.com/ceph/ceph/pull/55768">https://github.com/ceph/ceph/pull/55768</a></p>
<p>I ran it through fs suite and the qa run looks good. Patrick did a sample benchmark with this change and the results show improvement in the split path. See commit details in a0ccb79fa0806792c7ee666c667328a8aeb09e97 in the above PR.</p>
<p>I'll be merging the change sometime this week once we get some of the unrelated qa failures sorted out. If you are for it, feel free to cherry-pick the changes and run it through the workload (or something similar to it) which initially prompted the creation of this tracker :)</p> CephFS - Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operationshttps://tracker.ceph.com/issues/53192?journal_id=2565382024-03-11T11:16:42ZVenky Shankarvshankar@redhat.com
<ul></ul><p>Abhishek Lekshmanan wrote:</p>
<blockquote>
<p>Hi Venky, Patrick</p>
<p>further to our talk, we saw the MDS growing with a lot of log segments and crashing in the up:replay/resolve loop. I reported what we saw as a new issue, let us know if this is somehow related and what we can do to move things along <br /><a class="external" href="https://tracker.ceph.com/issues/64717">https://tracker.ceph.com/issues/64717</a></p>
</blockquote>
<p>Thanks for the report, Abhishek. I've left some comments on the above tracker. Let's continue the discussion there.</p>