Project

General

Profile

Bug #41140

mds: trim cache more regularly

Added by Patrick Donnelly 14 days ago. Updated 7 days ago.

Status:
Need Review
Priority:
Urgent
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
nautilus,mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:

Description

Under create workloads that result in the acquisition of a lot of capabilities, the MDS can't trim the cache fast enough. The cache trimming throttle gets hit at ~64K dentries removed but because the upkeep trimming only occurs every 5 seconds, the cache is trimmed too slowly.

It's undesirable to just increase the throttle limit from 64K to 512K as then the MDS spends a long time trimming the cache every 5 seconds. A better approach would be to drive cache trimming more regularly.

See also thread: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/U7XNK3RTV3YZHJFUZ3QJUXHH2WYAT4DN/

History

#1 Updated by Dan van der Ster 14 days ago

FWIW i can trigger the same problems here without creates. `ls -lR` of a large tree is enough, and increasing the throttle limits can help but not in every case.

#2 Updated by Patrick Donnelly 14 days ago

Dan van der Ster wrote:

FWIW i can trigger the same problems here without creates. `ls -lR` of a large tree is enough, and increasing the throttle limits can help but not in every case.

Yes, that makes sense of course. The problem is not specific to create workloads.

#3 Updated by Patrick Donnelly 14 days ago

  • Description updated (diff)

#4 Updated by Janek Bevendorff 13 days ago

This may be obvious, but to put the whole thing into context: this cache trimming issue can make a CephFS permanently unusable if the MDS hits the physical memory limit. After being kicked by the MON, a standby will continuously try to reload capabilities, fail and be kicked as well.

#5 Updated by Dan van der Ster 13 days ago

Janek Bevendorff wrote:

This may be obvious, but to put the whole thing into context: this cache trimming issue can make a CephFS permanently unusable if the MDS hits the physical memory limit. After being kicked by the MON, a standby will continuously try to reload capabilities, fail and be kicked as well.

Yes we also hit that in prod. MDS's were flapping / running out of memory trying to reload the caps.
A solution in our case was to umount cephfs from the relevant clients, so they wouldn't reconnect and their caps wouldn't need to be reloaded the next time the standby MDS tried to start.

Maybe helpful here would be a more conservative cache reservation and an earlier warning on going over the memory limit:

mds cache reservation = 0.2  # default 0.05. reserve more LRU space to handle peak client loads
mds health cache threshold = 1.2 # default 1.5. Early warning if the mds cache is not being trimmed

I also planned to try out the `mds_cap_revoke_eviction_timeout` to see if these clients can be evicted early before breaking the cluster.

#6 Updated by Mark Nelson 13 days ago

We did much the same thing in the OSD. Previously we trimmed in a single thread at regular intervals, but now we trim on add in whatever thread issues the add. We also reduced lock contention (buffer/onode caches shared the same lock) which also helped. That work proved to be very beneficial with a 25-30% increase in 4K random write IOPS in bluestore. The red "wip-bs-cache-evict" numbers are what we gained just with the trim behavior changes:

https://github.com/ceph/ceph/pull/28597

#7 Updated by Patrick Donnelly 13 days ago

  • Status changed from New to Need Review
  • Pull request ID set to 29542

#8 Updated by Janek Bevendorff 12 days ago

I have the following settings now, which seem to work okay-ish:

mds         advanced mds_beacon_grace                   120.000000                               
mds         basic    mds_cache_memory_limit             26843545600                              
mds         advanced mds_cache_reservation              0.100000                                 
mds         advanced mds_cache_trim_threshold           524288                                   
mds         advanced mds_health_cache_threshold         1.200000                                 
mds         advanced mds_recall_max_caps                15360                                    
mds         advanced mds_recall_max_decay_rate          1.000000

I have three active MDSs and each one is handling about 300-2500 requests/s with a total throughput of 100-300MB/s. The latter is not much and in theory I should be getting up to 2x10Gbps, but considering that this is a latency-dominated task with millions of small files, I guess it's alright.

#9 Updated by Janek Bevendorff 7 days ago

I believe this problem may be particularly severe when the main data pool is an EC pool. I am trying the same thing with a replicated pool now and am having a lot less issues with this.

Also available in: Atom PDF