Bug #41140
closedmds: trim cache more regularly
0%
Description
Under create workloads that result in the acquisition of a lot of capabilities, the MDS can't trim the cache fast enough. The cache trimming throttle gets hit at ~64K dentries removed but because the upkeep trimming only occurs every 5 seconds, the cache is trimmed too slowly.
It's undesirable to just increase the throttle limit from 64K to 512K as then the MDS spends a long time trimming the cache every 5 seconds. A better approach would be to drive cache trimming more regularly.
See also thread: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/U7XNK3RTV3YZHJFUZ3QJUXHH2WYAT4DN/
Updated by Dan van der Ster over 4 years ago
FWIW i can trigger the same problems here without creates. `ls -lR` of a large tree is enough, and increasing the throttle limits can help but not in every case.
Updated by Patrick Donnelly over 4 years ago
Dan van der Ster wrote:
FWIW i can trigger the same problems here without creates. `ls -lR` of a large tree is enough, and increasing the throttle limits can help but not in every case.
Yes, that makes sense of course. The problem is not specific to create workloads.
Updated by Janek Bevendorff over 4 years ago
This may be obvious, but to put the whole thing into context: this cache trimming issue can make a CephFS permanently unusable if the MDS hits the physical memory limit. After being kicked by the MON, a standby will continuously try to reload capabilities, fail and be kicked as well.
Updated by Dan van der Ster over 4 years ago
Janek Bevendorff wrote:
This may be obvious, but to put the whole thing into context: this cache trimming issue can make a CephFS permanently unusable if the MDS hits the physical memory limit. After being kicked by the MON, a standby will continuously try to reload capabilities, fail and be kicked as well.
Yes we also hit that in prod. MDS's were flapping / running out of memory trying to reload the caps.
A solution in our case was to umount cephfs from the relevant clients, so they wouldn't reconnect and their caps wouldn't need to be reloaded the next time the standby MDS tried to start.
Maybe helpful here would be a more conservative cache reservation and an earlier warning on going over the memory limit:
mds cache reservation = 0.2 # default 0.05. reserve more LRU space to handle peak client loads
mds health cache threshold = 1.2 # default 1.5. Early warning if the mds cache is not being trimmed
I also planned to try out the `mds_cap_revoke_eviction_timeout` to see if these clients can be evicted early before breaking the cluster.
Updated by Mark Nelson over 4 years ago
We did much the same thing in the OSD. Previously we trimmed in a single thread at regular intervals, but now we trim on add in whatever thread issues the add. We also reduced lock contention (buffer/onode caches shared the same lock) which also helped. That work proved to be very beneficial with a 25-30% increase in 4K random write IOPS in bluestore. The red "wip-bs-cache-evict" numbers are what we gained just with the trim behavior changes:
Updated by Patrick Donnelly over 4 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 29542
Updated by Janek Bevendorff over 4 years ago
I have the following settings now, which seem to work okay-ish:
mds advanced mds_beacon_grace 120.000000 mds basic mds_cache_memory_limit 26843545600 mds advanced mds_cache_reservation 0.100000 mds advanced mds_cache_trim_threshold 524288 mds advanced mds_health_cache_threshold 1.200000 mds advanced mds_recall_max_caps 15360 mds advanced mds_recall_max_decay_rate 1.000000
I have three active MDSs and each one is handling about 300-2500 requests/s with a total throughput of 100-300MB/s. The latter is not much and in theory I should be getting up to 2x10Gbps, but considering that this is a latency-dominated task with millions of small files, I guess it's alright.
Updated by Janek Bevendorff over 4 years ago
I believe this problem may be particularly severe when the main data pool is an EC pool. I am trying the same thing with a replicated pool now and am having a lot less issues with this.
Updated by Patrick Donnelly over 4 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Nathan Cutler over 4 years ago
- Status changed from Pending Backport to Resolved
- Backport deleted (
nautilus,mimic)
Since #41141 is fixed by the same PR, we'll handle the backports there.
Updated by Jan Fajerski over 4 years ago
As this won't be backported to luminous and many of the mentioned mds options don't exist in luminous, is there a way to address or at least somewhat improve this in luminous?
Updated by Patrick Donnelly over 4 years ago
Jan Fajerski wrote:
As this won't be backported to luminous and many of the mentioned mds options don't exist in luminous, is there a way to address or at least somewhat improve this in luminous?
Luminous is EOL. We'd recommend you upgrade to Mimic or Nautilus.
The other options I spoke about in the mailing list should help mitigate the problem. You can also try setting the mds upkeep thread timeout, mds_tick_interval, to 1.