Project

General

Profile

Actions

Bug #41140

closed

mds: trim cache more regularly

Added by Patrick Donnelly over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Urgent
Category:
-
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Under create workloads that result in the acquisition of a lot of capabilities, the MDS can't trim the cache fast enough. The cache trimming throttle gets hit at ~64K dentries removed but because the upkeep trimming only occurs every 5 seconds, the cache is trimmed too slowly.

It's undesirable to just increase the throttle limit from 64K to 512K as then the MDS spends a long time trimming the cache every 5 seconds. A better approach would be to drive cache trimming more regularly.

See also thread: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/U7XNK3RTV3YZHJFUZ3QJUXHH2WYAT4DN/

Actions #1

Updated by Dan van der Ster over 4 years ago

FWIW i can trigger the same problems here without creates. `ls -lR` of a large tree is enough, and increasing the throttle limits can help but not in every case.

Actions #2

Updated by Patrick Donnelly over 4 years ago

Dan van der Ster wrote:

FWIW i can trigger the same problems here without creates. `ls -lR` of a large tree is enough, and increasing the throttle limits can help but not in every case.

Yes, that makes sense of course. The problem is not specific to create workloads.

Actions #3

Updated by Patrick Donnelly over 4 years ago

  • Description updated (diff)
Actions #4

Updated by Janek Bevendorff over 4 years ago

This may be obvious, but to put the whole thing into context: this cache trimming issue can make a CephFS permanently unusable if the MDS hits the physical memory limit. After being kicked by the MON, a standby will continuously try to reload capabilities, fail and be kicked as well.

Actions #5

Updated by Dan van der Ster over 4 years ago

Janek Bevendorff wrote:

This may be obvious, but to put the whole thing into context: this cache trimming issue can make a CephFS permanently unusable if the MDS hits the physical memory limit. After being kicked by the MON, a standby will continuously try to reload capabilities, fail and be kicked as well.

Yes we also hit that in prod. MDS's were flapping / running out of memory trying to reload the caps.
A solution in our case was to umount cephfs from the relevant clients, so they wouldn't reconnect and their caps wouldn't need to be reloaded the next time the standby MDS tried to start.

Maybe helpful here would be a more conservative cache reservation and an earlier warning on going over the memory limit:

mds cache reservation = 0.2  # default 0.05. reserve more LRU space to handle peak client loads
mds health cache threshold = 1.2 # default 1.5. Early warning if the mds cache is not being trimmed

I also planned to try out the `mds_cap_revoke_eviction_timeout` to see if these clients can be evicted early before breaking the cluster.

Actions #6

Updated by Mark Nelson over 4 years ago

We did much the same thing in the OSD. Previously we trimmed in a single thread at regular intervals, but now we trim on add in whatever thread issues the add. We also reduced lock contention (buffer/onode caches shared the same lock) which also helped. That work proved to be very beneficial with a 25-30% increase in 4K random write IOPS in bluestore. The red "wip-bs-cache-evict" numbers are what we gained just with the trim behavior changes:

https://github.com/ceph/ceph/pull/28597

Actions #7

Updated by Patrick Donnelly over 4 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 29542
Actions #8

Updated by Janek Bevendorff over 4 years ago

I have the following settings now, which seem to work okay-ish:

mds         advanced mds_beacon_grace                   120.000000                               
mds         basic    mds_cache_memory_limit             26843545600                              
mds         advanced mds_cache_reservation              0.100000                                 
mds         advanced mds_cache_trim_threshold           524288                                   
mds         advanced mds_health_cache_threshold         1.200000                                 
mds         advanced mds_recall_max_caps                15360                                    
mds         advanced mds_recall_max_decay_rate          1.000000

I have three active MDSs and each one is handling about 300-2500 requests/s with a total throughput of 100-300MB/s. The latter is not much and in theory I should be getting up to 2x10Gbps, but considering that this is a latency-dominated task with millions of small files, I guess it's alright.

Actions #9

Updated by Janek Bevendorff over 4 years ago

I believe this problem may be particularly severe when the main data pool is an EC pool. I am trying the same thing with a replicated pool now and am having a lot less issues with this.

Actions #10

Updated by Patrick Donnelly over 4 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #13

Updated by Nathan Cutler over 4 years ago

  • Status changed from Pending Backport to Resolved
  • Backport deleted (nautilus,mimic)

Since #41141 is fixed by the same PR, we'll handle the backports there.

Actions #14

Updated by Jan Fajerski over 4 years ago

As this won't be backported to luminous and many of the mentioned mds options don't exist in luminous, is there a way to address or at least somewhat improve this in luminous?

Actions #15

Updated by Patrick Donnelly over 4 years ago

Jan Fajerski wrote:

As this won't be backported to luminous and many of the mentioned mds options don't exist in luminous, is there a way to address or at least somewhat improve this in luminous?

Luminous is EOL. We'd recommend you upgrade to Mimic or Nautilus.

The other options I spoke about in the mailing list should help mitigate the problem. You can also try setting the mds upkeep thread timeout, mds_tick_interval, to 1.

Actions

Also available in: Atom PDF