Project

General

Profile

Documentation #48585

mds_cache_trim_decay_rate misnamed?

Added by Jan Fajerski about 1 month ago. Updated 21 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Tags:
Backport:
Reviewed:
Affected Versions:
Labels (FS):
Pull request ID:

Description

I'm unsure about all this, so input is appreciated.

I recently played around with this and essentially broke a cluster by misinterpreting the option name mds_cache_trim_decay_rate. I wanted to increase the speed with which the MDS trimmed its cache. Looking at https://docs.ceph.com/en/latest/cephfs/cache-configuration/#mds-cache-trimming I figured, ok there is a threshold and a rate. I'll increase the rate to get faster cache trimming.
To my surprise this seemed to have the opposite effect (I suppose there is not necessarily a causality), the cache grew out of bounds.

Looking closer at the formula -ln(0.5)/rate*threshold and glancing at the code, it seems like this option would be more suitably name as half-life or mean lifetime (to stay in the realm of particle physics).
This imho would suite the behaviour "increase option value -> slower cache trimming" better.
When I increase a decay rate I typically expect expect the decay to go quicker not slower.

History

#1 Updated by Patrick Donnelly about 1 month ago

Jan Fajerski wrote:

I'm unsure about all this, so input is appreciated.

I recently played around with this and essentially broke a cluster by misinterpreting the option name mds_cache_trim_decay_rate. I wanted to increase the speed with which the MDS trimmed its cache. Looking at https://docs.ceph.com/en/latest/cephfs/cache-configuration/#mds-cache-trimming I figured, ok there is a threshold and a rate. I'll increase the rate to get faster cache trimming.
To my surprise this seemed to have the opposite effect (I suppose there is not necessarily a causality), the cache grew out of bounds.

Yes.

Looking closer at the formula -ln(0.5)/rate*threshold and glancing at the code,

That formula is accurate only for a "constantly refilled counter". In this case, if the cache is continually filled with new metadata after trimming.

it seems like this option would be more suitably name as half-life or mean lifetime (to stay in the realm of particle physics).
This imho would suite the behaviour "increase option value -> slower cache trimming" better.
When I increase a decay rate I typically expect expect the decay to go quicker not slower.

I think I just got that option name from the code (DecayRate) but, yes, the name is unfortunate. I don't think it's really feasible to change it at this point though.

#2 Updated by Jan Fajerski about 1 month ago

Patrick Donnelly wrote:

I think I just got that option name from the code (DecayRate) but, yes, the name is unfortunate. I don't think it's really feasible to change it at this point though.

Yeah I understand and its probably equally unfeasible to change the semantics to something like 1/value in the code in order to turn it into a rate
Then at the very least we should document that explicitly. Happy to propose something.

#3 Updated by Patrick Donnelly about 1 month ago

Jan Fajerski wrote:

Patrick Donnelly wrote:

I think I just got that option name from the code (DecayRate) but, yes, the name is unfortunate. I don't think it's really feasible to change it at this point though.

Yeah I understand and its probably equally unfeasible to change the semantics to something like 1/value in the code in order to turn it into a rate
Then at the very least we should document that explicitly. Happy to propose something.

It's documented now with the docs you linked in the issue description. Where else should we document this?

#4 Updated by Jan Fajerski about 1 month ago

  • Status changed from New to Fix Under Review
  • Backport set to octopus
  • Pull request ID set to 38587

No other places, just being more explicit would be helpful I think.

#5 Updated by Patrick Donnelly 21 days ago

  • Tracker changed from Bug to Documentation
  • Status changed from Fix Under Review to Resolved
  • Assignee set to Jan Fajerski
  • Target version set to v16.0.0
  • Backport deleted (octopus)

Also available in: Atom PDF