Feature #12671
openEnforce cache limit during dirfrag load during open_ino (during rejoin)
0%
Description
When clients replay requests referring to inodes not found in cache, the inode numbers are stashed for loading later (in MDCache::cap_imports).
Later, in MDCache::process_imported_caps (i.e. during rejoin), MDCache calls open_ino for these.
open_ino (and subsequently open_ino_traverse_dir) load the backtrace and traverse the parents, but for each dirfrag traversed, it is loaded if not complete.
The result is that if you have many large dirfrags, and some imported caps during rejoin, then it is possible for the MDS to aggressively exceed the usual cache size limit (trim() is never called during rejoin).
We need to either do some trimming at some point during this phase, or we need to make the open_ino procedure not force directories to be completely opened (by improving the CDir::fetch path to allow selective loading of dentries).
Updated by John Spray over 8 years ago
- Category set to 47
- Priority changed from Normal to High
The source of this observation was https://www.mail-archive.com/ceph-users@lists.ceph.com/msg22235.html
In this instance the user has 64k files in each directory, and directory fragmentation is not enabled (as is our current default).
However, we could readily also see this scenario even if fragmentation was enabled. For example if there are 100 clients working in 100 dirs, each just below the default fragmentation threshold (10k dentries), we would try and ram a million inodes into memory during rejoin.
Updated by Greg Farnum almost 8 years ago
The naive solution to this seems pretty bad as well. If we only load the needed dentries, in a serial fashion, we'll probably do a lot more disk accesses in order to load stuff than is necessary. That disk access is the limiting factor in replay speed, too.
So we will want to be careful about batching disk IOs together.
Updated by Greg Farnum almost 8 years ago
- Category changed from 47 to Performance/Resource Usage
- Component(FS) MDS added
Updated by Greg Farnum almost 8 years ago
If we do #13688, we probably won't need this one or can put it off.