Fix #5268
closed
mds: fix/clean up file size/mtime recovery code
Added by Sage Weil almost 11 years ago.
Updated about 6 years ago.
Category:
Performance/Resource Usage
Component(FS):
Client, MDS, osdc
Description
from diagnosing #4832 (see the attached log) it looks like this code needs an overhaul:
- i don't think we should be triggering recovery when transitioning from stable states, but explicitly sometime earlier
- we should hold a wrlock while gathering, and avoid the maxsize/size force_wrlock flag at the end
- we should have well defined behavior for when a client goes stale, resume, stale, etc., and races with file size recovery.
Related issues
1 (1 open — 0 closed)
From #10875:
A very sparse file of length slightly larger than 1GB had got a few scattered writes when the mds restarted.
Recovery decided to scan all 512 objects from 0 to 2GB.
This takes a very long time on my cluster. Each object stat is taking a few seconds, presumably because of the ongoing migration of data to an EC pool.
The only information the mds logs is the delayed attempts to obtain rdlocks when I access the file.
If we probed multiple objects in parallel, I think it would go much faster, but it's statting only one object at a time, going backwards. Starting the search so far away from the actual size surely doesn't help either.
Regardless, recovery might still take a long time in particularly pathological cases, so it would be nice if the mds would log long-running probe aggregate operations, just as it logs delayed client requests. This would at least give users a clue on what is going on when accessing a file takes a very, very long time.
So: parallel object checks. More visibility about ongoing recovery operations. Unfortunately going backwards from max_size is necessary, since we need to demonstrate that we know the last object. :(
- Category changed from 47 to Performance/Resource Usage
- Component(FS) MDS added
- Status changed from 12 to New
- Assignee set to Zheng Yan
- Target version set to v13.0.0
- Component(FS) Client, osdc added
- Status changed from New to Closed
current code does parallel object checks.
Also available in: Atom
PDF