Actions
Fix #5268
closedmds: fix/clean up file size/mtime recovery code
Status:
Closed
Priority:
High
Assignee:
Category:
Performance/Resource Usage
Target version:
% Done:
0%
Source:
Development
Tags:
Backport:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client, MDS, osdc
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
from diagnosing #4832 (see the attached log) it looks like this code needs an overhaul:
- i don't think we should be triggering recovery when transitioning from stable states, but explicitly sometime earlier
- we should hold a wrlock while gathering, and avoid the maxsize/size force_wrlock flag at the end
- we should have well defined behavior for when a client goes stale, resume, stale, etc., and races with file size recovery.
Updated by Greg Farnum about 9 years ago
From #10875:
A very sparse file of length slightly larger than 1GB had got a few scattered writes when the mds restarted. Recovery decided to scan all 512 objects from 0 to 2GB. This takes a very long time on my cluster. Each object stat is taking a few seconds, presumably because of the ongoing migration of data to an EC pool. The only information the mds logs is the delayed attempts to obtain rdlocks when I access the file. If we probed multiple objects in parallel, I think it would go much faster, but it's statting only one object at a time, going backwards. Starting the search so far away from the actual size surely doesn't help either. Regardless, recovery might still take a long time in particularly pathological cases, so it would be nice if the mds would log long-running probe aggregate operations, just as it logs delayed client requests. This would at least give users a clue on what is going on when accessing a file takes a very, very long time.
So: parallel object checks. More visibility about ongoing recovery operations. Unfortunately going backwards from max_size is necessary, since we need to demonstrate that we know the last object. :(
Updated by Greg Farnum almost 8 years ago
- Category changed from 47 to Performance/Resource Usage
- Component(FS) MDS added
Updated by Greg Farnum almost 8 years ago
- Related to Feature #4485: Improve "needsrecover" handling added
Updated by Patrick Donnelly about 6 years ago
- Assignee set to Zheng Yan
- Target version set to v13.0.0
- Component(FS) Client, osdc added
Updated by Zheng Yan about 6 years ago
- Status changed from New to Closed
current code does parallel object checks.
Actions