Project

General

Profile

Fix #10874

MDS: file recovery overwrites explicit client timestamps

Added by Alexandre Oliva about 9 years ago. Updated over 7 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client, Common/Protocol, MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I think I've had an open bug about this for a very long time, but I couldn't find it, and I think it had been regarded as another problem, already fixed.

The problem arises when rsyncing a tree within cephfs with -a, so that timestamps are preserved.

If mds recovery takes place, timestamps may go out of sync.

I think I finally have a working theory to explain this: it is the file size and mtime recovery.

We take the latest mtime among the objects the object is striped into, but recovery takes place after replaying the journal, so we override the mtime set by rsync.

I get this is with Giant; I haven't tried newer releases yet.

History

#1 Updated by Greg Farnum about 9 years ago

  • Project changed from Ceph to CephFS
  • Subject changed from timestamps modified upon mds restart to MDS: file recovery overwrites explicit client timestamps
  • Category set to 47

I'm not quite sure the best way to go about fixing this. I think we might have discussed just dropping the mtime recovery, on the theory that whatever the MDS has in-journal is less likely to be wildly inaccurate than OSD timestamps.
That would also make things like Hadoop or rsync, which are flushing explicit times whenever they expect to see a change, a bit happier. So that's my vote.

#2 Updated by Alexandre Oliva about 9 years ago

I'd be happy enough if the mds wouldn't mess with timestamps of files of clients that are still up, and sync would make sure any files that are not modified afterwards keep their synced timestamps. I suppose this is already the case of umount, but I haven't checked. Meaning it's sort of ok if mds has to go through file probing if the client modifying the file is gone before the metadata can make the new mds.

Now, it would be nice if data writes and metadata updates could somehow be made transactional/atomic, in both-or-none and serializable-versions senses, so that an mds could tell whether metadata updates related with or subsequent to data writes, so that it doesn't need to probe or, if it is probing, it can tell the mtime update for the latest write it finds is already reflected in metadata, and so that a write whose corresponding metadata update didn't reach stable storage before the mds dies does not get unnoticed by the subsequent mds.

Since we don't have osd&mds transactions AFAIK, one thought that occurred to me was to use osd transactions to write the data and an xattr in a single transaction, and have the recovering mds look for the xattr to know about pending metadata updates. The client would remove the xattr once the mds confirmed the metadata update hit the mds journal; if the client is gone, the recovering mds would do that.

#3 Updated by Greg Farnum about 9 years ago

Transactions like that would be much more expensive than e.g. adding backtraces to file data objects, which I seem to remember you complaining bitterly about. ;) We're not going to do that.

#4 Updated by Alexandre Oliva about 9 years ago

What if writes carried the timestamp that ctime and mtime in the object are about to be set to, and the osd set the object's mtime to that timestamp, and mds recovery compared the timestamps of the osd objects with the ctime of the inode to decide whether there have been changes after the latest mds metadata commit for the file? I think this would preserve all intended properties: writes after commit will update mtime and ctime, whereas mtime warps after writes won't get clobbered because their ctime would necessarily be newer than any earlier writes, even if the writes only make to osds at a later time.

#5 Updated by Greg Farnum about 9 years ago

  • Tracker changed from Bug to Fix
  • Source changed from other to Community (user)

Mmmm, logging on the local object might work. More likely we'd do it by setting a special xattr (I'm not sure the OSD has interfaces for setting mtimes, and it might interfere with other bits if they're not synced), but that should be a pretty cheap way of storing and optionally retrieving the data we want.

That's a little more complicated so I'm moving this into the Fix tracker where it can be picked up and scheduled later.

#6 Updated by Alexandre Oliva about 9 years ago

Now, maybe it would be more appropriate for mtime and size updates to be sent to the mds only after the osds ack a write. I had a ceph.ko client failure part-way through an rsync of a large tree, and later found out that many small files had the right size, but none of the data.

Now, I realize this wouldn't cover the case of the client crashing between the write and the metadata update, so this wouldn't be a complete solution. Now, if logging on the object might bring a solution to this, say by writing an mtime update to such a log before the write, then we might as well issue the mtime update to the mds, when we request write caps on the file.

#7 Updated by Greg Farnum over 7 years ago

  • Category changed from 47 to Correctness/Safety
  • Component(FS) Client, Common/Protocol, MDS added

Also available in: Atom PDF