directories disappear across multiple rsyncs
Because of bug 1317, I upload about 1TB of files to my home cluster with one of the 3 osds disabled, so all 3-plicated PGs are degraded. It has often happened that, after completing the upload, I enable the osd and run a final rsync into the mounted filesystem. I've observed that, although nothing changed in the source filesystem, some files are copied over in this final pass, sometimes small sub-trees, sometimes very large ones. Every time, it appears that entire sub-trees disappeared, as in, I've never noticed it for a single file, but I can't rule out this possibility. One additional disturbing data point was that, when a very large directory tree disappeared and was uploaded again, the osd disk space used by it was not recovered: it appears that the subtree was disconnected from the root, but was not collected. After it happens, it doesn't help to stop the osd and restart the mds: the connection to the old directory was already lost/overwritten. Recently, I tried to remove the directory that overrode the other; in one case, I succeeded, and got a frag error note in the mon about discrepancies in sizes or somesuch right then; in other case, I could not remove the directory, for it was allegedly not empty.
#1 Updated by Sage Weil over 12 years ago
- Subject changed from directories disappear when mds restarted while propagating to new osd to directories disappear across multiple rsyncs
- Category set to 1
- Target version set to v0.33
Hmm, this sounds like an MDS issue, probably unrelated to the OSD degradation. Is this something you can reproduce?
If you can make it happen with mds logs enabled (debug mds = 20, debug ms = 1), along with the name of the directories that got recopied, it should be easy to identify the problem.
#3 Updated by Alexandre Oliva over 12 years ago
Unintentional investigation (still messing with my cluster big-time, couldn't turn much of mds logging on), looks like it has nothing to do with OSDs still missing replicas: it happens in mds recovery. It looks like there's a window in which a lstat asked just as the old mds crashed gets a response from the taking-oevr mds that indicates failure to find an existing directory or file, in such a way that permits the file or directory to be created, overriding the pre-existing file or directory.
I've watched this with both the cfuse client and the in-kernel (3.0-libre) ceph.ko. However, the effects of the failure with the cfuse client appear to be permanent, whereas it appears to be possible to avoid the ill effects thanks to local caching in ceph.ko, if I notice the problem right away and restart the mds. The faulty directory/file overwriting appears to be discarded/merged during recovery, so that no loss occurs. But the faulty mds has to be restarted right away (within seconds), otherwise the effects become permanent.
Once I'm done with my messing around I'll try to duplicate the problem with mds logging enabled.
#7 Updated by Greg Farnum over 12 years ago
Looking at these symptoms again, I wonder if this could have been a result of the path_traverse changes we were making at that time. (Specifically commit:39d50c1362db1d86782a60a5714e088d9ef7deaa and those that followed on from it.) We've fixed several things in there since the beginning of July.
#8 Updated by Sage Weil about 12 years ago
added a workunit misc/multiple_rsyncs.sh to do a couple rsyncs and make sure no additional files are transfered. src is just /usr. will throw this into the qa suite and see if it ever pops up.
we may need a larger dataset to trigger... we'll see what happens with a smallish one first.
#17 Updated by Jonathan Dieter over 11 years ago
I've just run into this on 0.43 using the ceph kernel module in 3.2.7. My symptoms are that a repeated rsync from a non-ceph filesystem results in groups of files being updated, even though they haven't changed on the source.
I'm rsyncing our local Fedora mirror and have copied roughly 200GB so far. The second rsync started duplicating data at roughly 100GB and duplicated maybe 500MB (a very rough guess).
The only log message that came up on ceph -w is:
log 2012-03-06 20:14:07.195675 mds.0 10.10.1.51:6800/9700 2 : [ERR] loaded dup inode 100000105f7 [2,head] v128379 at /system_data/rpms/fedora/releases/13/Everything/x86_64/os/Packages/zhcon-0.2.6-15.fc13.x86_64.rpm, but inode 100000105f7.head v118536 already exists at /system_data/rpms/fedora/releases/13/Everything/x86_64/os/Packages/.zhcon-0.2.6-15.fc13.x86_64.rpm.bEC1Rz
There were far more files duplicated than just that one.
#20 Updated by Alexandre Oliva over 9 years ago
I'm afraid this still occurs quite often with ceph 0.77 and ceph.ko 3.13.6-gnu. I have a slightly better understanding of the situation: files don't really disappear, they just get timestamps that don't match those of the source, even though rsync did set their mtime after writing out the file contents. That disk space remained in use after the second rsync was an artifact of the delay between moving the overwritten files to stray and actually removing their data.
I have observed this after writing trees with many small files without any mds restart, followed by a successful umount. Whatever it is, it's not a result of mds recovery problems. Failures I have had often during sessions after which I observed such incorrect timestamps are osd failures that result in retransmits of writes by the client. I don't know that this delayed write retry could be messing up with the file metadata on the client, prevailing over the timestamp warp, but it's the theory that was least nonsensical to me, after some superficial review of the involved code paths. That said, stopping the rsync writer and running sync every now and again, so as to avoid the osd's suicide timeout, seems to have made the problem worse, not better! That made me wonder if it's some other interaction involving the timestamps in the lock object, because the timestamps often look right, even from another client, before umounting the writer, mounting it again, and run a new rsync that should be a no-op, but that finds many files with the wrong timestamps.
I haven't been smart enough to figure it out, but if someone can name some places to look, I'd be happy to instrument the mds or somesuch, so as to try to figure out what's causing files that had mtime set to long ago get current timestamps. Thanks,
#21 Updated by Alexandre Oliva over 9 years ago
By “this” I meant files with different timestamps from what they were last set to, as in the first paragraph of comment #17. I don't now that this is quite the same problem as in the original bug report; it kind of fits, but it's hard to tell after so long...