Bug #1399
closed
Added by Sam Lang over 12 years ago.
Updated over 7 years ago.
Description
After running successfully with one active mds and two standbys, the active mds has crashed, and on restart, it crashes again. Also, both standbys crashed and continue to crash on restart:
Initial active mds crash: http://fpaste.org/cyGt/
Second active mds crash on restart: http://fpaste.org/eiNt/
Initial standby mds crash: http://fpaste.org/qa61/
Attached the core file generated by the second mds crash. I don't have the core file from the first one unfortunately.
Files
- Target version set to v0.34
Sam, do you still have this cluster? Can you restart the mds with debug mds = 20 and attach the resulting log? There are 2 bugs here, and the logs will help solve the second (not obvious from a code review).
As for the first one, what was your workload?
Thanks!
I removed the assertion: assert(in->is_head());
That allowed the mds servers to restart and complete recovery, and from initial tests the filesystem seems fine. If I add the assertion back into the code and restart the mds, I don't hit that assertion any more, so I can't give you the debug output.
I did try to debug this a little bit. The is_head() function verifies that the inode isn't a snapshot, but in this case the value of last.val is 16 (not -1,-2, or -3).
Based on the warning about "bad client_range", it looks like the errors here are all related to client leases being invalid. In that case, it doesn't seem like that assertion is necessary (maybe just a warning). If the mds just needs to be able to recover, the client leases could all just be revoked, couldn't they?
As for the original error, it does seem reproducible by creating a snapshot of a directory using the mkdir system call (from a C program) with 0777 as the mask, and then moving another directory into the directory that had just been snapshotted.
Hmmm. If last.val
isn't -1 then it's not a head inode.
With the reproduction steps (we love reproduceable bugs!) and the backtrace, I bet it's because the moved directory is getting pulled into the middle of a snapshot-in-progress and not handling it properly.
- Status changed from New to In Progress
original crash is fixed by commit:e98669ea69059e26e0c4aa72c46e0be5bfc96386
- Status changed from In Progress to Resolved
I'm not sure I can reproduce the second (replay) crash. Sam, next time you see one of these, please capture a replay log before working around it. Thanks!
replay crash looks like the one fixed in commit:8c5e7dcf8cf7f3daa65eb9905, yay!
- Project changed from Ceph to CephFS
- Category deleted (
1)
- Target version deleted (
v0.34)
Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.
Also available in: Atom
PDF