Bug #1399: mds crash - CephFS - Ceph

Actions

Copy link

Bug #1399

closed

mds crash

Added by Sam Lang over 12 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After running successfully with one active mds and two standbys, the active mds has crashed, and on restart, it crashes again. Also, both standbys crashed and continue to crash on restart:

Initial active mds crash: http://fpaste.org/cyGt/
Second active mds crash on restart: http://fpaste.org/eiNt/

Initial standby mds crash: http://fpaste.org/qa61/

Attached the core file generated by the second mds crash. I don't have the core file from the first one unfortunately.

Files

core.gz (6.16 MB) core.gz

Sam Lang, 08/16/2011 03:11 PM

Actions

Copy link

Updated by Sage Weil over 12 years ago

Target version set to v0.34

Actions

Copy link

Updated by Sage Weil over 12 years ago

Category set to 1

Sam, do you still have this cluster? Can you restart the mds with debug mds = 20 and attach the resulting log? There are 2 bugs here, and the logs will help solve the second (not obvious from a code review).

As for the first one, what was your workload?

Thanks!

Actions

Copy link

Updated by Sam Lang over 12 years ago

I removed the assertion: assert(in->is_head());

That allowed the mds servers to restart and complete recovery, and from initial tests the filesystem seems fine. If I add the assertion back into the code and restart the mds, I don't hit that assertion any more, so I can't give you the debug output.

I did try to debug this a little bit. The is_head() function verifies that the inode isn't a snapshot, but in this case the value of last.val is 16 (not -1,-2, or -3).

Based on the warning about "bad client_range", it looks like the errors here are all related to client leases being invalid. In that case, it doesn't seem like that assertion is necessary (maybe just a warning). If the mds just needs to be able to recover, the client leases could all just be revoked, couldn't they?

Actions

Copy link

Updated by Sam Lang over 12 years ago

As for the original error, it does seem reproducible by creating a snapshot of a directory using the mkdir system call (from a C program) with 0777 as the mask, and then moving another directory into the directory that had just been snapshotted.

Actions

Copy link

Updated by Greg Farnum over 12 years ago

Hmmm. If last.val isn't -1 then it's not a head inode.

With the reproduction steps (we love reproduceable bugs!) and the backtrace, I bet it's because the moved directory is getting pulled into the middle of a snapshot-in-progress and not handling it properly.

Actions

Copy link

Updated by Sage Weil over 12 years ago

Status changed from New to In Progress

original crash is fixed by commit:e98669ea69059e26e0c4aa72c46e0be5bfc96386

Actions

Copy link

Updated by Sage Weil over 12 years ago

Status changed from In Progress to Resolved

I'm not sure I can reproduce the second (replay) crash. Sam, next time you see one of these, please capture a replay log before working around it. Thanks!

Actions

Copy link