Project

General

Profile

Actions

Bug #1399

closed

mds crash

Added by Sam Lang over 12 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After running successfully with one active mds and two standbys, the active mds has crashed, and on restart, it crashes again. Also, both standbys crashed and continue to crash on restart:

Initial active mds crash: http://fpaste.org/cyGt/
Second active mds crash on restart: http://fpaste.org/eiNt/

Initial standby mds crash: http://fpaste.org/qa61/

Attached the core file generated by the second mds crash. I don't have the core file from the first one unfortunately.


Files

core.gz (6.16 MB) core.gz Sam Lang, 08/16/2011 03:11 PM
Actions #1

Updated by Sage Weil over 12 years ago

  • Target version set to v0.34
Actions #2

Updated by Sage Weil over 12 years ago

  • Category set to 1

Sam, do you still have this cluster? Can you restart the mds with debug mds = 20 and attach the resulting log? There are 2 bugs here, and the logs will help solve the second (not obvious from a code review).

As for the first one, what was your workload?

Thanks!

Actions #3

Updated by Sam Lang over 12 years ago

I removed the assertion: assert(in->is_head());

That allowed the mds servers to restart and complete recovery, and from initial tests the filesystem seems fine. If I add the assertion back into the code and restart the mds, I don't hit that assertion any more, so I can't give you the debug output.

I did try to debug this a little bit. The is_head() function verifies that the inode isn't a snapshot, but in this case the value of last.val is 16 (not -1,-2, or -3).

Based on the warning about "bad client_range", it looks like the errors here are all related to client leases being invalid. In that case, it doesn't seem like that assertion is necessary (maybe just a warning). If the mds just needs to be able to recover, the client leases could all just be revoked, couldn't they?

Actions #4

Updated by Sam Lang over 12 years ago

As for the original error, it does seem reproducible by creating a snapshot of a directory using the mkdir system call (from a C program) with 0777 as the mask, and then moving another directory into the directory that had just been snapshotted.

Actions #5

Updated by Greg Farnum over 12 years ago

Hmmm. If last.val isn't -1 then it's not a head inode.

With the reproduction steps (we love reproduceable bugs!) and the backtrace, I bet it's because the moved directory is getting pulled into the middle of a snapshot-in-progress and not handling it properly.

Actions #6

Updated by Sage Weil over 12 years ago

  • Status changed from New to In Progress

original crash is fixed by commit:e98669ea69059e26e0c4aa72c46e0be5bfc96386

Actions #7

Updated by Sage Weil over 12 years ago

  • Status changed from In Progress to Resolved

I'm not sure I can reproduce the second (replay) crash. Sam, next time you see one of these, please capture a replay log before working around it. Thanks!

Actions #8

Updated by Sage Weil over 12 years ago

replay crash looks like the one fixed in commit:8c5e7dcf8cf7f3daa65eb9905, yay!

Actions #9

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
  • Target version deleted (v0.34)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF