Project

General

Profile

Bug #15920

mds/StrayManager.cc: 520: FAILED assert(dnl->is_primary())

Added by Daniel van Ham Colchete almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
Correctness/Safety
Target version:
-
Start date:
05/18/2016
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
jewel
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:

Description

I'm running Ceph 10.2.1 on Ubuntu 14.04.4 LTS with kernel 4.4, using CephFS, in production, and I'm constantly getting the following assertion failed:

mds/StrayManager.cc: 520: FAILED assert(dnl->is_primary())

ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x55ce4f357d3b]
2: (StrayManager::__eval_stray(CDentry*, bool)+0x15f) [0x55ce4f0dc7cf]
3: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x55ce4f0dd32e]
4: (Server::_rename_finish(std::shared_ptr<MDRequestImpl>&, CDentry*, CDentry*, CDentry*)+0x1ed) [0x55ce4f00358d]
5: (MDSInternalContextBase::complete(int)+0x1db) [0x55ce4f1ca54b]
6: (MDSInternalContextBase::complete(int)+0x1db) [0x55ce4f1ca54b]
7: (C_MDL_Flushed::finish(int)+0x13) [0x55ce4f1de9d3]
8: (MDSIOContextBase::complete(int)+0x91) [0x55ce4f1ca841]
9: (Finisher::finisher_thread_entry()+0x206) [0x55ce4f28e206]
10: (()+0x8182) [0x7f9a0b5d2182]
11: (clone()+0x6d) [0x7f9a09b2947d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues

Copied to fs - Backport #16041: jewel: mds/StrayManager.cc: 520: FAILED assert(dnl->is_primary()) Resolved

History

#1 Updated by Greg Farnum almost 3 years ago

  • Priority changed from Normal to High

Can you set "debug mds = 20" in your MDS, turn it on, and then upload the full log of the crash with ceph-post-file? That should get us the info we need to see how we're going down a bad path.

#2 Updated by Daniel van Ham Colchete almost 3 years ago

Greg Farnum wrote:

Can you set "debug mds = 20" in your MDS, turn it on, and then upload the full log of the crash with ceph-post-file? That should get us the info we need to see how we're going down a bad path.

Greg,

the log got to 30GB, I just sent the gzipped file:
ceph-post-file: 317c676e-ed06-41cc-bdd5-f799fac91cb0

Let me know any other way I can help.

Best,
Daniel Colchete

#3 Updated by Greg Farnum almost 3 years ago

Hmm, the gzip of the ceph-mds log is corrupted on this end. If it's valid on your side, could you re-post it please? :/

#4 Updated by Greg Farnum almost 3 years ago

Had to do a manual transfer, but it's unzipped in 86238dec-a35e-49f6-91f3-9efa496d59b7 now.

#5 Updated by Greg Farnum almost 3 years ago

  • Status changed from New to In Progress

Looks like we're renaming a->b, b exists, and the inode at b has a "remote" parent. The StrayManager is asserting out because it's not supposed to get called on something which isn't primary. I'm not sure which piece of code should be behaving differently, though...I don't think we should even be creating a stray linkage for an inode which isn't linked in to the dying dentry.

#6 Updated by Zheng Yan almost 3 years ago

where to the uploaded log file

#7 Updated by Sage Weil almost 3 years ago

My guess is that b was just created and the check is doing linkage instead of projected linkage or something. e.g.,

touch a b
ln c a
mv b c

or similar?

#8 Updated by Daniel van Ham Colchete almost 3 years ago

From what I could see here, this is happening when an e-mail arrives on Dovecot thorugh LMTP. I was doing migrations for hours last night, reading from CephFS at max speed, and not a single crash.

Also, it doesn't happen all the time an e-mail arrive, since e-mails are arriving here on the tens of thousands per hour and the crash happened 14 times only in the last 6 hours on one of my clusters.

Sage Weil wrote:

My guess is that b was just created and the check is doing linkage instead of projected linkage or something. e.g.,

touch a b
ln c a
mv b c

or similar?

#9 Updated by Daniel van Ham Colchete almost 3 years ago

By, I was doing migrations I mean that LMTP was offline so that no new e-mail would arrive and slow down the process.

Daniel van Ham Colchete wrote:

From what I could see here, this is happening when an e-mail arrives on Dovecot thorugh LMTP. I was doing migrations for hours last night, reading from CephFS at max speed, and not a single crash.

Also, it doesn't happen all the time an e-mail arrive, since e-mails are arriving here on the tens of thousands per hour and the crash happened 14 times only in the last 6 hours on one of my clusters.

#10 Updated by Daniel van Ham Colchete almost 3 years ago

Correction: it happened a little bit more than about 700 times today.

#11 Updated by Zheng Yan almost 3 years ago

StrayManager::eval_stray() is called after Server::respond_to_request() drops locks, so it can race with StrayManager::reintegrate_stray()

https://github.com/ceph/ceph/pull/9260

#12 Updated by John Spray almost 3 years ago

  • Status changed from In Progress to Need Review

#13 Updated by John Spray almost 3 years ago

  • Status changed from Need Review to Pending Backport
  • Backport set to jewel

#14 Updated by Nathan Cutler almost 3 years ago

  • Copied to Backport #16041: jewel: mds/StrayManager.cc: 520: FAILED assert(dnl->is_primary()) added

#15 Updated by Daniel van Ham Colchete almost 3 years ago

Good morning everyone!

Considering that a backport is done, though not merged yet, is there away for me to get a gitbuilder build to test the fix? I would like to move back to CephFS quickly, as RBD+NFS has lower performance.

Best,
Daniel

#16 Updated by John Spray almost 3 years ago

I've pushed a jewel-15920 branch for you with the fix cherry-picked onto it. (don't usually do this, but it's fairly severe and is probably missing the 10.2.2 release) -- you should find it's built on the gitbuilder of you choice soon/now.

#17 Updated by Daniel van Ham Colchete almost 3 years ago

John, thank you very much! Yeah, I saw that it was going to miss 10.2.2. Thank you for making this exception! I'll start testing today and should go increasing the load in the following days. I'll report here anything I find.

#18 Updated by Greg Farnum over 2 years ago

  • Category changed from 47 to Correctness/Safety
  • Component(FS) MDS added

#19 Updated by Loic Dachary over 2 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF