Bug #15920: mds/StrayManager.cc: 520: FAILED assert(dnl->is_primary()) - CephFS - Ceph

Actions

Copy link

Bug #15920

closed

mds/StrayManager.cc: 520: FAILED assert(dnl->is_primary())

Added by Daniel van Ham Colchete almost 8 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

Correctness/Safety

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

jewel

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v10.2.1

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I'm running Ceph 10.2.1 on Ubuntu 14.04.4 LTS with kernel 4.4, using CephFS, in production, and I'm constantly getting the following assertion failed:

mds/StrayManager.cc: 520: FAILED assert(dnl->is_primary())

ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x55ce4f357d3b]
 2: (StrayManager::__eval_stray(CDentry*, bool)+0x15f) [0x55ce4f0dc7cf]
 3: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x55ce4f0dd32e]
 4: (Server::_rename_finish(std::shared_ptr&lt;MDRequestImpl&gt;&, CDentry*, CDentry*, CDentry*)+0x1ed) [0x55ce4f00358d]
 5: (MDSInternalContextBase::complete(int)+0x1db) [0x55ce4f1ca54b]
 6: (MDSInternalContextBase::complete(int)+0x1db) [0x55ce4f1ca54b]
 7: (C_MDL_Flushed::finish(int)+0x13) [0x55ce4f1de9d3]
 8: (MDSIOContextBase::complete(int)+0x91) [0x55ce4f1ca841]
 9: (Finisher::finisher_thread_entry()+0x206) [0x55ce4f28e206]
 10: (()+0x8182) [0x7f9a0b5d2182]
 11: (clone()+0x6d) [0x7f9a09b2947d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Greg Farnum almost 8 years ago

Priority changed from Normal to High

Can you set "debug mds = 20" in your MDS, turn it on, and then upload the full log of the crash with ceph-post-file? That should get us the info we need to see how we're going down a bad path.

Actions

Copy link

Updated by Daniel van Ham Colchete almost 8 years ago

Greg Farnum wrote:

Can you set "debug mds = 20" in your MDS, turn it on, and then upload the full log of the crash with ceph-post-file? That should get us the info we need to see how we're going down a bad path.

Greg,

the log got to 30GB, I just sent the gzipped file:
ceph-post-file: 317c676e-ed06-41cc-bdd5-f799fac91cb0

Let me know any other way I can help.

Best,
Daniel Colchete

Actions

Copy link

Updated by Greg Farnum almost 8 years ago

Hmm, the gzip of the ceph-mds log is corrupted on this end. If it's valid on your side, could you re-post it please? :/

Actions

Copy link

Updated by Greg Farnum almost 8 years ago

Had to do a manual transfer, but it's unzipped in 86238dec-a35e-49f6-91f3-9efa496d59b7 now.

Actions

Copy link

Updated by Greg Farnum almost 8 years ago

Status changed from New to In Progress

Looks like we're renaming a->b, b exists, and the inode at b has a "remote" parent. The StrayManager is asserting out because it's not supposed to get called on something which isn't primary. I'm not sure which piece of code should be behaving differently, though...I don't think we should even be creating a stray linkage for an inode which isn't linked in to the dying dentry.

Actions

Copy link

Updated by Zheng Yan almost 8 years ago

where to the uploaded log file

Actions

Copy link

Updated by Sage Weil almost 8 years ago

My guess is that b was just created and the check is doing linkage instead of projected linkage or something. e.g.,

touch a b
ln c a
mv b c

or similar?

Actions

Copy link

Updated by Daniel van Ham Colchete almost 8 years ago

From what I could see here, this is happening when an e-mail arrives on Dovecot thorugh LMTP. I was doing migrations for hours last night, reading from CephFS at max speed, and not a single crash.

Also, it doesn't happen all the time an e-mail arrive, since e-mails are arriving here on the tens of thousands per hour and the crash happened 14 times only in the last 6 hours on one of my clusters.

Sage Weil wrote:

My guess is that b was just created and the check is doing linkage instead of projected linkage or something. e.g.,

touch a b
ln c a
mv b c

or similar?

Actions

Copy link

Updated by Daniel van Ham Colchete almost 8 years ago

By, I was doing migrations I mean that LMTP was offline so that no new e-mail would arrive and slow down the process.

Daniel van Ham Colchete wrote:

From what I could see here, this is happening when an e-mail arrives on Dovecot thorugh LMTP. I was doing migrations for hours last night, reading from CephFS at max speed, and not a single crash.

Also, it doesn't happen all the time an e-mail arrive, since e-mails are arriving here on the tens of thousands per hour and the crash happened 14 times only in the last 6 hours on one of my clusters.

Actions

Copy link

#10

Updated by Daniel van Ham Colchete almost 8 years ago

Correction: it happened a little bit more than about 700 times today.

Actions

Copy link

#11

Updated by Zheng Yan almost 8 years ago

StrayManager::eval_stray() is called after Server::respond_to_request() drops locks, so it can race with StrayManager::reintegrate_stray()

https://github.com/ceph/ceph/pull/9260

Actions

Copy link

#12

Updated by John Spray almost 8 years ago

Status changed from In Progress to Fix Under Review

Actions

Copy link

#13

Updated by John Spray almost 8 years ago

Status changed from Fix Under Review to Pending Backport
Backport set to jewel

Actions

Copy link

#14

Updated by Nathan Cutler almost 8 years ago

Copied to Backport #16041: jewel: mds/StrayManager.cc: 520: FAILED assert(dnl->is_primary()) added

Actions

Copy link

#15

Updated by Daniel van Ham Colchete almost 8 years ago

Good morning everyone!

Considering that a backport is done, though not merged yet, is there away for me to get a gitbuilder build to test the fix? I would like to move back to CephFS quickly, as RBD+NFS has lower performance.

Best,
Daniel

Actions

Copy link

#16

Updated by John Spray almost 8 years ago

I've pushed a jewel-15920 branch for you with the fix cherry-picked onto it. (don't usually do this, but it's fairly severe and is probably missing the 10.2.2 release) -- you should find it's built on the gitbuilder of you choice soon/now.

Actions

Copy link

#17

Updated by Daniel van Ham Colchete almost 8 years ago

John, thank you very much! Yeah, I saw that it was going to miss 10.2.2. Thank you for making this exception! I'll start testing today and should go increasing the load in the following days. I'll report here anything I find.

Actions

Copy link

#18

Updated by Greg Farnum almost 8 years ago

Category changed from 47 to Correctness/Safety
Component(FS) MDS added

Actions

Copy link

#19

Updated by Loïc Dachary over 7 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #15920

mds/StrayManager.cc: 520: FAILED assert(dnl->is_primary())

Updated by Greg Farnum almost 8 years ago

Updated by Daniel van Ham Colchete almost 8 years ago

Updated by Greg Farnum almost 8 years ago

Updated by Greg Farnum almost 8 years ago

Updated by Greg Farnum almost 8 years ago

Updated by Zheng Yan almost 8 years ago

Updated by Sage Weil almost 8 years ago

Updated by Daniel van Ham Colchete almost 8 years ago

Updated by Daniel van Ham Colchete almost 8 years ago

Updated by Daniel van Ham Colchete almost 8 years ago

Updated by Zheng Yan almost 8 years ago

Updated by John Spray almost 8 years ago

Updated by John Spray almost 8 years ago

Updated by Nathan Cutler almost 8 years ago

Updated by Daniel van Ham Colchete almost 8 years ago

Updated by John Spray almost 8 years ago

Updated by Daniel van Ham Colchete almost 8 years ago

Updated by Greg Farnum almost 8 years ago

Updated by Loïc Dachary over 7 years ago