Project

General

Profile

Actions

Bug #5031

closed

mds/MDCache.cc: 5221: FAILED assert(reconnected_snaprealms.empty())

Added by Sage Weil almost 11 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2013-05-11T07:25:10.100 INFO:teuthology.task.ceph.mds.b-s-a.err:mds/MDCache.cc: In function 'void MDCache::open_snap_parents()' thread 7f0696c91700 time 2013-05-11 07:25:28.131413
2013-05-11T07:25:10.100 INFO:teuthology.task.ceph.mds.b-s-a.err:mds/MDCache.cc: 5221: FAILED assert(reconnected_snaprealms.empty())
2013-05-11T07:25:10.113 INFO:teuthology.task.ceph.mds.b-s-a.err: ceph version 0.61-249-gb5e9b56 (b5e9b56fc93dd4896c802aff1096430b523ad84c)
2013-05-11T07:25:10.113 INFO:teuthology.task.ceph.mds.b-s-a.err: 1: (MDCache::open_snap_parents()+0x9bf) [0x5ca44f]
2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 2: (MDCache::rejoin_gather_finish()+0x1e8) [0x5cd698]
2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 3: (MDCache::rejoin_send_rejoins()+0x108c) [0x5d346c]
2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 4: (MDS::rejoin_joint_start()+0x13c) [0x4ca3bc]
2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 5: (MDS::handle_mds_map(MMDSMap*)+0x3ad6) [0x4de806]
2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 6: (MDS::handle_core_message(Message*)+0xb1b) [0x4dfe9b]
2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 7: (MDS::_dispatch(Message*)+0x2f) [0x4e002f]
2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 8: (MDS::ms_dispatch(Message*)+0x1d3) [0x4e1ab3]
2013-05-11T07:25:10.115 INFO:teuthology.task.ceph.mds.b-s-a.err: 9: (DispatchQueue::entry()+0x3f1) [0x830551]
2013-05-11T07:25:10.115 INFO:teuthology.task.ceph.mds.b-s-a.err: 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ac9cd]
2013-05-11T07:25:10.115 INFO:teuthology.task.ceph.mds.b-s-a.err: 11: (()+0x7e9a) [0x7f069afede9a]
2013-05-11T07:25:10.115 INFO:teuthology.task.ceph.mds.b-s-a.err: 12: (clone()+0x6d) [0x7f06997a3ccd]
2013-05-11T07:25:10.115 INFO:teuthology.task.ceph.mds.b-s-a.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

job was
ubuntu@teuthology:/a/teuthology-2013-05-11_01:00:38-fs-next-testing-basic/11284$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: b5b09be30cf99f9c699e825629f02e3bce555d44
machine_type: plana
nuke-on-error: true
overrides:
  ceph:
    conf:
      client:
        debug client: 10
      global:
        mds inject delay type: osd mds
        ms inject delay max: 1
        ms inject delay probability: 0.005
        ms inject socket failures: 2500
      mds:
        debug mds: 20
        debug ms: 1
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
    log-whitelist:
    - slow request
    - wrongly marked me down
    sha1: fd901056831586e8135e28c8f4ba9c2ec44dfcf6
  s3tests:
    branch: next
  workunit:
    sha1: fd901056831586e8135e28c8f4ba9c2ec44dfcf6
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
  - mds.b-s-a
tasks:
- chef: null
- clock.check: null
- install: null
- ceph: null
- mds_thrash: null
- ceph-fuse: null
- workunit:
    clients:
      all:
      - suites/fsstress.sh

full logs!

Files

ceph-mds.0.log.xz (13.8 KB) ceph-mds.0.log.xz First node that I started, then it crashed Walter Huf, 05/23/2013 10:35 AM
ceph-mds.1.log.xz (13.4 MB) ceph-mds.1.log.xz Second node that I started, it took over the primary role I guess Walter Huf, 05/23/2013 10:35 AM
Actions #1

Updated by Sage Weil almost 11 years ago

logs copied to logs/ subdir

Actions #2

Updated by Zheng Yan almost 11 years ago

The items left in reconnected_snaprealms should be other MDS's mdsdir. I comment out that line when running test

Actions #3

Updated by Walter Huf almost 11 years ago

I also have encountered this. Under Bobtail, I had it running with 2 active nodes and a passive node. Now, I can only start one node and any others fail. ceph status shows this mds information:
mdsmap e1591: 2/2/1 up {0=0=up:resolve,1=2=up:rejoin(laggy or crashed)}

I have tried "ceph mds tell 0 injectargs '--max_mds 1'", but it doesn't seem to change anything. I can't run "ceph mds stop 1" because that node doesn't stay up long enough.

Actions #4

Updated by Sage Weil almost 11 years ago

  • Priority changed from Urgent to High

Argh.. i don't have a log after all.

Yan, dropping the assert avoids teh crash, but it seems like the real issue is that we had caps on a replicated inode in another mds's stray dir. To reconnect those, we need to open them up during rejoin...

Actions #5

Updated by Sage Weil almost 11 years ago

Walter: can you produce a log? 'debug mds = 20', 'debug ms = 1', restart the mds and wait for it to crash.

I have a patch in master that comments out the assert fo rnow, which will get your cluster back up and running.. commit:70c9851a55808b7a3d081f84dedb43c5484176b1

Actions #6

Updated by Sage Weil almost 11 years ago

  • Status changed from 12 to Need More Info
Actions #7

Updated by Zheng Yan almost 11 years ago

Sage Weil wrote:

Argh.. i don't have a log after all.

Yan, dropping the assert avoids teh crash, but it seems like the real issue is that we had caps on a replicated inode in another mds's stray dir. To reconnect those, we need to open them up during rejoin...

caps on replicated inode are exported to their auth MDS, why should we open them up? I think real issue is that client sends all snap realms it has to the recovering MDS. For a given snap realm, if journal replay doesn't bring its inode into the cache, the recovering MDS doesn't have auth subtree in it either, it will be left in reconnected_snaprealms.

Updated by Walter Huf almost 11 years ago

I have attached the logs from two nodes of my MDS cluster.
I started mds.0 first. When I started mds.1, mds.0 crashed.

Actions #9

Updated by Sage Weil almost 11 years ago

  • Status changed from Need More Info to Resolved
Actions #10

Updated by Greg Farnum almost 8 years ago

  • Component(FS) MDS added
Actions

Also available in: Atom PDF