Bug #5031: mds/MDCache.cc: 5221: FAILED assert(reconnected_snaprealms.empty()) - CephFS - Ceph

Actions

Copy link

Bug #5031

closed

mds/MDCache.cc: 5221: FAILED assert(reconnected_snaprealms.empty())

Added by Sage Weil almost 11 years ago. Updated almost 8 years ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2013-05-11T07:25:10.100 INFO:teuthology.task.ceph.mds.b-s-a.err:mds/MDCache.cc: In function 'void MDCache::open_snap_parents()' thread 7f0696c91700 time 2013-05-11 07:25:28.131413
2013-05-11T07:25:10.100 INFO:teuthology.task.ceph.mds.b-s-a.err:mds/MDCache.cc: 5221: FAILED assert(reconnected_snaprealms.empty())
2013-05-11T07:25:10.113 INFO:teuthology.task.ceph.mds.b-s-a.err: ceph version 0.61-249-gb5e9b56 (b5e9b56fc93dd4896c802aff1096430b523ad84c)
2013-05-11T07:25:10.113 INFO:teuthology.task.ceph.mds.b-s-a.err: 1: (MDCache::open_snap_parents()+0x9bf) [0x5ca44f]
2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 2: (MDCache::rejoin_gather_finish()+0x1e8) [0x5cd698]
2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 3: (MDCache::rejoin_send_rejoins()+0x108c) [0x5d346c]
2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 4: (MDS::rejoin_joint_start()+0x13c) [0x4ca3bc]
2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 5: (MDS::handle_mds_map(MMDSMap*)+0x3ad6) [0x4de806]
2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 6: (MDS::handle_core_message(Message*)+0xb1b) [0x4dfe9b]
2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 7: (MDS::_dispatch(Message*)+0x2f) [0x4e002f]
2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 8: (MDS::ms_dispatch(Message*)+0x1d3) [0x4e1ab3]
2013-05-11T07:25:10.115 INFO:teuthology.task.ceph.mds.b-s-a.err: 9: (DispatchQueue::entry()+0x3f1) [0x830551]
2013-05-11T07:25:10.115 INFO:teuthology.task.ceph.mds.b-s-a.err: 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ac9cd]
2013-05-11T07:25:10.115 INFO:teuthology.task.ceph.mds.b-s-a.err: 11: (()+0x7e9a) [0x7f069afede9a]
2013-05-11T07:25:10.115 INFO:teuthology.task.ceph.mds.b-s-a.err: 12: (clone()+0x6d) [0x7f06997a3ccd]
2013-05-11T07:25:10.115 INFO:teuthology.task.ceph.mds.b-s-a.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

job was

ubuntu@teuthology:/a/teuthology-2013-05-11_01:00:38-fs-next-testing-basic/11284$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: b5b09be30cf99f9c699e825629f02e3bce555d44
machine_type: plana
nuke-on-error: true
overrides:
  ceph:
    conf:
      client:
        debug client: 10
      global:
        mds inject delay type: osd mds
        ms inject delay max: 1
        ms inject delay probability: 0.005
        ms inject socket failures: 2500
      mds:
        debug mds: 20
        debug ms: 1
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
    log-whitelist:
    - slow request
    - wrongly marked me down
    sha1: fd901056831586e8135e28c8f4ba9c2ec44dfcf6
  s3tests:
    branch: next
  workunit:
    sha1: fd901056831586e8135e28c8f4ba9c2ec44dfcf6
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
  - mds.b-s-a
tasks:
- chef: null
- clock.check: null
- install: null
- ceph: null
- mds_thrash: null
- ceph-fuse: null
- workunit:
    clients:
      all:
      - suites/fsstress.sh

full logs!

Files

Download all files

ceph-mds.0.log.xz (13.8 KB) ceph-mds.0.log.xz	First node that I started, then it crashed	Walter Huf, 05/23/2013 10:35 AM
ceph-mds.1.log.xz (13.4 MB) ceph-mds.1.log.xz	Second node that I started, it took over the primary role I guess	Walter Huf, 05/23/2013 10:35 AM

Actions

Copy link

Updated by Sage Weil almost 11 years ago

logs copied to logs/ subdir

Actions

Copy link

Updated by Zheng Yan almost 11 years ago

The items left in reconnected_snaprealms should be other MDS's mdsdir. I comment out that line when running test

Actions

Copy link

Updated by Walter Huf almost 11 years ago

I also have encountered this. Under Bobtail, I had it running with 2 active nodes and a passive node. Now, I can only start one node and any others fail. ceph status shows this mds information:
mdsmap e1591: 2/2/1 up {0=0=up:resolve,1=2=up:rejoin(laggy or crashed)}

I have tried "ceph mds tell 0 injectargs '--max_mds 1'", but it doesn't seem to change anything. I can't run "ceph mds stop 1" because that node doesn't stay up long enough.

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Priority changed from Urgent to High

Argh.. i don't have a log after all.

Yan, dropping the assert avoids teh crash, but it seems like the real issue is that we had caps on a replicated inode in another mds's stray dir. To reconnect those, we need to open them up during rejoin...

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Walter: can you produce a log? 'debug mds = 20', 'debug ms = 1', restart the mds and wait for it to crash.

I have a patch in master that comments out the assert fo rnow, which will get your cluster back up and running.. commit:70c9851a55808b7a3d081f84dedb43c5484176b1

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Status changed from 12 to Need More Info

Actions

Copy link

Updated by Zheng Yan almost 11 years ago

Sage Weil wrote:

Argh.. i don't have a log after all.

Yan, dropping the assert avoids teh crash, but it seems like the real issue is that we had caps on a replicated inode in another mds's stray dir. To reconnect those, we need to open them up during rejoin...

caps on replicated inode are exported to their auth MDS, why should we open them up? I think real issue is that client sends all snap realms it has to the recovering MDS. For a given snap realm, if journal replay doesn't bring its inode into the cache, the recovering MDS doesn't have auth subtree in it either, it will be left in reconnected_snaprealms.

Actions

Copy link Download all files