Bug #5031
closedmds/MDCache.cc: 5221: FAILED assert(reconnected_snaprealms.empty())
0%
Description
2013-05-11T07:25:10.100 INFO:teuthology.task.ceph.mds.b-s-a.err:mds/MDCache.cc: In function 'void MDCache::open_snap_parents()' thread 7f0696c91700 time 2013-05-11 07:25:28.131413 2013-05-11T07:25:10.100 INFO:teuthology.task.ceph.mds.b-s-a.err:mds/MDCache.cc: 5221: FAILED assert(reconnected_snaprealms.empty()) 2013-05-11T07:25:10.113 INFO:teuthology.task.ceph.mds.b-s-a.err: ceph version 0.61-249-gb5e9b56 (b5e9b56fc93dd4896c802aff1096430b523ad84c) 2013-05-11T07:25:10.113 INFO:teuthology.task.ceph.mds.b-s-a.err: 1: (MDCache::open_snap_parents()+0x9bf) [0x5ca44f] 2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 2: (MDCache::rejoin_gather_finish()+0x1e8) [0x5cd698] 2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 3: (MDCache::rejoin_send_rejoins()+0x108c) [0x5d346c] 2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 4: (MDS::rejoin_joint_start()+0x13c) [0x4ca3bc] 2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 5: (MDS::handle_mds_map(MMDSMap*)+0x3ad6) [0x4de806] 2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 6: (MDS::handle_core_message(Message*)+0xb1b) [0x4dfe9b] 2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 7: (MDS::_dispatch(Message*)+0x2f) [0x4e002f] 2013-05-11T07:25:10.114 INFO:teuthology.task.ceph.mds.b-s-a.err: 8: (MDS::ms_dispatch(Message*)+0x1d3) [0x4e1ab3] 2013-05-11T07:25:10.115 INFO:teuthology.task.ceph.mds.b-s-a.err: 9: (DispatchQueue::entry()+0x3f1) [0x830551] 2013-05-11T07:25:10.115 INFO:teuthology.task.ceph.mds.b-s-a.err: 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ac9cd] 2013-05-11T07:25:10.115 INFO:teuthology.task.ceph.mds.b-s-a.err: 11: (()+0x7e9a) [0x7f069afede9a] 2013-05-11T07:25:10.115 INFO:teuthology.task.ceph.mds.b-s-a.err: 12: (clone()+0x6d) [0x7f06997a3ccd] 2013-05-11T07:25:10.115 INFO:teuthology.task.ceph.mds.b-s-a.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
job was
ubuntu@teuthology:/a/teuthology-2013-05-11_01:00:38-fs-next-testing-basic/11284$ cat orig.config.yaml kernel: kdb: true sha1: b5b09be30cf99f9c699e825629f02e3bce555d44 machine_type: plana nuke-on-error: true overrides: ceph: conf: client: debug client: 10 global: mds inject delay type: osd mds ms inject delay max: 1 ms inject delay probability: 0.005 ms inject socket failures: 2500 mds: debug mds: 20 debug ms: 1 mon: debug mon: 20 debug ms: 20 debug paxos: 20 log-whitelist: - slow request - wrongly marked me down sha1: fd901056831586e8135e28c8f4ba9c2ec44dfcf6 s3tests: branch: next workunit: sha1: fd901056831586e8135e28c8f4ba9c2ec44dfcf6 roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - - client.0 - mds.b-s-a tasks: - chef: null - clock.check: null - install: null - ceph: null - mds_thrash: null - ceph-fuse: null - workunit: clients: all: - suites/fsstress.sh
full logs!
Files
Updated by Zheng Yan almost 11 years ago
The items left in reconnected_snaprealms should be other MDS's mdsdir. I comment out that line when running test
Updated by Walter Huf almost 11 years ago
I also have encountered this. Under Bobtail, I had it running with 2 active nodes and a passive node. Now, I can only start one node and any others fail. ceph status shows this mds information:
mdsmap e1591: 2/2/1 up {0=0=up:resolve,1=2=up:rejoin(laggy or crashed)}
I have tried "ceph mds tell 0 injectargs '--max_mds 1'", but it doesn't seem to change anything. I can't run "ceph mds stop 1" because that node doesn't stay up long enough.
Updated by Sage Weil almost 11 years ago
- Priority changed from Urgent to High
Argh.. i don't have a log after all.
Yan, dropping the assert avoids teh crash, but it seems like the real issue is that we had caps on a replicated inode in another mds's stray dir. To reconnect those, we need to open them up during rejoin...
Updated by Sage Weil almost 11 years ago
Walter: can you produce a log? 'debug mds = 20', 'debug ms = 1', restart the mds and wait for it to crash.
I have a patch in master that comments out the assert fo rnow, which will get your cluster back up and running.. commit:70c9851a55808b7a3d081f84dedb43c5484176b1
Updated by Sage Weil almost 11 years ago
- Status changed from 12 to Need More Info
Updated by Zheng Yan almost 11 years ago
Sage Weil wrote:
Argh.. i don't have a log after all.
Yan, dropping the assert avoids teh crash, but it seems like the real issue is that we had caps on a replicated inode in another mds's stray dir. To reconnect those, we need to open them up during rejoin...
caps on replicated inode are exported to their auth MDS, why should we open them up? I think real issue is that client sends all snap realms it has to the recovering MDS. For a given snap realm, if journal replay doesn't bring its inode into the cache, the recovering MDS doesn't have auth subtree in it either, it will be left in reconnected_snaprealms.
Updated by Walter Huf almost 11 years ago
- File ceph-mds.0.log.xz ceph-mds.0.log.xz added
- File ceph-mds.1.log.xz ceph-mds.1.log.xz added
I have attached the logs from two nodes of my MDS cluster.
I started mds.0 first. When I started mds.1, mds.0 crashed.
Updated by Sage Weil almost 11 years ago
- Status changed from Need More Info to Resolved