Bug #5236
closedmds assert when starting file scan
0%
Description
2013-06-03T05:02:42.986 INFO:teuthology.task.ceph.mds.b-s-a.err:mds/Locker.cc: In function 'void Locker::file_recover(ScatterLock*)' thread 7ff0fed1a700 time 2013-06-03 05:03:24.725204 2013-06-03T05:02:42.986 INFO:teuthology.task.ceph.mds.b-s-a.err:mds/Locker.cc: 4442: FAILED assert(lock->get_state() == LOCK_PRE_SCAN) 2013-06-03T05:02:42.986 INFO:teuthology.task.ceph.mds.b-s-a.err: ceph version 0.63-322-gf7c1944 (f7c19440290d4b82ced0320d1dfc4676ad5083d2) 2013-06-03T05:02:42.986 INFO:teuthology.task.ceph.mds.b-s-a.err: 1: (Locker::file_recover(ScatterLock*)+0x1dc) [0x612fcc] 2013-06-03T05:02:42.987 INFO:teuthology.task.ceph.mds.b-s-a.err: 2: (MDCache::start_files_to_recover(std::vector<CInode*, std::allocator<CInode*> >&, std::vector<CInode*, std::allocator<CInode*> >&)+0x86) [0x57a8e6] 2013-06-03T05:02:42.987 INFO:teuthology.task.ceph.mds.b-s-a.err: 3: (MDCache::open_snap_parents()+0x9dc) [0x5cb97c] 2013-06-03T05:02:42.987 INFO:teuthology.task.ceph.mds.b-s-a.err: 4: (MDCache::rejoin_gather_finish()+0x115) [0x5ceac5] 2013-06-03T05:02:42.987 INFO:teuthology.task.ceph.mds.b-s-a.err: 5: (MDCache::rejoin_send_rejoins()+0x1329) [0x5d4b89] 2013-06-03T05:02:42.987 INFO:teuthology.task.ceph.mds.b-s-a.err: 6: (MDS::rejoin_joint_start()+0x130) [0x4cbf30] 2013-06-03T05:02:42.988 INFO:teuthology.task.ceph.mds.b-s-a.err: 7: (MDS::handle_mds_map(MMDSMap*)+0x39cc) [0x4e03ec] 2013-06-03T05:02:42.988 INFO:teuthology.task.ceph.mds.b-s-a.err: 8: (MDS::handle_core_message(Message*)+0xb1b) [0x4e1d0b] 2013-06-03T05:02:42.988 INFO:teuthology.task.ceph.mds.b-s-a.err: 9: (MDS::_dispatch(Message*)+0x2f) [0x4e1e9f] 2013-06-03T05:02:42.988 INFO:teuthology.task.ceph.mds.b-s-a.err: 10: (MDS::ms_dispatch(Message*)+0x1d3) [0x4e3923] 2013-06-03T05:02:42.988 INFO:teuthology.task.ceph.mds.b-s-a.err: 11: (DispatchQueue::entry()+0x3f1) [0x839021] 2013-06-03T05:02:42.988 INFO:teuthology.task.ceph.mds.b-s-a.err: 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7b571d] 2013-06-03T05:02:42.989 INFO:teuthology.task.ceph.mds.b-s-a.err: 13: (()+0x7e9a) [0x7ff103076e9a] 2013-06-03T05:02:42.989 INFO:teuthology.task.ceph.mds.b-s-a.err: 14: (clone()+0x6d) [0x7ff10182cccd] 2013-06-03T05:02:42.989 INFO:teuthology.task.ceph.mds.b-s-a.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
job was
ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-06-03_01:00:48-fs-master-testing-basic/30161$ cat orig.config.yaml kernel: kdb: true sha1: 19bb6a83cb93383b363cc5956e304213f0f1b79f machine_type: plana nuke-on-error: true overrides: ceph: conf: client: debug client: 10 global: mds inject delay type: osd mds ms inject delay max: 1 ms inject delay probability: 0.005 ms inject socket failures: 2500 mds: debug mds: 20 debug ms: 1 mon: debug mon: 20 debug ms: 20 debug paxos: 20 log-whitelist: - slow request - wrongly marked me down sha1: f7c19440290d4b82ced0320d1dfc4676ad5083d2 install: ceph: sha1: f7c19440290d4b82ced0320d1dfc4676ad5083d2 s3tests: branch: master workunit: sha1: f7c19440290d4b82ced0320d1dfc4676ad5083d2 roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - - client.0 - mds.b-s-a tasks: - chef: null - clock.check: null - install: null - ceph: null - mds_thrash: null - ceph-fuse: null - workunit: clients: all: - suites/fsstress.sh
Files
Updated by Sage Weil almost 11 years ago
- Category set to 1
- Priority changed from Normal to Urgent
- Source changed from other to Q/A
Updated by Sage Weil almost 11 years ago
- Assignee set to Sage Weil
ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-06-03_01:00:48-fs-master-testing-basic/30161
Updated by Sage Weil almost 11 years ago
- File ceph-mds.a.log ceph-mds.a.log added
- Status changed from New to 12
- Assignee changed from Sage Weil to Zheng Yan
Yan, I got as far as identifying that the problem is that rejoin_gather_finish->identify_files_to_recovery is getting called twice: once from rejoin_start, via the completion check at the end of rejoin_start -> process_imported_caps, and again from rejoin_joint_start -> rejoin_send_rejoins. I think this broke from one of your recent changes... do you have any quick thought on the cleanest way to resolve it? A bool to make us only do the rejoin_gather_finish() once? That could short-circuit the entire rejoin_send_rejoins() call in rejoin_joint_start(). set it true in rejoin_start(), clear it in rejoin_gather_finish(), and add a guard...
Full mds log is attached.
Thanks!
Updated by Greg Farnum almost 11 years ago
- Project changed from Ceph to CephFS
- Category changed from 1 to 47
Updated by Zheng Yan almost 11 years ago
looks like I forget to initialize MDCache::rejoins_pending
Updated by Sage Weil almost 11 years ago
commit:2d655bde8de9ad255d63718768558399cacd7068
thanks!