Bug #1708
closedmon/PGMonitor.cc: 218: FAILED assert(paxos->get_version() + 1 == pending_inc.version)
Description
Running ceph version from git: a3dd5bd67ba19aae51a51318138ef10213a91449
Slaves are all ubuntu 11.10, 3.0.0-12
Filesystem is ext4
I have a 3 slave cluster, each one running osd, mds, and mon. I had a qemu running rbd across the cluster and was testing failover. Using /etc/init.d/ceph stop/start to stop and start individual nodes. It worked a few times, but then at one point the mon process on one of the slaves crashed.
The mon.0.log is attached.
The assert and backtrace:
2011-11-10 16:34:46.211114 7ffc6d4d0700 mon.0@0(leader) e1 handle_command mon_command(health v 0) v1 2011-11-10 16:34:52.792849 7ffc6cccf700 log [INF] : mon.0@0 won leader election with quorum 0,1 2011-11-10 16:34:58.979151 7ffc6d4d0700 log [INF] : mds.? 192.168.122.74:6800/9301 up:boot 2011-11-10 16:34:58.983151 7ffc6d4d0700 mon.0@0(leader) e1 handle_command mon_command(health v 0) v1 mon/PGMonitor.cc: In function 'virtual void PGMonitor::encode_pending(ceph::bufferlist&)', in thread '7ffc6d4d0700' mon/PGMonitor.cc: 218: FAILED assert(paxos->get_version() + 1 == pending_inc.version) ceph version 0.37-364-ga3dd5bd (commit:a3dd5bd67ba19aae51a51318138ef10213a91449) 1: (PGMonitor::encode_pending(ceph::buffer::list&)+0x108) [0x4c9fd8] 2: (PaxosService::propose_pending()+0xd2) [0x48e532] 3: (PGMonitor::check_osd_map(unsigned int)+0xca0) [0x4d0520] 4: (Context::complete(int)+0xa) [0x478ffa] 5: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xca) [0x47a52a] 6: (Paxos::handle_accept(MMonPaxos*)+0x5d8) [0x488948] 7: (Paxos::dispatch(PaxosServiceMessage*)+0x23b) [0x48b66b] 8: (Monitor::_ms_dispatch(Message*)+0xb99) [0x478409] 9: (Monitor::ms_dispatch(Message*)+0x35) [0x4832b5] 10: (SimpleMessenger::dispatch_entry()+0x84b) [0x57612b] 11: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x462e9c] 12: (()+0x7efc) [0x7ffc70d95efc] 13: (clone()+0x6d) [0x7ffc6f7cf89d]
Files
Updated by Sage Weil over 12 years ago
- Category set to Monitor
- Status changed from New to In Progress
- Assignee set to Sage Weil
- Priority changed from Normal to High
- Target version set to v0.39
Updated by Sage Weil over 12 years ago
- Status changed from In Progress to Can't reproduce
I fixed a number of bugs in this area, and there was a big refactor. Can you retest the latest and see if you run into problems?
Updated by Josh Pieper over 12 years ago
- File mon.0.log.bz2 mon.0.log.bz2 added
Yes, I still get the problem with an updated master 6bc9a544b62bb21f6ee7ef51bfbe9111f7add9cb
I had monitor debugging on this time, the full log will be attached. A snippet is inlined:
{{{
2011-11-17 19:50:33.188458 7f60ccb7b700 mon.0@0(leader).pg v2758 no change in pool 2 rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 128 pgp_num 128 lpg_num 2 lpgp_num 2 last_change 1 owner 0
2011-11-17 19:50:33.188517 7f60ccb7b700 mon.0@0(leader).pg v2758 register_new_pgs registered 0 new pgs, removed 0 uncreated pgs
2011-11-17 19:50:33.188524 7f60ccb7b700 mon.0@0(leader).paxosservice(pgmap) propose_pending
2011-11-17 19:50:33.188528 7f60ccb7b700 mon.0@0(leader).pg v2758 encode_pending v 2759
mon/PGMonitor.cc: In function 'virtual void PGMonitor::encode_pending(ceph::bufferlist&)', in thread '7f60ccb7b700'
mon/PGMonitor.cc: 216: FAILED assert(paxos->get_version() + 1 == pending_inc.version)
ceph version 0.38-190-g6bc9a54 (6bc9a544b62bb21f6ee7ef51bfbe9111f7add9cb)
1: (PGMonitor::encode_pending(ceph::buffer::list&)+0x108) [0x4cdf98]
2: (PaxosService::propose_pending()+0xd2) [0x4923a2]
3: (PGMonitor::check_osd_map(unsigned int)+0xcb0) [0x4d42a0]
4: (Context::complete(int)+0xa) [0x47c10a]
5: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xca) [0x47da6a]
6: (Paxos::handle_accept(MMonPaxos*)+0x5d8) [0x48dd38]
7: (Paxos::dispatch(PaxosServiceMessage*)+0x23b) [0x48f4db]
8: (Monitor::_ms_dispatch(Message*)+0xcbf) [0x47b64f]
9: (Monitor::ms_dispatch(Message*)+0x35) [0x486405]
10: (SimpleMessenger::dispatch_entry()+0x84b) [0x583f0b]
11: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x46610c]
12: (()+0x7efc) [0x7f60d0645efc]
13: (clone()+0x6d) [0x7f60cee7a89d]
ceph version 0.38-190-g6bc9a54 (6bc9a544b62bb21f6ee7ef51bfbe9111f7add9cb)
1: (PGMonitor::encode_pending(ceph::buffer::list&)+0x108) [0x4cdf98]
2: (PaxosService::propose_pending()+0xd2) [0x4923a2]
3: (PGMonitor::check_osd_map(unsigned int)+0xcb0) [0x4d42a0]
4: (Context::complete(int)+0xa) [0x47c10a]
5: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xca) [0x47da6a]
6: (Paxos::handle_accept(MMonPaxos*)+0x5d8) [0x48dd38]
7: (Paxos::dispatch(PaxosServiceMessage*)+0x23b) [0x48f4db]
8: (Monitor::_ms_dispatch(Message*)+0xcbf) [0x47b64f]
9: (Monitor::ms_dispatch(Message*)+0x35) [0x486405]
10: (SimpleMessenger::dispatch_entry()+0x84b) [0x583f0b]
11: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x46610c]
12: (()+0x7efc) [0x7f60d0645efc]
13: (clone()+0x6d) [0x7f60cee7a89d]
}}}
Updated by Sage Weil over 12 years ago
- Status changed from Can't reproduce to Resolved
This latest variation should be fixed by 66c628acc8be71a92e801179431e4b938b857b3d. Thanks for the log!