Project

General

Profile

Actions

Bug #1708

closed

mon/PGMonitor.cc: 218: FAILED assert(paxos->get_version() + 1 == pending_inc.version)

Added by Josh Pieper over 12 years ago. Updated over 12 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Monitor
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Running ceph version from git: a3dd5bd67ba19aae51a51318138ef10213a91449
Slaves are all ubuntu 11.10, 3.0.0-12
Filesystem is ext4

I have a 3 slave cluster, each one running osd, mds, and mon. I had a qemu running rbd across the cluster and was testing failover. Using /etc/init.d/ceph stop/start to stop and start individual nodes. It worked a few times, but then at one point the mon process on one of the slaves crashed.

The mon.0.log is attached.

The assert and backtrace:

2011-11-10 16:34:46.211114 7ffc6d4d0700 mon.0@0(leader) e1 handle_command mon_command(health v 0) v1
2011-11-10 16:34:52.792849 7ffc6cccf700 log [INF] : mon.0@0 won leader election with quorum 0,1
2011-11-10 16:34:58.979151 7ffc6d4d0700 log [INF] : mds.? 192.168.122.74:6800/9301 up:boot
2011-11-10 16:34:58.983151 7ffc6d4d0700 mon.0@0(leader) e1 handle_command mon_command(health v 0) v1
mon/PGMonitor.cc: In function 'virtual void PGMonitor::encode_pending(ceph::bufferlist&)', in thread '7ffc6d4d0700'
mon/PGMonitor.cc: 218: FAILED assert(paxos->get_version() + 1 == pending_inc.version)
 ceph version 0.37-364-ga3dd5bd (commit:a3dd5bd67ba19aae51a51318138ef10213a91449)
 1: (PGMonitor::encode_pending(ceph::buffer::list&)+0x108) [0x4c9fd8]
 2: (PaxosService::propose_pending()+0xd2) [0x48e532]
 3: (PGMonitor::check_osd_map(unsigned int)+0xca0) [0x4d0520]
 4: (Context::complete(int)+0xa) [0x478ffa]
 5: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xca) [0x47a52a]
 6: (Paxos::handle_accept(MMonPaxos*)+0x5d8) [0x488948]
 7: (Paxos::dispatch(PaxosServiceMessage*)+0x23b) [0x48b66b]
 8: (Monitor::_ms_dispatch(Message*)+0xb99) [0x478409]
 9: (Monitor::ms_dispatch(Message*)+0x35) [0x4832b5]
 10: (SimpleMessenger::dispatch_entry()+0x84b) [0x57612b]
 11: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x462e9c]
 12: (()+0x7efc) [0x7ffc70d95efc]
 13: (clone()+0x6d) [0x7ffc6f7cf89d]


Files

ceph.conf (598 Bytes) ceph.conf Josh Pieper, 11/10/2011 01:45 PM
mon.0.log (14.6 KB) mon.0.log Josh Pieper, 11/10/2011 01:45 PM
mon.0.log.bz2 (599 KB) mon.0.log.bz2 Josh Pieper, 11/17/2011 05:18 PM
Actions #1

Updated by Sage Weil over 12 years ago

  • Category set to Monitor
  • Status changed from New to In Progress
  • Assignee set to Sage Weil
  • Priority changed from Normal to High
  • Target version set to v0.39
Actions #2

Updated by Sage Weil over 12 years ago

  • Status changed from In Progress to Can't reproduce

I fixed a number of bugs in this area, and there was a big refactor. Can you retest the latest and see if you run into problems?

Actions #3

Updated by Josh Pieper over 12 years ago

Yes, I still get the problem with an updated master 6bc9a544b62bb21f6ee7ef51bfbe9111f7add9cb

I had monitor debugging on this time, the full log will be attached. A snippet is inlined:

{{{
2011-11-17 19:50:33.188458 7f60ccb7b700 mon.0@0(leader).pg v2758 no change in pool 2 rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 128 pgp_num 128 lpg_num 2 lpgp_num 2 last_change 1 owner 0
2011-11-17 19:50:33.188517 7f60ccb7b700 mon.0@0(leader).pg v2758 register_new_pgs registered 0 new pgs, removed 0 uncreated pgs
2011-11-17 19:50:33.188524 7f60ccb7b700 mon.0@0(leader).paxosservice(pgmap) propose_pending
2011-11-17 19:50:33.188528 7f60ccb7b700 mon.0@0(leader).pg v2758 encode_pending v 2759
mon/PGMonitor.cc: In function 'virtual void PGMonitor::encode_pending(ceph::bufferlist&)', in thread '7f60ccb7b700'
mon/PGMonitor.cc: 216: FAILED assert(paxos->get_version() + 1 == pending_inc.version)
ceph version 0.38-190-g6bc9a54 (6bc9a544b62bb21f6ee7ef51bfbe9111f7add9cb)
1: (PGMonitor::encode_pending(ceph::buffer::list&)+0x108) [0x4cdf98]
2: (PaxosService::propose_pending()+0xd2) [0x4923a2]
3: (PGMonitor::check_osd_map(unsigned int)+0xcb0) [0x4d42a0]
4: (Context::complete(int)+0xa) [0x47c10a]
5: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xca) [0x47da6a]
6: (Paxos::handle_accept(MMonPaxos*)+0x5d8) [0x48dd38]
7: (Paxos::dispatch(PaxosServiceMessage*)+0x23b) [0x48f4db]
8: (Monitor::_ms_dispatch(Message*)+0xcbf) [0x47b64f]
9: (Monitor::ms_dispatch(Message*)+0x35) [0x486405]
10: (SimpleMessenger::dispatch_entry()+0x84b) [0x583f0b]
11: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x46610c]
12: (()+0x7efc) [0x7f60d0645efc]
13: (clone()+0x6d) [0x7f60cee7a89d]
ceph version 0.38-190-g6bc9a54 (6bc9a544b62bb21f6ee7ef51bfbe9111f7add9cb)
1: (PGMonitor::encode_pending(ceph::buffer::list&)+0x108) [0x4cdf98]
2: (PaxosService::propose_pending()+0xd2) [0x4923a2]
3: (PGMonitor::check_osd_map(unsigned int)+0xcb0) [0x4d42a0]
4: (Context::complete(int)+0xa) [0x47c10a]
5: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xca) [0x47da6a]
6: (Paxos::handle_accept(MMonPaxos*)+0x5d8) [0x48dd38]
7: (Paxos::dispatch(PaxosServiceMessage*)+0x23b) [0x48f4db]
8: (Monitor::_ms_dispatch(Message*)+0xcbf) [0x47b64f]
9: (Monitor::ms_dispatch(Message*)+0x35) [0x486405]
10: (SimpleMessenger::dispatch_entry()+0x84b) [0x583f0b]
11: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x46610c]
12: (()+0x7efc) [0x7f60d0645efc]
13: (clone()+0x6d) [0x7f60cee7a89d]
}}}

Actions #4

Updated by Sage Weil over 12 years ago

  • Status changed from Can't reproduce to Resolved

This latest variation should be fixed by 66c628acc8be71a92e801179431e4b938b857b3d. Thanks for the log!

Actions

Also available in: Atom PDF