Project

General

Profile

Bug #1708

mon/PGMonitor.cc: 218: FAILED assert(paxos->get_version() + 1 == pending_inc.version)

Added by Josh Pieper over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Monitor
Target version:
Start date:
11/10/2011
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Running ceph version from git: a3dd5bd67ba19aae51a51318138ef10213a91449
Slaves are all ubuntu 11.10, 3.0.0-12
Filesystem is ext4

I have a 3 slave cluster, each one running osd, mds, and mon. I had a qemu running rbd across the cluster and was testing failover. Using /etc/init.d/ceph stop/start to stop and start individual nodes. It worked a few times, but then at one point the mon process on one of the slaves crashed.

The mon.0.log is attached.

The assert and backtrace:

2011-11-10 16:34:46.211114 7ffc6d4d0700 mon.0@0(leader) e1 handle_command mon_command(health v 0) v1
2011-11-10 16:34:52.792849 7ffc6cccf700 log [INF] : mon.0@0 won leader election with quorum 0,1
2011-11-10 16:34:58.979151 7ffc6d4d0700 log [INF] : mds.? 192.168.122.74:6800/9301 up:boot
2011-11-10 16:34:58.983151 7ffc6d4d0700 mon.0@0(leader) e1 handle_command mon_command(health v 0) v1
mon/PGMonitor.cc: In function 'virtual void PGMonitor::encode_pending(ceph::bufferlist&)', in thread '7ffc6d4d0700'
mon/PGMonitor.cc: 218: FAILED assert(paxos->get_version() + 1 == pending_inc.version)
 ceph version 0.37-364-ga3dd5bd (commit:a3dd5bd67ba19aae51a51318138ef10213a91449)
 1: (PGMonitor::encode_pending(ceph::buffer::list&)+0x108) [0x4c9fd8]
 2: (PaxosService::propose_pending()+0xd2) [0x48e532]
 3: (PGMonitor::check_osd_map(unsigned int)+0xca0) [0x4d0520]
 4: (Context::complete(int)+0xa) [0x478ffa]
 5: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xca) [0x47a52a]
 6: (Paxos::handle_accept(MMonPaxos*)+0x5d8) [0x488948]
 7: (Paxos::dispatch(PaxosServiceMessage*)+0x23b) [0x48b66b]
 8: (Monitor::_ms_dispatch(Message*)+0xb99) [0x478409]
 9: (Monitor::ms_dispatch(Message*)+0x35) [0x4832b5]
 10: (SimpleMessenger::dispatch_entry()+0x84b) [0x57612b]
 11: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x462e9c]
 12: (()+0x7efc) [0x7ffc70d95efc]
 13: (clone()+0x6d) [0x7ffc6f7cf89d]

ceph.conf View (598 Bytes) Josh Pieper, 11/10/2011 01:45 PM

mon.0.log View (14.6 KB) Josh Pieper, 11/10/2011 01:45 PM

mon.0.log.bz2 (599 KB) Josh Pieper, 11/17/2011 05:18 PM

Associated revisions

Revision 66c628ac (diff)
Added by Sage Weil over 7 years ago

mon: don't propose new state from update_from_paxos

Proposing a new state from within update_from_paxos() confuses some callers,
like PaxosService::_active(). Instead, do it in the on_active() callback.
This also let's us collapse the check_osd_map() caller into on_active(),
and makes it happen on leaders and peons alike, which ought to avoid some
of the pg creation lag we see sometimes (presumably when the osds have
sessions with peons instead of the leader).

Fixes: #1708
Signed-off-by: Sage Weil <>

History

#1 Updated by Sage Weil over 7 years ago

  • Category set to Monitor
  • Status changed from New to In Progress
  • Assignee set to Sage Weil
  • Priority changed from Normal to High
  • Target version set to v0.39

#2 Updated by Sage Weil over 7 years ago

  • Status changed from In Progress to Can't reproduce

I fixed a number of bugs in this area, and there was a big refactor. Can you retest the latest and see if you run into problems?

#3 Updated by Josh Pieper over 7 years ago

Yes, I still get the problem with an updated master 6bc9a544b62bb21f6ee7ef51bfbe9111f7add9cb

I had monitor debugging on this time, the full log will be attached. A snippet is inlined:

{{{
2011-11-17 19:50:33.188458 7f60ccb7b700 mon.0@0(leader).pg v2758 no change in pool 2 rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 128 pgp_num 128 lpg_num 2 lpgp_num 2 last_change 1 owner 0
2011-11-17 19:50:33.188517 7f60ccb7b700 mon.0@0(leader).pg v2758 register_new_pgs registered 0 new pgs, removed 0 uncreated pgs
2011-11-17 19:50:33.188524 7f60ccb7b700 mon.0@0(leader).paxosservice(pgmap) propose_pending
2011-11-17 19:50:33.188528 7f60ccb7b700 mon.0@0(leader).pg v2758 encode_pending v 2759
mon/PGMonitor.cc: In function 'virtual void PGMonitor::encode_pending(ceph::bufferlist&)', in thread '7f60ccb7b700'
mon/PGMonitor.cc: 216: FAILED assert(paxos->get_version() + 1 == pending_inc.version)
ceph version 0.38-190-g6bc9a54 (6bc9a544b62bb21f6ee7ef51bfbe9111f7add9cb)
1: (PGMonitor::encode_pending(ceph::buffer::list&)+0x108) [0x4cdf98]
2: (PaxosService::propose_pending()+0xd2) [0x4923a2]
3: (PGMonitor::check_osd_map(unsigned int)+0xcb0) [0x4d42a0]
4: (Context::complete(int)+0xa) [0x47c10a]
5: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xca) [0x47da6a]
6: (Paxos::handle_accept(MMonPaxos*)+0x5d8) [0x48dd38]
7: (Paxos::dispatch(PaxosServiceMessage*)+0x23b) [0x48f4db]
8: (Monitor::_ms_dispatch(Message*)+0xcbf) [0x47b64f]
9: (Monitor::ms_dispatch(Message*)+0x35) [0x486405]
10: (SimpleMessenger::dispatch_entry()+0x84b) [0x583f0b]
11: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x46610c]
12: (()+0x7efc) [0x7f60d0645efc]
13: (clone()+0x6d) [0x7f60cee7a89d]
ceph version 0.38-190-g6bc9a54 (6bc9a544b62bb21f6ee7ef51bfbe9111f7add9cb)
1: (PGMonitor::encode_pending(ceph::buffer::list&)+0x108) [0x4cdf98]
2: (PaxosService::propose_pending()+0xd2) [0x4923a2]
3: (PGMonitor::check_osd_map(unsigned int)+0xcb0) [0x4d42a0]
4: (Context::complete(int)+0xa) [0x47c10a]
5: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xca) [0x47da6a]
6: (Paxos::handle_accept(MMonPaxos*)+0x5d8) [0x48dd38]
7: (Paxos::dispatch(PaxosServiceMessage*)+0x23b) [0x48f4db]
8: (Monitor::_ms_dispatch(Message*)+0xcbf) [0x47b64f]
9: (Monitor::ms_dispatch(Message*)+0x35) [0x486405]
10: (SimpleMessenger::dispatch_entry()+0x84b) [0x583f0b]
11: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x46610c]
12: (()+0x7efc) [0x7f60d0645efc]
13: (clone()+0x6d) [0x7f60cee7a89d]
}}}

#4 Updated by Sage Weil over 7 years ago

  • Status changed from Can't reproduce to Resolved

This latest variation should be fixed by 66c628acc8be71a92e801179431e4b938b857b3d. Thanks for the log!

Also available in: Atom PDF