Bug #4050
closedrecovery assert failure, osd/PG.cc: 6255: FAILED assert(query.query.type == pg_query_t::MISSING)
0%
Description
2013-02-07 20:58:49.461754 7f518f18c700 -1 osd/PG.cc: In function 'boost::statechart::result PG::RecoveryState::ReplicaActive::react(const PG::MQuery&)' thread
7f518f18c700 time 2013-02-07 20:58:49.460049
osd/PG.cc: 6255: FAILED assert(query.query.type == pg_query_t::MISSING)
ceph version 0.56.2-17-g200d5e2 (200d5e2da5ab7a6292f3174b5a38510630e2c91f)
1: /usr/bin/ceph-osd() [0x68cb84]
2: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState
::RepNotRecovering, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list4<boost::statechart::custom_reaction<PG::
MQuery>, boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::Activate> >,
boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history
mode)0> >(boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechar
t::history_mode)0>&, boost::statechart::event_base const&, void const*)+0xc8) [0x6d8c28]
3: (boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::his
tory_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x81) [0x6d8d41]
4: (boost::statechart::simple_state<PG::RecoveryState::RepNotRecovering, PG::RecoveryState::ReplicaActive, boost::mpl::list<mpl::na, mpl_::na, mpl_::na, mpl_
::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl
_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x10a) [0x6dbdaa]
5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_t
ranslator>::send_event(boost::statechart::event_base const&)+0x5b) [0x6c141b]
6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_t
ranslator>::process_event(boost::statechart::event_base const&)+0x11) [0x6c16f1]
7: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x347) [0x6821d7]
8: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2c8) [0x62c238]
9: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x10) [0x662530]
10: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x82a3d6]
11: (ThreadPool::WorkThread::entry()+0x10) [0x82c200]
12: (()+0x7e9a) [0x7f51a021fe9a]
13: (clone()+0x6d) [0x7f519e5b8cbd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Ian Colle about 11 years ago
Any update on this? Should we downgrade?
Updated by Ian Colle about 11 years ago
- Status changed from 12 to New
- Priority changed from Urgent to Normal
Updated by Samuel Just about 11 years ago
- Priority changed from Normal to Urgent
Reproduced it by accident.
osd.2 (primary):
2013-03-13 18:09:58.201224 7f038ebab700 1 -- 10.214.131.37:6809/22140 --> 10.214.131.37:6801/18277 -- pg_query(1.5,1.6,1.7,1.8,1.b,2.4,2.5,2.6,2.7,2.a epoch 219) v2 -- ?+0 0x2b0fa80 con 0x2896c60
2013-03-13 18:09:59.606409 7f038ebab700 1 -- 10.214.131.37:6809/22140 --> 10.214.131.37:6801/18277 -- pg_query(1.5,1.6,1.7,1.8,1.b,2.4,2.5,2.6,2.7,2.a epoch 222) v2 -- ?+0 0x2cc51c0 con 0x2896c60
2013-03-13 18:10:10.741770 7f038f3ac700 1 -- 10.214.131.37:6809/22140 --> osd.0 10.214.131.37:6801/18277 -- pg_log(1.6 epoch 223 query_epoch 223) v3 -- ?+0 0x290d680
osd.0 (replica):
4883> 2013-03-13 18:10:10.760991 7f372cf3a700 1 - 10.214.131.37:6801/18277 <== osd.2 10.214.131.37:6809/22140 6 ==== pg_query(1.5,1.6,1.7,1.8,1.b,2.4,2.5,2.6,2.7,2.a epoch 219) v2 ==== 1262+0+0 (3455695585 0 0) 0x3d581c0 con 0x26d0160
1831> 2013-03-13 18:10:10.825331 7f372cf3a700 1 - 10.214.131.37:6801/18277 <== osd.2 10.214.131.37:6809/22140 25 ==== pg_log(1.6 epoch 223 query_epoch 223) v3 ==== 600+0+0 (1810365554 0 0) 0x3b58b00 con 0x27ee840
1540> 2013-03-13 18:10:10.827229 7f372cf3a700 1 - 10.214.131.37:6801/18277 <== osd.2 10.214.131.37:6809/22140 13 ==== pg_query(1.5,1.6,1.7,1.8,1.b,2.4,2.5,2.6,2.7,2.a epoch 222) v2 ==== 1262+0+0 (3247665410 0 0) 0x3f641c0 con 0x26d0160
Primary sends query(info, 219), query(info, 222), log(223)
Replica sees query(info, 219), log(223), query(info, 222)
roles:
- - mon.0
- osd.0
- osd.1
- osd.2
- osd.3
- client.0
#- - osd.4
- - osd.5
- - client.1
overrides:
ceph:
valgrind: - osd:
- - --tool=memcheck
- path: /home/samuelj/ceph2
- branch: wip_sam_test
branch: wip_omap_snaps - branch: master
fs: xfs
log-whitelist:
- clocks not synchronized
conf: - global:
- ms inject socket failures: 500
osd:
lockdep : false
debug osd : 20
debug ms : 1
debug filestore : 20
debug journal : 20
debug objecter : 20
debug client : 20
debug optracker : 20
osd max backfills : 20
osd recovery max chunk : 1000
osd recovery max active : 50
filestore debug verify split : true
osd debug verify snaps on info : true
journal write header frequency : 200 - osd recover clone overlap : false
- filestore btrfs snap : 0
tasks:
- install: null
- ceph: null
- thrashosds:
chance_down: 70
chance_pgnum_grow: 1
chance_pgpnum_fix: 1
- rados:
runs: 100
clients:
- client.0
objects: 1000
object_size: 4194
op_weights:
write: 100
delete: 50
read: 100
snap_create: 10
rollback: 20
snap_remove: 8
setattr: 100
rmattr: 50
ops: 3000
Updated by Samuel Just about 11 years ago
- Status changed from New to Pending Backport
Updated by Samuel Just about 11 years ago
- Status changed from Pending Backport to Resolved