Project

General

Profile

Actions

Bug #4050

closed

recovery assert failure, osd/PG.cc: 6255: FAILED assert(query.query.type == pg_query_t::MISSING)

Added by Samuel Just about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
bobtail
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2013-02-07 20:58:49.461754 7f518f18c700 -1 osd/PG.cc: In function 'boost::statechart::result PG::RecoveryState::ReplicaActive::react(const PG::MQuery&)' thread
7f518f18c700 time 2013-02-07 20:58:49.460049
osd/PG.cc: 6255: FAILED assert(query.query.type == pg_query_t::MISSING)

ceph version 0.56.2-17-g200d5e2 (200d5e2da5ab7a6292f3174b5a38510630e2c91f)
1: /usr/bin/ceph-osd() [0x68cb84]
2: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState
::RepNotRecovering, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list4<boost::statechart::custom_reaction<PG::
MQuery>, boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::Activate> >,
boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history
mode)0> >(boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechar
t::history_mode)0>&, boost::statechart::event_base const&, void const*)+0xc8) [0x6d8c28]
3: (boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::his
tory_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x81) [0x6d8d41]
4: (boost::statechart::simple_state<PG::RecoveryState::RepNotRecovering, PG::RecoveryState::ReplicaActive, boost::mpl::list<mpl
::na, mpl_::na, mpl_::na, mpl_
::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl
_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x10a) [0x6dbdaa]
5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_t
ranslator>::send_event(boost::statechart::event_base const&)+0x5b) [0x6c141b]
6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_t
ranslator>::process_event(boost::statechart::event_base const&)+0x11) [0x6c16f1]
7: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x347) [0x6821d7]
8: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2c8) [0x62c238]
9: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x10) [0x662530]
10: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x82a3d6]
11: (ThreadPool::WorkThread::entry()+0x10) [0x82c200]
12: (()+0x7e9a) [0x7f51a021fe9a]
13: (clone()+0x6d) [0x7f519e5b8cbd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #1

Updated by Ian Colle about 11 years ago

Any update on this? Should we downgrade?

Actions #2

Updated by Ian Colle about 11 years ago

  • Status changed from 12 to New
  • Priority changed from Urgent to Normal
Actions #3

Updated by Samuel Just about 11 years ago

  • Priority changed from Normal to Urgent

Reproduced it by accident.

osd.2 (primary):
2013-03-13 18:09:58.201224 7f038ebab700 1 -- 10.214.131.37:6809/22140 --> 10.214.131.37:6801/18277 -- pg_query(1.5,1.6,1.7,1.8,1.b,2.4,2.5,2.6,2.7,2.a epoch 219) v2 -- ?+0 0x2b0fa80 con 0x2896c60
2013-03-13 18:09:59.606409 7f038ebab700 1 -- 10.214.131.37:6809/22140 --> 10.214.131.37:6801/18277 -- pg_query(1.5,1.6,1.7,1.8,1.b,2.4,2.5,2.6,2.7,2.a epoch 222) v2 -- ?+0 0x2cc51c0 con 0x2896c60
2013-03-13 18:10:10.741770 7f038f3ac700 1 -- 10.214.131.37:6809/22140 --> osd.0 10.214.131.37:6801/18277 -- pg_log(1.6 epoch 223 query_epoch 223) v3 -- ?+0 0x290d680

osd.0 (replica):
4883> 2013-03-13 18:10:10.760991 7f372cf3a700 1 - 10.214.131.37:6801/18277 <== osd.2 10.214.131.37:6809/22140 6 ==== pg_query(1.5,1.6,1.7,1.8,1.b,2.4,2.5,2.6,2.7,2.a epoch 219) v2 ==== 1262+0+0 (3455695585 0 0) 0x3d581c0 con 0x26d0160
1831> 2013-03-13 18:10:10.825331 7f372cf3a700 1 - 10.214.131.37:6801/18277 <== osd.2 10.214.131.37:6809/22140 25 ==== pg_log(1.6 epoch 223 query_epoch 223) v3 ==== 600+0+0 (1810365554 0 0) 0x3b58b00 con 0x27ee840
1540> 2013-03-13 18:10:10.827229 7f372cf3a700 1 - 10.214.131.37:6801/18277 <== osd.2 10.214.131.37:6809/22140 13 ==== pg_query(1.5,1.6,1.7,1.8,1.b,2.4,2.5,2.6,2.7,2.a epoch 222) v2 ==== 1262+0+0 (3247665410 0 0) 0x3f641c0 con 0x26d0160

Primary sends query(info, 219), query(info, 222), log(223)
Replica sees query(info, 219), log(223), query(info, 222)

interactive-on-error: true
roles:
- - mon.0
- osd.0
- osd.1
- osd.2
- osd.3
- client.0
#- - osd.4
  1. - osd.5
  2. - client.1
    overrides:
    ceph:
    valgrind:
  3. osd:
  4. - --tool=memcheck
  5. path: /home/samuelj/ceph2
  6. branch: wip_sam_test
    branch: wip_omap_snaps
  7. branch: master
    fs: xfs
    log-whitelist:
    - clocks not synchronized
    conf:
  8. global:
  9. ms inject socket failures: 500
    osd:
    lockdep : false
    debug osd : 20
    debug ms : 1
    debug filestore : 20
    debug journal : 20
    debug objecter : 20
    debug client : 20
    debug optracker : 20
    osd max backfills : 20
    osd recovery max chunk : 1000
    osd recovery max active : 50
    filestore debug verify split : true
    osd debug verify snaps on info : true
    journal write header frequency : 200
  10. osd recover clone overlap : false
  11. filestore btrfs snap : 0
    tasks:
    - install: null
    - ceph: null
    - thrashosds:
    chance_down: 70
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    - rados:
    runs: 100
    clients:
    - client.0
    objects: 1000
    object_size: 4194
    op_weights:
    write: 100
    delete: 50
    read: 100
    snap_create: 10
    rollback: 20
    snap_remove: 8
    setattr: 100
    rmattr: 50
    ops: 3000
Actions #4

Updated by Samuel Just about 11 years ago

logs in ubuntu@plana03:~/bug_4050/

Actions #5

Updated by Samuel Just about 11 years ago

Fix pending merge of wip_4196

Actions #6

Updated by Samuel Just about 11 years ago

  • Status changed from New to Pending Backport
Actions #7

Updated by Samuel Just about 11 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF