Bug #15391
ReplicatedPG.cc: 430: FAILED assert(!actingbackfill.empty()): old watch timeout tries to queue repop on replica
0%
Description
Run: http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-04-05_02:10:01-upgrade:hammer-x-jewel-distro-basic-openstack/
Job: 29491
Logs: http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2016-04-05_02:10:01-upgrade:hammer-x-jewel-distro-basic-openstack/29491/teuthology.log
2016-04-05T08:29:41.100 INFO:tasks.ceph.ceph_manager:waiting on admin_socket for osd-3, ['dump_ops_in_flight'] 2016-04-05T08:29:41.359 INFO:tasks.ceph.osd.3.target087062.stderr:2016-04-05 08:29:41.321997 7f983e2938c0 -1 osd.3 3158 log_to_monitors {default=true} 2016-04-05T08:29:45.533 INFO:tasks.ceph.osd.2.target087063.stderr:osd/ReplicatedPG.cc: In function 'bool ReplicatedPG::is_degraded_or_backfilling_object(const hobject_t&)' thread 7f1fa34e3700 time 2016-04-05 08:29:45.426142 2016-04-05T08:29:45.533 INFO:tasks.ceph.osd.2.target087063.stderr:osd/ReplicatedPG.cc: 430: FAILED assert(!actingbackfill.empty()) 2016-04-05T08:29:45.535 INFO:tasks.ceph.osd.2.target087063.stderr: ceph version 10.1.0-445-g370e4f7 (370e4f773a5347a2d0c0493ccf3dc55909b75bce) 2016-04-05T08:29:45.536 INFO:tasks.ceph.osd.2.target087063.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f1fc271516b] 2016-04-05T08:29:45.536 INFO:tasks.ceph.osd.2.target087063.stderr: 2: (()+0x4b8f74) [0x7f1fc21eff74] 2016-04-05T08:29:45.536 INFO:tasks.ceph.osd.2.target087063.stderr: 3: (ReplicatedPG::handle_watch_timeout(std::shared_ptr<Watch>)+0x125) [0x7f1fc21fe375] 2016-04-05T08:29:45.536 INFO:tasks.ceph.osd.2.target087063.stderr: 4: (HandleDelayedWatchTimeout::finish(int)+0xd3) [0x7f1fc213e2e3] 2016-04-05T08:29:45.537 INFO:tasks.ceph.osd.2.target087063.stderr: 5: (Context::complete(int)+0x9) [0x7f1fc20ead29] 2016-04-05T08:29:45.537 INFO:tasks.ceph.osd.2.target087063.stderr: 6: (ReplicatedPG::finish_degraded_object(hobject_t const&)+0x354) [0x7f1fc22429e4] 2016-04-05T08:29:45.537 INFO:tasks.ceph.osd.2.target087063.stderr: 7: (ReplicatedPG::on_change(ObjectStore::Transaction*)+0x2ba) [0x7f1fc224353a] 2016-04-05T08:29:45.538 INFO:tasks.ceph.osd.2.target087063.stderr: 8: (PG::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ObjectStore::Transaction*)+0x7bd) [0x7f1fc219a0bd] 2016-04-05T08:29:45.538 INFO:tasks.ceph.osd.2.target087063.stderr: 9: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x4f0) [0x7f1fc219b100] 2016-04-05T08:29:45.538 INFO:tasks.ceph.osd.2.target087063.stderr: 10: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1fc) [0x7f1fc21ce9fc] 2016-04-05T08:29:45.539 INFO:tasks.ceph.osd.2.target087063.stderr: 11: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7f1fc21b5e9b] 2016-04-05T08:29:45.539 INFO:tasks.ceph.osd.2.target087063.stderr: 12: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xd8) [0x7f1fc21b6008] 2016-04-05T08:29:45.539 INFO:tasks.ceph.osd.2.target087063.stderr: 13: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x499) [0x7f1fc218fde9] 2016-04-05T08:29:45.539 INFO:tasks.ceph.osd.2.target087063.stderr: 14: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x2b9) [0x7f1fc20ca3e9] 2016-04-05T08:29:45.540 INFO:tasks.ceph.osd.2.target087063.stderr: 15: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x1a1) [0x7f1fc20de051] 2016-04-05T08:29:45.540 INFO:tasks.ceph.osd.2.target087063.stderr: 16: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x7f1fc21269b2] 2016-04-05T08:29:45.540 INFO:tasks.ceph.osd.2.target087063.stderr: 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0x7f1fc27066ce] 2016-04-05T08:29:45.540 INFO:tasks.ceph.osd.2.target087063.stderr: 18: (ThreadPool::WorkThread::entry()+0x10) [0x7f1fc27075b0] 2016-04-05T08:29:45.541 INFO:tasks.ceph.osd.2.target087063.stderr: 19: (()+0x8182) [0x7f1fc09d8182] 2016-04-05T08:29:45.541 INFO:tasks.ceph.osd.2.target087063.stderr: 20: (clone()+0x6d) [0x7f1fbeb0647d] 2016-04-05T08:29:45.541 INFO:tasks.ceph.osd.2.target087063.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Related issues
History
#1 Updated by Yuri Weinstein almost 8 years ago
- Description updated (diff)
#2 Updated by Yuri Weinstein almost 8 years ago
- Status changed from New to 12
#3 Updated by Sage Weil almost 8 years ago
- Subject changed from "ReplicatedPG.cc: 430: FAILED assert(!actingbackfill.empty())" in upgrade:hammer-x-jewel-distro-basic-openstack to ReplicatedPG.cc: 430: FAILED assert(!actingbackfill.empty()): old watch timeout tries to queue repop on replica
- Status changed from 12 to Fix Under Review
- Backport set to hammer
3189> 2016-04-05 08:29:45.426046 7f1fa34e3700 10 osd.2 pg_epoch: 3160 pg[32.1( v 3040'6 (0'0,3040'6] local-les=3158 n=1 ec=3031 les/c/f 3158/3150/0 3160/3160/3130) [2] r=0 lpr=3160 pi=3149-3159/3 rops=1 crt=3040'6 mlcod 0'0 inactive NIBBLEWISE] finish_degraded_object 32:93e5b521:::notify.7:head Watch(70752016,client.34804) HandleWatchTimeoutDelayed
-3186> 2016-04-05 08:29:45.426070 7f1fa34e3700 10 osd.2 pg_epoch: 3160 pg[32.1( v 3040'6 (0'0,3040'6] local-les=3158 n=1 ec=3031 les/c/f 3158/3150/0 3160/3160/3130) [2] r=0 lpr=3160 pi=3149-3159/3 rops=1 crt=3040'6 mlcod 0'0 inactive NIBBLEWISE] -
3184> 2016-04-05 08:29:45.426083 7f1fa34e3700 10 osd.2 pg_epoch: 3160 pg[32.1( v 3040'6 (0'0,3040'6] local-les=3158 n=1 ec=3031 les/c/f 3158/3150/0 3160/3160/3130) [2] r=0 lpr=3160 pi=3149-3159/3 rops=1 crt=3040'6 mlcod 0'0 inactive NIBBLEWISE] handle_watch_timeout obc 0x7f1fd22a9c00 Watch(70752016,client.34804) HandleWatchTimeoutDelayed
-3181> 2016-04-05 08:29:45.426097 7f1fa34e3700 10 osd.2 pg_epoch: 3160 pg[32.1( v 3040'6 (0'0,3040'6] local-les=3158 n=1 ec=3031 les/c/f 3158/3150/0 3160/3160/3130) [2] r=0 lpr=3160 pi=3149-3159/3 rops=1 crt=3040'6 mlcod 0'0 inactive NIBBLEWISE] handle_watch_timeout waiting for degraded on obj 32:93e5b521:::notify
-3179> 2016-04-05 08:29:45.426107 7f1fa34e3700 10 osd.2 pg_epoch: 3160 pg[32.1( v 3040'6 (0'0,3040'6] local-les=3158 n=1 ec=3031 les/c/f 3158/3150/0 3160/3160/3130) [2] r=0 lpr=3160 pi=3149-3159/3 rops=1 crt=3040'6 mlcod 0'0 inactive NIBBLEWISE] finish_degraded_object 32:93e5b521:::notify.7:head
-3177> 2016-04-05 08:29:45.426117 7f1fa34e3700 10 osd.2 pg_epoch: 3160 pg[32.1( v 3040'6 (0'0,3040'6] local-les=3158 n=1 ec=3031 les/c/f 3158/3150/0 3160/3160/3130) [2] r=0 lpr=3160 pi=3149-3159/3 rops=1 crt=3040'6 mlcod 0'0 inactive NIBBLEWISE] -
-3175> 2016-04-05 08:29:45.426128 7f1fa34e3700 10 osd.2 pg_epoch: 3160 pg[32.1( v 3040'6 (0'0,3040'6] local-les=3158 n=1 ec=3031 les/c/f 3158/3150/0 3160/3160/3130) [2] r=0 lpr=3160 pi=3149-3159/3 rops=1 crt=3040'6 mlcod 0'0 inactive NIBBLEWISE] handle_watch_timeout obc 0x7f1fd22a9c00
... crash.
handle_watch_timeout assumes it is primary and active. in this case neither are true. I think a simple is_active check is sufficient to catch this cleanup case from on_change.
#4 Updated by Sage Weil almost 8 years ago
- Assignee set to Sage Weil
#5 Updated by Yuri Weinstein almost 8 years ago
#6 Updated by Yuri Weinstein almost 8 years ago
- ceph-qa-suite upgrade/infernalis-x added
In infernalis-x as well
Run: http://pulpito.ceph.com/teuthology-2016-04-10_02:10:02-upgrade:infernalis-x-jewel-distro-basic-vps/
Job: 119573
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2016-04-10_02:10:02-upgrade:infernalis-x-jewel-distro-basic-vps/119573/teuthology.log
2016-04-10T07:11:21.866 INFO:tasks.ceph.osd.2.vpm040.stderr:osd/ReplicatedPG.cc: In function 'bool ReplicatedPG::is_degraded_or_backfilling_object(const hobject_t&)' thread 7f5607eef700 time 2016-04-10 14:11:21.766679 2016-04-10T07:11:21.867 INFO:tasks.ceph.osd.2.vpm040.stderr:osd/ReplicatedPG.cc: 430: FAILED assert(!actingbackfill.empty()) 2016-04-10T07:11:21.932 INFO:tasks.ceph.osd.2.vpm040.stderr: ceph version 10.1.1 (ce50389b773fe7f72fca40a3dd69cfe6613eaeb1) 2016-04-10T07:11:21.933 INFO:tasks.ceph.osd.2.vpm040.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f5627118f4b] 2016-04-10T07:11:21.933 INFO:tasks.ceph.osd.2.vpm040.stderr: 2: (()+0x4b8924) [0x7f5626bf3924] 2016-04-10T07:11:21.933 INFO:tasks.ceph.osd.2.vpm040.stderr: 3: (ReplicatedPG::handle_watch_timeout(std::shared_ptr<Watch>)+0x125) [0x7f5626c01d25] 2016-04-10T07:11:21.933 INFO:tasks.ceph.osd.2.vpm040.stderr: 4: (HandleDelayedWatchTimeout::finish(int)+0xd3) [0x7f5626b41a53] 2016-04-10T07:11:21.933 INFO:tasks.ceph.osd.2.vpm040.stderr: 5: (Context::complete(int)+0x9) [0x7f5626aeebd9] 2016-04-10T07:11:21.934 INFO:tasks.ceph.osd.2.vpm040.stderr: 6: (ReplicatedPG::finish_degraded_object(hobject_t const&)+0x354) [0x7f5626c46a84] 2016-04-10T07:11:21.934 INFO:tasks.ceph.osd.2.vpm040.stderr: 7: (ReplicatedPG::on_change(ObjectStore::Transaction*)+0x2ba) [0x7f5626c475da] 2016-04-10T07:11:21.934 INFO:tasks.ceph.osd.2.vpm040.stderr: 8: (PG::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ObjectStore::Transaction*)+0x7bd) [0x7f5626b9d85d] 2016-04-10T07:11:21.934 INFO:tasks.ceph.osd.2.vpm040.stderr: 9: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x4f0) [0x7f5626b9e8a0] 2016-04-10T07:11:21.934 INFO:tasks.ceph.osd.2.vpm040.stderr: 10: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1fc) [0x7f5626bd229c] 2016-04-10T07:11:21.935 INFO:tasks.ceph.osd.2.vpm040.stderr: 11: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7f5626bb972b] 2016-04-10T07:11:21.935 INFO:tasks.ceph.osd.2.vpm040.stderr: 12: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xd8) [0x7f5626bb9898] 2016-04-10T07:11:21.935 INFO:tasks.ceph.osd.2.vpm040.stderr: 13: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x499) [0x7f5626b93589] 2016-04-10T07:11:21.935 INFO:tasks.ceph.osd.2.vpm040.stderr: 14: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x2b9) [0x7f5626ace279] 2016-04-10T07:11:21.935 INFO:tasks.ceph.osd.2.vpm040.stderr: 15: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x1a1) [0x7f5626ae1ee1] 2016-04-10T07:11:21.936 INFO:tasks.ceph.osd.2.vpm040.stderr: 16: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x7f5626b2a162] 2016-04-10T07:11:21.936 INFO:tasks.ceph.osd.2.vpm040.stderr: 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0x7f562710a4ae] 2016-04-10T07:11:21.936 INFO:tasks.ceph.osd.2.vpm040.stderr: 18: (ThreadPool::WorkThread::entry()+0x10) [0x7f562710b390] 2016-04-10T07:11:21.936 INFO:tasks.ceph.osd.2.vpm040.stderr: 19: (()+0x8182) [0x7f56253dc182] 2016-04-10T07:11:21.936 INFO:tasks.ceph.osd.2.vpm040.stderr: 20: (clone()+0x6d) [0x7f562350a47d] 2016-04-10T07:11:21.937 INFO:tasks.ceph.osd.2.vpm040.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
#7 Updated by Sage Weil almost 8 years ago
- Status changed from Fix Under Review to Pending Backport
#8 Updated by Nathan Cutler almost 8 years ago
- Copied to Backport #15525: hammer: ReplicatedPG.cc: 430: FAILED assert(!actingbackfill.empty()): old watch timeout tries to queue repop on replica added
#9 Updated by Nathan Cutler over 7 years ago
- Status changed from Pending Backport to Resolved