Project

General

Profile

Bug #15391

ReplicatedPG.cc: 430: FAILED assert(!actingbackfill.empty()): old watch timeout tries to queue repop on replica

Added by Yuri Weinstein over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/hammer-x, upgrade/infernalis-x
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Run: http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-04-05_02:10:01-upgrade:hammer-x-jewel-distro-basic-openstack/
Job: 29491
Logs: http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2016-04-05_02:10:01-upgrade:hammer-x-jewel-distro-basic-openstack/29491/teuthology.log

2016-04-05T08:29:41.100 INFO:tasks.ceph.ceph_manager:waiting on admin_socket for osd-3, ['dump_ops_in_flight']
2016-04-05T08:29:41.359 INFO:tasks.ceph.osd.3.target087062.stderr:2016-04-05 08:29:41.321997 7f983e2938c0 -1 osd.3 3158 log_to_monitors {default=true}
2016-04-05T08:29:45.533 INFO:tasks.ceph.osd.2.target087063.stderr:osd/ReplicatedPG.cc: In function 'bool ReplicatedPG::is_degraded_or_backfilling_object(const hobject_t&)' thread 7f1fa34e3700 time 2016-04-05 08:29:45.426142
2016-04-05T08:29:45.533 INFO:tasks.ceph.osd.2.target087063.stderr:osd/ReplicatedPG.cc: 430: FAILED assert(!actingbackfill.empty())
2016-04-05T08:29:45.535 INFO:tasks.ceph.osd.2.target087063.stderr: ceph version 10.1.0-445-g370e4f7 (370e4f773a5347a2d0c0493ccf3dc55909b75bce)
2016-04-05T08:29:45.536 INFO:tasks.ceph.osd.2.target087063.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f1fc271516b]
2016-04-05T08:29:45.536 INFO:tasks.ceph.osd.2.target087063.stderr: 2: (()+0x4b8f74) [0x7f1fc21eff74]
2016-04-05T08:29:45.536 INFO:tasks.ceph.osd.2.target087063.stderr: 3: (ReplicatedPG::handle_watch_timeout(std::shared_ptr<Watch>)+0x125) [0x7f1fc21fe375]
2016-04-05T08:29:45.536 INFO:tasks.ceph.osd.2.target087063.stderr: 4: (HandleDelayedWatchTimeout::finish(int)+0xd3) [0x7f1fc213e2e3]
2016-04-05T08:29:45.537 INFO:tasks.ceph.osd.2.target087063.stderr: 5: (Context::complete(int)+0x9) [0x7f1fc20ead29]
2016-04-05T08:29:45.537 INFO:tasks.ceph.osd.2.target087063.stderr: 6: (ReplicatedPG::finish_degraded_object(hobject_t const&)+0x354) [0x7f1fc22429e4]
2016-04-05T08:29:45.537 INFO:tasks.ceph.osd.2.target087063.stderr: 7: (ReplicatedPG::on_change(ObjectStore::Transaction*)+0x2ba) [0x7f1fc224353a]
2016-04-05T08:29:45.538 INFO:tasks.ceph.osd.2.target087063.stderr: 8: (PG::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ObjectStore::Transaction*)+0x7bd) [0x7f1fc219a0bd]
2016-04-05T08:29:45.538 INFO:tasks.ceph.osd.2.target087063.stderr: 9: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x4f0) [0x7f1fc219b100]
2016-04-05T08:29:45.538 INFO:tasks.ceph.osd.2.target087063.stderr: 10: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1fc) [0x7f1fc21ce9fc]
2016-04-05T08:29:45.539 INFO:tasks.ceph.osd.2.target087063.stderr: 11: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7f1fc21b5e9b]
2016-04-05T08:29:45.539 INFO:tasks.ceph.osd.2.target087063.stderr: 12: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xd8) [0x7f1fc21b6008]
2016-04-05T08:29:45.539 INFO:tasks.ceph.osd.2.target087063.stderr: 13: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x499) [0x7f1fc218fde9]
2016-04-05T08:29:45.539 INFO:tasks.ceph.osd.2.target087063.stderr: 14: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x2b9) [0x7f1fc20ca3e9]
2016-04-05T08:29:45.540 INFO:tasks.ceph.osd.2.target087063.stderr: 15: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x1a1) [0x7f1fc20de051]
2016-04-05T08:29:45.540 INFO:tasks.ceph.osd.2.target087063.stderr: 16: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x7f1fc21269b2]
2016-04-05T08:29:45.540 INFO:tasks.ceph.osd.2.target087063.stderr: 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0x7f1fc27066ce]
2016-04-05T08:29:45.540 INFO:tasks.ceph.osd.2.target087063.stderr: 18: (ThreadPool::WorkThread::entry()+0x10) [0x7f1fc27075b0]
2016-04-05T08:29:45.541 INFO:tasks.ceph.osd.2.target087063.stderr: 19: (()+0x8182) [0x7f1fc09d8182]
2016-04-05T08:29:45.541 INFO:tasks.ceph.osd.2.target087063.stderr: 20: (clone()+0x6d) [0x7f1fbeb0647d]
2016-04-05T08:29:45.541 INFO:tasks.ceph.osd.2.target087063.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues

Copied to Ceph - Backport #15525: hammer: ReplicatedPG.cc: 430: FAILED assert(!actingbackfill.empty()): old watch timeout tries to queue repop on replica Resolved

History

#1 Updated by Yuri Weinstein over 6 years ago

  • Description updated (diff)

#3 Updated by Sage Weil over 6 years ago

  • Subject changed from "ReplicatedPG.cc: 430: FAILED assert(!actingbackfill.empty())" in upgrade:hammer-x-jewel-distro-basic-openstack to ReplicatedPG.cc: 430: FAILED assert(!actingbackfill.empty()): old watch timeout tries to queue repop on replica
  • Status changed from 12 to Fix Under Review
  • Backport set to hammer

3189> 2016-04-05 08:29:45.426046 7f1fa34e3700 10 osd.2 pg_epoch: 3160 pg[32.1( v 3040'6 (0'0,3040'6] local-les=3158 n=1 ec=3031 les/c/f 3158/3150/0 3160/3160/3130) [2] r=0 lpr=3160 pi=3149-3159/3 rops=1 crt=3040'6 mlcod 0'0 inactive NIBBLEWISE] finish_degraded_object 32:93e5b521:::notify.7:head
-3186> 2016-04-05 08:29:45.426070 7f1fa34e3700 10 osd.2 pg_epoch: 3160 pg[32.1( v 3040'6 (0'0,3040'6] local-les=3158 n=1 ec=3031 les/c/f 3158/3150/0 3160/3160/3130) [2] r=0 lpr=3160 pi=3149-3159/3 rops=1 crt=3040'6 mlcod 0'0 inactive NIBBLEWISE] -
Watch(70752016,client.34804) HandleWatchTimeoutDelayed
3184> 2016-04-05 08:29:45.426083 7f1fa34e3700 10 osd.2 pg_epoch: 3160 pg[32.1( v 3040'6 (0'0,3040'6] local-les=3158 n=1 ec=3031 les/c/f 3158/3150/0 3160/3160/3130) [2] r=0 lpr=3160 pi=3149-3159/3 rops=1 crt=3040'6 mlcod 0'0 inactive NIBBLEWISE] handle_watch_timeout obc 0x7f1fd22a9c00
-3181> 2016-04-05 08:29:45.426097 7f1fa34e3700 10 osd.2 pg_epoch: 3160 pg[32.1( v 3040'6 (0'0,3040'6] local-les=3158 n=1 ec=3031 les/c/f 3158/3150/0 3160/3160/3130) [2] r=0 lpr=3160 pi=3149-3159/3 rops=1 crt=3040'6 mlcod 0'0 inactive NIBBLEWISE] handle_watch_timeout waiting for degraded on obj 32:93e5b521:::notify
-3179> 2016-04-05 08:29:45.426107 7f1fa34e3700 10 osd.2 pg_epoch: 3160 pg[32.1( v 3040'6 (0'0,3040'6] local-les=3158 n=1 ec=3031 les/c/f 3158/3150/0 3160/3160/3130) [2] r=0 lpr=3160 pi=3149-3159/3 rops=1 crt=3040'6 mlcod 0'0 inactive NIBBLEWISE] finish_degraded_object 32:93e5b521:::notify.7:head
-3177> 2016-04-05 08:29:45.426117 7f1fa34e3700 10 osd.2 pg_epoch: 3160 pg[32.1( v 3040'6 (0'0,3040'6] local-les=3158 n=1 ec=3031 les/c/f 3158/3150/0 3160/3160/3130) [2] r=0 lpr=3160 pi=3149-3159/3 rops=1 crt=3040'6 mlcod 0'0 inactive NIBBLEWISE] -
Watch(70752016,client.34804) HandleWatchTimeoutDelayed
-3175> 2016-04-05 08:29:45.426128 7f1fa34e3700 10 osd.2 pg_epoch: 3160 pg[32.1( v 3040'6 (0'0,3040'6] local-les=3158 n=1 ec=3031 les/c/f 3158/3150/0 3160/3160/3130) [2] r=0 lpr=3160 pi=3149-3159/3 rops=1 crt=3040'6 mlcod 0'0 inactive NIBBLEWISE] handle_watch_timeout obc 0x7f1fd22a9c00

... crash.

handle_watch_timeout assumes it is primary and active. in this case neither are true. I think a simple is_active check is sufficient to catch this cleanup case from on_change.

https://github.com/ceph/ceph/pull/8505

#4 Updated by Sage Weil over 6 years ago

  • Assignee set to Sage Weil

#6 Updated by Yuri Weinstein over 6 years ago

  • ceph-qa-suite upgrade/infernalis-x added

In infernalis-x as well
Run: http://pulpito.ceph.com/teuthology-2016-04-10_02:10:02-upgrade:infernalis-x-jewel-distro-basic-vps/
Job: 119573
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2016-04-10_02:10:02-upgrade:infernalis-x-jewel-distro-basic-vps/119573/teuthology.log

2016-04-10T07:11:21.866 INFO:tasks.ceph.osd.2.vpm040.stderr:osd/ReplicatedPG.cc: In function 'bool ReplicatedPG::is_degraded_or_backfilling_object(const hobject_t&)' thread 7f5607eef700 time 2016-04-10 14:11:21.766679
2016-04-10T07:11:21.867 INFO:tasks.ceph.osd.2.vpm040.stderr:osd/ReplicatedPG.cc: 430: FAILED assert(!actingbackfill.empty())
2016-04-10T07:11:21.932 INFO:tasks.ceph.osd.2.vpm040.stderr: ceph version 10.1.1 (ce50389b773fe7f72fca40a3dd69cfe6613eaeb1)
2016-04-10T07:11:21.933 INFO:tasks.ceph.osd.2.vpm040.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f5627118f4b]
2016-04-10T07:11:21.933 INFO:tasks.ceph.osd.2.vpm040.stderr: 2: (()+0x4b8924) [0x7f5626bf3924]
2016-04-10T07:11:21.933 INFO:tasks.ceph.osd.2.vpm040.stderr: 3: (ReplicatedPG::handle_watch_timeout(std::shared_ptr<Watch>)+0x125) [0x7f5626c01d25]
2016-04-10T07:11:21.933 INFO:tasks.ceph.osd.2.vpm040.stderr: 4: (HandleDelayedWatchTimeout::finish(int)+0xd3) [0x7f5626b41a53]
2016-04-10T07:11:21.933 INFO:tasks.ceph.osd.2.vpm040.stderr: 5: (Context::complete(int)+0x9) [0x7f5626aeebd9]
2016-04-10T07:11:21.934 INFO:tasks.ceph.osd.2.vpm040.stderr: 6: (ReplicatedPG::finish_degraded_object(hobject_t const&)+0x354) [0x7f5626c46a84]
2016-04-10T07:11:21.934 INFO:tasks.ceph.osd.2.vpm040.stderr: 7: (ReplicatedPG::on_change(ObjectStore::Transaction*)+0x2ba) [0x7f5626c475da]
2016-04-10T07:11:21.934 INFO:tasks.ceph.osd.2.vpm040.stderr: 8: (PG::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ObjectStore::Transaction*)+0x7bd) [0x7f5626b9d85d]
2016-04-10T07:11:21.934 INFO:tasks.ceph.osd.2.vpm040.stderr: 9: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x4f0) [0x7f5626b9e8a0]
2016-04-10T07:11:21.934 INFO:tasks.ceph.osd.2.vpm040.stderr: 10: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1fc) [0x7f5626bd229c]
2016-04-10T07:11:21.935 INFO:tasks.ceph.osd.2.vpm040.stderr: 11: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7f5626bb972b]
2016-04-10T07:11:21.935 INFO:tasks.ceph.osd.2.vpm040.stderr: 12: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xd8) [0x7f5626bb9898]
2016-04-10T07:11:21.935 INFO:tasks.ceph.osd.2.vpm040.stderr: 13: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x499) [0x7f5626b93589]
2016-04-10T07:11:21.935 INFO:tasks.ceph.osd.2.vpm040.stderr: 14: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x2b9) [0x7f5626ace279]
2016-04-10T07:11:21.935 INFO:tasks.ceph.osd.2.vpm040.stderr: 15: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x1a1) [0x7f5626ae1ee1]
2016-04-10T07:11:21.936 INFO:tasks.ceph.osd.2.vpm040.stderr: 16: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x7f5626b2a162]
2016-04-10T07:11:21.936 INFO:tasks.ceph.osd.2.vpm040.stderr: 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0x7f562710a4ae]
2016-04-10T07:11:21.936 INFO:tasks.ceph.osd.2.vpm040.stderr: 18: (ThreadPool::WorkThread::entry()+0x10) [0x7f562710b390]
2016-04-10T07:11:21.936 INFO:tasks.ceph.osd.2.vpm040.stderr: 19: (()+0x8182) [0x7f56253dc182]
2016-04-10T07:11:21.936 INFO:tasks.ceph.osd.2.vpm040.stderr: 20: (clone()+0x6d) [0x7f562350a47d]
2016-04-10T07:11:21.937 INFO:tasks.ceph.osd.2.vpm040.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#7 Updated by Sage Weil over 6 years ago

  • Status changed from Fix Under Review to Pending Backport

#8 Updated by Nathan Cutler over 6 years ago

  • Copied to Backport #15525: hammer: ReplicatedPG.cc: 430: FAILED assert(!actingbackfill.empty()): old watch timeout tries to queue repop on replica added

#9 Updated by Nathan Cutler over 6 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF