Bug #17830
closedosd-scrub-repair.sh is failing (intermittently?) on Jenkins
0%
Description
For instance, https://jenkins.ceph.com/job/ceph-pull-requests/13489/console
An issue with this test was also reported at http://www.spinics.net/lists/ceph-devel/msg33453.html
Perhaps related to https://github.com/ceph/ceph/pull/11449
Files
Updated by David Zafman over 7 years ago
This is caused by a crashing an osd. Not related to any minor test changes.
0> 2016-11-04 13:29:30.532311 7f967ce8b700 -1 /home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/osd_types.h: In function 'ObjectContext::~ObjectContext()' thread 7f967ce8b700 time 2016-11-04 13:29:30.527794
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/osd_types.h: 4287: FAILED assert(rwstate.empty())
ceph version 11.0.2-1088-gfef0b21 (fef0b216957755f40d2c3796f56470e51629fb8d)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f9694c6871b]
2: (std::_Sp_counted_deleter<ObjectContext*, SharedLRU<hobject_t, ObjectContext, hobject_t::ComparatorWithDefault, std::hash<hobject_t> >::Cleanup, std::allocator<int>, (_gnu_cxx::_Lock_policy)2>::_M_dispose()+0x1f9) [0x7f96947e18c9]
3: (std::_Sp_counted_base<(_gnu_cxx::_Lock_policy)2>::_M_release()+0x39) [0x7f96945f0229]
4: (SharedLRU<hobject_t, ObjectContext, hobject_t::ComparatorWithDefault, std::hash<hobject_t> >::clear()+0x152) [0x7f96947f1302]
5: (ReplicatedPG::on_change(ObjectStore::Transaction*)+0x56f) [0x7f96947bea2f]
6: (PG::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ObjectStore::Transaction*)+0x7c2) [0x7f96946f8d82]
7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x4f0) [0x7f96946f9dc0]
8: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1fc) [0x7f9694734d0c]
9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7f969471618b]
10: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xd8) [0x7f9694716308]
11: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x499) [0x7f96946ee959]
12: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x2b9) [0x7f9694632a79]
13: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x1bf) [0x7f969464c5ef]
14: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x7f9694697b72]
15: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb65) [0x7f9694c6f1e5]
16: (ThreadPool::WorkThread::entry()+0x10) [0x7f9694c701b0]
17: (()+0x8182) [0x7f9692ab4182]
18: (clone()+0x6d) [0x7f96914234ad]
Updated by David Zafman over 7 years ago
- Status changed from New to In Progress
Here is the portion of the test that ran. When osd.0 went down to perform the ceph-objectstore-tool list-attrs, the primary osd.1 was handling the advance map and crashed.
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:187: TEST_auto_repair_erasure_coded: objectstore_tool testdir/osd-scrub-repair 2 SOMETHING remove
//home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:189: TEST_auto_repair_erasure_coded: get_pg ecpool SOMETHING
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:189: TEST_auto_repair_erasure_coded: local pgid=2.7
//home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:190: TEST_auto_repair_erasure_coded: get_last_scrub_stamp 2.7
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:190: TEST_auto_repair_erasure_coded: wait_for_scrub 2.7 '2016-11-04 13:29:14.921440'
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:191: TEST_auto_repair_erasure_coded: wait_for_clean
//home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:193: TEST_auto_repair_erasure_coded: get_not_primary ecpool SOMETHING
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:193: TEST_auto_repair_erasure_coded: objectstore_tool testdir/osd-scrub-repair 0 SOMETHING list-attrs
</span>/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:193: TEST_auto_repair_erasure_coded: return 1
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:31: run: return 1
Updated by Sage Weil over 7 years ago
- Has duplicate Bug #17860: osd/osd_types.h: 4287: FAILED assert(rwstate.empty()) added
Updated by Loïc Dachary over 7 years ago
Updated by Loïc Dachary over 7 years ago
https://github.com/ceph/ceph/pull/11979/commits/8854cca4164f9184cc549ba0b90b44515933de8c disables osd-scrub-repair.sh until this is fixed because it fails most of the time. To verify the fix is good, it is recommended to run the test in a loop on a machine that has a high load. For instance running a ./run-make-check.sh at the same time is enough to recreate conditions similar to jenkins on a machine such as rex001.
Updated by Kefu Chai over 7 years ago
- Status changed from In Progress to Resolved
Updated by Kefu Chai over 7 years ago
- Has duplicate deleted (Bug #17860: osd/osd_types.h: 4287: FAILED assert(rwstate.empty()))
Updated by Kefu Chai over 7 years ago
- Related to Bug #17857: osd/osd_types.h: 4287: FAILED assert(rwstate.empty()) added
Updated by Loïc Dachary over 7 years ago
- Status changed from Resolved to 12
Updated by Loïc Dachary over 7 years ago
Updated by Loïc Dachary over 7 years ago
Updated by Loïc Dachary over 7 years ago
Updated by Loïc Dachary over 7 years ago
I propose to temporarily disable it while it is worked on : https://github.com/ceph/ceph/pull/12058
Updated by Loïc Dachary over 7 years ago
Updated by Loïc Dachary over 7 years ago
Updated by Loïc Dachary over 7 years ago
Updated by Loïc Dachary over 7 years ago
For the record, over 25 occurrences of failed make check restarted because of the eio failure.
Updated by David Zafman over 7 years ago
More fixes and reenabled: https://github.com/ceph/ceph/pull/12072
Updated by Loïc Dachary over 7 years ago
https://jenkins.ceph.com/job/ceph-pull-requests/14906/console from https://github.com/ceph/ceph/pull/12061
It timesout with no sign of misbehavior.
156/156 Test #137: osd-scrub-repair.sh .....................***Timeout 3600.02 sec ... /home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:1567: TEST_corrupt_scrub_erasure: test 1 = 1 //home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:1569: TEST_corrupt_scrub_erasure: jq -r '.[0]' td/osd-scrub-repair/json /home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:1569: TEST_corrupt_scrub_erasure: test 2.0 = 2.0 /home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:1571: TEST_corrupt_scrub_erasure: rados list-inconsistent-obj 2.0
A previous run measured:
156/156 Test #137: osd-scrub-repair.sh ..................... Passed 811.17 sec
so it would have to be running on a machine that is 4 times slower to timeout. While this does not seem impossible, it does not seem likely.
Updated by Loïc Dachary over 7 years ago
https://jenkins.ceph.com/job/ceph-pull-requests/14911/console from https://github.com/ceph/ceph/pull/12081
timesout at exactly the same place:
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:504: TEST_corrupt_scrub_replicated: test 1 = 1 //home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:506: TEST_corrupt_scrub_replicated: jq -r '.[0]' td/osd-scrub-repair/json /home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:506: TEST_corrupt_scrub_replicated: test 2.0 = 2.0 /home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:508: TEST_corrupt_scrub_replicated: rados list-inconsistent-obj 2.0
In the logs
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:897: objectstore_tool: local dir=td/osd-scrub-repair 14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:898: objectstore_tool: shift 14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:899: objectstore_tool: local id=1 14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:900: objectstore_tool: shift 14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:901: objectstore_tool: local osd_data=td/osd-scrub-repair/1 14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:903: objectstore_tool: kill_daemons td/osd-scrub-repair TERM osd.1 14607: //home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:253: kill_daemons: shopt -q -o xtrace /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1416: run_in_background: return 0 /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1445: wait_background: for pid in '$pids' /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1446: wait_background: wait 14606 /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1453: wait_background: eval 'pids='\'''\''' //home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1453: wait_background: pids= /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1455: wait_background: return 0 /home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:96: corrupt_and_repair_two: return_code=0 /home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:97: corrupt_and_repair_two: '[' 0 -ne 0 ']' 14607: //home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:253: kill_daemons: echo true 14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:253: kill_daemons: local trace=true 14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:254: kill_daemons: true 14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:254: kill_daemons: shopt -u -o xtrace 14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:270: kill_daemons: return 0 14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:905: objectstore_tool: ceph-objectstore-tool --data-path td/osd-scrub-repair/1 --journal-path td/osd-scrub-repair/1/jour
Does not look right because
/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1455: wait_background: return 0
even though
14607: //home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:253: kill_daemons: echo true
shows later in the logs. It may be because it is bufferized and not because it actually happens after though.
https://github.com/ceph/ceph/pull/12183 should clarify this.
Updated by Samuel Just over 7 years ago
- Priority changed from Immediate to Urgent
Updated by Greg Farnum almost 7 years ago
- Related to Bug #20242: Make osd-scrub-repair.sh unit test run faster added
Updated by Greg Farnum almost 7 years ago
- Project changed from Ceph to RADOS
David, do we have any idea why this is failing? I'm not getting any idea from what's in the comments here.
Updated by David Zafman almost 7 years ago
- Status changed from 12 to Can't reproduce
Haven't been seeing this at all, so I'm closing for now.