Project

General

Profile

Bug #17830

osd-scrub-repair.sh is failing (intermittently?) on Jenkins

Added by Greg Farnum almost 3 years ago. Updated about 2 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
11/08/2016
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

o.txt.gz (263 KB) Loic Dachary, 11/24/2016 03:52 PM

o2.txt.gz (305 KB) Loic Dachary, 11/24/2016 04:53 PM


Related issues

Related to Ceph - Bug #17857: osd/osd_types.h: 4287: FAILED assert(rwstate.empty()) Resolved 11/10/2016
Related to RADOS - Bug #20242: Make osd-scrub-repair.sh unit test run faster Resolved 06/09/2017

History

#1 Updated by Samuel Just almost 3 years ago

  • Assignee set to David Zafman

#2 Updated by David Zafman almost 3 years ago

This is caused by a crashing an osd. Not related to any minor test changes.

0> 2016-11-04 13:29:30.532311 7f967ce8b700 -1 /home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/osd_types.h: In function 'ObjectContext::~ObjectContext()' thread 7f967ce8b700 time 2016-11-04 13:29:30.527794
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/osd_types.h: 4287: FAILED assert(rwstate.empty())
ceph version 11.0.2-1088-gfef0b21 (fef0b216957755f40d2c3796f56470e51629fb8d)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f9694c6871b]
2: (std::_Sp_counted_deleter<ObjectContext*, SharedLRU<hobject_t, ObjectContext, hobject_t::ComparatorWithDefault, std::hash<hobject_t> >::Cleanup, std::allocator<int>, (_gnu_cxx::_Lock_policy)2>::_M_dispose()+0x1f9) [0x7f96947e18c9]
3: (std::_Sp_counted_base<(
_gnu_cxx::_Lock_policy)2>::_M_release()+0x39) [0x7f96945f0229]
4: (SharedLRU<hobject_t, ObjectContext, hobject_t::ComparatorWithDefault, std::hash<hobject_t> >::clear()+0x152) [0x7f96947f1302]
5: (ReplicatedPG::on_change(ObjectStore::Transaction*)+0x56f) [0x7f96947bea2f]
6: (PG::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ObjectStore::Transaction*)+0x7c2) [0x7f96946f8d82]
7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x4f0) [0x7f96946f9dc0]
8: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1fc) [0x7f9694734d0c]
9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7f969471618b]
10: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xd8) [0x7f9694716308]
11: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x499) [0x7f96946ee959]
12: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x2b9) [0x7f9694632a79]
13: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x1bf) [0x7f969464c5ef]
14: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x7f9694697b72]
15: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb65) [0x7f9694c6f1e5]
16: (ThreadPool::WorkThread::entry()+0x10) [0x7f9694c701b0]
17: (()+0x8182) [0x7f9692ab4182]
18: (clone()+0x6d) [0x7f96914234ad]

#3 Updated by David Zafman almost 3 years ago

  • Status changed from New to In Progress

Here is the portion of the test that ran. When osd.0 went down to perform the ceph-objectstore-tool list-attrs, the primary osd.1 was handling the advance map and crashed.

/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:187: TEST_auto_repair_erasure_coded: objectstore_tool testdir/osd-scrub-repair 2 SOMETHING remove
//home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:189: TEST_auto_repair_erasure_coded: get_pg ecpool SOMETHING
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:189: TEST_auto_repair_erasure_coded: local pgid=2.7
//home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:190: TEST_auto_repair_erasure_coded: get_last_scrub_stamp 2.7
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:190: TEST_auto_repair_erasure_coded: wait_for_scrub 2.7 '2016-11-04 13:29:14.921440'
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:191: TEST_auto_repair_erasure_coded: wait_for_clean
//home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:193: TEST_auto_repair_erasure_coded: get_not_primary ecpool SOMETHING
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:193: TEST_auto_repair_erasure_coded: objectstore_tool testdir/osd-scrub-repair 0 SOMETHING list-attrs
</span>/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:193: TEST_auto_repair_erasure_coded: return 1
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:31: run: return 1

#4 Updated by Sage Weil almost 3 years ago

  • Duplicated by Bug #17860: osd/osd_types.h: 4287: FAILED assert(rwstate.empty()) added

#6 Updated by Loic Dachary almost 3 years ago

https://github.com/ceph/ceph/pull/11979/commits/8854cca4164f9184cc549ba0b90b44515933de8c disables osd-scrub-repair.sh until this is fixed because it fails most of the time. To verify the fix is good, it is recommended to run the test in a loop on a machine that has a high load. For instance running a ./run-make-check.sh at the same time is enough to recreate conditions similar to jenkins on a machine such as rex001.

#7 Updated by Kefu Chai almost 3 years ago

  • Status changed from In Progress to Resolved

#8 Updated by Kefu Chai almost 3 years ago

  • Duplicated by deleted (Bug #17860: osd/osd_types.h: 4287: FAILED assert(rwstate.empty()))

#9 Updated by Kefu Chai almost 3 years ago

  • Related to Bug #17857: osd/osd_types.h: 4287: FAILED assert(rwstate.empty()) added

#10 Updated by Loic Dachary almost 3 years ago

  • Status changed from Resolved to Verified

#14 Updated by Loic Dachary almost 3 years ago

I propose to temporarily disable it while it is worked on : https://github.com/ceph/ceph/pull/12058

#18 Updated by Loic Dachary almost 3 years ago

For the record, over 25 occurrences of failed make check restarted because of the eio failure.

#19 Updated by David Zafman almost 3 years ago

More fixes and reenabled: https://github.com/ceph/ceph/pull/12072

#20 Updated by David Zafman almost 3 years ago

  • Status changed from Verified to Resolved

#21 Updated by Loic Dachary almost 3 years ago

  • Status changed from Resolved to Verified

#22 Updated by Loic Dachary almost 3 years ago

  • Status changed from Verified to Resolved

#23 Updated by Loic Dachary almost 3 years ago

  • File o.txt.gz added
  • Status changed from Resolved to Verified

https://jenkins.ceph.com/job/ceph-pull-requests/14906/console from https://github.com/ceph/ceph/pull/12061

It timesout with no sign of misbehavior.

156/156 Test #137: osd-scrub-repair.sh .....................***Timeout 3600.02 sec
...
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:1567: TEST_corrupt_scrub_erasure:  test 1 = 1
//home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:1569: TEST_corrupt_scrub_erasure:  jq -r '.[0]' td/osd-scrub-repair/json
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:1569: TEST_corrupt_scrub_erasure:  test 2.0 = 2.0
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:1571: TEST_corrupt_scrub_erasure:  rados list-inconsistent-obj 2.0

A previous run measured:

156/156 Test #137: osd-scrub-repair.sh .....................   Passed  811.17 sec

so it would have to be running on a machine that is 4 times slower to timeout. While this does not seem impossible, it does not seem likely.

#24 Updated by Loic Dachary almost 3 years ago

https://jenkins.ceph.com/job/ceph-pull-requests/14911/console from https://github.com/ceph/ceph/pull/12081

timesout at exactly the same place:

/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:504: TEST_corrupt_scrub_replicated:  test 1 = 1
//home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:506: TEST_corrupt_scrub_replicated:  jq -r '.[0]' td/osd-scrub-repair/json
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:506: TEST_corrupt_scrub_replicated:  test 2.0 = 2.0
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:508: TEST_corrupt_scrub_replicated:  rados list-inconsistent-obj 2.0

In the logs

14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:897: objectstore_tool:  local dir=td/osd-scrub-repair
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:898: objectstore_tool:  shift
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:899: objectstore_tool:  local id=1
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:900: objectstore_tool:  shift
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:901: objectstore_tool:  local osd_data=td/osd-scrub-repair/1
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:903: objectstore_tool:  kill_daemons td/osd-scrub-repair TERM osd.1
14607: //home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:253: kill_daemons:  shopt -q -o xtrace
/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1416: run_in_background:  return 0
/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1445: wait_background:  for pid in '$pids'
/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1446: wait_background:  wait 14606
/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1453: wait_background:  eval 'pids='\'''\'''
//home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1453: wait_background:  pids=
/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1455: wait_background:  return 0
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:96: corrupt_and_repair_two:  return_code=0
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:97: corrupt_and_repair_two:  '[' 0 -ne 0 ']'
14607: //home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:253: kill_daemons:  echo true
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:253: kill_daemons:  local trace=true
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:254: kill_daemons:  true
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:254: kill_daemons:  shopt -u -o xtrace
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:270: kill_daemons:  return 0
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:905: objectstore_tool:  ceph-objectstore-tool --data-path td/osd-scrub-repair/1 --journal-path td/osd-scrub-repair/1/jour

Does not look right because

/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1455: wait_background:  return 0

even though
14607: //home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:253: kill_daemons:  echo true

shows later in the logs. It may be because it is bufferized and not because it actually happens after though.

https://github.com/ceph/ceph/pull/12183 should clarify this.

#25 Updated by Samuel Just over 2 years ago

  • Priority changed from Immediate to Urgent

#26 Updated by Samuel Just over 2 years ago

  • Priority changed from Urgent to High

#27 Updated by Greg Farnum about 2 years ago

  • Related to Bug #20242: Make osd-scrub-repair.sh unit test run faster added

#28 Updated by Greg Farnum about 2 years ago

  • Project changed from Ceph to RADOS

David, do we have any idea why this is failing? I'm not getting any idea from what's in the comments here.

#29 Updated by David Zafman about 2 years ago

  • Status changed from Verified to Can't reproduce

Haven't been seeing this at all, so I'm closing for now.

Also available in: Atom PDF