Project

General

Profile

Actions

Bug #17830

closed

osd-scrub-repair.sh is failing (intermittently?) on Jenkins

Added by Greg Farnum over 7 years ago. Updated almost 7 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):


Files

o.txt.gz (263 KB) o.txt.gz Loïc Dachary, 11/24/2016 03:52 PM
o2.txt.gz (305 KB) o2.txt.gz Loïc Dachary, 11/24/2016 04:53 PM

Related issues 2 (0 open2 closed)

Related to Ceph - Bug #17857: osd/osd_types.h: 4287: FAILED assert(rwstate.empty())ResolvedKefu Chai11/10/2016

Actions
Related to RADOS - Bug #20242: Make osd-scrub-repair.sh unit test run fasterResolvedDavid Zafman06/09/2017

Actions
Actions #1

Updated by Samuel Just over 7 years ago

  • Assignee set to David Zafman
Actions #2

Updated by David Zafman over 7 years ago

This is caused by a crashing an osd. Not related to any minor test changes.

0> 2016-11-04 13:29:30.532311 7f967ce8b700 -1 /home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/osd_types.h: In function 'ObjectContext::~ObjectContext()' thread 7f967ce8b700 time 2016-11-04 13:29:30.527794
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/osd_types.h: 4287: FAILED assert(rwstate.empty())
ceph version 11.0.2-1088-gfef0b21 (fef0b216957755f40d2c3796f56470e51629fb8d)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f9694c6871b]
2: (std::_Sp_counted_deleter<ObjectContext*, SharedLRU<hobject_t, ObjectContext, hobject_t::ComparatorWithDefault, std::hash<hobject_t> >::Cleanup, std::allocator<int>, (_gnu_cxx::_Lock_policy)2>::_M_dispose()+0x1f9) [0x7f96947e18c9]
3: (std::_Sp_counted_base<(
_gnu_cxx::_Lock_policy)2>::_M_release()+0x39) [0x7f96945f0229]
4: (SharedLRU<hobject_t, ObjectContext, hobject_t::ComparatorWithDefault, std::hash<hobject_t> >::clear()+0x152) [0x7f96947f1302]
5: (ReplicatedPG::on_change(ObjectStore::Transaction*)+0x56f) [0x7f96947bea2f]
6: (PG::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ObjectStore::Transaction*)+0x7c2) [0x7f96946f8d82]
7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x4f0) [0x7f96946f9dc0]
8: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1fc) [0x7f9694734d0c]
9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x7f969471618b]
10: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xd8) [0x7f9694716308]
11: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x499) [0x7f96946ee959]
12: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x2b9) [0x7f9694632a79]
13: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x1bf) [0x7f969464c5ef]
14: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x7f9694697b72]
15: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb65) [0x7f9694c6f1e5]
16: (ThreadPool::WorkThread::entry()+0x10) [0x7f9694c701b0]
17: (()+0x8182) [0x7f9692ab4182]
18: (clone()+0x6d) [0x7f96914234ad]
Actions #3

Updated by David Zafman over 7 years ago

  • Status changed from New to In Progress

Here is the portion of the test that ran. When osd.0 went down to perform the ceph-objectstore-tool list-attrs, the primary osd.1 was handling the advance map and crashed.

/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:187: TEST_auto_repair_erasure_coded: objectstore_tool testdir/osd-scrub-repair 2 SOMETHING remove
//home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:189: TEST_auto_repair_erasure_coded: get_pg ecpool SOMETHING
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:189: TEST_auto_repair_erasure_coded: local pgid=2.7
//home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:190: TEST_auto_repair_erasure_coded: get_last_scrub_stamp 2.7
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:190: TEST_auto_repair_erasure_coded: wait_for_scrub 2.7 '2016-11-04 13:29:14.921440'
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:191: TEST_auto_repair_erasure_coded: wait_for_clean
//home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:193: TEST_auto_repair_erasure_coded: get_not_primary ecpool SOMETHING
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:193: TEST_auto_repair_erasure_coded: objectstore_tool testdir/osd-scrub-repair 0 SOMETHING list-attrs
</span>/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:193: TEST_auto_repair_erasure_coded: return 1
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:31: run: return 1

Actions #4

Updated by Sage Weil over 7 years ago

  • Has duplicate Bug #17860: osd/osd_types.h: 4287: FAILED assert(rwstate.empty()) added
Actions #6

Updated by Loïc Dachary over 7 years ago

https://github.com/ceph/ceph/pull/11979/commits/8854cca4164f9184cc549ba0b90b44515933de8c disables osd-scrub-repair.sh until this is fixed because it fails most of the time. To verify the fix is good, it is recommended to run the test in a loop on a machine that has a high load. For instance running a ./run-make-check.sh at the same time is enough to recreate conditions similar to jenkins on a machine such as rex001.

Actions #7

Updated by Kefu Chai over 7 years ago

  • Status changed from In Progress to Resolved
Actions #8

Updated by Kefu Chai over 7 years ago

  • Has duplicate deleted (Bug #17860: osd/osd_types.h: 4287: FAILED assert(rwstate.empty()))
Actions #9

Updated by Kefu Chai over 7 years ago

  • Related to Bug #17857: osd/osd_types.h: 4287: FAILED assert(rwstate.empty()) added
Actions #14

Updated by Loïc Dachary over 7 years ago

I propose to temporarily disable it while it is worked on : https://github.com/ceph/ceph/pull/12058

Actions #18

Updated by Loïc Dachary over 7 years ago

For the record, over 25 occurrences of failed make check restarted because of the eio failure.

Actions #19

Updated by David Zafman over 7 years ago

More fixes and reenabled: https://github.com/ceph/ceph/pull/12072

Actions #20

Updated by David Zafman over 7 years ago

  • Status changed from 12 to Resolved
Actions #21

Updated by Loïc Dachary over 7 years ago

  • Status changed from Resolved to 12
Actions #22

Updated by Loïc Dachary over 7 years ago

  • Status changed from 12 to Resolved
Actions #23

Updated by Loïc Dachary over 7 years ago

https://jenkins.ceph.com/job/ceph-pull-requests/14906/console from https://github.com/ceph/ceph/pull/12061

It timesout with no sign of misbehavior.

156/156 Test #137: osd-scrub-repair.sh .....................***Timeout 3600.02 sec
...
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:1567: TEST_corrupt_scrub_erasure:  test 1 = 1
//home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:1569: TEST_corrupt_scrub_erasure:  jq -r '.[0]' td/osd-scrub-repair/json
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:1569: TEST_corrupt_scrub_erasure:  test 2.0 = 2.0
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:1571: TEST_corrupt_scrub_erasure:  rados list-inconsistent-obj 2.0

A previous run measured:

156/156 Test #137: osd-scrub-repair.sh .....................   Passed  811.17 sec

so it would have to be running on a machine that is 4 times slower to timeout. While this does not seem impossible, it does not seem likely.

Actions #24

Updated by Loïc Dachary over 7 years ago

https://jenkins.ceph.com/job/ceph-pull-requests/14911/console from https://github.com/ceph/ceph/pull/12081

timesout at exactly the same place:

/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:504: TEST_corrupt_scrub_replicated:  test 1 = 1
//home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:506: TEST_corrupt_scrub_replicated:  jq -r '.[0]' td/osd-scrub-repair/json
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:506: TEST_corrupt_scrub_replicated:  test 2.0 = 2.0
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:508: TEST_corrupt_scrub_replicated:  rados list-inconsistent-obj 2.0

In the logs

14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:897: objectstore_tool:  local dir=td/osd-scrub-repair
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:898: objectstore_tool:  shift
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:899: objectstore_tool:  local id=1
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:900: objectstore_tool:  shift
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:901: objectstore_tool:  local osd_data=td/osd-scrub-repair/1
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:903: objectstore_tool:  kill_daemons td/osd-scrub-repair TERM osd.1
14607: //home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:253: kill_daemons:  shopt -q -o xtrace
/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1416: run_in_background:  return 0
/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1445: wait_background:  for pid in '$pids'
/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1446: wait_background:  wait 14606
/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1453: wait_background:  eval 'pids='\'''\'''
//home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1453: wait_background:  pids=
/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1455: wait_background:  return 0
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:96: corrupt_and_repair_two:  return_code=0
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:97: corrupt_and_repair_two:  '[' 0 -ne 0 ']'
14607: //home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:253: kill_daemons:  echo true
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:253: kill_daemons:  local trace=true
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:254: kill_daemons:  true
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:254: kill_daemons:  shopt -u -o xtrace
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:270: kill_daemons:  return 0
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:905: objectstore_tool:  ceph-objectstore-tool --data-path td/osd-scrub-repair/1 --journal-path td/osd-scrub-repair/1/jour

Does not look right because

/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1455: wait_background:  return 0

even though
14607: //home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:253: kill_daemons:  echo true

shows later in the logs. It may be because it is bufferized and not because it actually happens after though.

https://github.com/ceph/ceph/pull/12183 should clarify this.

Actions #25

Updated by Samuel Just over 7 years ago

  • Priority changed from Immediate to Urgent
Actions #26

Updated by Samuel Just over 7 years ago

  • Priority changed from Urgent to High
Actions #27

Updated by Greg Farnum almost 7 years ago

  • Related to Bug #20242: Make osd-scrub-repair.sh unit test run faster added
Actions #28

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS

David, do we have any idea why this is failing? I'm not getting any idea from what's in the comments here.

Actions #29

Updated by David Zafman almost 7 years ago

  • Status changed from 12 to Can't reproduce

Haven't been seeing this at all, so I'm closing for now.

Actions

Also available in: Atom PDF