Bug #17830: osd-scrub-repair.sh is failing (intermittently?) on Jenkins - RADOS - Ceph

Actions

Copy link

Bug #17830

closed

osd-scrub-repair.sh is failing (intermittently?) on Jenkins

Added by Greg Farnum over 7 years ago. Updated almost 7 years ago.

Status:

Can't reproduce

Priority:

High

Assignee:

David Zafman

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

For instance, https://jenkins.ceph.com/job/ceph-pull-requests/13489/console

An issue with this test was also reported at http://www.spinics.net/lists/ceph-devel/msg33453.html

Perhaps related to https://github.com/ceph/ceph/pull/11449

Files

Download all files

o.txt.gz (263 KB) o.txt.gz		Loïc Dachary, 11/24/2016 03:52 PM
o2.txt.gz (305 KB) o2.txt.gz		Loïc Dachary, 11/24/2016 04:53 PM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Samuel Just over 7 years ago

Assignee set to David Zafman

Actions

Copy link

Updated by David Zafman over 7 years ago

This is caused by a crashing an osd. Not related to any minor test changes.

0&gt; 2016-11-04 13:29:30.532311 7f967ce8b700 -1 /home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/osd_types.h: In function 'ObjectContext::~ObjectContext()' thread 7f967ce8b700 time 2016-11-04 13:29:30.527794
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/osd_types.h: 4287: FAILED assert(rwstate.empty())

ceph version 11.0.2-1088-gfef0b21 (fef0b216957755f40d2c3796f56470e51629fb8d)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f9694c6871b]
 2: (std::_Sp_counted_deleter&lt;ObjectContext*, SharedLRU&lt;hobject_t, ObjectContext, hobject_t::ComparatorWithDefault, std::hash&lt;hobject_t&gt; &gt;::Cleanup, std::allocator&lt;int&gt;, (_gnu_cxx::_Lock_policy)2&gt;::_M_dispose()+0x1f9) [0x7f96947e18c9]
 3: (std::_Sp_counted_base&lt;(_gnu_cxx::_Lock_policy)2&gt;::_M_release()+0x39) [0x7f96945f0229]
 4: (SharedLRU&lt;hobject_t, ObjectContext, hobject_t::ComparatorWithDefault, std::hash&lt;hobject_t&gt; &gt;::clear()+0x152) [0x7f96947f1302]
 5: (ReplicatedPG::on_change(ObjectStore::Transaction*)+0x56f) [0x7f96947bea2f]
 6: (PG::start_peering_interval(std::shared_ptr&lt;OSDMap const&gt;, std::vector&lt;int, std::allocator&lt;int&gt; &gt; const&amp;, int, std::vector&lt;int, std::allocator&lt;int&gt; &gt; const&amp;, int, ObjectStore::Transaction*)+0x7c2) [0x7f96946f8d82]
 7: (PG::RecoveryState::Reset::react(PG::AdvMap const&amp;)+0x4f0) [0x7f96946f9dc0]
 8: (boost::statechart::simple_state&lt;PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list&lt;mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na&gt;, (boost::statechart::history_mode)0&gt;::react_impl(boost::statechart::event_base const&amp;, void const*)+0x1fc) [0x7f9694734d0c]
 9: (boost::statechart::state_machine&lt;PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator&lt;void&gt;, boost::statechart::null_exception_translator&gt;::send_event(boost::statechart::event_base const&amp;)+0x5b) [0x7f969471618b]
 10: (boost::statechart::state_machine&lt;PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator&lt;void&gt;, boost::statechart::null_exception_translator&gt;::process_queued_events()+0xd8) [0x7f9694716308]
 11: (PG::handle_advance_map(std::shared_ptr&lt;OSDMap const&gt;, std::shared_ptr&lt;OSDMap const&gt;, std::vector&lt;int, std::allocator&lt;int&gt; &gt;&amp;, int, std::vector&lt;int, std::allocator&lt;int&gt; &gt;&amp;, int, PG::RecoveryCtx*)+0x499) [0x7f96946ee959]
 12: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&amp;, PG::RecoveryCtx*, std::set&lt;boost::intrusive_ptr&lt;PG&gt;, std::less&lt;boost::intrusive_ptr&lt;PG&gt; &gt;, std::allocator&lt;boost::intrusive_ptr&lt;PG&gt; &gt; &gt;*)+0x2b9) [0x7f9694632a79]
 13: (OSD::process_peering_events(std::list&lt;PG*, std::allocator&lt;PG*&gt; &gt; const&amp;, ThreadPool::TPHandle&amp;)+0x1bf) [0x7f969464c5ef]
 14: (OSD::PeeringWQ::_process(std::list&lt;PG*, std::allocator&lt;PG*&gt; &gt; const&amp;, ThreadPool::TPHandle&amp;)+0x12) [0x7f9694697b72]
 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb65) [0x7f9694c6f1e5]
 16: (ThreadPool::WorkThread::entry()+0x10) [0x7f9694c701b0]
 17: (()+0x8182) [0x7f9692ab4182]
 18: (clone()+0x6d) [0x7f96914234ad]

Actions

Copy link

Updated by David Zafman over 7 years ago

Status changed from New to In Progress

Here is the portion of the test that ran. When osd.0 went down to perform the ceph-objectstore-tool list-attrs, the primary osd.1 was handling the advance map and crashed.

/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:187: TEST_auto_repair_erasure_coded: objectstore_tool testdir/osd-scrub-repair 2 SOMETHING remove
//home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:189: TEST_auto_repair_erasure_coded: get_pg ecpool SOMETHING
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:189: TEST_auto_repair_erasure_coded: local pgid=2.7
//home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:190: TEST_auto_repair_erasure_coded: get_last_scrub_stamp 2.7
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:190: TEST_auto_repair_erasure_coded: wait_for_scrub 2.7 '2016-11-04 13:29:14.921440'
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:191: TEST_auto_repair_erasure_coded: wait_for_clean
//home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:193: TEST_auto_repair_erasure_coded: get_not_primary ecpool SOMETHING
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:193: TEST_auto_repair_erasure_coded: objectstore_tool testdir/osd-scrub-repair 0 SOMETHING list-attrs
</span>/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:193: TEST_auto_repair_erasure_coded: return 1
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:31: run: return 1

Actions

Copy link

Updated by Sage Weil over 7 years ago

Has duplicate Bug #17860: osd/osd_types.h: 4287: FAILED assert(rwstate.empty()) added

Actions

Copy link

Updated by Loïc Dachary over 7 years ago

https://github.com/ceph/ceph/pull/11926

Actions

Copy link

Updated by Loïc Dachary over 7 years ago

https://github.com/ceph/ceph/pull/11979/commits/8854cca4164f9184cc549ba0b90b44515933de8c disables osd-scrub-repair.sh until this is fixed because it fails most of the time. To verify the fix is good, it is recommended to run the test in a loop on a machine that has a high load. For instance running a ./run-make-check.sh at the same time is enough to recreate conditions similar to jenkins on a machine such as rex001.

Actions

Copy link

Updated by Kefu Chai over 7 years ago

Status changed from In Progress to Resolved

https://github.com/ceph/ceph/pull/11926

Actions

Copy link

Updated by Kefu Chai over 7 years ago

Has duplicate deleted (Bug #17860: osd/osd_types.h: 4287: FAILED assert(rwstate.empty()))

Actions

Copy link

Updated by Kefu Chai over 7 years ago

Related to Bug #17857: osd/osd_types.h: 4287: FAILED assert(rwstate.empty()) added

Actions

Copy link

#10

Updated by Loïc Dachary over 7 years ago

Status changed from Resolved to 12

Status changed from 12 to Resolved

Actions

Copy link

#21

Updated by Loïc Dachary over 7 years ago

Status changed from Resolved to 12

Actions

Copy link

#22

Updated by Loïc Dachary over 7 years ago

Status changed from 12 to Resolved

Actions

Copy link

#23

Updated by Loïc Dachary over 7 years ago

File o.txt.gz o.txt.gz added
Status changed from Resolved to 12

https://jenkins.ceph.com/job/ceph-pull-requests/14906/console from https://github.com/ceph/ceph/pull/12061

It timesout with no sign of misbehavior.

156/156 Test #137: osd-scrub-repair.sh .....................***Timeout 3600.02 sec
...
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:1567: TEST_corrupt_scrub_erasure:  test 1 = 1
//home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:1569: TEST_corrupt_scrub_erasure:  jq -r '.[0]' td/osd-scrub-repair/json
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:1569: TEST_corrupt_scrub_erasure:  test 2.0 = 2.0
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:1571: TEST_corrupt_scrub_erasure:  rados list-inconsistent-obj 2.0

A previous run measured:

156/156 Test #137: osd-scrub-repair.sh .....................   Passed  811.17 sec

so it would have to be running on a machine that is 4 times slower to timeout. While this does not seem impossible, it does not seem likely.

Actions

Copy link

#24

Updated by Loïc Dachary over 7 years ago

File o2.txt.gz o2.txt.gz added

https://jenkins.ceph.com/job/ceph-pull-requests/14911/console from https://github.com/ceph/ceph/pull/12081

timesout at exactly the same place:

/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:504: TEST_corrupt_scrub_replicated:  test 1 = 1
//home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:506: TEST_corrupt_scrub_replicated:  jq -r '.[0]' td/osd-scrub-repair/json
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:506: TEST_corrupt_scrub_replicated:  test 2.0 = 2.0
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:508: TEST_corrupt_scrub_replicated:  rados list-inconsistent-obj 2.0

In the logs

14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:897: objectstore_tool:  local dir=td/osd-scrub-repair
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:898: objectstore_tool:  shift
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:899: objectstore_tool:  local id=1
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:900: objectstore_tool:  shift
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:901: objectstore_tool:  local osd_data=td/osd-scrub-repair/1
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:903: objectstore_tool:  kill_daemons td/osd-scrub-repair TERM osd.1
14607: //home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:253: kill_daemons:  shopt -q -o xtrace
/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1416: run_in_background:  return 0
/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1445: wait_background:  for pid in '$pids'
/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1446: wait_background:  wait 14606
/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1453: wait_background:  eval 'pids='\'''\'''
//home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1453: wait_background:  pids=
/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1455: wait_background:  return 0
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:96: corrupt_and_repair_two:  return_code=0
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/osd/osd-scrub-repair.sh:97: corrupt_and_repair_two:  '[' 0 -ne 0 ']'
14607: //home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:253: kill_daemons:  echo true
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:253: kill_daemons:  local trace=true
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:254: kill_daemons:  true
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:254: kill_daemons:  shopt -u -o xtrace
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:270: kill_daemons:  return 0
14607: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:905: objectstore_tool:  ceph-objectstore-tool --data-path td/osd-scrub-repair/1 --journal-path td/osd-scrub-repair/1/jour

Does not look right because

/home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:1455: wait_background:  return 0

even though

14607: //home/jenkins-build/build/workspace/ceph-pull-requests/qa/workunits/ceph-helpers.sh:253: kill_daemons:  echo true

shows later in the logs. It may be because it is bufferized and not because it actually happens after though.

https://github.com/ceph/ceph/pull/12183 should clarify this.

Actions

Copy link

#25

Updated by Samuel Just over 7 years ago

Priority changed from Immediate to Urgent

Actions

Copy link

#26

Updated by Samuel Just over 7 years ago

Priority changed from Urgent to High

Actions

Copy link

#27

Updated by Greg Farnum almost 7 years ago

Related to Bug #20242: Make osd-scrub-repair.sh unit test run faster added

Actions

Copy link

#28

Updated by Greg Farnum almost 7 years ago

Project changed from Ceph to RADOS

David, do we have any idea why this is failing? I'm not getting any idea from what's in the comments here.

Actions

Copy link

#29

Updated by David Zafman almost 7 years ago

Status changed from 12 to Can't reproduce

Haven't been seeing this at all, so I'm closing for now.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #17830

osd-scrub-repair.sh is failing (intermittently?) on Jenkins

Updated by Samuel Just over 7 years ago

Updated by David Zafman over 7 years ago

Updated by David Zafman over 7 years ago

Updated by Sage Weil over 7 years ago

Updated by Loïc Dachary over 7 years ago

Updated by Loïc Dachary over 7 years ago

Updated by Kefu Chai over 7 years ago

Updated by Kefu Chai over 7 years ago

Updated by Kefu Chai over 7 years ago

Updated by Loïc Dachary over 7 years ago

Updated by Loïc Dachary over 7 years ago

Updated by Loïc Dachary over 7 years ago

Updated by Loïc Dachary over 7 years ago

Updated by Loïc Dachary over 7 years ago

Updated by Loïc Dachary over 7 years ago

Updated by Loïc Dachary over 7 years ago

Updated by Loïc Dachary over 7 years ago

Updated by Loïc Dachary over 7 years ago

Updated by David Zafman over 7 years ago

Updated by David Zafman over 7 years ago

Updated by Loïc Dachary over 7 years ago

Updated by Loïc Dachary over 7 years ago

Updated by Loïc Dachary over 7 years ago

Updated by Loïc Dachary over 7 years ago

Updated by Samuel Just over 7 years ago

Updated by Samuel Just over 7 years ago

Updated by Greg Farnum almost 7 years ago

Updated by Greg Farnum almost 7 years ago

Updated by David Zafman almost 7 years ago