Project

General

Profile

Actions

Bug #16641

closed

hammer-backports: OSD crash in *lost-unfound* tests with msgr-failures

Added by Samuel Just almost 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2016-07-03T19:55:06.259 INFO:teuthology.orchestra.run.smithi066:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok dump_ops_in_flight'
2016-07-03T19:55:06.648 INFO:teuthology.orchestra.run.smithi066.stderr:no valid command found; 10 closest matches:
2016-07-03T19:55:06.648 INFO:teuthology.orchestra.run.smithi066.stderr:config set <var> <val> [<val>...]
2016-07-03T19:55:06.649 INFO:teuthology.orchestra.run.smithi066.stderr:version
2016-07-03T19:55:06.649 INFO:teuthology.orchestra.run.smithi066.stderr:git_version
2016-07-03T19:55:06.649 INFO:teuthology.orchestra.run.smithi066.stderr:help
2016-07-03T19:55:06.649 INFO:teuthology.orchestra.run.smithi066.stderr:config show
2016-07-03T19:55:06.649 INFO:teuthology.orchestra.run.smithi066.stderr:get_command_descriptions
2016-07-03T19:55:06.650 INFO:teuthology.orchestra.run.smithi066.stderr:config get <var>
2016-07-03T19:55:06.650 INFO:teuthology.orchestra.run.smithi066.stderr:perfcounters_dump
2016-07-03T19:55:06.650 INFO:teuthology.orchestra.run.smithi066.stderr:2
2016-07-03T19:55:06.650 INFO:teuthology.orchestra.run.smithi066.stderr:config diff
2016-07-03T19:55:06.650 INFO:teuthology.orchestra.run.smithi066.stderr:admin_socket: invalid command
2016-07-03T19:55:06.654 INFO:tasks.ceph.ceph_manager:waiting on admin_socket for osd-0, ['dump_ops_in_flight']

sjust@teuthology:/a/smithfarm-2016-07-02_01:06:27-rados-hammer-backports---basic-smithi/288070

Actions #1

Updated by David Zafman almost 8 years ago

  • Project changed from 23 to Ceph
Actions #2

Updated by David Zafman almost 8 years ago

  • Is duplicate of Backport #16225: hammer: SIGABRT in TrackedOp::dump() via dump_ops_in_flight() added
Actions #3

Updated by Nathan Cutler almost 8 years ago

David, in this job the command "ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok dump_ops_in_flight" returns "admin_socket: invalid command" and immediately afterwards osd.1 aborts with "ceph-osd: /usr/include/boost/smart_ptr/intrusive_ptr.hpp:162: T* boost::intrusive_ptr<T>::operator->() const [with T = Connection]: Assertion `px != 0' failed."

This intrusive_ptr assertion is something we're seeing in multiple jobs in the current hammer_backports integration round.

Can you explain how this duplicates #16225?

Actions #4

Updated by Nathan Cutler almost 8 years ago

Another oddity is that the "admin_socket: invalid command" occurs at 2016-07-03T19:58:26.213, then there is this message:

2016-07-03T19:58:26.704 INFO:tasks.ceph.osd.0.smithi066.stderr:2016-07-04 02:58:27.209311 7f6f82919700 -1 osd.0 27 heartbeat_check: no reply from osd.1 since back 2016-07-04 02:58:04.691817 front 2016-07-04 02:58:04.691817 (cutoff 2016-07-04 02:58:07.209309)

(which would appear to indicate that osd.0 heartbeat_check is several hours in the future?)

and then the intrusive_ptr assertion happens on osd.1.

Actions #5

Updated by Nathan Cutler almost 8 years ago

Also, the intrusive_ptr assertion stacktrace in this bug starts in ThreadPool::WorkThread::entry() (and AdminSocket is not involved), while in #16225 it starts in AdminSocket::entry().

Actions #6

Updated by David Zafman almost 8 years ago

  • Is duplicate of deleted (Backport #16225: hammer: SIGABRT in TrackedOp::dump() via dump_ops_in_flight())
Actions #7

Updated by David Zafman almost 8 years ago

I'm convinced now that this isn't a duplicated of 16225. The heartbeat checks and admin socket failures would be caused by osd crashes.

I've noticed that teuthology time stamps have always skewed from wall clock. I would have thought it was GMT so would be ahead by 7 or 8 hours, but it is behind, strange.

How did you determine an intrusive_ptr crash from stack trace I see:


 ceph version 0.94.7-122-gf222b64 (f222b64febaaef8c5077f4ffc243598764bb8627)
 1: ceph-osd() [0xab84ea]
 2: (()+0x10340) [0x7fa760f37340]
 3: (gsignal()+0x39) [0x7fa75f3d8cc9]
 4: (abort()+0x148) [0x7fa75f3dc0d8]
 5: (()+0x2fb86) [0x7fa75f3d1b86]
 6: (()+0x2fc32) [0x7fa75f3d1c32]
 7: ceph-osd() [0x7a9763]
 8: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x9df) [0x7e4a2f]
 9: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1f4) [0x820fb4]
 10: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x80c4eb]
 11: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1ce) [0x7b918e]
 12: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2c0) [0x6ab900]
 13: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x18) [0x704b78]
 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0xba08a6]
 15: (ThreadPool::WorkThread::entry()+0x10) [0xba1950]
 16: (()+0x8182) [0x7fa760f2f182]
 17: (clone()+0x6d) [0x7fa75f49c47d]

Why does the log have blank output every other line? Is there some logging change that could explain the output and crashes?

2016-07-04 02:58:29.948616 7fa755a0c700 20 journal do_write queueing finishers through seq 255
2016-07-04 02:58:29.948616 7fa755a0c700 20
2016-07-04 02:58:29.948619 7fa755a0c700 10 journal queue_completions_thru seq 255 queueing seq 255 0x42ae0e0 lat 0.000461
2016-07-04 02:58:29.948619 7fa755a0c700 10
2016-07-04 02:58:29.948628 7fa755a0c700  5 journal put_throttle finished 1 ops and 956 bytes, now 0 ops and 0 bytes
2016-07-04 02:58:29.948628 7fa755a0c700  5
2016-07-04 02:58:29.948633 7fa755a0c700 20 journal write_thread_entry going to sleep
2016-07-04 02:58:29.948633 7fa755a0c700 20
2016-07-04 02:58:29.950637 7fa747895700 -1 *** Caught signal (Aborted) **
 in thread 7fa747895700
Actions #8

Updated by Nathan Cutler almost 8 years ago

  • Subject changed from hammer: dump_ops_in_flight apparently failing on hammer? to hammer-backports: OSD crash in ec-lost-unfound test with msgr-failures/many.yaml

@David: Regarding the newline in the logs, that was caused by an error in one of the backports, which is now fixed.

I've been looking at this tonight and here's what I've ascertained:

1. it's reproducible in the current hammer-backports branch by running e.g. --suite rados --suite-branch hammer --ceph hammer-backports --filter "rados/singleton/{all/ec-lost-unfound.yaml fs/xfs.yaml msgr-failures/many.yaml}"

2. it has nothing to do with intrusive_ptr (that regression has since been fixed in hammer-backports)

3. it has nothing to do with the "admin_socket: invalid command dump_ops_in_flight" -> that is just part of the OSD restart logic

4. what appears to happen is that the OSD crashes shortly after being restarted. For example in the "rados/singleton/{all/ec-lost-unfound.yaml fs/xfs.yaml msgr-failures/many.yaml}" test the OSD crashes after the "ceph -- pg 1.b list_missing '{}'" command is issued.

The crash dump is:

2016-07-23T17:28:28.151 INFO:tasks.ec_lost_unfound:listing missing/lost in 1.b state active+recovering+undersized+degraded
2016-07-23T17:28:28.151 INFO:teuthology.orchestra.run.target158069068184:Running: u"adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph -- pg 1.b list_missing '{}'" 
2016-07-23T17:28:28.201 INFO:tasks.ceph.osd.0.target158069068184.stderr:osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::op_applied(const eversion_t&)' thread 7fac01bab700 time 2016-07-23 17:28:28.218203
2016-07-23T17:28:28.201 INFO:tasks.ceph.osd.0.target158069068184.stderr:osd/ReplicatedPG.cc: 7321: FAILED assert(applied_version <= info.last_update)
2016-07-23T17:28:28.201 INFO:tasks.ceph.osd.0.target158069068184.stderr: ceph version 0.94.7-159-g2ee8cd6 (2ee8cd65a68e1b799d1bfef309cd07a63e3d55da)
2016-07-23T17:28:28.201 INFO:tasks.ceph.osd.0.target158069068184.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbdc715]
2016-07-23T17:28:28.201 INFO:tasks.ceph.osd.0.target158069068184.stderr: 2: (ReplicatedPG::op_applied(eversion_t const&)+0x3db) [0x83cb3b]
2016-07-23T17:28:28.201 INFO:tasks.ceph.osd.0.target158069068184.stderr: 3: (ECBackend::sub_write_applied(unsigned long, eversion_t)+0x3e) [0xa39f6e]
2016-07-23T17:28:28.202 INFO:tasks.ceph.osd.0.target158069068184.stderr: 4: (SubWriteApplied::finish(int)+0x6d) [0xa4e5ad]
2016-07-23T17:28:28.202 INFO:tasks.ceph.osd.0.target158069068184.stderr: 5: (Context::complete(int)+0x9) [0x6c86f9]
2016-07-23T17:28:28.202 INFO:tasks.ceph.osd.0.target158069068184.stderr: 6: (ReplicatedPG::BlessedContext::finish(int)+0x94) [0x8af534]
2016-07-23T17:28:28.202 INFO:tasks.ceph.osd.0.target158069068184.stderr: 7: (Context::complete(int)+0x9) [0x6c86f9]
2016-07-23T17:28:28.202 INFO:tasks.ceph.osd.0.target158069068184.stderr: 8: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xac) [0x70cddc]
2016-07-23T17:28:28.202 INFO:tasks.ceph.osd.0.target158069068184.stderr: 9: (C_ContextsBase<Context, Context>::complete(int)+0x9) [0x6c8fe9]
2016-07-23T17:28:28.202 INFO:tasks.ceph.osd.0.target158069068184.stderr: 10: (Finisher::finisher_thread_entry()+0x168) [0xb01e28]
2016-07-23T17:28:28.202 INFO:tasks.ceph.osd.0.target158069068184.stderr: 11: (()+0x7dc5) [0x7fac0e541dc5]
2016-07-23T17:28:28.202 INFO:tasks.ceph.osd.0.target158069068184.stderr: 12: (clone()+0x6d) [0x7fac0d024ced]
2016-07-23T17:28:28.202 INFO:tasks.ceph.osd.0.target158069068184.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-07-23T17:28:28.203 INFO:tasks.ceph.osd.0.target158069068184.stderr:2016-07-23 17:28:28.221429 7fac01bab700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::op_applied(const eversion_t&)' thread 7fac01bab700 time 2016-07-23 17:28:28.218203
2016-07-23T17:28:28.203 INFO:tasks.ceph.osd.0.target158069068184.stderr:osd/ReplicatedPG.cc: 7321: FAILED assert(applied_version <= info.last_update)
2016-07-23T17:28:28.203 INFO:tasks.ceph.osd.0.target158069068184.stderr:
2016-07-23T17:28:28.203 INFO:tasks.ceph.osd.0.target158069068184.stderr: ceph version 0.94.7-159-g2ee8cd6 (2ee8cd65a68e1b799d1bfef309cd07a63e3d55da)
2016-07-23T17:28:28.203 INFO:tasks.ceph.osd.0.target158069068184.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbdc715]
2016-07-23T17:28:28.203 INFO:tasks.ceph.osd.0.target158069068184.stderr: 2: (ReplicatedPG::op_applied(eversion_t const&)+0x3db) [0x83cb3b]
2016-07-23T17:28:28.203 INFO:tasks.ceph.osd.0.target158069068184.stderr: 3: (ECBackend::sub_write_applied(unsigned long, eversion_t)+0x3e) [0xa39f6e]
2016-07-23T17:28:28.203 INFO:tasks.ceph.osd.0.target158069068184.stderr: 4: (SubWriteApplied::finish(int)+0x6d) [0xa4e5ad]
2016-07-23T17:28:28.203 INFO:tasks.ceph.osd.0.target158069068184.stderr: 5: (Context::complete(int)+0x9) [0x6c86f9]
2016-07-23T17:28:28.204 INFO:tasks.ceph.osd.0.target158069068184.stderr: 6: (ReplicatedPG::BlessedContext::finish(int)+0x94) [0x8af534]
2016-07-23T17:28:28.204 INFO:tasks.ceph.osd.0.target158069068184.stderr: 7: (Context::complete(int)+0x9) [0x6c86f9]
2016-07-23T17:28:28.204 INFO:tasks.ceph.osd.0.target158069068184.stderr: 8: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xac) [0x70cddc]
2016-07-23T17:28:28.204 INFO:tasks.ceph.osd.0.target158069068184.stderr: 9: (C_ContextsBase<Context, Context>::complete(int)+0x9) [0x6c8fe9]
2016-07-23T17:28:28.204 INFO:tasks.ceph.osd.0.target158069068184.stderr: 10: (Finisher::finisher_thread_entry()+0x168) [0xb01e28]
2016-07-23T17:28:28.204 INFO:tasks.ceph.osd.0.target158069068184.stderr: 11: (()+0x7dc5) [0x7fac0e541dc5]
2016-07-23T17:28:28.204 INFO:tasks.ceph.osd.0.target158069068184.stderr: 12: (clone()+0x6d) [0x7fac0d024ced]
2016-07-23T17:28:28.204 INFO:tasks.ceph.osd.0.target158069068184.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-07-23T17:28:28.204 INFO:tasks.ceph.osd.0.target158069068184.stderr:
2016-07-23T17:28:28.311 INFO:tasks.ceph.osd.0.target158069068184.stderr:     0> 2016-07-23 17:28:28.221429 7fac01bab700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::op_applied(const eversion_t&)' thread 7fac01bab700 time 2016-07-23 17:28:28.218203
2016-07-23T17:28:28.312 INFO:tasks.ceph.osd.0.target158069068184.stderr:osd/ReplicatedPG.cc: 7321: FAILED assert(applied_version <= info.last_update)
2016-07-23T17:28:28.312 INFO:tasks.ceph.osd.0.target158069068184.stderr:
2016-07-23T17:28:28.312 INFO:tasks.ceph.osd.0.target158069068184.stderr: ceph version 0.94.7-159-g2ee8cd6 (2ee8cd65a68e1b799d1bfef309cd07a63e3d55da)
2016-07-23T17:28:28.312 INFO:tasks.ceph.osd.0.target158069068184.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbdc715]
2016-07-23T17:28:28.312 INFO:tasks.ceph.osd.0.target158069068184.stderr: 2: (ReplicatedPG::op_applied(eversion_t const&)+0x3db) [0x83cb3b]
2016-07-23T17:28:28.312 INFO:tasks.ceph.osd.0.target158069068184.stderr: 3: (ECBackend::sub_write_applied(unsigned long, eversion_t)+0x3e) [0xa39f6e]
2016-07-23T17:28:28.312 INFO:tasks.ceph.osd.0.target158069068184.stderr: 4: (SubWriteApplied::finish(int)+0x6d) [0xa4e5ad]
2016-07-23T17:28:28.313 INFO:tasks.ceph.osd.0.target158069068184.stderr: 5: (Context::complete(int)+0x9) [0x6c86f9]
2016-07-23T17:28:28.313 INFO:tasks.ceph.osd.0.target158069068184.stderr: 6: (ReplicatedPG::BlessedContext::finish(int)+0x94) [0x8af534]
2016-07-23T17:28:28.313 INFO:tasks.ceph.osd.0.target158069068184.stderr: 7: (Context::complete(int)+0x9) [0x6c86f9]
2016-07-23T17:28:28.313 INFO:tasks.ceph.osd.0.target158069068184.stderr: 8: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xac) [0x70cddc]
2016-07-23T17:28:28.313 INFO:tasks.ceph.osd.0.target158069068184.stderr: 9: (C_ContextsBase<Context, Context>::complete(int)+0x9) [0x6c8fe9]
2016-07-23T17:28:28.313 INFO:tasks.ceph.osd.0.target158069068184.stderr: 10: (Finisher::finisher_thread_entry()+0x168) [0xb01e28]
2016-07-23T17:28:28.313 INFO:tasks.ceph.osd.0.target158069068184.stderr: 11: (()+0x7dc5) [0x7fac0e541dc5]
2016-07-23T17:28:28.313 INFO:tasks.ceph.osd.0.target158069068184.stderr: 12: (clone()+0x6d) [0x7fac0d024ced]
2016-07-23T17:28:28.313 INFO:tasks.ceph.osd.0.target158069068184.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-07-23T17:28:28.314 INFO:tasks.ceph.osd.0.target158069068184.stderr:
2016-07-23T17:28:28.314 INFO:tasks.ceph.osd.0.target158069068184.stderr:terminate called after throwing an instance of 'ceph::FailedAssertion'
2016-07-23T17:28:28.314 INFO:tasks.ceph.osd.0.target158069068184.stderr:*** Caught signal (Aborted) **
2016-07-23T17:28:28.314 INFO:tasks.ceph.osd.0.target158069068184.stderr: in thread 7fac01bab700
2016-07-23T17:28:28.314 INFO:tasks.ceph.osd.0.target158069068184.stderr: ceph version 0.94.7-159-g2ee8cd6 (2ee8cd65a68e1b799d1bfef309cd07a63e3d55da)
2016-07-23T17:28:28.314 INFO:tasks.ceph.osd.0.target158069068184.stderr: 1: ceph-osd() [0xadc6f2]
2016-07-23T17:28:28.314 INFO:tasks.ceph.osd.0.target158069068184.stderr: 2: (()+0xf100) [0x7fac0e549100]
2016-07-23T17:28:28.315 INFO:tasks.ceph.osd.0.target158069068184.stderr: 3: (gsignal()+0x37) [0x7fac0cf635f7]
2016-07-23T17:28:28.315 INFO:tasks.ceph.osd.0.target158069068184.stderr: 4: (abort()+0x148) [0x7fac0cf64ce8]
2016-07-23T17:28:28.316 INFO:tasks.ceph.osd.0.target158069068184.stderr: 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fac0d8689d5]
2016-07-23T17:28:28.316 INFO:tasks.ceph.osd.0.target158069068184.stderr: 6: (()+0x5e946) [0x7fac0d866946]
2016-07-23T17:28:28.316 INFO:tasks.ceph.osd.0.target158069068184.stderr: 7: (()+0x5e973) [0x7fac0d866973]
2016-07-23T17:28:28.316 INFO:tasks.ceph.osd.0.target158069068184.stderr: 8: (()+0x5eb93) [0x7fac0d866b93]
2016-07-23T17:28:28.316 INFO:tasks.ceph.osd.0.target158069068184.stderr: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbdc90a]
2016-07-23T17:28:28.316 INFO:tasks.ceph.osd.0.target158069068184.stderr: 10: (ReplicatedPG::op_applied(eversion_t const&)+0x3db) [0x83cb3b]
2016-07-23T17:28:28.317 INFO:tasks.ceph.osd.0.target158069068184.stderr: 11: (ECBackend::sub_write_applied(unsigned long, eversion_t)+0x3e) [0xa39f6e]
2016-07-23T17:28:28.317 INFO:tasks.ceph.osd.0.target158069068184.stderr: 12: (SubWriteApplied::finish(int)+0x6d) [0xa4e5ad]
2016-07-23T17:28:28.317 INFO:tasks.ceph.osd.0.target158069068184.stderr: 13: (Context::complete(int)+0x9) [0x6c86f9]
2016-07-23T17:28:28.318 INFO:tasks.ceph.osd.0.target158069068184.stderr: 14: (ReplicatedPG::BlessedContext::finish(int)+0x94) [0x8af534]
2016-07-23T17:28:28.318 INFO:tasks.ceph.osd.0.target158069068184.stderr: 15: (Context::complete(int)+0x9) [0x6c86f9]
2016-07-23T17:28:28.318 INFO:tasks.ceph.osd.0.target158069068184.stderr: 16: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xac) [0x70cddc]
2016-07-23T17:28:28.318 INFO:tasks.ceph.osd.0.target158069068184.stderr: 17: (C_ContextsBase<Context, Context>::complete(int)+0x9) [0x6c8fe9]
2016-07-23T17:28:28.318 INFO:tasks.ceph.osd.0.target158069068184.stderr: 18: (Finisher::finisher_thread_entry()+0x168) [0xb01e28]
2016-07-23T17:28:28.322 INFO:tasks.ceph.osd.0.target158069068184.stderr: 19: (()+0x7dc5) [0x7fac0e541dc5]
2016-07-23T17:28:28.322 INFO:tasks.ceph.osd.0.target158069068184.stderr: 20: (clone()+0x6d) [0x7fac0d024ced]
2016-07-23T17:28:28.322 INFO:tasks.ceph.osd.0.target158069068184.stderr:2016-07-23 17:28:28.339725 7fac01bab700 -1 *** Caught signal (Aborted) **
2016-07-23T17:28:28.322 INFO:tasks.ceph.osd.0.target158069068184.stderr: in thread 7fac01bab700
2016-07-23T17:28:28.322 INFO:tasks.ceph.osd.0.target158069068184.stderr:
2016-07-23T17:28:28.322 INFO:tasks.ceph.osd.0.target158069068184.stderr: ceph version 0.94.7-159-g2ee8cd6 (2ee8cd65a68e1b799d1bfef309cd07a63e3d55da)
2016-07-23T17:28:28.322 INFO:tasks.ceph.osd.0.target158069068184.stderr: 1: ceph-osd() [0xadc6f2]
2016-07-23T17:28:28.322 INFO:tasks.ceph.osd.0.target158069068184.stderr: 2: (()+0xf100) [0x7fac0e549100]
2016-07-23T17:28:28.323 INFO:tasks.ceph.osd.0.target158069068184.stderr: 3: (gsignal()+0x37) [0x7fac0cf635f7]
2016-07-23T17:28:28.323 INFO:tasks.ceph.osd.0.target158069068184.stderr: 4: (abort()+0x148) [0x7fac0cf64ce8]
2016-07-23T17:28:28.323 INFO:tasks.ceph.osd.0.target158069068184.stderr: 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fac0d8689d5]
2016-07-23T17:28:28.323 INFO:tasks.ceph.osd.0.target158069068184.stderr: 6: (()+0x5e946) [0x7fac0d866946]
2016-07-23T17:28:28.323 INFO:tasks.ceph.osd.0.target158069068184.stderr: 7: (()+0x5e973) [0x7fac0d866973]
2016-07-23T17:28:28.323 INFO:tasks.ceph.osd.0.target158069068184.stderr: 8: (()+0x5eb93) [0x7fac0d866b93]
2016-07-23T17:28:28.323 INFO:tasks.ceph.osd.0.target158069068184.stderr: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbdc90a]
2016-07-23T17:28:28.323 INFO:tasks.ceph.osd.0.target158069068184.stderr: 10: (ReplicatedPG::op_applied(eversion_t const&)+0x3db) [0x83cb3b]
2016-07-23T17:28:28.323 INFO:tasks.ceph.osd.0.target158069068184.stderr: 11: (ECBackend::sub_write_applied(unsigned long, eversion_t)+0x3e) [0xa39f6e]
2016-07-23T17:28:28.323 INFO:tasks.ceph.osd.0.target158069068184.stderr: 12: (SubWriteApplied::finish(int)+0x6d) [0xa4e5ad]
2016-07-23T17:28:28.324 INFO:tasks.ceph.osd.0.target158069068184.stderr: 13: (Context::complete(int)+0x9) [0x6c86f9]
2016-07-23T17:28:28.324 INFO:tasks.ceph.osd.0.target158069068184.stderr: 14: (ReplicatedPG::BlessedContext::finish(int)+0x94) [0x8af534]
2016-07-23T17:28:28.324 INFO:tasks.ceph.osd.0.target158069068184.stderr: 15: (Context::complete(int)+0x9) [0x6c86f9]
2016-07-23T17:28:28.324 INFO:tasks.ceph.osd.0.target158069068184.stderr: 16: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xac) [0x70cddc]
2016-07-23T17:28:28.324 INFO:tasks.ceph.osd.0.target158069068184.stderr: 17: (C_ContextsBase<Context, Context>::complete(int)+0x9) [0x6c8fe9]
2016-07-23T17:28:28.324 INFO:tasks.ceph.osd.0.target158069068184.stderr: 18: (Finisher::finisher_thread_entry()+0x168) [0xb01e28]
2016-07-23T17:28:28.324 INFO:tasks.ceph.osd.0.target158069068184.stderr: 19: (()+0x7dc5) [0x7fac0e541dc5]
2016-07-23T17:28:28.324 INFO:tasks.ceph.osd.0.target158069068184.stderr: 20: (clone()+0x6d) [0x7fac0d024ced]
2016-07-23T17:28:28.324 INFO:tasks.ceph.osd.0.target158069068184.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-07-23T17:28:28.326 INFO:tasks.ceph.osd.0.target158069068184.stderr:
2016-07-23T17:28:28.326 INFO:tasks.ceph.osd.0.target158069068184.stderr:     0> 2016-07-23 17:28:28.339725 7fac01bab700 -1 *** Caught signal (Aborted) **
2016-07-23T17:28:28.328 INFO:tasks.ceph.osd.0.target158069068184.stderr: in thread 7fac01bab700
2016-07-23T17:28:28.329 INFO:tasks.ceph.osd.0.target158069068184.stderr:
2016-07-23T17:28:28.329 INFO:tasks.ceph.osd.0.target158069068184.stderr: ceph version 0.94.7-159-g2ee8cd6 (2ee8cd65a68e1b799d1bfef309cd07a63e3d55da)
2016-07-23T17:28:28.329 INFO:tasks.ceph.osd.0.target158069068184.stderr: 1: ceph-osd() [0xadc6f2]
2016-07-23T17:28:28.329 INFO:tasks.ceph.osd.0.target158069068184.stderr: 2: (()+0xf100) [0x7fac0e549100]
2016-07-23T17:28:28.329 INFO:tasks.ceph.osd.0.target158069068184.stderr: 3: (gsignal()+0x37) [0x7fac0cf635f7]
2016-07-23T17:28:28.329 INFO:tasks.ceph.osd.0.target158069068184.stderr: 4: (abort()+0x148) [0x7fac0cf64ce8]
2016-07-23T17:28:28.329 INFO:tasks.ceph.osd.0.target158069068184.stderr: 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fac0d8689d5]
2016-07-23T17:28:28.329 INFO:tasks.ceph.osd.0.target158069068184.stderr: 6: (()+0x5e946) [0x7fac0d866946]
2016-07-23T17:28:28.330 INFO:tasks.ceph.osd.0.target158069068184.stderr: 7: (()+0x5e973) [0x7fac0d866973]
2016-07-23T17:28:28.330 INFO:tasks.ceph.osd.0.target158069068184.stderr: 8: (()+0x5eb93) [0x7fac0d866b93]
2016-07-23T17:28:28.330 INFO:tasks.ceph.osd.0.target158069068184.stderr: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbdc90a]
2016-07-23T17:28:28.330 INFO:tasks.ceph.osd.0.target158069068184.stderr: 10: (ReplicatedPG::op_applied(eversion_t const&)+0x3db) [0x83cb3b]
2016-07-23T17:28:28.330 INFO:tasks.ceph.osd.0.target158069068184.stderr: 11: (ECBackend::sub_write_applied(unsigned long, eversion_t)+0x3e) [0xa39f6e]
2016-07-23T17:28:28.332 INFO:tasks.ceph.osd.0.target158069068184.stderr: 12: (SubWriteApplied::finish(int)+0x6d) [0xa4e5ad]
2016-07-23T17:28:28.332 INFO:tasks.ceph.osd.0.target158069068184.stderr: 13: (Context::complete(int)+0x9) [0x6c86f9]
2016-07-23T17:28:28.332 INFO:tasks.ceph.osd.0.target158069068184.stderr: 14: (ReplicatedPG::BlessedContext::finish(int)+0x94) [0x8af534]
2016-07-23T17:28:28.332 INFO:tasks.ceph.osd.0.target158069068184.stderr: 15: (Context::complete(int)+0x9) [0x6c86f9]
2016-07-23T17:28:28.332 INFO:tasks.ceph.osd.0.target158069068184.stderr: 16: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xac) [0x70cddc]
2016-07-23T17:28:28.332 INFO:tasks.ceph.osd.0.target158069068184.stderr: 17: (C_ContextsBase<Context, Context>::complete(int)+0x9) [0x6c8fe9]
2016-07-23T17:28:28.333 INFO:tasks.ceph.osd.0.target158069068184.stderr: 18: (Finisher::finisher_thread_entry()+0x168) [0xb01e28]
2016-07-23T17:28:28.333 INFO:tasks.ceph.osd.0.target158069068184.stderr: 19: (()+0x7dc5) [0x7fac0e541dc5]
2016-07-23T17:28:28.333 INFO:tasks.ceph.osd.0.target158069068184.stderr: 20: (clone()+0x6d) [0x7fac0d024ced]
2016-07-23T17:28:28.333 INFO:tasks.ceph.osd.0.target158069068184.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-07-23T17:28:28.333 INFO:tasks.ceph.osd.0.target158069068184.stderr:

5. This crash doesn't cause the test to fail. The test continues for another ~2 minutes, during which the other OSDs start missing the crashed OSD in their heartbeat_check.

6. Finally the test fails on this assert:

2016-07-23T17:30:11.305 INFO:tasks.ec_lost_unfound:{u'num_unfound': 0, u'objects': [], u'num_missing': 0, u'offset': {u'hash': 0, u'max': 0, u'oid': u'', u'namespace': u'', u'snapid': 0, u'key': u'', u'pool': -1}}
2016-07-23T17:30:11.305 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/ubuntu/teuthology/teuthology/run_tasks.py", line 66, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/ubuntu/teuthology/teuthology/run_tasks.py", line 45, in run_one_task
    return fn(**kwargs)
  File "/home/ubuntu/src/ceph-qa-suite_wip-hammer-openstack-fixes/tasks/ec_lost_unfound.py", line 135, in task
    assert m['num_unfound'] == pg['stat_sum']['num_objects_unfound']
AssertionError
Actions #9

Updated by Nathan Cutler almost 8 years ago

A very similar and reproducible crash happens in the following tests:

  • rados/singleton/{all/lost-unfound-delete.yaml fs/xfs.yaml msgr-failures/few.yaml}
  • rados/singleton/{all/ec-lost-unfound.yaml fs/xfs.yaml msgr-failures/few.yaml}
  • rados/singleton/{all/lost-unfound.yaml fs/xfs.yaml msgr-failures/few.yaml}
Actions #10

Updated by Nathan Cutler almost 8 years ago

  • Subject changed from hammer-backports: OSD crash in ec-lost-unfound test with msgr-failures/many.yaml to hammer-backports: OSD crash in *lost-unfound* tests with msgr-failures
Actions #12

Updated by Nathan Cutler almost 8 years ago

I guess this commit might be involved, and should be dropped from the PR?

https://github.com/ceph/ceph-qa-suite/pull/1089/commits/139c1007190174ac161fdbd88dd43b7cb13e15e7

(Note that the test runs include this ceph-qa-suite PR, which has not yet been merged)

Actions #14

Updated by Nathan Cutler almost 8 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF