Project

General

Profile

Actions

Bug #8736

closed

thrash and scrub combination lead to error

Added by Loïc Dachary almost 10 years ago. Updated over 9 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In http://pulpito.ceph.com/loic-2014-07-02_23:05:05-upgrade:firefly-x:stress-split-firefly-testing-basic-vps/338904/ OSD 1 is killed by the thrasher

2014-07-02T21:58:51.862 INFO:teuthology.task.thrashosds.thrasher:Killing osd 1, live_osds are [5, 4, 3, 1, 2, 0]

but the kill fails
2014-07-02T22:27:58.598 ERROR:teuthology.run_tasks:Manager failed: thrashosds
...
CommandFailedError: Command failed on vpm070 with status 1: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-osd -f -i 1'

Immediately after that scrub tries to run on osd (although it should probably not because it is not in) and fails
2014-07-02T22:28:05.339 INFO:teuthology.orchestra.run.vpm075:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph osd scrub osd.1'
2014-07-02T22:28:05.614 INFO:teuthology.orchestra.run.vpm075.stderr:Error EAGAIN: osd.1 is not up


Related issues 2 (0 open2 closed)

Related to Ceph - Bug #9158: osd crashed in upgrade:dumpling-x:stress-split-master-distro-basic-vps suiteDuplicate08/18/2014

Actions
Is duplicate of Ceph - Bug #8777: osd/PGLog.h: 88: FAILED assert(rollback_info_trimmed_to_riter == log.rbegin())ResolvedSamuel Just07/08/2014

Actions
Actions #1

Updated by Ian Colle over 9 years ago

  • Assignee set to Yuri Weinstein
Actions #2

Updated by Loïc Dachary over 9 years ago

http://pulpito.ceph.com/loic-2014-08-04_15:06:02-upgrade:firefly-x:stress-split-wip-8475-testing-basic-plana/396887/

2014-08-04T12:37:47.478 INFO:teuthology.orchestra.run.plana89.stderr:Error EAGAIN: osd.5 is not up

Actions #3

Updated by Yuri Weinstein over 9 years ago

  • Project changed from teuthology to Ceph
  • Assignee changed from Yuri Weinstein to Ian Colle

This needs to be prioritized.

Confirmed, logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-08-21_11:40:02-upgrade:dumpling-x:stress-split-master-distro-basic-vps/439533/

It's an osd.5 crash, coredump in ceph-osd.5.log.gz

903270073:2014-08-21 19:47:59.612165 7f59303eb700 -1 *** Caught signal (Aborted) **
903270147- in thread 7f59303eb700
903270171-
903270172- ceph version 0.84-372-gb0aa846 (b0aa846b3f81225a779de00100e15334fb8156b3)
903270247- 1: ceph-osd() [0x9a8a0a]
903270273- 2: (()+0xfcb0) [0x7f5949bd8cb0]
903270306- 3: (gsignal()+0x35) [0x7f59484c34f5]
903270344- 4: (abort()+0x17b) [0x7f59484c6c5b]
903270381- 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f5948e1669d]
903270451- 6: (()+0xb5846) [0x7f5948e14846]
903270485- 7: (()+0xb5873) [0x7f5948e14873]
903270519- 8: (()+0xb596e) [0x7f5948e1496e]
903270553- 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0xa8cf7f]
903270645- 10: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x8c8) [0x7881d8]
903270721- 11: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x182) [0x7b6d32]
903271148- 12: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x79a7bb]
903271386- 13: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x11) [0x79ab11]
903271627- 14: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x303) [0x752753]
903271736- 15: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2ce) [0x65720e]
903271856- 16: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x6a9982]
903271972- 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xa7c6b6]
903272040- 18: (ThreadPool::WorkThread::entry()+0x10) [0xa7f760]
903272095- 19: (()+0x7e9a) [0x7f5949bd0e9a]
903272129- 20: (clone()+0x6d) [0x7f594858173d]
Actions #4

Updated by Ian Colle over 9 years ago

  • Assignee changed from Ian Colle to Yuri Weinstein
Actions #5

Updated by Sage Weil over 9 years ago

  • Priority changed from Normal to Urgent
  • Source changed from other to Q/A
Actions #6

Updated by Sage Weil over 9 years ago

  • Status changed from New to Duplicate
  • Assignee deleted (Yuri Weinstein)

ha, it's the riter bug. #8777

Actions

Also available in: Atom PDF