Bug #3142: osd: crash induced by fsx workload - Ceph - Ceph

Actions

Copy link

Bug #3142

closed

osd: crash induced by fsx workload

Added by Sage Weil over 11 years ago. Updated over 11 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Sage Weil

Category:

OSD

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

kernel: &id001
  kdb: true
  branch: testing
nuke-on-error: true
overrides:
  ceph:
    conf:
      client:
        rbd cache: true
      global:
        ms inject socket failures: 5000
      osd:
        debug osd: 20
        debug ms: 1
    fs: ext4
    log-whitelist:
    - slow request
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    timeout: 1200
- rbd_fsx:
    clients:
    - client.0
    ops: 2000

Files

osd.2.log.gz (223 KB) osd.2.log.gz

Sage Weil, 10/20/2012 01:32 PM

Actions

Copy link

Updated by Sage Weil over 11 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2012-09-11_02:00:03-regression-testing-testing-basic/20743

Actions

Copy link

Updated by Dan Mick over 11 years ago

Assignee set to Dan Mick

Actions

Copy link

Updated by Dan Mick over 11 years ago

Attempting a bisect from master to stable. Using

cd _{/src/ceph/ceph
git describe
make distclean && ./do_autogen.sh && make -j 16}/src/ceph/teuthology/virtualenv/bin/teuthology --lock ~/src/ceph/teuthology/fsx.yaml || exit 127

as the command to bisect run.

Actions

Copy link

Updated by Sage Weil over 11 years ago

ubuntu@teuthology:/a/teuthology-2012-09-21_19:00:08-regression-master-testing-gcov/27383

Actions

Copy link

Updated by Sage Weil over 11 years ago

Status changed from New to 12

heap corruption? this hardly narrows it down, but from ubuntu@teuthology:/a/teuthology-2012-09-22_19:00:05-regression-master-testing-gcov/27938

2012-09-22 23:17:46.933546 7f0d9171f700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f0d9171f700

 ceph version 0.51-690-g720a301 (commit:720a30173dc73b4e696ba4b8e0c977dd4f4db858)
 1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x82865a]
 2: (()+0xfcb0) [0x7f0da27aecb0]
 3: (tcmalloc::CentralFreeList::FetchFromSpans()+0x27) [0x7f0da1854df7]
 4: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0x107) [0x7f0da1855167]
 5: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long)+0x5d) [0x7f0da1857cad]
 6: (tc_new()+0x486) [0x7f0da1866c76]
 7: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x3579) [0x5ba899]
 8: (PG::do_request(std::tr1::shared_ptr<OpRequest>)+0x482) [0x6f45a2]
 9: (OSD::dequeue_op(PG*)+0x40f) [0x614c7f]
 10: (OSD::OpWQ::_process(PG*)+0x15) [0x67bf45]
 11: (ThreadPool::WorkQueue<PG>::_void_process(void*)+0x12) [0x672292]
 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0x73d) [0x9017bd]
 13: (ThreadPool::WorkThread::entry()+0x18) [0x9051d8]
 14: (Thread::_entry_func(void*)+0x12) [0x8f3492]
 15: (()+0x7e9a) [0x7f0da27a6e9a]
 16: (clone()+0x6d) [0x7f0da0b4a4bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

maybe we should reproduce with the notcmalloc gitbuilder and see if we get a more usable core file

Actions

Copy link

Updated by Sage Weil over 11 years ago

Assignee changed from Dan Mick to Sage Weil

Actions

Copy link

Updated by Sage Weil over 11 years ago

File osd.2.log.gz osd.2.log.gz added

i got a log for

   -19> 2012-10-19 14:24:20.776308 7f63fbdcd700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::do_osd_op_effects(ReplicatedPG::OpContext*)' thread 7f63fbdcd700 time 2012-10-19 14:24:20.774375
osd/ReplicatedPG.cc: 3268: FAILED assert(obc->unconnected_watchers.count(entity))

 ceph version 0.53-393-g50bb659 (commit:50bb65963c16bcf892157bd19a308ae593215f84)
 1: (ReplicatedPG::do_osd_op_effects(ReplicatedPG::OpContext*)+0x2200) [0x56cb80]
 2: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x623) [0x590f03]
 3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x1dc5) [0x5942e5]
 4: (PG::do_request(std::tr1::shared_ptr<OpRequest>)+0x325) [0x66e2a5]
 5: (OSD::dequeue_op(PG*)+0x2fd) [0x5ceb0d]
 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x542) [0x7f32a2]
 7: (ThreadPool::WorkThread::entry()+0x10) [0x7f5240]
 8: (()+0x7e9a) [0x7f640d157e9a]
 9: (clone()+0x6d) [0x7f640b4fb4bd]

Actions

Copy link

Updated by Sage Weil over 11 years ago

Status changed from 12 to 7

fix for the watcher thing merged to next branch, yay! hopefully that was the root cause for the mysterious nightly failures with bogus core files too.

Actions

Copy link

Updated by Sage Weil over 11 years ago

Status changed from 7 to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #3142

osd: crash induced by fsx workload

Updated by Sage Weil over 11 years ago

Updated by Dan Mick over 11 years ago

Updated by Dan Mick over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Sage Weil over 11 years ago