Bug #3142

osd: crash induced by fsx workload

Added by Sage Weil 8 months ago. Updated 7 months ago.

Status:ResolvedStart date:09/12/2012
Priority:UrgentDue date:
Assignee:Sage Weil% Done:

0%

Category:OSDSpent time:-
Target version:-
Source:Q/A Severity:
Backport: Reviewed:
Tags:

Description

kernel: &id001
  kdb: true
  branch: testing
nuke-on-error: true
overrides:
  ceph:
    conf:
      client:
        rbd cache: true
      global:
        ms inject socket failures: 5000
      osd:
        debug osd: 20
        debug ms: 1
    fs: ext4
    log-whitelist:
    - slow request
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    timeout: 1200
- rbd_fsx:
    clients:
    - client.0
    ops: 2000

osd.2.log.gz (223 KB) Sage Weil, 10/20/2012 01:32 pm

History

#1 Updated by Sage Weil 8 months ago

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2012-09-11_02:00:03-regression-testing-testing-basic/20743

#2 Updated by Dan Mick 8 months ago

  • Assignee set to Dan Mick

#3 Updated by Dan Mick 8 months ago

Attempting a bisect from master to stable. Using

cd /src/ceph/ceph
git describe
make distclean && ./do_autogen.sh && make -j 16
/src/ceph/teuthology/virtualenv/bin/teuthology --lock ~/src/ceph/teuthology/fsx.yaml || exit 127

as the command to bisect run.

#4 Updated by Sage Weil 8 months ago

ubuntu@teuthology:/a/teuthology-2012-09-21_19:00:08-regression-master-testing-gcov/27383

#5 Updated by Sage Weil 8 months ago

  • Status changed from New to Verified

heap corruption? this hardly narrows it down, but from ubuntu@teuthology:/a/teuthology-2012-09-22_19:00:05-regression-master-testing-gcov/27938

2012-09-22 23:17:46.933546 7f0d9171f700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f0d9171f700

 ceph version 0.51-690-g720a301 (commit:720a30173dc73b4e696ba4b8e0c977dd4f4db858)
 1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x82865a]
 2: (()+0xfcb0) [0x7f0da27aecb0]
 3: (tcmalloc::CentralFreeList::FetchFromSpans()+0x27) [0x7f0da1854df7]
 4: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0x107) [0x7f0da1855167]
 5: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long)+0x5d) [0x7f0da1857cad]
 6: (tc_new()+0x486) [0x7f0da1866c76]
 7: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x3579) [0x5ba899]
 8: (PG::do_request(std::tr1::shared_ptr<OpRequest>)+0x482) [0x6f45a2]
 9: (OSD::dequeue_op(PG*)+0x40f) [0x614c7f]
 10: (OSD::OpWQ::_process(PG*)+0x15) [0x67bf45]
 11: (ThreadPool::WorkQueue<PG>::_void_process(void*)+0x12) [0x672292]
 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0x73d) [0x9017bd]
 13: (ThreadPool::WorkThread::entry()+0x18) [0x9051d8]
 14: (Thread::_entry_func(void*)+0x12) [0x8f3492]
 15: (()+0x7e9a) [0x7f0da27a6e9a]
 16: (clone()+0x6d) [0x7f0da0b4a4bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

maybe we should reproduce with the notcmalloc gitbuilder and see if we get a more usable core file

#6 Updated by Sage Weil 7 months ago

  • Assignee changed from Dan Mick to Sage Weil

#7 Updated by Sage Weil 7 months ago

i got a log for

   -19> 2012-10-19 14:24:20.776308 7f63fbdcd700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::do_osd_op_effects(ReplicatedPG::OpContext*)' thread 7f63fbdcd700 time 2012-10-19 14:24:20.774375
osd/ReplicatedPG.cc: 3268: FAILED assert(obc->unconnected_watchers.count(entity))

 ceph version 0.53-393-g50bb659 (commit:50bb65963c16bcf892157bd19a308ae593215f84)
 1: (ReplicatedPG::do_osd_op_effects(ReplicatedPG::OpContext*)+0x2200) [0x56cb80]
 2: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x623) [0x590f03]
 3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x1dc5) [0x5942e5]
 4: (PG::do_request(std::tr1::shared_ptr<OpRequest>)+0x325) [0x66e2a5]
 5: (OSD::dequeue_op(PG*)+0x2fd) [0x5ceb0d]
 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x542) [0x7f32a2]
 7: (ThreadPool::WorkThread::entry()+0x10) [0x7f5240]
 8: (()+0x7e9a) [0x7f640d157e9a]
 9: (clone()+0x6d) [0x7f640b4fb4bd]

#8 Updated by Sage Weil 7 months ago

  • Status changed from Verified to Testing

fix for the watcher thing merged to next branch, yay! hopefully that was the root cause for the mysterious nightly failures with bogus core files too.

#9 Updated by Sage Weil 7 months ago

  • Status changed from Testing to Resolved

Also available in: Atom PDF