Bug #3142
closed
osd: crash induced by fsx workload
Added by Sage Weil over 11 years ago.
Updated over 11 years ago.
Description
kernel: &id001
kdb: true
branch: testing
nuke-on-error: true
overrides:
ceph:
conf:
client:
rbd cache: true
global:
ms inject socket failures: 5000
osd:
debug osd: 20
debug ms: 1
fs: ext4
log-whitelist:
- slow request
roles:
- - mon.a
- osd.0
- osd.1
- osd.2
- - mds.a
- osd.3
- osd.4
- osd.5
- - client.0
tasks:
- chef: null
- clock: null
- ceph:
log-whitelist:
- wrongly marked me down
- objects unfound and apparently lost
- thrashosds:
timeout: 1200
- rbd_fsx:
clients:
- client.0
ops: 2000
Files
ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2012-09-11_02:00:03-regression-testing-testing-basic/20743
Attempting a bisect from master to stable. Using
cd /src/ceph/ceph
git describe
make distclean && ./do_autogen.sh && make -j 16
/src/ceph/teuthology/virtualenv/bin/teuthology --lock ~/src/ceph/teuthology/fsx.yaml || exit 127
as the command to bisect run.
ubuntu@teuthology:/a/teuthology-2012-09-21_19:00:08-regression-master-testing-gcov/27383
- Status changed from New to 12
heap corruption? this hardly narrows it down, but from ubuntu@teuthology:/a/teuthology-2012-09-22_19:00:05-regression-master-testing-gcov/27938
2012-09-22 23:17:46.933546 7f0d9171f700 -1 *** Caught signal (Segmentation fault) **
in thread 7f0d9171f700
ceph version 0.51-690-g720a301 (commit:720a30173dc73b4e696ba4b8e0c977dd4f4db858)
1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x82865a]
2: (()+0xfcb0) [0x7f0da27aecb0]
3: (tcmalloc::CentralFreeList::FetchFromSpans()+0x27) [0x7f0da1854df7]
4: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0x107) [0x7f0da1855167]
5: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long)+0x5d) [0x7f0da1857cad]
6: (tc_new()+0x486) [0x7f0da1866c76]
7: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x3579) [0x5ba899]
8: (PG::do_request(std::tr1::shared_ptr<OpRequest>)+0x482) [0x6f45a2]
9: (OSD::dequeue_op(PG*)+0x40f) [0x614c7f]
10: (OSD::OpWQ::_process(PG*)+0x15) [0x67bf45]
11: (ThreadPool::WorkQueue<PG>::_void_process(void*)+0x12) [0x672292]
12: (ThreadPool::worker(ThreadPool::WorkThread*)+0x73d) [0x9017bd]
13: (ThreadPool::WorkThread::entry()+0x18) [0x9051d8]
14: (Thread::_entry_func(void*)+0x12) [0x8f3492]
15: (()+0x7e9a) [0x7f0da27a6e9a]
16: (clone()+0x6d) [0x7f0da0b4a4bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
maybe we should reproduce with the notcmalloc gitbuilder and see if we get a more usable core file
- Assignee changed from Dan Mick to Sage Weil
i got a log for
-19> 2012-10-19 14:24:20.776308 7f63fbdcd700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::do_osd_op_effects(ReplicatedPG::OpContext*)' thread 7f63fbdcd700 time 2012-10-19 14:24:20.774375
osd/ReplicatedPG.cc: 3268: FAILED assert(obc->unconnected_watchers.count(entity))
ceph version 0.53-393-g50bb659 (commit:50bb65963c16bcf892157bd19a308ae593215f84)
1: (ReplicatedPG::do_osd_op_effects(ReplicatedPG::OpContext*)+0x2200) [0x56cb80]
2: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x623) [0x590f03]
3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x1dc5) [0x5942e5]
4: (PG::do_request(std::tr1::shared_ptr<OpRequest>)+0x325) [0x66e2a5]
5: (OSD::dequeue_op(PG*)+0x2fd) [0x5ceb0d]
6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x542) [0x7f32a2]
7: (ThreadPool::WorkThread::entry()+0x10) [0x7f5240]
8: (()+0x7e9a) [0x7f640d157e9a]
9: (clone()+0x6d) [0x7f640b4fb4bd]
- Status changed from 12 to 7
fix for the watcher thing merged to next branch, yay! hopefully that was the root cause for the mysterious nightly failures with bogus core files too.
- Status changed from 7 to Resolved
Also available in: Atom
PDF