Bug #3142
closedosd: crash induced by fsx workload
0%
Description
kernel: &id001 kdb: true branch: testing nuke-on-error: true overrides: ceph: conf: client: rbd cache: true global: ms inject socket failures: 5000 osd: debug osd: 20 debug ms: 1 fs: ext4 log-whitelist: - slow request roles: - - mon.a - osd.0 - osd.1 - osd.2 - - mds.a - osd.3 - osd.4 - osd.5 - - client.0 tasks: - chef: null - clock: null - ceph: log-whitelist: - wrongly marked me down - objects unfound and apparently lost - thrashosds: timeout: 1200 - rbd_fsx: clients: - client.0 ops: 2000
Files
Updated by Sage Weil over 11 years ago
ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2012-09-11_02:00:03-regression-testing-testing-basic/20743
Updated by Dan Mick over 11 years ago
Attempting a bisect from master to stable. Using
cd /src/ceph/ceph
git describe
make distclean && ./do_autogen.sh && make -j 16
/src/ceph/teuthology/virtualenv/bin/teuthology --lock ~/src/ceph/teuthology/fsx.yaml || exit 127
as the command to bisect run.
Updated by Sage Weil over 11 years ago
ubuntu@teuthology:/a/teuthology-2012-09-21_19:00:08-regression-master-testing-gcov/27383
Updated by Sage Weil over 11 years ago
- Status changed from New to 12
heap corruption? this hardly narrows it down, but from ubuntu@teuthology:/a/teuthology-2012-09-22_19:00:05-regression-master-testing-gcov/27938
2012-09-22 23:17:46.933546 7f0d9171f700 -1 *** Caught signal (Segmentation fault) ** in thread 7f0d9171f700 ceph version 0.51-690-g720a301 (commit:720a30173dc73b4e696ba4b8e0c977dd4f4db858) 1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x82865a] 2: (()+0xfcb0) [0x7f0da27aecb0] 3: (tcmalloc::CentralFreeList::FetchFromSpans()+0x27) [0x7f0da1854df7] 4: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0x107) [0x7f0da1855167] 5: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long)+0x5d) [0x7f0da1857cad] 6: (tc_new()+0x486) [0x7f0da1866c76] 7: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x3579) [0x5ba899] 8: (PG::do_request(std::tr1::shared_ptr<OpRequest>)+0x482) [0x6f45a2] 9: (OSD::dequeue_op(PG*)+0x40f) [0x614c7f] 10: (OSD::OpWQ::_process(PG*)+0x15) [0x67bf45] 11: (ThreadPool::WorkQueue<PG>::_void_process(void*)+0x12) [0x672292] 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0x73d) [0x9017bd] 13: (ThreadPool::WorkThread::entry()+0x18) [0x9051d8] 14: (Thread::_entry_func(void*)+0x12) [0x8f3492] 15: (()+0x7e9a) [0x7f0da27a6e9a] 16: (clone()+0x6d) [0x7f0da0b4a4bd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
maybe we should reproduce with the notcmalloc gitbuilder and see if we get a more usable core file
Updated by Sage Weil over 11 years ago
- Assignee changed from Dan Mick to Sage Weil
Updated by Sage Weil over 11 years ago
- File osd.2.log.gz osd.2.log.gz added
i got a log for
-19> 2012-10-19 14:24:20.776308 7f63fbdcd700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::do_osd_op_effects(ReplicatedPG::OpContext*)' thread 7f63fbdcd700 time 2012-10-19 14:24:20.774375 osd/ReplicatedPG.cc: 3268: FAILED assert(obc->unconnected_watchers.count(entity)) ceph version 0.53-393-g50bb659 (commit:50bb65963c16bcf892157bd19a308ae593215f84) 1: (ReplicatedPG::do_osd_op_effects(ReplicatedPG::OpContext*)+0x2200) [0x56cb80] 2: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x623) [0x590f03] 3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x1dc5) [0x5942e5] 4: (PG::do_request(std::tr1::shared_ptr<OpRequest>)+0x325) [0x66e2a5] 5: (OSD::dequeue_op(PG*)+0x2fd) [0x5ceb0d] 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x542) [0x7f32a2] 7: (ThreadPool::WorkThread::entry()+0x10) [0x7f5240] 8: (()+0x7e9a) [0x7f640d157e9a] 9: (clone()+0x6d) [0x7f640b4fb4bd]
Updated by Sage Weil over 11 years ago
- Status changed from 12 to 7
fix for the watcher thing merged to next branch, yay! hopefully that was the root cause for the mysterious nightly failures with bogus core files too.