Bug #3142
osd: crash induced by fsx workload
| Status: | Resolved | Start date: | 09/12/2012 | |
|---|---|---|---|---|
| Priority: | Urgent | Due date: | ||
| Assignee: | Sage Weil | % Done: | 0% | |
| Category: | OSD | Spent time: | - | |
| Target version: | - | |||
| Source: | Q/A | Severity: | ||
| Backport: | Reviewed: | |||
| Tags: |
Description
kernel: &id001
kdb: true
branch: testing
nuke-on-error: true
overrides:
ceph:
conf:
client:
rbd cache: true
global:
ms inject socket failures: 5000
osd:
debug osd: 20
debug ms: 1
fs: ext4
log-whitelist:
- slow request
roles:
- - mon.a
- osd.0
- osd.1
- osd.2
- - mds.a
- osd.3
- osd.4
- osd.5
- - client.0
tasks:
- chef: null
- clock: null
- ceph:
log-whitelist:
- wrongly marked me down
- objects unfound and apparently lost
- thrashosds:
timeout: 1200
- rbd_fsx:
clients:
- client.0
ops: 2000
History
#5 Updated by Sage Weil 8 months ago
- Status changed from New to Verified
heap corruption? this hardly narrows it down, but from ubuntu@teuthology:/a/teuthology-2012-09-22_19:00:05-regression-master-testing-gcov/27938
2012-09-22 23:17:46.933546 7f0d9171f700 -1 *** Caught signal (Segmentation fault) ** in thread 7f0d9171f700 ceph version 0.51-690-g720a301 (commit:720a30173dc73b4e696ba4b8e0c977dd4f4db858) 1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x82865a] 2: (()+0xfcb0) [0x7f0da27aecb0] 3: (tcmalloc::CentralFreeList::FetchFromSpans()+0x27) [0x7f0da1854df7] 4: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0x107) [0x7f0da1855167] 5: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long)+0x5d) [0x7f0da1857cad] 6: (tc_new()+0x486) [0x7f0da1866c76] 7: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x3579) [0x5ba899] 8: (PG::do_request(std::tr1::shared_ptr<OpRequest>)+0x482) [0x6f45a2] 9: (OSD::dequeue_op(PG*)+0x40f) [0x614c7f] 10: (OSD::OpWQ::_process(PG*)+0x15) [0x67bf45] 11: (ThreadPool::WorkQueue<PG>::_void_process(void*)+0x12) [0x672292] 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0x73d) [0x9017bd] 13: (ThreadPool::WorkThread::entry()+0x18) [0x9051d8] 14: (Thread::_entry_func(void*)+0x12) [0x8f3492] 15: (()+0x7e9a) [0x7f0da27a6e9a] 16: (clone()+0x6d) [0x7f0da0b4a4bd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
maybe we should reproduce with the notcmalloc gitbuilder and see if we get a more usable core file
#7 Updated by Sage Weil 7 months ago
- File osd.2.log.gz added
i got a log for
-19> 2012-10-19 14:24:20.776308 7f63fbdcd700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::do_osd_op_effects(ReplicatedPG::OpContext*)' thread 7f63fbdcd700 time 2012-10-19 14:24:20.774375 osd/ReplicatedPG.cc: 3268: FAILED assert(obc->unconnected_watchers.count(entity)) ceph version 0.53-393-g50bb659 (commit:50bb65963c16bcf892157bd19a308ae593215f84) 1: (ReplicatedPG::do_osd_op_effects(ReplicatedPG::OpContext*)+0x2200) [0x56cb80] 2: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x623) [0x590f03] 3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x1dc5) [0x5942e5] 4: (PG::do_request(std::tr1::shared_ptr<OpRequest>)+0x325) [0x66e2a5] 5: (OSD::dequeue_op(PG*)+0x2fd) [0x5ceb0d] 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x542) [0x7f32a2] 7: (ThreadPool::WorkThread::entry()+0x10) [0x7f5240] 8: (()+0x7e9a) [0x7f640d157e9a] 9: (clone()+0x6d) [0x7f640b4fb4bd]