Project

General

Profile

Actions

Bug #5239

closed

osd: Segmentation fault in ceph-osd / tcmalloc

Added by Emil Renner Berthing almost 11 years ago. Updated over 10 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We're still experiencing segmentation faults in the ceph-osd daemons from the 0.61.2-1~bpo70+1 debian packages.
It appears to happen inside tcmalloc when used by LevelDB. It happens across all the OSD servers and it seems to happen more often under load.

The issue was reported on the mailing list here:
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15146

Initially we thought it was related to us using very big objects, but the daemons keep crashing even when all the cluster contains is
data from the following runs of rados benchmark:

while true; do rados -p benchmarks -b 4096 bench 3600 write -t 64 --no-cleanup; sleep 1; done

and
while true; do rados -p benchmarks -b 4194304 bench 3600 write -t 64 --no-cleanup; sleep 1; done

Here are some stats on the cluster:
- each server has 64GB ram,
- there are 12 OSDs pr. server and now 216 OSDs in all, (earlier we only had 132 OSDs)
- each OSD uses around 1.5 GB of memory,
- there are now 33792 PGs, (earlier we had 18432 PGs)
- all drives are 4TB large, have an xfs-formatted sdx1 and a 10GB journal at sdx2.
- the filesystems are mounted xfs (rw,noatime,attr2,noquota)
- we don't use snapshots

Backtrace from the coredump:

Core was generated by `/usr/bin/ceph-osd -i 130 --pid-file /var/run/ceph/osd.130.pid -c /etc/ceph/ceph'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f6a64e3eefb in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) backtrace
#0 0x00007f6a64e3eefb in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x0000000000853a89 in reraise_fatal (signum=11) at global/signal_handler.cc:58
#2 handle_fatal_signal (signum=11) at global/signal_handler.cc:104
#3 <signal handler called>
#4 0x00007f6a640596f3 in do_malloc (size=364131408) at src/tcmalloc.cc:1059
#5 cpp_alloc (nothrow=false, size=364131408) at src/tcmalloc.cc:1354
#6 tc_new (size=364131408) at src/tcmalloc.cc:1530
#7 0x00007f6a59e90c10 in ?? ()
#8 0x0000000015b43450 in ?? ()
#9 0x00007f6a63e09b21 in ?? () from /usr/lib/x86_64-linux-gnu/libleveldb.so.1
#10 0x00007f6a63e06ba8 in ?? () from /usr/lib/x86_64-linux-gnu/libleveldb.so.1
#11 0x00007f6a63df24d4 in ?? () from /usr/lib/x86_64-linux-gnu/libleveldb.so.1
#12 0x0000000000840977 in LevelDBStore::LevelDBWholeSpaceIteratorImpl::lower_bound (this=0x1ec1b6c0, prefix=..., to=...) at os/LevelDBStore.h:204
#13 0x000000000083f351 in LevelDBStore::get (this=<optimized out>, prefix=..., keys=..., out=0x7f6a59e90f60) at os/LevelDBStore.cc:106
#14 0x0000000000838449 in DBObjectMap::_lookup_map_header (this=this@entry=0x207b600, hoid=...) at os/DBObjectMap.cc:1080
#15 0x00000000008386f4 in DBObjectMap::lookup_create_map_header (this=this@entry=0x207b600, hoid=..., t=...) at os/DBObjectMap.cc:1146
#16 0x0000000000838c61 in DBObjectMap::set_keys (this=0x207b600, hoid=..., set=..., spos=0x7f6a59e91400) at os/DBObjectMap.cc:504
#17 0x00000000007f4380 in FileStore::_omap_setkeys (this=this@entry=0x2092000, cid=..., hoid=..., aset=..., spos=...) at os/FileStore.cc:4754
#18 0x000000000080f720 in FileStore::_do_transaction (this=this@entry=0x2092000, t=..., op_seq=op_seq@entry=22064536, trans_num=trans_num@entry=0) at os/FileStore.cc:2586
#19 0x0000000000812999 in FileStore::_do_transactions (this=this@entry=0x2092000, tls=..., op_seq=22064536, handle=handle@entry=0x7f6a59e91b80) at os/FileStore.cc:2151
#20 0x0000000000812b2e in FileStore::_do_op (this=0x2092000, osr=<optimized out>, handle=...) at os/FileStore.cc:1985
#21 0x00000000008f52ea in ThreadPool::worker (this=0x2092a08, wt=0x20a6480) at common/WorkQueue.cc:119
#22 0x00000000008f6590 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:316
#23 0x00007f6a64e36b50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#24 0x00007f6a63372a7d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#25 0x0000000000000000 in ?? ()
(gdb)

The log from the same crashed server is attached.


Files

ceph-osd.130.log.gz (230 KB) ceph-osd.130.log.gz Log from crashed OSD daemon Emil Renner Berthing, 06/03/2013 08:37 AM
ceph-osd.5.log (2.88 MB) ceph-osd.5.log Log of new type of crashes. Emil Renner Berthing, 06/28/2013 06:08 AM

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #5301: mon: leveldb crash in tcmallocCan't reproduce06/11/2013

Actions
Actions

Also available in: Atom PDF