Project

General

Profile

Actions

Bug #9582

closed

librados: segmentation fault on timeout

Added by Matthias Kiefer over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
librados
Target version:
-
% Done:

0%

Source:
Support
Tags:
Backport:
firefly, dumpling
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Summary: If you configure librados with rados_osd_op_timeout, timeouts will result sometimes in a segmentation fault.

Problem description: We configured librados with a rados_osd_op_timeout of 2 seconds on a cluster with 1440 OSDs. For testing purposes we put load on the cluster via 8 tomcat webapps (using librados via rados-java) on 8 hosts each doing about 15 reads/second and about 4 writes/seconds to the ceph cluster. Object size is 0-4MB. In normal situation timeouts happen from time to time and everything works as expected. However, in situtations where the ceph cluster gets unresponsive (due to a high CPU load on the ODS hosts) the tomcats crash randomly in librados. I have been able to to get the following stacktrace of a crash from a core dump:

#0  ceph_crc32c_le_intel (crc=0, data=0xc6bc3c0 <Address 0xc6bc3c0 out of bounds>, length=<optimized out>) at common/crc32c-intel.c:58
#1  0x00007fc715afd185 in ceph_crc32c_le (length=14245, data=0xc6bc3c0 <Address 0xc6bc3c0 out of bounds>, crc=0) at ./include/crc32c.h:16
#2  ceph::buffer::list::crc32c (this=this@entry=0x7fc6c5fde9a0, crc=crc@entry=0) at ./include/buffer.h:428
#3  0x00007fc715af6f32 in decode_message (cct=0x7fc71d180c90, header=..., footer=..., front=..., middle=..., data=...) at msg/Message.cc:267
#4  0x00007fc715b3e170 in Pipe::read_message (this=this@entry=0x51e10c0, pm=pm@entry=0x7fc6c5fded38, auth_handler=auth_handler@entry=0x3b12ed0) at msg/Pipe.cc:1920
#5  0x00007fc715b4fdc1 in Pipe::reader (this=0x51e10c0) at msg/Pipe.cc:1447
#6  0x00007fc715b542dd in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:49
#7  0x00007fc738097b50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8  0x00007fc7379c6e6d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#9  0x0000000000000000 in ?? ()

OS: Debian GNU/Linux 7.6 (wheezy)
System: cpuinfo is attached
librados: dumpling branch at commit 3f020443c8d92e61d8593049147a79a6696c9c93 installed from debian package http://gitbuilder.ceph.com/ceph-deb-wheezy-x86_64-basic/ref/dumpling/pool/main/c/ceph/librados2_0.67.10-13-g3f02044-1wheezy_amd64.deb (due to including a fix for http://tracker.ceph.com/issues/9362).

Please let me know, if you need more information. I can also provide you the full core dump if you need it.


Files

cpuinfo (10.9 KB) cpuinfo CPU info of system where the crash happened Matthias Kiefer, 09/24/2014 07:38 AM

Related issues 6 (0 open6 closed)

Related to Ceph - Bug #9613: "Segmentation fault" in upgrade:dumpling-giant-x:parallel-giant-distro-basic-multi runDuplicate09/27/2014

Actions
Related to Ceph - Bug #9650: RWTimer cancel_event is racyResolved10/02/2014

Actions
Related to Ceph - Bug #5925: hung ceph_test_rados_delete_pools_parallelCan't reproduceDavid Zafman08/09/2013

Actions
Related to Ceph - Bug #9845: hung ceph_test_rados_delete_pools_parallelResolvedDavid Zafman08/09/2013

Actions
Has duplicate Ceph - Bug #9417: "Segmentation fault" in upgrade:dumpling-giant-x-master-distro-basic-vps run DuplicateJosh Durgin09/10/2014

Actions
Has duplicate Ceph - Bug #9515: "Segmentation fault (ceph_test_rados_api_io)" in upgrade:dumpling-giant-x:parallel-giant-distro-basic-vps runDuplicate09/17/2014

Actions
Actions

Also available in: Atom PDF