Bug #9582
librados: segmentation fault on timeout
0%
Description
Summary: If you configure librados with rados_osd_op_timeout, timeouts will result sometimes in a segmentation fault.
Problem description: We configured librados with a rados_osd_op_timeout of 2 seconds on a cluster with 1440 OSDs. For testing purposes we put load on the cluster via 8 tomcat webapps (using librados via rados-java) on 8 hosts each doing about 15 reads/second and about 4 writes/seconds to the ceph cluster. Object size is 0-4MB. In normal situation timeouts happen from time to time and everything works as expected. However, in situtations where the ceph cluster gets unresponsive (due to a high CPU load on the ODS hosts) the tomcats crash randomly in librados. I have been able to to get the following stacktrace of a crash from a core dump:
#0 ceph_crc32c_le_intel (crc=0, data=0xc6bc3c0 <Address 0xc6bc3c0 out of bounds>, length=<optimized out>) at common/crc32c-intel.c:58 #1 0x00007fc715afd185 in ceph_crc32c_le (length=14245, data=0xc6bc3c0 <Address 0xc6bc3c0 out of bounds>, crc=0) at ./include/crc32c.h:16 #2 ceph::buffer::list::crc32c (this=this@entry=0x7fc6c5fde9a0, crc=crc@entry=0) at ./include/buffer.h:428 #3 0x00007fc715af6f32 in decode_message (cct=0x7fc71d180c90, header=..., footer=..., front=..., middle=..., data=...) at msg/Message.cc:267 #4 0x00007fc715b3e170 in Pipe::read_message (this=this@entry=0x51e10c0, pm=pm@entry=0x7fc6c5fded38, auth_handler=auth_handler@entry=0x3b12ed0) at msg/Pipe.cc:1920 #5 0x00007fc715b4fdc1 in Pipe::reader (this=0x51e10c0) at msg/Pipe.cc:1447 #6 0x00007fc715b542dd in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:49 #7 0x00007fc738097b50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #8 0x00007fc7379c6e6d in clone () from /lib/x86_64-linux-gnu/libc.so.6 #9 0x0000000000000000 in ?? ()
OS: Debian GNU/Linux 7.6 (wheezy)
System: cpuinfo is attached
librados: dumpling branch at commit 3f020443c8d92e61d8593049147a79a6696c9c93 installed from debian package http://gitbuilder.ceph.com/ceph-deb-wheezy-x86_64-basic/ref/dumpling/pool/main/c/ceph/librados2_0.67.10-13-g3f02044-1wheezy_amd64.deb (due to including a fix for http://tracker.ceph.com/issues/9362).
Please let me know, if you need more information. I can also provide you the full core dump if you need it.
Related issues
Associated revisions
osdc/Objecter: only post_rx_buffer if no op timeout
If we post an rx buffer and there is a timeout, the revocation can happen
while the reader has consumed the buffers but before it has decoded and
constructed the message. In particular, we calculate a crc32c over the
data portion of the message after we've taken the buffers and dropped the
lock.
Instead of fixing this race (for example, by reverifying rx_buffers under
the lock while calculating the crc.. bleh), just skip the rx buffer
optimization entirely when a timeout is present.
Note that this doesn't cover the op_cancel() paths, but none of those users
provide static buffers to read into.
Fixes: #9582
Backport: firefly, dumpling
Signed-off-by: Sage Weil <sage@redhat.com>
osdc/Objecter: only post_rx_buffer if no op timeout
If we post an rx buffer and there is a timeout, the revocation can happen
while the reader has consumed the buffers but before it has decoded and
constructed the message. In particular, we calculate a crc32c over the
data portion of the message after we've taken the buffers and dropped the
lock.
Instead of fixing this race (for example, by reverifying rx_buffers under
the lock while calculating the crc.. bleh), just skip the rx buffer
optimization entirely when a timeout is present.
Note that this doesn't cover the op_cancel() paths, but none of those users
provide static buffers to read into.
Fixes: #9582
Backport: firefly, dumpling
Signed-off-by: Sage Weil <sage@redhat.com>
backport of 126d0b30e990519b8f845f99ba893fdcd56de447
osdc/Objecter: only post_rx_buffer if no op timeout
If we post an rx buffer and there is a timeout, the revocation can happen
while the reader has consumed the buffers but before it has decoded and
constructed the message. In particular, we calculate a crc32c over the
data portion of the message after we've taken the buffers and dropped the
lock.
Instead of fixing this race (for example, by reverifying rx_buffers under
the lock while calculating the crc.. bleh), just skip the rx buffer
optimization entirely when a timeout is present.
Note that this doesn't cover the op_cancel() paths, but none of those users
provide static buffers to read into.
Fixes: #9582
Backport: firefly, dumpling
Signed-off-by: Sage Weil <sage@redhat.com>
backport of 126d0b30e990519b8f845f99ba893fdcd56de447
osdc/Objecter: use SafeTimer; make callbacks race-tolerant
The RWTimer event cancellation is racy. Instead, just make all of our
callbacks tolerate cancellation races. This is already true of most of
them (in fact, they are probably broken because they try to take a write
lock while holding a read lock). Fix C_CancelOp so that it calls the
other op_cancel (that takes a tid).
Then switch the RWTimer back to a SafeTimer. Put it in unsafe callbacks
mode because we don't want to introduce lock cycles with timer_lock.
Fixes: #9582
See also: #9650
Signed-off-by: Sage Weil <sage@redhat.com>
History
#1 Updated by JuanJose Galvez almost 9 years ago
- Source changed from other to Support
#2 Updated by Greg Farnum almost 9 years ago
Okay, looks like this is another race:
1) The message is coming in over the wire, and the Pipe grabs a preallocated buffer to store the message data in
2) The op gets cancelled and any remaining buffers are revoked
3) The on commit calls get run, and presumably the client code zaps the buffers
4) The Pipe runs decode_message() and the bufferlists are pointing at invalid memory.
The problem is basically that the messenger doesn't have real cancellation mechanisms, and the Objecter is trying to layer on top of them but there are some pretty rough edges. I think maybe we'd need to have the messenger indicate on op cancellation whether it succeeded in revoking the buffers, and if it fails the Objecter needs to wait to see the reply and give up on the cancel? Although the reserved buffers aren't the normal mechanism, so somebody would need to remember that existing state.
Or the messenger needs to grow a proper op cancellation system that tracks which ops are currently being satisfied and refuses to revoke them...
#3 Updated by Sage Weil almost 9 years ago
- Assignee set to Sage Weil
i'm going to see if we can just skip the rx_buffers zero-copy paths when a timeout is present
#4 Updated by Sage Weil almost 9 years ago
- Status changed from New to Fix Under Review
- Priority changed from High to Urgent
#5 Updated by Sage Weil almost 9 years ago
- Backport set to firefly, dumpling
#6 Updated by Greg Farnum almost 9 years ago
- Status changed from Fix Under Review to Pending Backport
#7 Updated by Sage Weil almost 9 years ago
in giant, dumpling. still need to merge firefly backport.
#8 Updated by Matthias Kiefer almost 9 years ago
Thanks for your investigations and the quick fix! We have not been able to test this fix yet, but I will report back the results as soon as possible.
#10 Updated by Sage Weil almost 9 years ago
- Status changed from Pending Backport to In Progress
hmm, several failures on giant
ubuntu@teuthology:/var/lib/teuthworker/archive/samuelj-2014-10-01_18:59:42-rados-giant-wip-testing-old-vanilla-basic-multi/524638
#11 Updated by Sage Weil almost 9 years ago
- Status changed from In Progress to Fix Under Review
#12 Updated by Sage Weil almost 9 years ago
- Status changed from Fix Under Review to Pending Backport
#13 Updated by Sage Weil almost 9 years ago
- Status changed from Pending Backport to Resolved
i believe all patches affecting firefly and dumpling have been backported.