Project

General

Profile

Actions

Bug #9356

closed

ceph_test_rados_striper_api_aio Segmentation faults

Added by David Zafman over 9 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
librados
Target version:
-
% Done:

100%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

mail thread


(gdb) run
Starting program: /home/dzafman/ceph2/src/.libs/lt-ceph_test_rados_striper_api_aio
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Running main() from gtest_main.cc
[==========] Running 18 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 9 tests from StriperTest
[New Thread 0x7ffff3a6a700 (LWP 53422)]
[New Thread 0x7ffff3269700 (LWP 53423)]
[New Thread 0x7ffff2a68700 (LWP 53424)]
[New Thread 0x7ffff2267700 (LWP 53425)]
[New Thread 0x7ffff1a66700 (LWP 53426)]
[New Thread 0x7ffff1265700 (LWP 53427)]
[New Thread 0x7ffff0a64700 (LWP 53428)]
[New Thread 0x7fffebfff700 (LWP 53429)]
[New Thread 0x7fffeb7fe700 (LWP 53430)]
[New Thread 0x7ffff7fda700 (LWP 53431)]
[New Thread 0x7ffff0263700 (LWP 53434)]
[New Thread 0x7fffeaffd700 (LWP 53435)]
[New Thread 0x7fffea7fc700 (LWP 53436)]
[ RUN      ] StriperTest.SimpleWrite
[New Thread 0x7ffff0162700 (LWP 53437)]
[New Thread 0x7fffe9ffb700 (LWP 53440)]
[       OK ] StriperTest.SimpleWrite (1220 ms)
[ RUN      ] StriperTest.WaitForSafe
[New Thread 0x7fffe9efa700 (LWP 53441)]
[New Thread 0x7fffe9df9700 (LWP 53444)]
[       OK ] StriperTest.WaitForSafe (165 ms)
[ RUN      ] StriperTest.RoundTrip
[New Thread 0x7fffe9cf8700 (LWP 53445)]
[New Thread 0x7fffe9bf7700 (LWP 53448)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffea7fc700 (LWP 53436)]
0x00007ffff59d3c9a in Mutex::Lock (this=0x68c410, no_lockdep=<optimized out>) at common/Mutex.cc:89
89        if (logger && cct && cct->_conf->mutex_perf_counter)
(gdb) bt
#0  0x00007ffff59d3c9a in Mutex::Lock (this=0x68c410, no_lockdep=<optimized out>) at common/Mutex.cc:89
#1  0x00007ffff557ac5b in libradosstriper::MultiAioCompletionImpl::safe_request (this=0x68c410, r=128) at libradosstriper/MultiAioCompletionImpl.cc:38
#2  0x00007ffff595c42d in librados::C_AioSafe::finish (this=<optimized out>, r=<optimized out>) at ./librados/AioCompletionImpl.h:199
#3  0x00007ffff5938b79 in Context::complete (this=0x7fffbc000c10, r=<optimized out>) at ./include/Context.h:64
#4  0x00007ffff59f8c20 in Finisher::finisher_thread_entry (this=0x663e70) at common/Finisher.cc:59
#5  0x00007ffff5351e9a in start_thread (arg=0x7fffea7fc700) at pthread_create.c:308
#6  0x00007ffff4b6873d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7  0x0000000000000000 in ?? ()

Actions #1

Updated by David Zafman over 9 years ago

Same issue with the ceph_test_rados_striper_api_io test.

$ gdb .libs/lt-ceph_test_rados_striper_api_io
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/&gt;...
Reading symbols from /src/ceph/src/.libs/lt-ceph_test_rados_striper_api_io...done.
(gdb) run
Starting program: /src/ceph/src/.libs/lt-ceph_test_rados_striper_api_io
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Running main() from gtest_main.cc
[==========] Running 26 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 13 tests from StriperTest
[New Thread 0x7ffff3a62700 (LWP 56880)]
[New Thread 0x7ffff3261700 (LWP 56881)]
[New Thread 0x7ffff2a60700 (LWP 56882)]
[New Thread 0x7ffff225f700 (LWP 56883)]
[New Thread 0x7ffff1a5e700 (LWP 56884)]
[New Thread 0x7ffff125d700 (LWP 56885)]
[New Thread 0x7ffff0a5c700 (LWP 56886)]
[New Thread 0x7fffebfff700 (LWP 56887)]
[New Thread 0x7fffeb7fe700 (LWP 56888)]
[New Thread 0x7ffff7fda700 (LWP 56889)]
[New Thread 0x7ffff025b700 (LWP 56892)]
[New Thread 0x7fffeaffd700 (LWP 56893)]
[New Thread 0x7fffea7fc700 (LWP 56894)]
[ RUN ] StriperTest.SimpleWrite
[New Thread 0x7ffff015a700 (LWP 56895)]
[New Thread 0x7fffe9ffb700 (LWP 56898)]
[ OK ] StriperTest.SimpleWrite (1298 ms)
[ RUN ] StriperTest.Stat
[ OK ] StriperTest.Stat (79 ms)
[ RUN ] StriperTest.RoundTrip
[New Thread 0x7fffe9efa700 (LWP 56899)]
[New Thread 0x7fffe9df9700 (LWP 56902)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffea7fc700 (LWP 56894)]
0x00007ffff59d1d2a in Mutex::Lock (this=0x6876e0, no_lockdep=<optimized out>) at common/Mutex.cc:89
89 if (logger && cct && cct->_conf->mutex_perf_counter)
(gdb) bt
#0 0x00007ffff59d1d2a in Mutex::Lock (this=0x6876e0, no_lockdep=<optimized out>) at common/Mutex.cc:89
#1 0x00007ffff5572e8b in libradosstriper::MultiAioCompletionImpl::safe_request (this=0x6876e0, r=128) at libradosstriper/MultiAioCompletionImpl.cc:38
#2 0x00007ffff5957f6d in librados::C_AioSafe::finish (this=<optimized out>, r=<optimized out>) at ./librados/AioCompletionImpl.h:199
#3 0x00007ffff5934079 in Context::complete (this=0x7fffc4000fc0, r=<optimized out>) at ./include/Context.h:64
#4 0x00007ffff59f6cb0 in Finisher::finisher_thread_entry (this=0x66f4b0) at common/Finisher.cc:59
#5 0x00007ffff5349e9a in start_thread (arg=0x7fffea7fc700) at pthread_create.c:308
#6 0x00007ffff4b6073d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7 0x0000000000000000 in ?? ()

Actions #2

Updated by Loïc Dachary over 9 years ago

$ libtool --mode=execute valgrind --tool=memcheck ceph_test_rados_striper_api_aio
==19740== Memcheck, a memory error detector
==19740== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==19740== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==19740== Command: /home/loic/software/ceph/ceph/src/.libs/lt-ceph_test_rados_striper_api_aio
==19740== 
Running main() from gtest_main.cc
[==========] Running 18 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 9 tests from StriperTest
[ RUN      ] StriperTest.SimpleWrite
[       OK ] StriperTest.SimpleWrite (1479 ms)
[ RUN      ] StriperTest.WaitForSafe
[       OK ] StriperTest.WaitForSafe (126 ms)
[ RUN      ] StriperTest.RoundTrip
==19740== Thread 14:
==19740== Invalid read of size 8
==19740==    at 0x8215E19: ??? (RadosStriperImpl.cc:397)
==19740==    by 0x5A82BC0: librados::C_AioSafe::finish(int) (AioCompletionImpl.h:199)
==19740==    by 0x5A3E0AC: Context::complete(int) (Context.h:64)
==19740==    by 0x5B4CBE2: Finisher::finisher_thread_entry() (Finisher.cc:59)
==19740==    by 0x5A6B045: Finisher::FinisherThread::entry() (Finisher.h:46)
==19740==    by 0x5B75070: Thread::entry_wrapper() (Thread.cc:60)
==19740==    by 0x5B74FED: Thread::_entry_func(void*) (Thread.cc:45)
==19740==    by 0x842F181: start_thread (pthread_create.c:312)
==19740==    by 0x8C5A30C: clone (clone.S:111)
==19740==  Address 0xa5732b0 is 0 bytes inside a block of size 24 free'd
==19740==    at 0x4C2BB5C: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==19740==    by 0x8215FA4: ??? (RadosStriperImpl.cc:431)
==19740==    by 0x5A82AB6: librados::C_AioComplete::finish(int) (AioCompletionImpl.h:180)
==19740==    by 0x5A3E0AC: Context::complete(int) (Context.h:64)
==19740==    by 0x5B4CBE2: Finisher::finisher_thread_entry() (Finisher.cc:59)
==19740==    by 0x5A6B045: Finisher::FinisherThread::entry() (Finisher.h:46)
==19740==    by 0x5B75070: Thread::entry_wrapper() (Thread.cc:60)
==19740==    by 0x5B74FED: Thread::_entry_func(void*) (Thread.cc:45)
==19740==    by 0x842F181: start_thread (pthread_create.c:312)
==19740==    by 0x8C5A30C: clone (clone.S:111)
==19740== 
==19740== Invalid free() / delete / delete[] / realloc()
==19740==    at 0x4C2BB5C: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==19740==    by 0x8215E40: ??? (RadosStriperImpl.cc:398)
==19740==    by 0x5A82BC0: librados::C_AioSafe::finish(int) (AioCompletionImpl.h:199)
==19740==    by 0x5A3E0AC: Context::complete(int) (Context.h:64)
==19740==    by 0x5B4CBE2: Finisher::finisher_thread_entry() (Finisher.cc:59)
==19740==    by 0x5A6B045: Finisher::FinisherThread::entry() (Finisher.h:46)
==19740==    by 0x5B75070: Thread::entry_wrapper() (Thread.cc:60)
==19740==    by 0x5B74FED: Thread::_entry_func(void*) (Thread.cc:45)
==19740==    by 0x842F181: start_thread (pthread_create.c:312)
==19740==    by 0x8C5A30C: clone (clone.S:111)
==19740==  Address 0xa5732b0 is 0 bytes inside a block of size 24 free'd
==19740==    at 0x4C2BB5C: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==19740==    by 0x8215FA4: ??? (RadosStriperImpl.cc:431)
==19740==    by 0x5A82AB6: librados::C_AioComplete::finish(int) (AioCompletionImpl.h:180)
==19740==    by 0x5A3E0AC: Context::complete(int) (Context.h:64)
==19740==    by 0x5B4CBE2: Finisher::finisher_thread_entry() (Finisher.cc:59)
==19740==    by 0x5A6B045: Finisher::FinisherThread::entry() (Finisher.h:46)
==19740==    by 0x5B75070: Thread::entry_wrapper() (Thread.cc:60)
==19740==    by 0x5B74FED: Thread::_entry_func(void*) (Thread.cc:45)
==19740==    by 0x842F181: start_thread (pthread_create.c:312)
==19740==    by 0x8C5A30C: clone (clone.S:111)
==19740== 
[       OK ] StriperTest.RoundTrip (342 ms)
[ RUN      ] StriperTest.RoundTrip2
[       OK ] StriperTest.RoundTrip2 (195 ms)
[ RUN      ] StriperTest.IsComplete
[       OK ] StriperTest.IsComplete (2163 ms)
[ RUN      ] StriperTest.IsSafe
[       OK ] StriperTest.IsSafe (2633 ms)
[ RUN      ] StriperTest.RoundTripAppend
[       OK ] StriperTest.RoundTripAppend (350 ms)
[ RUN      ] StriperTest.Flush
[       OK ] StriperTest.Flush (185 ms)
[ RUN      ] StriperTest.RoundTripWriteFull
[       OK ] StriperTest.RoundTripWriteFull (453 ms)
[----------] 9 tests from StriperTest (7944 ms total)

[----------] 9 tests from StriperTestPP
[ RUN      ] StriperTestPP.SimpleWritePP
[       OK ] StriperTestPP.SimpleWritePP (1330 ms)
[ RUN      ] StriperTestPP.WaitForSafePP
[       OK ] StriperTestPP.WaitForSafePP (121 ms)
[ RUN      ] StriperTestPP.RoundTripPP
[       OK ] StriperTestPP.RoundTripPP (184 ms)
[ RUN      ] StriperTestPP.RoundTripPP2
[       OK ] StriperTestPP.RoundTripPP2 (197 ms)
[ RUN      ] StriperTestPP.IsCompletePP
[       OK ] StriperTestPP.IsCompletePP (2636 ms)
==19740== Conditional jump or move depends on uninitialised value(s)
==19740==    at 0x84358E7: sem_post (sem_post.S:38)
==19740==    by 0x5A82AB6: librados::C_AioComplete::finish(int) (AioCompletionImpl.h:180)
==19740==    by 0x5A3E0AC: Context::complete(int) (Context.h:64)
==19740==    by 0x5B4CBE2: Finisher::finisher_thread_entry() (Finisher.cc:59)
==19740==    by 0x5A6B045: Finisher::FinisherThread::entry() (Finisher.h:46)
==19740==    by 0x5B75070: Thread::entry_wrapper() (Thread.cc:60)
==19740==    by 0x5B74FED: Thread::_entry_func(void*) (Thread.cc:45)
==19740==    by 0x842F181: start_thread (pthread_create.c:312)
==19740==    by 0x8C5A30C: clone (clone.S:111)
==19740== 
==19740== Conditional jump or move depends on uninitialised value(s)
==19740==    at 0x84358F0: sem_post (sem_post.S:46)
==19740==    by 0x5A82AB6: librados::C_AioComplete::finish(int) (AioCompletionImpl.h:180)
==19740==    by 0x5A3E0AC: Context::complete(int) (Context.h:64)
==19740==    by 0x5B4CBE2: Finisher::finisher_thread_entry() (Finisher.cc:59)
==19740==    by 0x5A6B045: Finisher::FinisherThread::entry() (Finisher.h:46)
==19740==    by 0x5B75070: Thread::entry_wrapper() (Thread.cc:60)
==19740==    by 0x5B74FED: Thread::_entry_func(void*) (Thread.cc:45)
==19740==    by 0x842F181: start_thread (pthread_create.c:312)
==19740==    by 0x8C5A30C: clone (clone.S:111)
==19740== 
==19740== Conditional jump or move depends on uninitialised value(s)
==19740==    at 0x84358F7: sem_post (sem_post.S:49)
==19740==    by 0x5A82AB6: librados::C_AioComplete::finish(int) (AioCompletionImpl.h:180)
==19740==    by 0x5A3E0AC: Context::complete(int) (Context.h:64)
==19740==    by 0x5B4CBE2: Finisher::finisher_thread_entry() (Finisher.cc:59)
==19740==    by 0x5A6B045: Finisher::FinisherThread::entry() (Finisher.h:46)
==19740==    by 0x5B75070: Thread::entry_wrapper() (Thread.cc:60)
==19740==    by 0x5B74FED: Thread::_entry_func(void*) (Thread.cc:45)
==19740==    by 0x842F181: start_thread (pthread_create.c:312)
==19740==    by 0x8C5A30C: clone (clone.S:111)
==19740== 
==19740== Syscall param futex(op) contains uninitialised byte(s)
==19740==    at 0x843590D: sem_post (sem_post.S:57)
==19740==    by 0x5A82AB6: librados::C_AioComplete::finish(int) (AioCompletionImpl.h:180)
==19740==    by 0x5A3E0AC: Context::complete(int) (Context.h:64)
==19740==    by 0x5B4CBE2: Finisher::finisher_thread_entry() (Finisher.cc:59)
==19740==    by 0x5A6B045: Finisher::FinisherThread::entry() (Finisher.h:46)
==19740==    by 0x5B75070: Thread::entry_wrapper() (Thread.cc:60)
==19740==    by 0x5B74FED: Thread::_entry_func(void*) (Thread.cc:45)
==19740==    by 0x842F181: start_thread (pthread_create.c:312)
==19740==    by 0x8C5A30C: clone (clone.S:111)
==19740== 
==19740== Conditional jump or move depends on uninitialised value(s)
==19740==    at 0x84358E7: sem_post (sem_post.S:38)
==19740==    by 0x5A82BC0: librados::C_AioSafe::finish(int) (AioCompletionImpl.h:199)
==19740==    by 0x5A3E0AC: Context::complete(int) (Context.h:64)
==19740==    by 0x5B4CBE2: Finisher::finisher_thread_entry() (Finisher.cc:59)
==19740==    by 0x5A6B045: Finisher::FinisherThread::entry() (Finisher.h:46)
==19740==    by 0x5B75070: Thread::entry_wrapper() (Thread.cc:60)
==19740==    by 0x5B74FED: Thread::_entry_func(void*) (Thread.cc:45)
==19740==    by 0x842F181: start_thread (pthread_create.c:312)
==19740==    by 0x8C5A30C: clone (clone.S:111)
==19740== 
==19740== Conditional jump or move depends on uninitialised value(s)
==19740==    at 0x84358F0: sem_post (sem_post.S:46)
==19740==    by 0x5A82BC0: librados::C_AioSafe::finish(int) (AioCompletionImpl.h:199)
==19740==    by 0x5A3E0AC: Context::complete(int) (Context.h:64)
==19740==    by 0x5B4CBE2: Finisher::finisher_thread_entry() (Finisher.cc:59)
==19740==    by 0x5A6B045: Finisher::FinisherThread::entry() (Finisher.h:46)
==19740==    by 0x5B75070: Thread::entry_wrapper() (Thread.cc:60)
==19740==    by 0x5B74FED: Thread::_entry_func(void*) (Thread.cc:45)
==19740==    by 0x842F181: start_thread (pthread_create.c:312)
==19740==    by 0x8C5A30C: clone (clone.S:111)
==19740== 
==19740== Conditional jump or move depends on uninitialised value(s)
==19740==    at 0x84358F7: sem_post (sem_post.S:49)
==19740==    by 0x5A82BC0: librados::C_AioSafe::finish(int) (AioCompletionImpl.h:199)
==19740==    by 0x5A3E0AC: Context::complete(int) (Context.h:64)
==19740==    by 0x5B4CBE2: Finisher::finisher_thread_entry() (Finisher.cc:59)
==19740==    by 0x5A6B045: Finisher::FinisherThread::entry() (Finisher.h:46)
==19740==    by 0x5B75070: Thread::entry_wrapper() (Thread.cc:60)
==19740==    by 0x5B74FED: Thread::_entry_func(void*) (Thread.cc:45)
==19740==    by 0x842F181: start_thread (pthread_create.c:312)
==19740==    by 0x8C5A30C: clone (clone.S:111)
==19740== 
==19740== Syscall param futex(op) contains uninitialised byte(s)
==19740==    at 0x843590D: sem_post (sem_post.S:57)
==19740==    by 0x5A82BC0: librados::C_AioSafe::finish(int) (AioCompletionImpl.h:199)
==19740==    by 0x5A3E0AC: Context::complete(int) (Context.h:64)
==19740==    by 0x5B4CBE2: Finisher::finisher_thread_entry() (Finisher.cc:59)
==19740==    by 0x5A6B045: Finisher::FinisherThread::entry() (Finisher.h:46)
==19740==    by 0x5B75070: Thread::entry_wrapper() (Thread.cc:60)
==19740==    by 0x5B74FED: Thread::_entry_func(void*) (Thread.cc:45)
==19740==    by 0x842F181: start_thread (pthread_create.c:312)
==19740==    by 0x8C5A30C: clone (clone.S:111)
==19740== 
[ RUN      ] StriperTestPP.IsSafePP
[       OK ] StriperTestPP.IsSafePP (4051 ms)
[ RUN      ] StriperTestPP.RoundTripAppendPP
[       OK ] StriperTestPP.RoundTripAppendPP (299 ms)
[ RUN      ] StriperTestPP.FlushPP
[       OK ] StriperTestPP.FlushPP (134 ms)
[ RUN      ] StriperTestPP.RoundTripWriteFullPP
[       OK ] StriperTestPP.RoundTripWriteFullPP (306 ms)
[----------] 9 tests from StriperTestPP (9264 ms total)

[----------] Global test environment tear-down
[==========] 18 tests from 2 test cases ran. (20872 ms total)
[  PASSED  ] 18 tests.
==19740== 
==19740== HEAP SUMMARY:
==19740==     in use at exit: 0 bytes in 0 blocks
==19740==   total heap usage: 39,551 allocs, 39,565 frees, 5,616,851 bytes allocated
==19740== 
==19740== All heap blocks were freed -- no leaks are possible
==19740== 
==19740== For counts of detected and suppressed errors, rerun with: -v
==19740== Use --track-origins=yes to see where uninitialised values come from
==19740== ERROR SUMMARY: 36 errors from 10 contexts (suppressed: 2 from 2)
Actions #3

Updated by Loïc Dachary over 9 years ago

libtool --mode=execute valgrind --tool=memcheck ceph_test_rados_striper_api_aio 
./check_version ./.git_version
./.git_version is up to date.
==13552== Memcheck, a memory error detector
==13552== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==13552== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==13552== Command: /home/loic/software/ceph/ceph/src/.libs/lt-ceph_test_rados_striper_api_aio
==13552== 
Running main() from gtest_main.cc
[==========] Running 18 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 9 tests from StriperTest
[ RUN      ] StriperTest.SimpleWrite
[       OK ] StriperTest.SimpleWrite (1441 ms)
[ RUN      ] StriperTest.WaitForSafe
[       OK ] StriperTest.WaitForSafe (154 ms)
[ RUN      ] StriperTest.RoundTrip
[       OK ] StriperTest.RoundTrip (337 ms)
[ RUN      ] StriperTest.RoundTrip2
[       OK ] StriperTest.RoundTrip2 (204 ms)
[ RUN      ] StriperTest.IsComplete
[       OK ] StriperTest.IsComplete (2649 ms)
[ RUN      ] StriperTest.IsSafe
[       OK ] StriperTest.IsSafe (3134 ms)
[ RUN      ] StriperTest.RoundTripAppend
[       OK ] StriperTest.RoundTripAppend (335 ms)
[ RUN      ] StriperTest.Flush
[       OK ] StriperTest.Flush (194 ms)
[ RUN      ] StriperTest.RoundTripWriteFull
[       OK ] StriperTest.RoundTripWriteFull (457 ms)
[----------] 9 tests from StriperTest (8923 ms total)

[----------] 9 tests from StriperTestPP
[ RUN      ] StriperTestPP.SimpleWritePP
[       OK ] StriperTestPP.SimpleWritePP (1436 ms)
[ RUN      ] StriperTestPP.WaitForSafePP
[       OK ] StriperTestPP.WaitForSafePP (114 ms)
[ RUN      ] StriperTestPP.RoundTripPP
[       OK ] StriperTestPP.RoundTripPP (184 ms)
[ RUN      ] StriperTestPP.RoundTripPP2
[       OK ] StriperTestPP.RoundTripPP2 (233 ms)
[ RUN      ] StriperTestPP.IsCompletePP
[       OK ] StriperTestPP.IsCompletePP (4075 ms)
[ RUN      ] StriperTestPP.IsSafePP
[       OK ] StriperTestPP.IsSafePP (1556 ms)
[ RUN      ] StriperTestPP.RoundTripAppendPP
[       OK ] StriperTestPP.RoundTripAppendPP (348 ms)
[ RUN      ] StriperTestPP.FlushPP
[       OK ] StriperTestPP.FlushPP (178 ms)
[ RUN      ] StriperTestPP.RoundTripWriteFullPP
[       OK ] StriperTestPP.RoundTripWriteFullPP (426 ms)
[----------] 9 tests from StriperTestPP (8561 ms total)

[----------] Global test environment tear-down
[==========] 18 tests from 2 test cases ran. (21293 ms total)
[  PASSED  ] 18 tests.
==13552== 
==13552== HEAP SUMMARY:
==13552==     in use at exit: 0 bytes in 0 blocks
==13552==   total heap usage: 39,758 allocs, 39,758 frees, 5,669,950 bytes allocated
==13552== 
==13552== All heap blocks were freed -- no leaks are possible
==13552== 
==13552== For counts of detected and suppressed errors, rerun with: -v
==13552== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)
Actions #4

Updated by Loïc Dachary over 9 years ago

  • Category set to librados
  • Status changed from New to Fix Under Review
  • Assignee set to Loïc Dachary
Actions #5

Updated by Loïc Dachary over 9 years ago

Something else is still crashing the test but not with valgrind. A race condition of some kind ?

loic@fold:~/software/ceph/ceph/src$ ./ceph_test_rados_striper_api_aio
Running main() from gtest_main.cc
[==========] Running 18 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 9 tests from StriperTest
[ RUN      ] StriperTest.SimpleWrite
[       OK ] StriperTest.SimpleWrite (1697 ms)
[ RUN      ] StriperTest.WaitForSafe
[       OK ] StriperTest.WaitForSafe (142 ms)
[ RUN      ] StriperTest.RoundTrip
[       OK ] StriperTest.RoundTrip (191 ms)
[ RUN      ] StriperTest.RoundTrip2
[       OK ] StriperTest.RoundTrip2 (188 ms)
[ RUN      ] StriperTest.IsComplete
sStriperTest.IsComplete (162 ms)
[ RUN      ] StriperTest.IsSafe
[       OK ] StriperTest.IsSafe (1631 ms)
[ RUN      ] StriperTest.RoundTripAppend
[       OK ] StriperTest.RoundTripAppend (154 ms)
[ RUN      ] StriperTest.Flush
[       OK ] StriperTest.Flush (70 ms)
[ RUN      ] StriperTest.RoundTripWriteFull
[       OK ] StriperTest.RoundTripWriteFull (215 ms)
[----------] 9 tests from StriperTest (4454 ms total)

[----------] 9 tests from StriperTestPP
[ RUN      ] StriperTestPP.SimpleWritePP
[       OK ] StriperTestPP.SimpleWritePP (1312 ms)
[ RUN      ] StriperTestPP.WaitForSafePP
[       OK ] StriperTestPP.WaitForSafePP (37 ms)
[ RUN      ] StriperTestPP.RoundTripPP
Segmentation fault (core dumped)
loic@fold:~/software/ceph/ceph/src$ ./ceph_test_rados_striper_api_aio
Running main() from gtest_main.cc
[==========] Running 18 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 9 tests from StriperTest
[ RUN      ] StriperTest.SimpleWrite
[       OK ] StriperTest.SimpleWrite (1428 ms)
[ RUN      ] StriperTest.WaitForSafe
[       OK ] StriperTest.WaitForSafe (48 ms)
[ RUN      ] StriperTest.RoundTrip
Segmentation fault (core dumped)
loic@fold:~/software/ceph/ceph/src$ ./ceph_test_rados_striper_api_aio
Running main() from gtest_main.cc
[==========] Running 18 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 9 tests from StriperTest
[ RUN      ] StriperTest.SimpleWrite
[       OK ] StriperTest.SimpleWrite (1298 ms)
[ RUN      ] StriperTest.WaitForSafe
[       OK ] StriperTest.WaitForSafe (53 ms)
[ RUN      ] StriperTest.RoundTrip
[       OK ] StriperTest.RoundTrip (78 ms)
[ RUN      ] StriperTest.RoundTrip2
sStriperTest.RoundTrip2 (110 ms)
[ RUN      ] StriperTest.IsComplete
StriperTest.IsComplete (66 ms)
[ RUN      ] StriperTest.IsSafe
[       OK ] StriperTest.IsSafe (74 ms)
[ RUN      ] StriperTest.RoundTripAppend
[       OK ] StriperTest.RoundTripAppend (144 ms)
[ RUN      ] StriperTest.Flush
[       OK ] StriperTest.Flush (100 ms)
[ RUN      ] StriperTest.RoundTripWriteFull
[       OK ] StriperTest.RoundTripWriteFull (311 ms)
[----------] 9 tests from StriperTest (2237 ms total)

[----------] 9 tests from StriperTestPP
[ RUN      ] StriperTestPP.SimpleWritePP
[       OK ] StriperTestPP.SimpleWritePP (1313 ms)
[ RUN      ] StriperTestPP.WaitForSafePP
Segmentation fault (core dumped)
loic@fold:~/software/ceph/ceph/src$ ./ceph_test_rados_striper_api_aio
Running main() from gtest_main.cc
[==========] Running 18 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 9 tests from StriperTest
[ RUN      ] StriperTest.SimpleWrite
[       OK ] StriperTest.SimpleWrite (1296 ms)
[ RUN      ] StriperTest.WaitForSafe
[       OK ] StriperTest.WaitForSafe (59 ms)
[ RUN      ] StriperTest.RoundTrip
[       OK ] StriperTest.RoundTrip (62 ms)
[ RUN      ] StriperTest.RoundTrip2
Segmentation fault (core dumped)
loic@fold:~/software/ceph/ceph/src$ 

Actions #6

Updated by Loïc Dachary over 9 years ago

  • Priority changed from Normal to High

It would probably be a good thing to add these tests to make check so they are caught by the gitbuilder ?

Actions #7

Updated by Sebastien Ponce over 9 years ago

I'm having a look at it. From what I can see so far, we are somehow reusing a deallocated object.

Actions #8

Updated by Sebastien Ponce over 9 years ago

There is indeed a missing lock in RadosStriperImpl, that should serialize rados_req_read_safe and rados_req_read_complete to avoid race condition. Right now, if these 2 methods are called concurrently on the same request, you may get the crash you experienced.
Loic, thanks a lot for having posted your valgrind findinds. I could not make it fail myself so I would not have been able to understand it without your post.
I'll provide a fix in the next hours.

Actions #9

Updated by Sebastien Ponce over 9 years ago

I've just submitted a fix via pull request 2419. I copy here the comment of the pool request :
"This definitely fixes an existing race condition involved in issue 9356.
But as I could not manage to exercise this race condition in my environment, I cannot 100% guarantee that it completely fixes issue 9356"

So if you manage to reproduce it, please test this new code and keep me updated

Actions #10

Updated by Loïc Dachary over 9 years ago

  • Status changed from Fix Under Review to In Progress
  • Assignee changed from Loïc Dachary to Sage Weil

It improves the situation but there still are issues. I believe you can reproduce this on a ubuntu 14.04. Here are traces of executions running your branch:

$ ./ceph_test_rados_striper_api_aio
Running main() from gtest_main.cc
[==========] Running 18 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 9 tests from StriperTest
[ RUN      ] StriperTest.SimpleWrite
[       OK ] StriperTest.SimpleWrite (1375 ms)
[ RUN      ] StriperTest.WaitForSafe
[       OK ] StriperTest.WaitForSafe (71 ms)
[ RUN      ] StriperTest.RoundTrip
[       OK ] StriperTest.RoundTrip (82 ms)
*** Error in `/home/loic/software/ceph/ceph/src/.libs/lt-ceph_test_rados_striper_api_aio': free(): invalid pointer: 0x000000000047200b ***
[ RUN      ] StriperTest.RoundTrip2
[       OK ] StriperTest.RoundTrip2 (83 ms)
[ RUN      ] StriperTest.IsComplete
[       OK ] StriperTest.IsComplete (103 ms)
[ RUN      ] StriperTest.IsSafe
[       OK ] StriperTest.IsSafe (57 ms)
Segmentation fault (core dumped)
loic@fold:~/software/ceph/ceph/src$ ./ceph_test_rados_striper_api_aio
Running main() from gtest_main.cc
[==========] Running 18 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 9 tests from StriperTest
[ RUN      ] StriperTest.SimpleWrite
[       OK ] StriperTest.SimpleWrite (1317 ms)
[ RUN      ] StriperTest.WaitForSafe
[       OK ] StriperTest.WaitForSafe (56 ms)
[ RUN      ] StriperTest.RoundTrip
[       OK ] StriperTest.RoundTrip (70 ms)
Segmentation fault (core dumped)

and
$ make ceph_test_rados_striper_api_aio && libtool --mode=execute valgrind --tool=memcheck ceph_test_rados_striper_api_aio 
./check_version ./.git_version
./.git_version is up to date.
==31725== Memcheck, a memory error detector
==31725== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==31725== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==31725== Command: /home/loic/software/ceph/ceph/src/.libs/lt-ceph_test_rados_striper_api_aio
==31725== 
Running main() from gtest_main.cc
[==========] Running 18 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 9 tests from StriperTest
[ RUN      ] StriperTest.SimpleWrite
[       OK ] StriperTest.SimpleWrite (1440 ms)
[ RUN      ] StriperTest.WaitForSafe
[       OK ] StriperTest.WaitForSafe (141 ms)
[ RUN      ] StriperTest.RoundTrip
[       OK ] StriperTest.RoundTrip (327 ms)
[ RUN      ] StriperTest.RoundTrip2
[       OK ] StriperTest.RoundTrip2 (165 ms)
[ RUN      ] StriperTest.IsComplete
[       OK ] StriperTest.IsComplete (7246 ms)
==31725== Thread 14:
==31725== Invalid read of size 4
==31725==    at 0x84358E0: sem_post (sem_post.S:33)
==31725==    by 0x5A82AA6: librados::C_AioComplete::finish(int) (AioCompletionImpl.h:180)
==31725==    by 0x5A3E09C: Context::complete(int) (Context.h:64)
==31725==    by 0x5B4CCA0: Finisher::finisher_thread_entry() (Finisher.cc:59)
==31725==    by 0x5A6B035: Finisher::FinisherThread::entry() (Finisher.h:46)
==31725==    by 0x5B7512E: Thread::entry_wrapper() (Thread.cc:60)
==31725==    by 0x5B750AB: Thread::_entry_func(void*) (Thread.cc:45)
==31725==    by 0x842F181: start_thread (pthread_create.c:312)
==31725==    by 0x8C5A30C: clone (clone.S:111)
==31725==  Address 0xffefffcc0 is not stack'd, malloc'd or (recently) free'd
==31725== 
==31725== Invalid read of size 4
==31725==    at 0x84358EC: sem_post (sem_post.S:40)
==31725==    by 0x5A82AA6: librados::C_AioComplete::finish(int) (AioCompletionImpl.h:180)
==31725==    by 0x5A3E09C: Context::complete(int) (Context.h:64)
==31725==    by 0x5B4CCA0: Finisher::finisher_thread_entry() (Finisher.cc:59)
==31725==    by 0x5A6B035: Finisher::FinisherThread::entry() (Finisher.h:46)
==31725==    by 0x5B7512E: Thread::entry_wrapper() (Thread.cc:60)
==31725==    by 0x5B750AB: Thread::_entry_func(void*) (Thread.cc:45)
==31725==    by 0x842F181: start_thread (pthread_create.c:312)
==31725==    by 0x8C5A30C: clone (clone.S:111)
==31725==  Address 0xffefffcc0 is not stack'd, malloc'd or (recently) free'd
==31725== 
==31725== Invalid read of size 8
==31725==    at 0x84358F2: sem_post (sem_post.S:48)
==31725==    by 0x5A82AA6: librados::C_AioComplete::finish(int) (AioCompletionImpl.h:180)
==31725==    by 0x5A3E09C: Context::complete(int) (Context.h:64)
==31725==    by 0x5B4CCA0: Finisher::finisher_thread_entry() (Finisher.cc:59)
==31725==    by 0x5A6B035: Finisher::FinisherThread::entry() (Finisher.h:46)
==31725==    by 0x5B7512E: Thread::entry_wrapper() (Thread.cc:60)
==31725==    by 0x5B750AB: Thread::_entry_func(void*) (Thread.cc:45)
==31725==    by 0x842F181: start_thread (pthread_create.c:312)
==31725==    by 0x8C5A30C: clone (clone.S:111)
==31725==  Address 0xffefffcc8 is not stack'd, malloc'd or (recently) free'd
==31725== 
==31725== Invalid read of size 4
==31725==    at 0x8435903: sem_post (sem_post.S:53)
==31725==    by 0x5A82AA6: librados::C_AioComplete::finish(int) (AioCompletionImpl.h:180)
==31725==    by 0x5A3E09C: Context::complete(int) (Context.h:64)
==31725==    by 0x5B4CCA0: Finisher::finisher_thread_entry() (Finisher.cc:59)
==31725==    by 0x5A6B035: Finisher::FinisherThread::entry() (Finisher.h:46)
==31725==    by 0x5B7512E: Thread::entry_wrapper() (Thread.cc:60)
==31725==    by 0x5B750AB: Thread::_entry_func(void*) (Thread.cc:45)
==31725==    by 0x842F181: start_thread (pthread_create.c:312)
==31725==    by 0x8C5A30C: clone (clone.S:111)
==31725==  Address 0xffefffcc4 is not stack'd, malloc'd or (recently) free'd
==31725== 
==31725== Invalid read of size 8
==31725==    at 0x84358F2: sem_post (sem_post.S:48)
==31725==    by 0x5A82BB0: librados::C_AioSafe::finish(int) (AioCompletionImpl.h:199)
==31725==    by 0x5A3E09C: Context::complete(int) (Context.h:64)
==31725==    by 0x5B4CCA0: Finisher::finisher_thread_entry() (Finisher.cc:59)
==31725==    by 0x5A6B035: Finisher::FinisherThread::entry() (Finisher.h:46)
==31725==    by 0x5B7512E: Thread::entry_wrapper() (Thread.cc:60)
==31725==    by 0x5B750AB: Thread::_entry_func(void*) (Thread.cc:45)
==31725==    by 0x842F181: start_thread (pthread_create.c:312)
==31725==    by 0x8C5A30C: clone (clone.S:111)
==31725==  Address 0xffefffcc8 is not stack'd, malloc'd or (recently) free'd
==31725== 
==31725== Invalid read of size 4
==31725==    at 0x8435903: sem_post (sem_post.S:53)
==31725==    by 0x5A82BB0: librados::C_AioSafe::finish(int) (AioCompletionImpl.h:199)
==31725==    by 0x5A3E09C: Context::complete(int) (Context.h:64)
==31725==    by 0x5B4CCA0: Finisher::finisher_thread_entry() (Finisher.cc:59)
==31725==    by 0x5A6B035: Finisher::FinisherThread::entry() (Finisher.h:46)
==31725==    by 0x5B7512E: Thread::entry_wrapper() (Thread.cc:60)
==31725==    by 0x5B750AB: Thread::_entry_func(void*) (Thread.cc:45)
==31725==    by 0x842F181: start_thread (pthread_create.c:312)
==31725==    by 0x8C5A30C: clone (clone.S:111)
==31725==  Address 0xffefffcc4 is not stack'd, malloc'd or (recently) free'd
==31725== 
[ RUN      ] StriperTest.IsSafe
[       OK ] StriperTest.IsSafe (5181 ms)
[ RUN      ] StriperTest.RoundTripAppend
[       OK ] StriperTest.RoundTripAppend (320 ms)
[ RUN      ] StriperTest.Flush
[       OK ] StriperTest.Flush (197 ms)
[ RUN      ] StriperTest.RoundTripWriteFull
[       OK ] StriperTest.RoundTripWriteFull (459 ms)
[----------] 9 tests from StriperTest (15494 ms total)

[----------] 9 tests from StriperTestPP
[ RUN      ] StriperTestPP.SimpleWritePP
[       OK ] StriperTestPP.SimpleWritePP (1349 ms)
[ RUN      ] StriperTestPP.WaitForSafePP
[       OK ] StriperTestPP.WaitForSafePP (107 ms)
[ RUN      ] StriperTestPP.RoundTripPP
[       OK ] StriperTestPP.RoundTripPP (209 ms)
[ RUN      ] StriperTestPP.RoundTripPP2
[       OK ] StriperTestPP.RoundTripPP2 (191 ms)
[ RUN      ] StriperTestPP.IsCompletePP
[       OK ] StriperTestPP.IsCompletePP (3890 ms)
[ RUN      ] StriperTestPP.IsSafePP
[       OK ] StriperTestPP.IsSafePP (6817 ms)
[ RUN      ] StriperTestPP.RoundTripAppendPP
[       OK ] StriperTestPP.RoundTripAppendPP (286 ms)
[ RUN      ] StriperTestPP.FlushPP
[       OK ] StriperTestPP.FlushPP (194 ms)
[ RUN      ] StriperTestPP.RoundTripWriteFullPP
[       OK ] StriperTestPP.RoundTripWriteFullPP (405 ms)
[----------] 9 tests from StriperTestPP (13450 ms total)

[----------] Global test environment tear-down
[==========] 18 tests from 2 test cases ran. (32744 ms total)
[  PASSED  ] 18 tests.
==31725== 
==31725== HEAP SUMMARY:
==31725==     in use at exit: 0 bytes in 0 blocks
==31725==   total heap usage: 39,781 allocs, 39,781 frees, 5,670,408 bytes allocated
==31725== 
==31725== All heap blocks were freed -- no leaks are possible
==31725== 
==31725== For counts of detected and suppressed errors, rerun with: -v
==31725== ERROR SUMMARY: 7 errors from 6 contexts (suppressed: 2 from 2)

thanks for looking into this !

Actions #11

Updated by Sebastien Ponce over 9 years ago

I've now installed an ubuntu 14.04 but could still not make it fail.
Even valgrind has a perfectly clean output.
I believe this is a race condition and I do not have the right machine to trigger it (too few cores ? I have a virtual 2 core client). Or maybe the problem is triggered by server side behavior.

Would you manage to reproduce with gdb and send me a backtrace of all threads ?

Actions #12

Updated by Sage Weil over 9 years ago

  • Assignee changed from Sage Weil to Loïc Dachary
Actions #13

Updated by Loïc Dachary over 9 years ago

Using the same source tree with the same kernel but inside an ubuntu 14.04 docker container, I was not able to reproduce the problem (i.e. valgrind is clean).

Actions #14

Updated by Loïc Dachary over 9 years ago

I reproduced the above valgrind output a few minutes ago on my development laptop. After upgrading from

ii  valgrind                 1:3.9.0-4ubuntu4  amd64

to
ii  valgrind                 1:3.10~20140411-0 amd64

to match the version from the linux container, valgrind no longer complains. The version of libaio are the same in the container and my laptop
ii  libaio-dev               0.3.109-4         amd64
ii  libaio1:amd64            0.3.109-4         amd64

Actions #15

Updated by Loïc Dachary over 9 years ago

Running the test in the container still fails.

$ ./ceph_test_rados_striper_api_aio
Running main() from gtest_main.cc
[==========] Running 18 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 9 tests from StriperTest
[ RUN      ] StriperTest.SimpleWrite
[       OK ] StriperTest.SimpleWrite (4643 ms)
[ RUN      ] StriperTest.WaitForSafe
[       OK ] StriperTest.WaitForSafe (362 ms)
[ RUN      ] StriperTest.RoundTrip
[       OK ] StriperTest.RoundTrip (2372 ms)
[ RUN      ] StriperTest.RoundTrip2
[       OK ] StriperTest.RoundTrip2 (740 ms)
[ RUN      ] StriperTest.IsComplete
[       OK ] StriperTest.IsComplete (439 ms)
[ RUN      ] StriperTest.IsSafe
[       OK ] StriperTest.IsSafe (772 ms)
[ RUN      ] StriperTest.RoundTripAppend
[       OK ] StriperTest.RoundTripAppend (876 ms)
[ RUN      ] StriperTest.Flush
[       OK ] StriperTest.Flush (514 ms)
[ RUN      ] StriperTest.RoundTripWriteFull
[       OK ] StriperTest.RoundTripWriteFull (1095 ms)
[----------] 9 tests from StriperTest (11813 ms total)

[----------] 9 tests from StriperTestPP
[ RUN      ] StriperTestPP.SimpleWritePP
[       OK ] StriperTestPP.SimpleWritePP (5364 ms)
[ RUN      ] StriperTestPP.WaitForSafePP
[       OK ] StriperTestPP.WaitForSafePP (328 ms)
[ RUN      ] StriperTestPP.RoundTripPP
Illegal instruction (core dumped)
loic@0a2ad46b4fb5:~/software/ceph/ceph/src$ ./ceph_test_rados_striper_api_aio
Running main() from gtest_main.cc
[==========] Running 18 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 9 tests from StriperTest
[ RUN      ] StriperTest.SimpleWrite
[       OK ] StriperTest.SimpleWrite (3825 ms)
[ RUN      ] StriperTest.WaitForSafe
[       OK ] StriperTest.WaitForSafe (572 ms)
[ RUN      ] StriperTest.RoundTrip
[       OK ] StriperTest.RoundTrip (705 ms)
[ RUN      ] StriperTest.RoundTrip2
[       OK ] StriperTest.RoundTrip2 (474 ms)
[ RUN      ] StriperTest.IsComplete
[       OK ] StriperTest.IsComplete (454 ms)
[ RUN      ] StriperTest.IsSafe
$ ./ceph_test_rados_striper_api_aio
Running main() from gtest_main.cc
[==========] Running 18 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 9 tests from StriperTest
[ RUN      ] StriperTest.SimpleWrite
[       OK ] StriperTest.SimpleWrite (4643 ms)
[ RUN      ] StriperTest.WaitForSafe
[       OK ] StriperTest.WaitForSafe (362 ms)
[ RUN      ] StriperTest.RoundTrip
[       OK ] StriperTest.RoundTrip (2372 ms)
[ RUN      ] StriperTest.RoundTrip2
[       OK ] StriperTest.RoundTrip2 (740 ms)
[ RUN      ] StriperTest.IsComplete
[       OK ] StriperTest.IsComplete (439 ms)
[ RUN      ] StriperTest.IsSafe
[       OK ] StriperTest.IsSafe (772 ms)
[ RUN      ] StriperTest.RoundTripAppend
[       OK ] StriperTest.RoundTripAppend (876 ms)
[ RUN      ] StriperTest.Flush
[       OK ] StriperTest.Flush (514 ms)
[ RUN      ] StriperTest.RoundTripWriteFull
[       OK ] StriperTest.RoundTripWriteFull (1095 ms)
[----------] 9 tests from StriperTest (11813 ms total)

[----------] 9 tests from StriperTestPP
[ RUN      ] StriperTestPP.SimpleWritePP
[       OK ] StriperTestPP.SimpleWritePP (5364 ms)
[ RUN      ] StriperTestPP.WaitForSafePP
[       OK ] StriperTestPP.WaitForSafePP (328 ms)
[ RUN      ] StriperTestPP.RoundTripPP
Illegal instruction (core dumped)
loic@0a2ad46b4fb5:~/software/ceph/ceph/src$ ./ceph_test_rados_striper_api_aio
Running main() from gtest_main.cc
[==========] Running 18 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 9 tests from StriperTest
[ RUN      ] StriperTest.SimpleWrite
[       OK ] StriperTest.SimpleWrite (3825 ms)
[ RUN      ] StriperTest.WaitForSafe
[       OK ] StriperTest.WaitForSafe (572 ms)
[ RUN      ] StriperTest.RoundTrip
[       OK ] StriperTest.RoundTrip (705 ms)
[ RUN      ] StriperTest.RoundTrip2
[       OK ] StriperTest.RoundTrip2 (474 ms)
[ RUN      ] StriperTest.IsComplete
[       OK ] StriperTest.IsComplete (454 ms)
[ RUN      ] StriperTest.IsSafe
[       OK ] StriperTest.IsSafe (515 ms)
[ RUN      ] StriperTest.RoundTripAppend
[       OK ] StriperTest.RoundTripAppend (800 ms)
[ RUN      ] StriperTest.Flush
[       OK ] StriperTest.Flush (507 ms)
[ RUN      ] StriperTest.RoundTripWriteFull
[       OK ] StriperTest.RoundTripWriteFull (1461 ms)
[----------] 9 tests from StriperTest (9313 ms total)

[----------] 9 tests from StriperTestPP
[ RUN      ] StriperTestPP.SimpleWritePP
[       OK ] StriperTestPP.SimpleWritePP (4206 ms)
[ RUN      ] StriperTestPP.WaitForSafePP
[       OK ] StriperTestPP.WaitForSafePP (910 ms)
[ RUN      ] StriperTestPP.RoundTripPP
[       OK ] StriperTestPP.RoundTripPP (664 ms)
[ RUN      ] StriperTestPP.RoundTripPP2
Segmentation fault (core dumped)

Actions #16

Updated by Loïc Dachary over 9 years ago

To make sure this is not environmental problem I clone a clean copy from your branch and removed .ccache entirely.

Actions #17

Updated by Loïc Dachary over 9 years ago

Working in the container

Reading symbols from /home/loic/software/ceph/ceph-sponce/src/.libs/lt-ceph_test_rados_striper_api_aio...done.
(gdb) run
Starting program: /home/loic/software/ceph/ceph-sponce/src/.libs/lt-ceph_test_rados_striper_api_aio
During startup program terminated with signal SIGSEGV, Segmentation fault.
(gdb) bt
No stack.
(gdb) info thread
No threads.
(gdb) run
Starting program: /home/loic/software/ceph/ceph-sponce/src/.libs/lt-ceph_test_rados_striper_api_aio
During startup program terminated with signal SIGSEGV, Segmentation fault.
(gdb) bt
No stack.
(gdb) info thread
No threads.
(gdb) 

this is weird.

Actions #18

Updated by Loïc Dachary over 9 years ago

Using

./autogen.sh ; ./configure

the ./ceph_test_rados_striper_api_aio does not core dump. There is a combination of ./configure flags that creates the problem.

Actions #19

Updated by Loïc Dachary over 9 years ago

  • Status changed from In Progress to 4
Actions #20

Updated by Loïc Dachary over 9 years ago

  • Priority changed from High to Normal

Demoting to Normal because it only happens in debug mode.

Actions #21

Updated by Sebastien Ponce over 9 years ago

I've finally managed to reproduce it, thanks to Loic : the trick was Ubuntu + debug mode. Maybe you also need more than 2 cores. Not sure about that.

Actions #22

Updated by Sebastien Ponce over 9 years ago

Damn... I was a bit too fast when I thought I was reproducing the issue !
I was indeed reproducing the original one, which I could not reproduce so far. But when I apply the fix I provided, then everything is fine, and valgrind is clean (I must say that I've got 1:3.10~20140411-0 that Loic confirmed to not expose any issue.

Actions #23

Updated by Loïc Dachary over 9 years ago

I'm able to reproduce the problem with 0daddfbf1164d6ba3f38eee29d2f11acfa62f2b6 from your tree https://github.com/sponce/ceph.git which is the head of wip-9356

Actions #24

Updated by Loïc Dachary over 9 years ago

  • Description updated (diff)
Actions #25

Updated by Sebastien Ponce over 9 years ago

Finally I can reproduce it ! I know, I've already said that and was wrong...
Actually I still don't manage with the raw code, but adding a sleep(1) in the right place of libradosstriper::RadosStriperImpl::internal_aio_write makes it happen all the time :-)

Program received signal SIGSEGV, Segmentation fault.
0x00000000006bd270 in ?? ()
(gdb) bt
#0  0x00000000006bd270 in ?? ()
#1  0x00007ffff4821fa0 in libradosstriper::RadosStriperImpl::WriteCompletionData::safe (this=0x6b9970, r=0) at libradosstriper/RadosStriperImpl.cc:177
#2  0x00007ffff4824bd8 in striper_write_aio_req_safe (c=0x6c35e0, arg=0x6b9970) at libradosstriper/RadosStriperImpl.cc:696
#3  0x00007ffff482abff in libradosstriper::MultiAioCompletionImpl::safe (this=0x6c35e0) at ./libradosstriper/MultiAioCompletionImpl.h:169
#4  0x00007ffff482aa20 in libradosstriper::MultiAioCompletionImpl::finish_adding_requests (this=0x6c35e0) at libradosstriper/MultiAioCompletionImpl.cc:59
#5  0x00007ffff48250ae in libradosstriper::RadosStriperImpl::internal_aio_write (this=0x6b3120, soid=..., c=0x6c35e0, bl=..., len=128, off=0, layout=...)
    at libradosstriper/RadosStriperImpl.cc:769
#6  0x00007ffff4824d0b in libradosstriper::RadosStriperImpl::aio_write_in_open_object (this=0x6b3120, soid=..., c=0x6b9750, layout=..., lockCookie=..., 
    bl=..., len=128, off=0) at libradosstriper/RadosStriperImpl.cc:718
#7  0x00007ffff4822d34 in libradosstriper::RadosStriperImpl::aio_write (this=0x6b3120, soid=..., c=0x6b9750, bl=..., len=128, off=0)
    at libradosstriper/RadosStriperImpl.cc:350
#8  0x00007ffff481fadb in rados_striper_aio_write (striper=0x6b3120, soid=0x46f524 "StriperTest", completion=0x6b9750, 
    buf=0x7fffffffe360 '\314' <repeats 128 times>, "\360\343\377\377\377\177", len=128, off=0) at libradosstriper/libradosstriper.cc:571
#9  0x000000000043d70b in StriperTest_SimpleWrite_Test::TestBody (this=0x6b2eb0) at test/libradosstriper/aio.cc:52
#10 0x000000000044e1c2 in testing::Test::Run (this=0x6b2eb0) at ./src/gtest.cc:2095
#11 0x000000000044e72c in testing::internal::TestInfoImpl::Run (this=0x68fc00) at ./src/gtest.cc:2314
#12 0x000000000044ec35 in testing::TestCase::Run (this=0x690050) at ./src/gtest.cc:2420
#13 0x0000000000453136 in testing::internal::UnitTestImpl::RunAllTests (this=0x68fd60) at ./src/gtest.cc:4024
#14 0x00000000004520ec in testing::UnitTest::Run (this=0x68dca0 <testing::UnitTest::GetInstance()::instance>) at ./src/gtest.cc:3687
#15 0x000000000046c5ab in main (argc=1, argv=0x7fffffffe658) at src/gtest_main.cc:38

And valgrind :

==9297== Invalid read of size 8
==9297==    at 0x81F6F77: libradosstriper::RadosStriperImpl::WriteCompletionData::safe(int) (RadosStriperImpl.cc:177)
==9297==    by 0x81F9BD7: ??? (RadosStriperImpl.cc:696)
==9297==    by 0x81FFBFE: libradosstriper::MultiAioCompletionImpl::safe() (MultiAioCompletionImpl.h:169)
==9297==    by 0x81FFA1F: libradosstriper::MultiAioCompletionImpl::finish_adding_requests() (MultiAioCompletionImpl.cc:59)
==9297==    by 0x81FA0AD: libradosstriper::RadosStriperImpl::internal_aio_write(std::string const&, libradosstriper::MultiAioCompletionImpl*, ceph::buffer::list
 const&, unsigned long, unsigned long, ceph_file_layout const&) (RadosStriperImpl.cc:769)
==9297==    by 0x81F9D0A: libradosstriper::RadosStriperImpl::aio_write_in_open_object(std::string const&, librados::AioCompletionImpl*, ceph_file_layout const&,
 std::string const&, ceph::buffer::list const&, unsigned long, unsigned long) (RadosStriperImpl.cc:718)
==9297==    by 0x81F7D33: libradosstriper::RadosStriperImpl::aio_write(std::string const&, librados::AioCompletionImpl*, ceph::buffer::list const&, unsigned lon
g, unsigned long) (RadosStriperImpl.cc:350)
==9297==    by 0x81F4ADA: rados_striper_aio_write (libradosstriper.cc:571)
==9297==    by 0x43D70A: StriperTest_SimpleWrite_Test::TestBody() (aio.cc:52)
==9297==    by 0x44E1C1: testing::Test::Run() (gtest.cc:2095)
==9297==    by 0x44E72B: testing::internal::TestInfoImpl::Run() (gtest.cc:2314)
==9297==    by 0x44EC34: testing::TestCase::Run() (gtest.cc:2420)
==9297==    by 0x453135: testing::internal::UnitTestImpl::RunAllTests() (gtest.cc:4024)
==9297==    by 0x4520EB: testing::UnitTest::Run() (gtest.cc:3687)
==9297==    by 0x46C5AA: main (gtest_main.cc:38)
==9297==  Address 0xabc2fe8 is 40 bytes inside a block of size 48 free'd
==9297==    at 0x4C2C2BC: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==9297==    by 0x81F6F61: libradosstriper::RadosStriperImpl::WriteCompletionData::~WriteCompletionData() (RadosStriperImpl.cc:174)
==9297==    by 0x81F9B86: ??? (RadosStriperImpl.cc:685)
==9297==    by 0x81FFB70: libradosstriper::MultiAioCompletionImpl::complete() (MultiAioCompletionImpl.h:161)
==9297==    by 0x81FFA05: libradosstriper::MultiAioCompletionImpl::finish_adding_requests() (MultiAioCompletionImpl.cc:57)
==9297==    by 0x81FA0AD: libradosstriper::RadosStriperImpl::internal_aio_write(std::string const&, libradosstriper::MultiAioCompletionImpl*, ceph::buffer::list
 const&, unsigned long, unsigned long, ceph_file_layout const&) (RadosStriperImpl.cc:769)
==9297==    by 0x81F9D0A: libradosstriper::RadosStriperImpl::aio_write_in_open_object(std::string const&, librados::AioCompletionImpl*, ceph_file_layout const&,
 std::string const&, ceph::buffer::list const&, unsigned long, unsigned long) (RadosStriperImpl.cc:718)
==9297==    by 0x81F7D33: libradosstriper::RadosStriperImpl::aio_write(std::string const&, librados::AioCompletionImpl*, ceph::buffer::list const&, unsigned lon
g, unsigned long) (RadosStriperImpl.cc:350)
==9297==    by 0x81F4ADA: rados_striper_aio_write (libradosstriper.cc:571)
==9297==    by 0x43D70A: StriperTest_SimpleWrite_Test::TestBody() (aio.cc:52)
==9297==    by 0x44E1C1: testing::Test::Run() (gtest.cc:2095)
==9297==    by 0x44E72B: testing::internal::TestInfoImpl::Run() (gtest.cc:2314)
==9297==    by 0x44EC34: testing::TestCase::Run() (gtest.cc:2420)
==9297==    by 0x453135: testing::internal::UnitTestImpl::RunAllTests() (gtest.cc:4024)
==9297==    by 0x4520EB: testing::UnitTest::Run() (gtest.cc:3687)
==9297==    by 0x46C5AA: main (gtest_main.cc:38)

So clearly a double delete, and it should now be easy to fix it.

Actions #26

Updated by Sebastien Ponce over 9 years ago

I've just updated pull request 2419 with a more complete fix for the issue.
I was now able to reproduce 100% when my sleep(1) was in place, and these fixes make it disappear completely.

Actions #27

Updated by Loïc Dachary over 9 years ago

  • Status changed from 4 to Resolved
  • % Done changed from 0 to 100
Actions #28

Updated by Kefu Chai about 6 years ago

  • Status changed from Resolved to New
  • Assignee deleted (Loïc Dachary)

i am able to reproduce this issue. at least 1 out of 3 times will trigger a segfault.

(gdb) bt
#0  __new_sem_post (sem=0x7ffff7ff4000) at sem_post.c:36
#1  0x00005555555d5779 in set_completion_complete (cb=0x555555a473f0, arg=0x7fffffffdaa0) at /var/ceph/ceph/src/test/libradosstriper/aio.cc:39
#2  0x00007ffff7a674f2 in librados::C_AioComplete::finish (this=0x555555a3b140, r=0) at /var/ceph/ceph/src/librados/AioCompletionImpl.h:169
#3  0x00007ffff7a60797 in Context::complete (this=0x555555a3b140, r=0) at /var/ceph/ceph/src/include/Context.h:77
#4  0x00007fffedea9425 in Finisher::finisher_thread_entry (this=0x5555559451c8) at /var/ceph/ceph/src/common/Finisher.cc:68
#5  0x00007ffff7a86b1a in Finisher::FinisherThread::entry (this=0x555555945308) at /var/ceph/ceph/src/common/Finisher.h:60
#6  0x00007fffee29b333 in Thread::entry_wrapper (this=0x555555945308) at /var/ceph/ceph/src/common/Thread.cc:79
#7  0x00007fffee29b268 in Thread::_entry_func (arg=0x555555945308) at /var/ceph/ceph/src/common/Thread.cc:59
#8  0x00007ffff761851a in start_thread (arg=0x7fffc97fa700) at pthread_create.c:465
#9  0x00007fffea5c43ef in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) p *test
$2 = {m_sem = 0x7fffffffdad7, m_complete = 176, m_safe = 219}

Actions #29

Updated by Sage Weil almost 3 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF