Bug #17561
closedtransient jerasure unit test failures
0%
Description
https://github.com/ceph/ceph/pull/11426 experienced it with the following output
79/146 Test #86: unittest_erasure_code_plugin_jerasure ...***Exception: SegFault 0.10 sec [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from ErasureCodePlugin [ RUN ] ErasureCodePlugin.factory [ OK ] ErasureCodePlugin.factory (3 ms) [----------] 1 test from ErasureCodePlugin (3 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (3 ms total) [ PASSED ] 1 test. 2016-10-11 12:42:37.598277 7f5762be7d40 -1 did not load config file, using default settings. load: jerasure *** Caught signal (Segmentation fault) ** in thread 7f5762be7d40 thread_name:unittest_erasur
an attempt to reproduce it in master as of 61310d41f307ac07ff81f19ec1926434a0ced713 did not reproduce the problem, it may be required to run it in a loop
loic@63fd9120625d:~/ceph--loic/build$ ctest -R unittest_erasure_code_plugin_jerasure Test project /home/loic/ceph--loic/build Start 86: unittest_erasure_code_plugin_jerasure 1/1 Test #86: unittest_erasure_code_plugin_jerasure ... Passed 0.01 sec 100% tests passed, 0 tests failed out of 1 Total Test time (real) = 0.03 sec loic@63fd9120625d:~/ceph--loic/build$ git rev-parse HEAD 61310d41f307ac07ff81f19ec1926434a0ced713 loic@63fd9120625d:~/ceph--loic/build$
Updated by Loïc Dachary over 7 years ago
Ran ctest -R unittest_erasure_code_plugin_jerasure a few thousand times in a loop and did not get an error.
Updated by Loïc Dachary over 7 years ago
attempt to reproduce : https://github.com/ceph/ceph/pull/11691
Updated by Loïc Dachary over 7 years ago
Never happened before https://github.com/ceph/ceph/pull/11086 was merged
Updated by Loïc Dachary over 7 years ago
Updated by Loïc Dachary over 7 years ago
src/test/erasure-code/CMakeLists.txt
target_link_libraries(unittest_erasure_code_plugin_jerasure global osd ec_jerasure common) add_dependencies(unittest_erasure_code_plugin_jerasure ec_jerasure)
Since jerasure is going to be dynamically loaded, there is no need to link with ec_jerasure
Updated by Loïc Dachary over 7 years ago
run-make-check.sh on CentOS7 in a loop to generate load
test/docker-test.sh --os-type ubuntu --os-version 14.04 --shell ./run-make-check.sh cd build ulimit -c unlimited rm -f src/test/erasure-code/core.* export PATH=/tmp/ceph-disk-virtualenv/bin:$(pwd)/bin:$PATH while : ; do echo -n . ; ctest -R unittest_erasure_code_plugin_jerasure >& out || break ; done
gdb bin/unittest_erasure_code_plugin_jerasure src/test/erasure-code/core.* set height 0 thread apply all bt
Updated by Loïc Dachary over 7 years ago
---------------------------------------------------------- 89/150 Testing: unittest_erasure_code_plugin_jerasure 89/150 Test: unittest_erasure_code_plugin_jerasure Command: "/home/loic/ceph-ubuntu-14.04-loic/build/bin/unittest_erasure_code_plugin_jerasure" Directory: /home/loic/ceph-ubuntu-14.04-loic/build/src/test/erasure-code "unittest_erasure_code_plugin_jerasure" start time: Nov 01 00:44 UTC Output: ---------------------------------------------------------- 2016-11-01 00:44:58.786912 7fd549f81d40 -1 did not load config file, using default settings. [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from ErasureCodePlugin [ RUN ] ErasureCodePlugin.factory [ OK ] ErasureCodePlugin.factory (5 ms) [----------] 1 test from ErasureCodePlugin (5 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (5 ms total) [ PASSED ] 1 test. load: jerasure *** Caught signal (Segmentation fault) ** in thread 7fd549f81d40 thread_name: <end of output> Test time = 0.13 sec ---------------------------------------------------------- Test Failed. "unittest_erasure_code_plugin_jerasure" end time: Nov 01 00:44 UTC "unittest_erasure_code_plugin_jerasure" time elapsed: 00:00:00 ---------------------------------------------------------- End testing: Nov 01 00:44 UTC
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/home/loic/ceph-ubuntu-14.04-loic/build/bin/unittest_erasure_code_plugin_jerasu'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007fd5496cc1fb in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37 37 ../nptl/sysdeps/unix/sysv/linux/pt-raise.c: No such file or directory. (gdb) bt bt #0 0x00007fd5496cc1fb in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x00007fd549c1eb52 in reraise_fatal (signum=11) at /home/loic/ceph-ubuntu-14.04-loic/src/global/signal_handler.cc:72 #2 handle_fatal_signal (signum=11) at /home/loic/ceph-ubuntu-14.04-loic/src/global/signal_handler.cc:134 #3 <signal handler called> #4 0x00007fd549c283c1 in hint_size (this=0x7fd553bc30a0) at /home/loic/ceph-ubuntu-14.04-loic/src/log/Entry.h:65 #5 ceph::logging::Log::_flush (this=this@entry=0x7fd553bc6000, t=t@entry=0x7ffc8fc01ea0, requeue=requeue@entry=0x7fd553bc6110, crash=crash@entry=false) at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:305 #6 0x00007fd549c28a39 in ceph::logging::Log::flush (this=0x7fd553bc6000) at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:282 #7 0x00007fd549c28c21 in ceph::logging::log_on_exit (p=0x7fd553b96048) at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:45 #8 0x00007fd549c29a16 in OnExitManager::~OnExitManager ( this=0x7fd549fa6a00 <ceph::logging::exit_callbacks>, __in_chrg=<optimized out>) at /home/loic/ceph-ubuntu-14.04-loic/src/include/on_exit.h:26 #9 0x00007fd5478f01a9 in __run_exit_handlers (status=0, listp=0x7fd547c726c8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:82 #10 0x00007fd5478f01f5 in __GI_exit (status=<optimized out>) at exit.c:104 #11 0x00007fd5478d5f4c in __libc_start_main ( main=0x7fd549bef1a0 <main(int, char**)>, argc=1, argv=0x7ffc8fc02048, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc8fc02038) at libc-start.c:321 #12 0x00007fd549bf0cd3 in _start () (gdb) info reg info reg rax 0x0 0 rbx 0xb 11 rcx 0xffffffffffffffff -1 rdx 0xb 11 rsi 0x5ff9 24569 rdi 0x5ff9 24569 rbp 0x7ffc8fc01370 0x7ffc8fc01370 rsp 0x7ffc8fc00978 0x7ffc8fc00978 r8 0x7fd549f81d40 140554045758784 r9 0x7ffc8fc00f70 140722720214896 r10 0x7ffc8fc00740 140722720212800 r11 0x206 518 r12 0x7fd549fa68f8 140554045909240 r13 0x7fd549f81d40 140554045758784 r14 0x0 0 r15 0x0 0 rip 0x7fd5496cc1fb 0x7fd5496cc1fb <raise+43> eflags 0x206 [ PF IF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 (gdb) x/i $pc x/i $pc => 0x7fd5496cc1fb <raise+43>: cmp $0xfffffffffffff000,%rax
Updated by Brad Hubbard over 7 years ago
I don't think this can be the same failure since the fatal signal handler got called on this occasion but does not appear to be called in the case of the jenkins failures?
Updated by Loïc Dachary over 7 years ago
thread apply all bt Thread 3 (Thread 0x7f1873bc9700 (LWP 2353)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x00007f1877d5f8ca in WaitUntil (when=..., mutex=..., this=0x7f1881038080) at /home/loic/ceph-ubuntu-14.04-loic/src/common/Cond.h:72 #2 WaitInterval (interval=..., mutex=..., cct=<optimized out>, this=0x7f1881038080) at /home/loic/ceph-ubuntu-14.04-loic/src/common/Cond.h:81 #3 CephContextServiceThread::entry (this=0x7f1881038000) at /home/loic/ceph-ubuntu-14.04-loic/src/common/ceph_context.cc:99 #4 0x00007f18777f2184 in start_thread (arg=0x7f1873bc9700) at pthread_create.c:312 #5 0x00007f1875adc37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 2 (Thread 0x7f1874b46700 (LWP 2352)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007f1877d56cdb in ceph::logging::Log::entry (this=0x7f1880fe2000) at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:459 #2 0x00007f18777f2184 in start_thread (arg=0x7f1874b46700) at pthread_create.c:312 #3 0x00007f1875adc37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 1 (Thread 0x7f18780afd40 (LWP 2351)): #0 0x00007f18777fa1fb in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x00007f1877d4cb52 in reraise_fatal (signum=11) at /home/loic/ceph-ubuntu-14.04-loic/src/global/signal_handler.cc:72 #2 handle_fatal_signal (signum=11) at /home/loic/ceph-ubuntu-14.04-loic/src/global/signal_handler.cc:134 #3 <signal handler called> #4 0x00007f1877d563c1 in hint_size (this=0x7f1880fdf0a0) at /home/loic/ceph-ubuntu-14.04-loic/src/log/Entry.h:65 #5 ceph::logging::Log::_flush (this=this@entry=0x7f1880fe2000, t=t@entry=0x7ffe0a70fb90, requeue=requeue@entry=0x7f1880fe2110, crash=crash@entry=false) at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:305 #6 0x00007f1877d56a39 in ceph::logging::Log::flush (this=0x7f1880fe2000) at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:282 #7 0x00007f1877d56c21 in ceph::logging::log_on_exit (p=0x7f1880fb2048) at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:45 #8 0x00007f1877d57a16 in OnExitManager::~OnExitManager ( this=0x7f18780d4a00 <ceph::logging::exit_callbacks>, __in_chrg=<optimized out>) at /home/loic/ceph-ubuntu-14.04-loic/src/include/on_exit.h:26 #9 0x00007f1875a1e1a9 in __run_exit_handlers (status=0, listp=0x7f1875da06c8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:82 #10 0x00007f1875a1e1f5 in __GI_exit (status=<optimized out>) at exit.c:104 #11 0x00007f1875a03f4c in __libc_start_main ( main=0x7f1877d1d1a0 <main(int, char**)>, argc=1, argv=0x7ffe0a70fd38, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe0a70fd28) at libc-start.c:321 #12 0x00007f1877d1ecd3 in _start () (gdb) f 4 #4 0x00007f1877d563c1 in hint_size (this=0x7f1880fdf0a0) at /home/loic/ceph-ubuntu-14.04-loic/src/log/Entry.h:65 65 if (size > __atomic_load_n(m_exp_len, __ATOMIC_RELAXED)) { (gdb) x/i $pc => 0x7f1877d563c1 <ceph::logging::Log::_flush(ceph::logging::EntryQueue*, ceph::logging::EntryQueue*, bool)+257>: mov (%rdx),%rcx (gdb) info reg rax 0xbf 191 rbx 0x7f1880fdf0a0 139743220068512 rcx 0x0 0 rdx 0x7f18733c7748 139742989285192 rsi 0x0 0 rdi 0x7f1880fdf0c0 139743220068544 rbp 0x7ffe0a70fb70 0x7ffe0a70fb70 rsp 0x7ffe0a70fa50 0x7ffe0a70fa50 r8 0x0 0 r9 0x0 0 r10 0x7ffe0a70f980 140729073596800 r11 0x7f18777f43d0 139743060771792 r12 0x7f1880fe2000 139743220080640 r13 0x7f18780afd40 139743069928768 r14 0x0 0 r15 0x0 0 rip 0x7f1877d563c1 0x7f1877d563c1 <ceph::logging::Log::_flush(ceph::logging::EntryQueue*, ceph::logging::EntryQueue*, bool)+257> eflags 0x202 [ IF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0
Updated by Kefu Chai over 7 years ago
looking at "#define dout_impl(cct, sub, ...)", in dout.h, _log_exp_length
is allocated as a static variable in the caller of dout()
, and this static variable will be destroyed when the program terminates.
and also, exit_callbacks will be destroyed when the program terminates. seems _log_exp_length
will not outlive exit_callbacks
, as per n3337, 3.6.31, _log_exp_length
will be destroyed before exit_callbacks
. so no wonder, when the unittest tries to flush the log entry in the queue before leaving the world, it kills itself reading the no-more-existant _log_exp_length
.
[1] c++ standard n3337, section 3.6.3
If the completion of the constructor or dynamic initialization of an object with thread storage duration is sequenced before that of another, the completion of the destructor of the second is sequenced before the initiation of the destructor of the first. If the completion of the constructor or dynamic initialization of an object with static storage duration is sequenced before that of another, the completion of the destructor of the second is sequenced before the initiation of the destructor of the first. [Note: This definition permits concurrent destruction. — end note ] If an object is initialized statically, the object is destroyed in the same order as if the object was dynamically initialized.
Updated by Loïc Dachary over 7 years ago
Thread 3 (Thread 0x7f39c9e7fd40 (LWP 2941)): #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f39c95c5649 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0 #2 0x00007f39c95c5470 in __GI___pthread_mutex_lock (mutex=0x7f39d3b32060) at ../nptl/pthread_mutex_lock.c:79 #3 0x00007f39c9b26904 in ceph::logging::Log::flush (this=0x7f39d3b32000, exiting=exiting@entry=true) at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:273 #4 0x00007f39c9b26b96 in ceph::logging::log_on_exit (p=0x7f39d3b02048) at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:45 #5 0x00007f39c9b279a6 in OnExitManager::~OnExitManager ( this=0x7f39c9ea4a00 <ceph::logging::exit_callbacks>, __in_chrg=<optimized out>) at /home/loic/ceph-ubuntu-14.04-loic/src/include/on_exit.h:26 #6 0x00007f39c77ef1a9 in __run_exit_handlers (status=0, listp=0x7f39c7b716c8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:82 #7 0x00007f39c77ef1f5 in __GI_exit (status=<optimized out>) at exit.c:104 #8 0x00007f39c77d4f4c in __libc_start_main ( main=0x7f39c9aee100 <main(int, char**)>, argc=1, argv=0x7ffe02004678, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe02004668) at libc-start.c:321 #9 0x00007f39c9aefc33 in _start () Thread 2 (Thread 0x7f39c599a700 (LWP 2943)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x00007f39c9b2f86a in WaitUntil (when=..., mutex=..., this=0x7f39d3b88080) at /home/loic/ceph-ubuntu-14.04-loic/src/common/Cond.h:72 #2 WaitInterval (interval=..., mutex=..., cct=<optimized out>, this=0x7f39d3b88080) at /home/loic/ceph-ubuntu-14.04-loic/src/common/Cond.h:81 #3 CephContextServiceThread::entry (this=0x7f39d3b88000) at /home/loic/ceph-ubuntu-14.04-loic/src/common/ceph_context.cc:99 #4 0x00007f39c95c3184 in start_thread (arg=0x7f39c599a700) at pthread_create.c:312 #5 0x00007f39c78ad37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 1 (Thread 0x7f39c6917700 (LWP 2942)): #0 0x00007f39c95cb1fb in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x00007f39c9b1ca92 in reraise_fatal (signum=11) at /home/loic/ceph-ubuntu-14.04-loic/src/global/signal_handler.cc:72 #2 handle_fatal_signal (signum=11) at /home/loic/ceph-ubuntu-14.04-loic/src/global/signal_handler.cc:134 #3 <signal handler called> #4 0x00007f39c9b2661e in hint_size (this=0x7f39d3b2f0a0) at /home/loic/ceph-ubuntu-14.04-loic/src/log/Entry.h:65 #5 ceph::logging::Log::_flush (this=this@entry=0x7f39d3b32000, t=t@entry=0x7f39c6916890, requeue=requeue@entry=0x7f39d3b32110, crash=crash@entry=false, exiting=exiting@entry=false) at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:306 #6 0x00007f39c9b269a0 in ceph::logging::Log::flush ( this=this@entry=0x7f39d3b32000, exiting=exiting@entry=false) at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:282 #7 0x00007f39c9b26c00 in ceph::logging::Log::entry (this=0x7f39d3b32000) at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:454 #8 0x00007f39c95c3184 in start_thread (arg=0x7f39c6917700) at pthread_create.c:312 #9 0x00007f39c78ad37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
The thread (3) starts deallocating globals, including the static size_t _log_exp_length from dout.h and after it OnExitManager::~OnExitManager is called. It proceeds to flush but blocks on the flush mutex because another flush() is going on, in thread (1). The flush() from thread (1) runs into an entry which references a static size_t _log_exp_length that has been deallocated by thread(3) and sig=11 as a result.
Since there is no way to guarantee that OnExitManager::~OnExitManager will be called before other globals are destroyed, the other threads must join() before it happens otherwise they may reference deleted static size_t _log_exp_length.
Updated by Kefu Chai over 7 years ago
also i don't understand how _log_exp_length
helps with the performance. asking on https://github.com/ceph/ceph/pull/6641 where the change was introduced.
Updated by Loïc Dachary over 7 years ago
- Status changed from 12 to Fix Under Review
Updated by Kefu Chai over 7 years ago
- Status changed from Fix Under Review to 12
- Assignee changed from Loïc Dachary to Kefu Chai
the test failure is worked around by https://github.com/ceph/ceph/pull/11721, but let's copy it as a new issue so we can keep it in mind that we need a complete fix for this issue.
Updated by Kefu Chai over 7 years ago
- Copied to Bug #17762: transient jerasure unit test failures added
Updated by Kefu Chai over 7 years ago
- Status changed from 12 to Pending Backport
Updated by Loïc Dachary over 7 years ago
- Copied to Backport #17768: jewel: transient jerasure unit test failures added
Updated by Loïc Dachary over 7 years ago
- Status changed from Pending Backport to Resolved
- Backport deleted (
jewel)
this is rare enough that a backport is not necessary