Project

General

Profile

Bug #17561

transient jerasure unit test failures

Added by Loïc Dachary about 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

https://github.com/ceph/ceph/pull/11426 experienced it with the following output

 79/146 Test  #86: unittest_erasure_code_plugin_jerasure ...***Exception: SegFault  0.10 sec
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ErasureCodePlugin
[ RUN      ] ErasureCodePlugin.factory
[       OK ] ErasureCodePlugin.factory (3 ms)
[----------] 1 test from ErasureCodePlugin (3 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3 ms total)
[  PASSED  ] 1 test.
2016-10-11 12:42:37.598277 7f5762be7d40 -1 did not load config file, using default settings.
load: jerasure *** Caught signal (Segmentation fault) **
 in thread 7f5762be7d40 thread_name:unittest_erasur

an attempt to reproduce it in master as of 61310d41f307ac07ff81f19ec1926434a0ced713 did not reproduce the problem, it may be required to run it in a loop

loic@63fd9120625d:~/ceph--loic/build$ ctest -R
unittest_erasure_code_plugin_jerasure
Test project /home/loic/ceph--loic/build
    Start 86: unittest_erasure_code_plugin_jerasure
1/1 Test #86: unittest_erasure_code_plugin_jerasure ...   Passed    0.01 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) =   0.03 sec
loic@63fd9120625d:~/ceph--loic/build$ git rev-parse HEAD
61310d41f307ac07ff81f19ec1926434a0ced713
loic@63fd9120625d:~/ceph--loic/build$

Related issues

Copied to Ceph - Bug #17762: transient jerasure unit test failures Resolved 10/13/2016
Copied to Ceph - Backport #17768: jewel: transient jerasure unit test failures Resolved

History

#1 Updated by Loïc Dachary about 6 years ago

Ran ctest -R unittest_erasure_code_plugin_jerasure a few thousand times in a loop and did not get an error.

#3 Updated by Loïc Dachary about 6 years ago

Never happened before https://github.com/ceph/ceph/pull/11086 was merged

#5 Updated by Loïc Dachary about 6 years ago

src/test/erasure-code/CMakeLists.txt

target_link_libraries(unittest_erasure_code_plugin_jerasure
  global
  osd
  ec_jerasure
  common)
add_dependencies(unittest_erasure_code_plugin_jerasure
  ec_jerasure)

Since jerasure is going to be dynamically loaded, there is no need to link with ec_jerasure

#6 Updated by Loïc Dachary about 6 years ago

run-make-check.sh on CentOS7 in a loop to generate load

test/docker-test.sh --os-type ubuntu --os-version 14.04 --shell
./run-make-check.sh
cd build
ulimit -c unlimited
rm -f src/test/erasure-code/core.*
export PATH=/tmp/ceph-disk-virtualenv/bin:$(pwd)/bin:$PATH
while : ; do echo -n . ; ctest -R  unittest_erasure_code_plugin_jerasure >& out || break ; done
gdb bin/unittest_erasure_code_plugin_jerasure src/test/erasure-code/core.*
set height 0
thread apply  all bt

#7 Updated by Loïc Dachary about 6 years ago

----------------------------------------------------------
89/150 Testing: unittest_erasure_code_plugin_jerasure
89/150 Test: unittest_erasure_code_plugin_jerasure
Command: "/home/loic/ceph-ubuntu-14.04-loic/build/bin/unittest_erasure_code_plugin_jerasure" 
Directory: /home/loic/ceph-ubuntu-14.04-loic/build/src/test/erasure-code
"unittest_erasure_code_plugin_jerasure" start time: Nov 01 00:44 UTC
Output:
----------------------------------------------------------
2016-11-01 00:44:58.786912 7fd549f81d40 -1 did not load config file, using default settings.
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ErasureCodePlugin
[ RUN      ] ErasureCodePlugin.factory
[       OK ] ErasureCodePlugin.factory (5 ms)
[----------] 1 test from ErasureCodePlugin (5 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (5 ms total)
[  PASSED  ] 1 test.
load: jerasure *** Caught signal (Segmentation fault) **
 in thread 7fd549f81d40 thread_name:
<end of output>
Test time =   0.13 sec
----------------------------------------------------------
Test Failed.
"unittest_erasure_code_plugin_jerasure" end time: Nov 01 00:44 UTC
"unittest_erasure_code_plugin_jerasure" time elapsed: 00:00:00
----------------------------------------------------------

End testing: Nov 01 00:44 UTC

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/loic/ceph-ubuntu-14.04-loic/build/bin/unittest_erasure_code_plugin_jerasu'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fd5496cc1fb in raise (sig=11)
    at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
37      ../nptl/sysdeps/unix/sysv/linux/pt-raise.c: No such file or directory.
(gdb) bt
bt
#0  0x00007fd5496cc1fb in raise (sig=11)
    at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x00007fd549c1eb52 in reraise_fatal (signum=11)
    at /home/loic/ceph-ubuntu-14.04-loic/src/global/signal_handler.cc:72
#2  handle_fatal_signal (signum=11)
    at /home/loic/ceph-ubuntu-14.04-loic/src/global/signal_handler.cc:134
#3  <signal handler called>
#4  0x00007fd549c283c1 in hint_size (this=0x7fd553bc30a0)
    at /home/loic/ceph-ubuntu-14.04-loic/src/log/Entry.h:65
#5  ceph::logging::Log::_flush (this=this@entry=0x7fd553bc6000,
    t=t@entry=0x7ffc8fc01ea0, requeue=requeue@entry=0x7fd553bc6110,
    crash=crash@entry=false)
    at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:305
#6  0x00007fd549c28a39 in ceph::logging::Log::flush (this=0x7fd553bc6000)
    at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:282
#7  0x00007fd549c28c21 in ceph::logging::log_on_exit (p=0x7fd553b96048)
    at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:45
#8  0x00007fd549c29a16 in OnExitManager::~OnExitManager (
    this=0x7fd549fa6a00 <ceph::logging::exit_callbacks>,
    __in_chrg=<optimized out>)
    at /home/loic/ceph-ubuntu-14.04-loic/src/include/on_exit.h:26
#9  0x00007fd5478f01a9 in __run_exit_handlers (status=0,
    listp=0x7fd547c726c8 <__exit_funcs>,
    run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#10 0x00007fd5478f01f5 in __GI_exit (status=<optimized out>) at exit.c:104
#11 0x00007fd5478d5f4c in __libc_start_main (
    main=0x7fd549bef1a0 <main(int, char**)>, argc=1, argv=0x7ffc8fc02048,
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
    stack_end=0x7ffc8fc02038) at libc-start.c:321
#12 0x00007fd549bf0cd3 in _start ()
(gdb) info reg
info reg
rax            0x0      0
rbx            0xb      11
rcx            0xffffffffffffffff       -1
rdx            0xb      11
rsi            0x5ff9   24569
rdi            0x5ff9   24569
rbp            0x7ffc8fc01370   0x7ffc8fc01370
rsp            0x7ffc8fc00978   0x7ffc8fc00978
r8             0x7fd549f81d40   140554045758784
r9             0x7ffc8fc00f70   140722720214896
r10            0x7ffc8fc00740   140722720212800
r11            0x206    518
r12            0x7fd549fa68f8   140554045909240
r13            0x7fd549f81d40   140554045758784
r14            0x0      0
r15            0x0      0
rip            0x7fd5496cc1fb   0x7fd5496cc1fb <raise+43>
eflags         0x206    [ PF IF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
(gdb) x/i $pc
x/i $pc
=> 0x7fd5496cc1fb <raise+43>:   cmp    $0xfffffffffffff000,%rax

#8 Updated by Brad Hubbard about 6 years ago

I don't think this can be the same failure since the fatal signal handler got called on this occasion but does not appear to be called in the case of the jenkins failures?

#9 Updated by Loïc Dachary about 6 years ago

thread apply all bt

Thread 3 (Thread 0x7f1873bc9700 (LWP 2353)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007f1877d5f8ca in WaitUntil (when=..., mutex=..., this=0x7f1881038080)
    at /home/loic/ceph-ubuntu-14.04-loic/src/common/Cond.h:72
#2  WaitInterval (interval=..., mutex=..., cct=<optimized out>,
    this=0x7f1881038080)
    at /home/loic/ceph-ubuntu-14.04-loic/src/common/Cond.h:81
#3  CephContextServiceThread::entry (this=0x7f1881038000)
    at /home/loic/ceph-ubuntu-14.04-loic/src/common/ceph_context.cc:99
#4  0x00007f18777f2184 in start_thread (arg=0x7f1873bc9700)
    at pthread_create.c:312
#5  0x00007f1875adc37d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7f1874b46700 (LWP 2352)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f1877d56cdb in ceph::logging::Log::entry (this=0x7f1880fe2000)
    at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:459
#2  0x00007f18777f2184 in start_thread (arg=0x7f1874b46700)
    at pthread_create.c:312
#3  0x00007f1875adc37d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7f18780afd40 (LWP 2351)):
#0  0x00007f18777fa1fb in raise (sig=11)
    at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x00007f1877d4cb52 in reraise_fatal (signum=11)
    at /home/loic/ceph-ubuntu-14.04-loic/src/global/signal_handler.cc:72
#2  handle_fatal_signal (signum=11)
    at /home/loic/ceph-ubuntu-14.04-loic/src/global/signal_handler.cc:134
#3  <signal handler called>
#4  0x00007f1877d563c1 in hint_size (this=0x7f1880fdf0a0)
    at /home/loic/ceph-ubuntu-14.04-loic/src/log/Entry.h:65
#5  ceph::logging::Log::_flush (this=this@entry=0x7f1880fe2000,
    t=t@entry=0x7ffe0a70fb90, requeue=requeue@entry=0x7f1880fe2110,
    crash=crash@entry=false)
    at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:305
#6  0x00007f1877d56a39 in ceph::logging::Log::flush (this=0x7f1880fe2000)
    at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:282
#7  0x00007f1877d56c21 in ceph::logging::log_on_exit (p=0x7f1880fb2048)
    at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:45
#8  0x00007f1877d57a16 in OnExitManager::~OnExitManager (
    this=0x7f18780d4a00 <ceph::logging::exit_callbacks>,
    __in_chrg=<optimized out>)
    at /home/loic/ceph-ubuntu-14.04-loic/src/include/on_exit.h:26
#9  0x00007f1875a1e1a9 in __run_exit_handlers (status=0,
    listp=0x7f1875da06c8 <__exit_funcs>,
    run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#10 0x00007f1875a1e1f5 in __GI_exit (status=<optimized out>) at exit.c:104
#11 0x00007f1875a03f4c in __libc_start_main (
    main=0x7f1877d1d1a0 <main(int, char**)>, argc=1, argv=0x7ffe0a70fd38,
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
    stack_end=0x7ffe0a70fd28) at libc-start.c:321
#12 0x00007f1877d1ecd3 in _start ()
(gdb) f 4
#4  0x00007f1877d563c1 in hint_size (this=0x7f1880fdf0a0)
    at /home/loic/ceph-ubuntu-14.04-loic/src/log/Entry.h:65
65            if (size > __atomic_load_n(m_exp_len, __ATOMIC_RELAXED)) {
(gdb) x/i $pc
=> 0x7f1877d563c1 <ceph::logging::Log::_flush(ceph::logging::EntryQueue*, ceph::logging::EntryQueue*, bool)+257>:       mov    (%rdx),%rcx
(gdb) info reg
rax            0xbf     191
rbx            0x7f1880fdf0a0   139743220068512
rcx            0x0      0
rdx            0x7f18733c7748   139742989285192
rsi            0x0      0
rdi            0x7f1880fdf0c0   139743220068544
rbp            0x7ffe0a70fb70   0x7ffe0a70fb70
rsp            0x7ffe0a70fa50   0x7ffe0a70fa50
r8             0x0      0
r9             0x0      0
r10            0x7ffe0a70f980   140729073596800
r11            0x7f18777f43d0   139743060771792
r12            0x7f1880fe2000   139743220080640
r13            0x7f18780afd40   139743069928768
r14            0x0      0
r15            0x0      0
rip            0x7f1877d563c1   0x7f1877d563c1 <ceph::logging::Log::_flush(ceph::logging::EntryQueue*, ceph::logging::EntryQueue*, bool)+257>
eflags         0x202    [ IF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0

#10 Updated by Kefu Chai about 6 years ago

looking at "#define dout_impl(cct, sub, ...)", in dout.h, _log_exp_length is allocated as a static variable in the caller of dout(), and this static variable will be destroyed when the program terminates.
and also, exit_callbacks will be destroyed when the program terminates. seems _log_exp_length will not outlive exit_callbacks, as per n3337, 3.6.31, _log_exp_length will be destroyed before exit_callbacks. so no wonder, when the unittest tries to flush the log entry in the queue before leaving the world, it kills itself reading the no-more-existant _log_exp_length.


[1] c++ standard n3337, section 3.6.3

If the completion of the constructor or dynamic initialization of an object with thread storage duration is sequenced before that of another, the completion of the destructor of the second is sequenced before the initiation of the destructor of the first. If the completion of the constructor or dynamic initialization of an object with static storage duration is sequenced before that of another, the completion of the destructor of the second is sequenced before the initiation of the destructor of the first. [Note: This definition permits concurrent destruction. — end note ] If an object is initialized statically, the object is destroyed in the same order as if the object was dynamically initialized.

#11 Updated by Kefu Chai about 6 years ago

  • Backport set to jewel

#12 Updated by Loïc Dachary about 6 years ago

Thread 3 (Thread 0x7f39c9e7fd40 (LWP 2941)):
#0  __lll_lock_wait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f39c95c5649 in _L_lock_909 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f39c95c5470 in __GI___pthread_mutex_lock (mutex=0x7f39d3b32060)
    at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f39c9b26904 in ceph::logging::Log::flush (this=0x7f39d3b32000,
    exiting=exiting@entry=true)
    at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:273
#4  0x00007f39c9b26b96 in ceph::logging::log_on_exit (p=0x7f39d3b02048)
    at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:45
#5  0x00007f39c9b279a6 in OnExitManager::~OnExitManager (
    this=0x7f39c9ea4a00 <ceph::logging::exit_callbacks>,
    __in_chrg=<optimized out>)
    at /home/loic/ceph-ubuntu-14.04-loic/src/include/on_exit.h:26
#6  0x00007f39c77ef1a9 in __run_exit_handlers (status=0,
    listp=0x7f39c7b716c8 <__exit_funcs>,
    run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#7  0x00007f39c77ef1f5 in __GI_exit (status=<optimized out>) at exit.c:104
#8  0x00007f39c77d4f4c in __libc_start_main (
    main=0x7f39c9aee100 <main(int, char**)>, argc=1, argv=0x7ffe02004678,
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
    stack_end=0x7ffe02004668) at libc-start.c:321
#9  0x00007f39c9aefc33 in _start ()

Thread 2 (Thread 0x7f39c599a700 (LWP 2943)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007f39c9b2f86a in WaitUntil (when=..., mutex=..., this=0x7f39d3b88080)
    at /home/loic/ceph-ubuntu-14.04-loic/src/common/Cond.h:72
#2  WaitInterval (interval=..., mutex=..., cct=<optimized out>,
    this=0x7f39d3b88080)
    at /home/loic/ceph-ubuntu-14.04-loic/src/common/Cond.h:81
#3  CephContextServiceThread::entry (this=0x7f39d3b88000)
    at /home/loic/ceph-ubuntu-14.04-loic/src/common/ceph_context.cc:99
#4  0x00007f39c95c3184 in start_thread (arg=0x7f39c599a700)
    at pthread_create.c:312
#5  0x00007f39c78ad37d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7f39c6917700 (LWP 2942)):
#0  0x00007f39c95cb1fb in raise (sig=11)
    at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x00007f39c9b1ca92 in reraise_fatal (signum=11)
    at /home/loic/ceph-ubuntu-14.04-loic/src/global/signal_handler.cc:72
#2  handle_fatal_signal (signum=11)
    at /home/loic/ceph-ubuntu-14.04-loic/src/global/signal_handler.cc:134
#3  <signal handler called>
#4  0x00007f39c9b2661e in hint_size (this=0x7f39d3b2f0a0)
    at /home/loic/ceph-ubuntu-14.04-loic/src/log/Entry.h:65
#5  ceph::logging::Log::_flush (this=this@entry=0x7f39d3b32000,
    t=t@entry=0x7f39c6916890, requeue=requeue@entry=0x7f39d3b32110,
    crash=crash@entry=false, exiting=exiting@entry=false)
    at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:306
#6  0x00007f39c9b269a0 in ceph::logging::Log::flush (
    this=this@entry=0x7f39d3b32000, exiting=exiting@entry=false)
    at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:282
#7  0x00007f39c9b26c00 in ceph::logging::Log::entry (this=0x7f39d3b32000)
    at /home/loic/ceph-ubuntu-14.04-loic/src/log/Log.cc:454
#8  0x00007f39c95c3184 in start_thread (arg=0x7f39c6917700)
    at pthread_create.c:312
#9  0x00007f39c78ad37d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

The thread (3) starts deallocating globals, including the static size_t _log_exp_length from dout.h and after it OnExitManager::~OnExitManager is called. It proceeds to flush but blocks on the flush mutex because another flush() is going on, in thread (1). The flush() from thread (1) runs into an entry which references a static size_t _log_exp_length that has been deallocated by thread(3) and sig=11 as a result.

Since there is no way to guarantee that OnExitManager::~OnExitManager will be called before other globals are destroyed, the other threads must join() before it happens otherwise they may reference deleted static size_t _log_exp_length.

#13 Updated by Kefu Chai about 6 years ago

also i don't understand how _log_exp_length helps with the performance. asking on https://github.com/ceph/ceph/pull/6641 where the change was introduced.

#14 Updated by Loïc Dachary about 6 years ago

  • Status changed from 12 to Fix Under Review

#15 Updated by Kefu Chai about 6 years ago

  • Status changed from Fix Under Review to 12
  • Assignee changed from Loïc Dachary to Kefu Chai

the test failure is worked around by https://github.com/ceph/ceph/pull/11721, but let's copy it as a new issue so we can keep it in mind that we need a complete fix for this issue.

#16 Updated by Kefu Chai about 6 years ago

  • Copied to Bug #17762: transient jerasure unit test failures added

#17 Updated by Kefu Chai about 6 years ago

  • Status changed from 12 to Pending Backport

#18 Updated by Loïc Dachary about 6 years ago

  • Copied to Backport #17768: jewel: transient jerasure unit test failures added

#19 Updated by Loïc Dachary about 6 years ago

  • Status changed from Pending Backport to Resolved
  • Backport deleted (jewel)

this is rare enough that a backport is not necessary

Also available in: Atom PDF