Project

General

Profile

Bug #42026

test: LibCephFS.ShutdownRace segfaults (msgr v2 related part)

Added by Patrick Donnelly 10 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Urgent
Category:
-
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
nautilus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Pull request ID:
Crash signature:

Description

2019-09-24T10:04:18.579 INFO:tasks.workunit.client.0.smithi026.stdout:[ RUN      ] LibCephFS.ShutdownRace
2019-09-24T10:04:18.935 INFO:teuthology.orchestra.run.smithi026:Running:
2019-09-24T10:04:18.935 INFO:teuthology.orchestra.run.smithi026:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2019-09-24T10:04:18.939 INFO:teuthology.orchestra.run.smithi033:Running:
2019-09-24T10:04:18.939 INFO:teuthology.orchestra.run.smithi033:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2019-09-24T10:04:37.483 DEBUG:teuthology.orchestra.run:got remote process result: 141
2019-09-24T10:04:37.483 INFO:tasks.workunit:Stopping ['libcephfs/test.sh'] on client.0...
2019-09-24T10:04:37.483 INFO:teuthology.orchestra.run.smithi026:Running:
2019-09-24T10:04:37.484 INFO:teuthology.orchestra.run.smithi026:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2019-09-24T10:04:37.701 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/workunit.py", line 123, in task
    timeout=timeout,cleanup=cleanup)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 85, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 99, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 22, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/workunit.py", line 409, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 205, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 437, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 162, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 184, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed (workunit test libcephfs/test.sh) on smithi026 with status 141: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=e5b766ed46c976b9ad65d1ddccb3a3c03cb4823e TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/libcephfs/test.sh' 

From: /ceph/teuthology-archive/pdonnell-2019-09-24_08:40:13-fs-master-distro-basic-smithi/4330830/teuthology.log

also a similar glibc corruption found:

2019-09-22T21:30:49.800 INFO:tasks.workunit.client.0.smithi026.stdout:[ RUN      ] LibCephFS.ShutdownRace
2019-09-22T21:30:52.068 INFO:teuthology.orchestra.run.smithi026:Running:
2019-09-22T21:30:52.068 INFO:teuthology.orchestra.run.smithi026:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2019-09-22T21:30:52.089 INFO:teuthology.orchestra.run.smithi096:Running:
2019-09-22T21:30:52.089 INFO:teuthology.orchestra.run.smithi096:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2019-09-22T21:31:14.724 INFO:tasks.workunit.client.0.smithi026.stderr:*** Error in `ceph_test_libcephfs': corrupted size vs. prev_size: 0x00007f440cacc230 ***
2019-09-22T21:31:14.733 INFO:tasks.workunit.client.0.smithi026.stderr:======= Backtrace: =========
2019-09-22T21:31:14.733 INFO:tasks.workunit.client.0.smithi026.stderr:/lib64/libc.so.6(+0x80fc7)[0x7f44301bdfc7]
2019-09-22T21:31:14.734 INFO:tasks.workunit.client.0.smithi026.stderr:/lib64/libc.so.6(+0x8159e)[0x7f44301be59e]
2019-09-22T21:31:14.734 INFO:tasks.workunit.client.0.smithi026.stderr:/lib64/libcrypto.so.10(CRYPTO_free+0x1d)[0x7f44315f480d]
2019-09-22T21:31:14.734 INFO:tasks.workunit.client.0.smithi026.stderr:/lib64/libcrypto.so.10(lh_free+0x57)[0x7f44316ac637]
2019-09-22T21:31:14.734 INFO:tasks.workunit.client.0.smithi026.stderr:/lib64/libcrypto.so.10(+0x128c60)[0x7f44316aec60]
2019-09-22T21:31:14.735 INFO:tasks.workunit.client.0.smithi026.stderr:/usr/lib64/ceph/libceph-common.so.0(_ZN4ceph6crypto8shutdownEb+0x91)[0x7f4432f89891]
2019-09-22T21:31:14.736 INFO:tasks.workunit.client.0.smithi026.stderr:/usr/lib64/ceph/libceph-common.so.0(_ZN11CephContextD1Ev+0x25f)[0x7f4432f8121f]
2019-09-22T21:31:14.736 INFO:tasks.workunit.client.0.smithi026.stderr:/usr/lib64/ceph/libceph-common.so.0(_ZN11CephContext3putEv+0x19)[0x7f4432f81419]
2019-09-22T21:31:14.736 INFO:tasks.workunit.client.0.smithi026.stderr:/lib64/libcephfs.so.2(ceph_shutdown+0xd1)[0x7f44328f7031]
2019-09-22T21:31:14.736 INFO:tasks.workunit.client.0.smithi026.stderr:ceph_test_libcephfs(+0x73946)[0x55c72bf5b946]
2019-09-22T21:31:14.736 INFO:tasks.workunit.client.0.smithi026.stderr:/usr/lib64/ceph/libceph-common.so.0(+0x81643f)[0x7f443341543f]
2019-09-22T21:31:14.736 INFO:tasks.workunit.client.0.smithi026.stderr:/lib64/libpthread.so.0(+0x7ea5)[0x7f44326b9ea5]
2019-09-22T21:31:14.736 INFO:tasks.workunit.client.0.smithi026.stderr:/lib64/libc.so.6(clone+0x6d)[0x7f443023b8cd]

From: /ceph/teuthology-archive/pdonnell-2019-09-22_10:05:30-fs-wip-pdonnell-testing-20190922.042910-distro-basic-smithi/4326224/teuthology.log

shutdownrace-valgrind.out (75 KB) Jeff Layton, 09/24/2019 12:19 PM

tsan.out.gz - TSAN output (46.6 KB) Jeff Layton, 09/30/2019 06:26 PM


Related issues

Related to Messengers - Bug #43070: CephxSessionHandler::_calc_signature segv Resolved
Duplicated by Messengers - Bug #40026: segfaults in ProtocolV2::write_message() Duplicate
Duplicated by Ceph - Bug #45523: mgr and mon crashed Duplicate
Copied to Ceph - Backport #42856: nautilus: test: LibCephFS.ShutdownRace segfaults (msgr v2 related part) Resolved

History

#1 Updated by Jeff Layton 10 months ago

Running the test under valgrind seems to find something:

$  valgrind --max-threads=2048 ./bin/ceph_test_libcephfs --gtest_filter=LibCephFS.ShutdownRace
==16056== Memcheck, a memory error detector
==16056== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==16056== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==16056== Command: ./bin/ceph_test_libcephfs --gtest_filter=LibCephFS.ShutdownRace
==16056== 
Note: Google Test filter = LibCephFS.ShutdownRace
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from LibCephFS
[ RUN      ] LibCephFS.ShutdownRace
==16056== Warning: invalid file descriptor 1031 in syscall socket()
==16056== Warning: invalid file descriptor 1031 in syscall socket()
==16056== Warning: invalid file descriptor 1031 in syscall socket()
==16056== Warning: invalid file descriptor 1031 in syscall openat()
==16056== Warning: invalid file descriptor 1032 in syscall pipe2()
==16056== Warning: invalid file descriptor 1032 in syscall socket()
==16056== Warning: invalid file descriptor 1032 in syscall socket()
==16056== Warning: invalid file descriptor 1032 in syscall pipe2()
==16056== Warning: invalid file descriptor 1032 in syscall socket()
==16056== Warning: invalid file descriptor 1032 in syscall socket()
==16056== Warning: invalid file descriptor 1032 in syscall openat()
==16056== Warning: invalid file descriptor 1032 in syscall socket()
==16056== Warning: invalid file descriptor 1032 in syscall pipe2()
==16056== Warning: invalid file descriptor 1032 in syscall epoll_create()
==16056== Warning: invalid file descriptor 1032 in syscall epoll_create()
==16056== Warning: invalid file descriptor 1032 in syscall epoll_create()
==16056== Warning: invalid file descriptor 1031 in syscall openat()
==16056== Thread 704 msgr-worker-1:
==16056== Invalid read of size 4
==16056==    at 0x4DE054E: EventCenter::create_file_event(int, int, EventCallback*) (Event.cc:229)
==16056==    by 0x4DE1220: EventCenter::set_owner() (Event.cc:202)
==16056==    by 0x4DE6171: operator() (Stack.cc:45)
==16056==    by 0x4DE6171: std::_Function_handler<void (), NetworkStack::add_thread(unsigned int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (std_function.h:300)
==16056==    by 0xDB6B6F3: ??? (in /usr/lib64/libstdc++.so.6.0.27)
==16056==    by 0xD6384BF: start_thread (in /usr/lib64/libpthread-2.29.so)
==16056==    by 0xDEEC552: clone (in /usr/lib64/libc-2.29.so)
==16056==  Address 0x1c0cd3a28 is 24 bytes before a block of size 120,000 alloc'd
==16056==    at 0x4838E86: operator new(unsigned long) (vg_replace_malloc.c:344)
==16056==    by 0x4DE1ED8: allocate (new_allocator.h:114)
==16056==    by 0x4DE1ED8: allocate (alloc_traits.h:444)
==16056==    by 0x4DE1ED8: _M_allocate (stl_vector.h:343)
==16056==    by 0x4DE1ED8: std::vector<EventCenter::FileEvent, std::allocator<EventCenter::FileEvent> >::_M_default_append(unsigned long) (vector.tcc:635)
==16056==    by 0x4DE155E: resize (stl_vector.h:937)
==16056==    by 0x4DE155E: EventCenter::init(int, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (Event.cc:138)
==16056==    by 0x4DE55E4: NetworkStack::NetworkStack(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (Stack.cc:120)
==16056==    by 0x4DE3BF8: PosixNetworkStack::PosixNetworkStack(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (PosixStack.cc:288)
==16056==    by 0x4DE5144: construct<PosixNetworkStack, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (new_allocator.h:147)
==16056==    by 0x4DE5144: construct<PosixNetworkStack, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (alloc_traits.h:484)
==16056==    by 0x4DE5144: _Sp_counted_ptr_inplace<CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (shared_ptr_base.h:548)
==16056==    by 0x4DE5144: __shared_count<PosixNetworkStack, std::allocator<PosixNetworkStack>, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (shared_ptr_base.h:679)
==16056==    by 0x4DE5144: __shared_ptr<std::allocator<PosixNetworkStack>, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (shared_ptr_base.h:1344)
==16056==    by 0x4DE5144: shared_ptr<std::allocator<PosixNetworkStack>, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (shared_ptr.h:359)
==16056==    by 0x4DE5144: allocate_shared<PosixNetworkStack, std::allocator<PosixNetworkStack>, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (shared_ptr.h:702)
==16056==    by 0x4DE5144: make_shared<PosixNetworkStack, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (shared_ptr.h:718)
==16056==    by 0x4DE5144: NetworkStack::create(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (Stack.cc:69)
==16056==    by 0x4D8B6ED: ready (AsyncMessenger.cc:254)
==16056==    by 0x4D8B6ED: ready (AsyncMessenger.cc:252)
==16056==    by 0x4D8B6ED: AsyncMessenger::AsyncMessenger(CephContext*, entity_name_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long) (AsyncMessenger.cc:291)
==16056==    by 0x4D71CF6: Messenger::create(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, entity_name_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, unsigned long) (Messenger.cc:31)
==16056==    by 0x4D720A4: Messenger::create_client_messenger(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (Messenger.cc:17)
==16056==    by 0x4E503AF: MonClient::get_monmap_and_config() (MonClient.cc:123)
==16056==    by 0xD4CE668: ceph_mount_info::init() (libcephfs.cc:88)
==16056==    by 0xD4CB05F: mount (libcephfs.cc:157)
==16056==    by 0xD4CB05F: ceph_mount (libcephfs.cc:532)
==16056== 
==16056== Invalid read of size 4
==16056==    at 0x4DE0FD4: EventCenter::create_file_event(int, int, EventCallback*) (Event.cc:238)
==16056==    by 0x4DE1220: EventCenter::set_owner() (Event.cc:202)
==16056==    by 0x4DE6171: operator() (Stack.cc:45)
==16056==    by 0x4DE6171: std::_Function_handler<void (), NetworkStack::add_thread(unsigned int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (std_function.h:300)
==16056==    by 0xDB6B6F3: ??? (in /usr/lib64/libstdc++.so.6.0.27)
==16056==    by 0xD6384BF: start_thread (in /usr/lib64/libpthread-2.29.so)
==16056==    by 0xDEEC552: clone (in /usr/lib64/libc-2.29.so)
==16056==  Address 0x1c0cd3a28 is 24 bytes before a block of size 120,000 alloc'd
==16056==    at 0x4838E86: operator new(unsigned long) (vg_replace_malloc.c:344)
==16056==    by 0x4DE1ED8: allocate (new_allocator.h:114)
==16056==    by 0x4DE1ED8: allocate (alloc_traits.h:444)
==16056==    by 0x4DE1ED8: _M_allocate (stl_vector.h:343)
==16056==    by 0x4DE1ED8: std::vector<EventCenter::FileEvent, std::allocator<EventCenter::FileEvent> >::_M_default_append(unsigned long) (vector.tcc:635)
==16056==    by 0x4DE155E: resize (stl_vector.h:937)
==16056==    by 0x4DE155E: EventCenter::init(int, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (Event.cc:138)
==16056==    by 0x4DE55E4: NetworkStack::NetworkStack(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (Stack.cc:120)
==16056==    by 0x4DE3BF8: PosixNetworkStack::PosixNetworkStack(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (PosixStack.cc:288)
==16056==    by 0x4DE5144: construct<PosixNetworkStack, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (new_allocator.h:147)
==16056==    by 0x4DE5144: construct<PosixNetworkStack, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (alloc_traits.h:484)
==16056==    by 0x4DE5144: _Sp_counted_ptr_inplace<CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (shared_ptr_base.h:548)
==16056==    by 0x4DE5144: __shared_count<PosixNetworkStack, std::allocator<PosixNetworkStack>, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (shared_ptr_base.h:679)
==16056==    by 0x4DE5144: __shared_ptr<std::allocator<PosixNetworkStack>, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (shared_ptr_base.h:1344)
==16056==    by 0x4DE5144: shared_ptr<std::allocator<PosixNetworkStack>, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (shared_ptr.h:359)
==16056==    by 0x4DE5144: allocate_shared<PosixNetworkStack, std::allocator<PosixNetworkStack>, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (shared_ptr.h:702)
==16056==    by 0x4DE5144: make_shared<PosixNetworkStack, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (shared_ptr.h:718)
==16056==    by 0x4DE5144: NetworkStack::create(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (Stack.cc:69)
==16056==    by 0x4D8B6ED: ready (AsyncMessenger.cc:254)
==16056==    by 0x4D8B6ED: ready (AsyncMessenger.cc:252)
==16056==    by 0x4D8B6ED: AsyncMessenger::AsyncMessenger(CephContext*, entity_name_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long) (AsyncMessenger.cc:291)
==16056==    by 0x4D71CF6: Messenger::create(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, entity_name_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, unsigned long) (Messenger.cc:31)
==16056==    by 0x4D720A4: Messenger::create_client_messenger(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (Messenger.cc:17)
==16056==    by 0x4E503AF: MonClient::get_monmap_and_config() (MonClient.cc:123)
==16056==    by 0xD4CE668: ceph_mount_info::init() (libcephfs.cc:88)
==16056==    by 0xD4CB05F: mount (libcephfs.cc:157)
==16056==    by 0xD4CB05F: ceph_mount (libcephfs.cc:532)
==16056== 
==16056== Warning: invalid file descriptor 1031 in syscall socket()
==16056== Warning: invalid file descriptor 1031 in syscall socket()
==16056== 
==16056== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==16056==  Access not within mapped region at address 0xFFFFFFFFFFFFFFE8
==16056==    at 0x4DE054E: EventCenter::create_file_event(int, int, EventCallback*) (Event.cc:229)
==16056==    by 0x4DE1220: EventCenter::set_owner() (Event.cc:202)
==16056==    by 0x4DE6171: operator() (Stack.cc:45)
==16056==    by 0x4DE6171: std::_Function_handler<void (), NetworkStack::add_thread(unsigned int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (std_function.h:300)
==16056==    by 0xDB6B6F3: ??? (in /usr/lib64/libstdc++.so.6.0.27)
==16056==    by 0xD6384BF: start_thread (in /usr/lib64/libpthread-2.29.so)
==16056==    by 0xDEEC552: clone (in /usr/lib64/libc-2.29.so)
==16056==  If you believe this happened as a result of a stack
==16056==  overflow in your program's main thread (unlikely but
==16056==  possible), you can try to increase the size of the
==16056==  main thread stack using the --main-stacksize= flag.
==16056==  The main thread stack size used in this run was 8388608.
==16056== 
==16056== HEAP SUMMARY:
==16056==     in use at exit: 1,469,586,494 bytes in 463,433 blocks
==16056==   total heap usage: 2,192,606 allocs, 1,729,173 frees, 1,584,768,580 bytes allocated
==16056== 
==16056== LEAK SUMMARY:
==16056==    definitely lost: 0 bytes in 0 blocks
==16056==    indirectly lost: 0 bytes in 0 blocks
==16056==      possibly lost: 3,936,500 bytes in 2,331 blocks
==16056==    still reachable: 1,465,649,994 bytes in 461,102 blocks
==16056==         suppressed: 0 bytes in 0 blocks
==16056== Rerun with --leak-check=full to see details of leaked memory
==16056== 
==16056== For lists of detected and suppressed errors, rerun with: -s
==16056== ERROR SUMMARY: 3 errors from 2 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)

#2 Updated by Jeff Layton 10 months ago

Attaching more verbose output. The "invalid file descriptor" messages hint that we're probably operating on fds that have been closed by another thread (or maybe something has scribbled over the fd array).

==18519== Warning: invalid file descriptor 1031 in syscall pipe2()
==18519==    at 0xDEDDDDB: pipe2 (in /usr/lib64/libc-2.29.so)
==18519==    by 0x4DE1586: EventCenter::init(int, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (Event.cc:145)
==18519==    by 0x4DE55E4: NetworkStack::NetworkStack(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (Stack.cc:120)
==18519==    by 0x4DE3BF8: PosixNetworkStack::PosixNetworkStack(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (PosixStack.cc:288)
==18519==    by 0x4DE5144: construct<PosixNetworkStack, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (new_allocator.h:147)
==18519==    by 0x4DE5144: construct<PosixNetworkStack, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (alloc_traits.h:484)
==18519==    by 0x4DE5144: _Sp_counted_ptr_inplace<CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (shared_ptr_base.h:548)
==18519==    by 0x4DE5144: __shared_count<PosixNetworkStack, std::allocator<PosixNetworkStack>, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (shared_ptr_base.h:679)
==18519==    by 0x4DE5144: __shared_ptr<std::allocator<PosixNetworkStack>, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (shared_ptr_base.h:1344)
==18519==    by 0x4DE5144: shared_ptr<std::allocator<PosixNetworkStack>, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (shared_ptr.h:359)
==18519==    by 0x4DE5144: allocate_shared<PosixNetworkStack, std::allocator<PosixNetworkStack>, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (shared_ptr.h:702)
==18519==    by 0x4DE5144: make_shared<PosixNetworkStack, CephContext*&, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (shared_ptr.h:718)
==18519==    by 0x4DE5144: NetworkStack::create(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (Stack.cc:69)
==18519==    by 0x4D8B6ED: ready (AsyncMessenger.cc:254)
==18519==    by 0x4D8B6ED: ready (AsyncMessenger.cc:252)
==18519==    by 0x4D8B6ED: AsyncMessenger::AsyncMessenger(CephContext*, entity_name_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long) (AsyncMessenger.cc:291)
==18519==    by 0x4D71CF6: Messenger::create(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, entity_name_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, unsigned long) (Messenger.cc:31)
==18519==    by 0x4D720A4: Messenger::create_client_messenger(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (Messenger.cc:17)
==18519==    by 0x4E503AF: MonClient::get_monmap_and_config() (MonClient.cc:123)
==18519==    by 0xD4CE668: ceph_mount_info::init() (libcephfs.cc:88)
==18519==    by 0xD4CB05F: mount (libcephfs.cc:157)
==18519==    by 0xD4CB05F: ceph_mount (libcephfs.cc:532)
==18519==    by 0x17AECD: shutdown_racer_func() (test.cc:1992)

I don't know if this is the cause of the crashes in teuthology but it is a potential culprit.

In any case, I suspect that maybe some global (not per-cct) context has been introduced recently and we're now tripping over the case where threads are concurrently trying to create and destroy those objects. This makes it look like it's something down in the socket handling, but it could be that we need to serialize something at a higher level.

#3 Updated by Jeff Layton 10 months ago

  • Project changed from fs to Ceph
  • Severity changed from 3 - minor to 2 - major

Moving this to the larger ceph project.

I suspect this is indicative of races in the network setup/teardown code, though I don't know the code that well and it could be a problem elsewhere. Either way, it doesn't seem to be cephfs specific.

I've rolled a librados-based test that is a little lighter-weight:

https://github.com/ceph/ceph/pull/30552

It seems to exhibit the same problem when run under valgrind.

#4 Updated by Jeff Layton 9 months ago

I cut down the number of threads in the rados version of the test to 4, and then built a binary with libtsan support. It shows some interesting potential data races, but without knowing more about the internals, it's hard to know whether they are real or are false positives.

#5 Updated by Jeff Layton 9 months ago

ASAN caught something too:

=================================================================
==12025==ERROR: AddressSanitizer: heap-use-after-free on address 0x6030009c7c30 at pc 0x7f8eff9055e5 bp 0x7f8ee9dd24c0 sp 0x7f8ee9dd24b0
WRITE of size 8 at 0x6030009c7c30 thread T232 (msgr-worker-1)
    #0 0x7f8eff9055e4 in ceph::buffer::v14_2_0::list::buffers_t::push_back(ceph::buffer::v14_2_0::ptr_node&) /home/jlayton/git/ceph/src/include/buffer.h:545
    #1 0x7f8eff9055e4 in ceph::buffer::v14_2_0::list::refill_append_space(unsigned int) /home/jlayton/git/ceph/src/common/buffer.cc:1406
    #2 0x7f8eff908aa1 in ceph::buffer::v14_2_0::list::append_hole(unsigned int) /home/jlayton/git/ceph/src/common/buffer.cc:1518
    #3 0x7f8eff877943 in ceph::crypto::onwire::AES128GCM_OnWireTxHandler::authenticated_encrypt_final() /home/jlayton/git/ceph/src/msg/async/crypto_onwire.cc:120
    #4 0x7f8eff7a48e1 in ceph::msgr::v2::Frame<ceph::msgr::v2::MessageFrame, (unsigned short)8, (unsigned short)8, (unsigned short)8, (unsigned short)4096>::get_buffer(ceph::crypto::onwire::rxtx_t&) /home/jlayton/git/ceph/src/msg/async/frames_v2.h:271
    #5 0x7f8eff7a48e1 in ProtocolV2::write_message(Message*, bool) /home/jlayton/git/ceph/src/msg/async/ProtocolV2.cc:512
    #6 0x7f8eff819695 in ProtocolV2::write_event() /home/jlayton/git/ceph/src/msg/async/ProtocolV2.cc:622
    #7 0x7f8eff6a53ce in AsyncConnection::handle_write() /home/jlayton/git/ceph/src/msg/async/AsyncConnection.cc:710
    #8 0x7f8eff852196 in EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) /home/jlayton/git/ceph/src/msg/async/Event.cc:433
    #9 0x7f8eff86f919 in operator() /home/jlayton/git/ceph/src/msg/async/Stack.cc:53
    #10 0x7f8eff86f919 in _M_invoke /usr/include/c++/9/bits/std_function.h:300
    #11 0x7f8efe23b6f3  (/lib64/libstdc++.so.6+0xd76f3)
    #12 0x7f8f08edc4bf in start_thread (/lib64/libpthread.so.0+0x84bf)
    #13 0x7f8efdf38552 in __clone (/lib64/libc.so.6+0xfc552)

0x6030009c7c30 is located 0 bytes inside of 24-byte region [0x6030009c7c30,0x6030009c7c48)
freed by thread T20 here:
    #0 0x7f8f094fb0b5 in operator delete(void*, unsigned long) (/lib64/libasan.so.5+0x1110b5)
    #1 0x7f8eff87912e in ceph::buffer::v14_2_0::ptr_node::disposer::operator()(ceph::buffer::v14_2_0::ptr_node*) /home/jlayton/git/ceph/src/include/buffer.h:393
    #2 0x7f8eff87912e in ceph::buffer::v14_2_0::ptr_node::disposer::operator()(ceph::buffer::v14_2_0::ptr_node*) /home/jlayton/git/ceph/src/include/buffer.h:391
    #3 0x7f8eff87912e in ceph::buffer::v14_2_0::list::buffers_t::clear_and_dispose() /home/jlayton/git/ceph/src/include/buffer.h:638
    #4 0x7f8eff87912e in ceph::buffer::v14_2_0::list::~list() /home/jlayton/git/ceph/src/include/buffer.h:985
    #5 0x7f8eff87912e in ceph::crypto::onwire::AES128GCM_OnWireTxHandler::~AES128GCM_OnWireTxHandler() /home/jlayton/git/ceph/src/msg/async/crypto_onwire.cc:61
    #6 0x7f8eff87912e in ceph::crypto::onwire::AES128GCM_OnWireTxHandler::~AES128GCM_OnWireTxHandler() /home/jlayton/git/ceph/src/msg/async/crypto_onwire.cc:63
    #7 0x7f8eff7b95a1 in std::default_delete<ceph::crypto::onwire::TxHandler>::operator()(ceph::crypto::onwire::TxHandler*) const /usr/include/c++/9/bits/unique_ptr.h:81
    #8 0x7f8eff7b95a1 in std::unique_ptr<ceph::crypto::onwire::TxHandler, std::default_delete<ceph::crypto::onwire::TxHandler> >::reset(ceph::crypto::onwire::TxHandler*) /usr/include/c++/9/bits/unique_ptr.h:394
    #9 0x7f8eff7b95a1 in ProtocolV2::reset_recv_state() /home/jlayton/git/ceph/src/msg/async/ProtocolV2.cc:227
    #10 0x7f8eff7ba543 in ProtocolV2::stop() /home/jlayton/git/ceph/src/msg/async/ProtocolV2.cc:173
    #11 0x7f8eff6a8ca4 in AsyncConnection::mark_down() /home/jlayton/git/ceph/src/msg/async/AsyncConnection.cc:704
    #12 0x7f8effa3ba61 in MonConnection::~MonConnection() /home/jlayton/git/ceph/src/mon/MonClient.cc:1527
    #13 0x7f8effa86905 in std::default_delete<MonConnection>::operator()(MonConnection*) const /usr/include/c++/9/bits/unique_ptr.h:81
    #14 0x7f8effa86905 in std::default_delete<MonConnection>::operator()(MonConnection*) const /usr/include/c++/9/bits/unique_ptr.h:75
    #15 0x7f8effa86905 in std::unique_ptr<MonConnection, std::default_delete<MonConnection> >::reset(MonConnection*) /usr/include/c++/9/bits/unique_ptr.h:394
    #16 0x7f8effa86905 in MonClient::shutdown() /home/jlayton/git/ceph/src/mon/MonClient.cc:490
    #17 0x7f8f09176636 in librados::v14_2_0::RadosClient::shutdown() /home/jlayton/git/ceph/src/librados/RadosClient.cc:379
    #18 0x7f8f08f6f2a3 in _rados_shutdown /home/jlayton/git/ceph/src/librados/librados_c.cc:188
    #19 0x55f6772f9c26 in shutdown_racer_func /home/jlayton/git/ceph/src/test/librados/misc.cc:322
    #20 0x7f8efe23b6f3  (/lib64/libstdc++.so.6+0xd76f3)

previously allocated by thread T232 (msgr-worker-1) here:
    #0 0x7f8f094f99d7 in operator new(unsigned long) (/lib64/libasan.so.5+0x10f9d7)
    #1 0x7f8eff90a600 in ceph::buffer::v14_2_0::ptr_node::create_hypercombined(ceph::unique_leakable_ptr<ceph::buffer::v14_2_0::raw>) /home/jlayton/git/ceph/src/common/buffer.cc:2200
    #2 0x7f8eff9103b5 in ceph::buffer::v14_2_0::ptr_node::create(ceph::unique_leakable_ptr<ceph::buffer::v14_2_0::raw>) /home/jlayton/git/ceph/src/include/buffer.h:402
    #3 0x7f8eff9103b5 in ceph::buffer::v14_2_0::list::reserve(unsigned long) /home/jlayton/git/ceph/src/common/buffer.cc:1281
    #4 0x7f8eff8732fb in ceph::crypto::onwire::AES128GCM_OnWireTxHandler::reset_tx_handler(std::initializer_list<unsigned int>) /home/jlayton/git/ceph/src/msg/async/crypto_onwire.cc:85
    #5 0x7f8eff7a4363 in void ceph::msgr::v2::Frame<ceph::msgr::v2::MessageFrame, (unsigned short)8, (unsigned short)8, (unsigned short)8, (unsigned short)4096>::reset_tx_handler<0ul, 1ul, 2ul, 3ul>(ceph::crypto::onwire::rxtx_t&, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>) /home/jlayton/git/ceph/src/msg/async/frames_v2.h:234
    #6 0x7f8eff7a4363 in ceph::msgr::v2::Frame<ceph::msgr::v2::MessageFrame, (unsigned short)8, (unsigned short)8, (unsigned short)8, (unsigned short)4096>::get_buffer(ceph::crypto::onwire::rxtx_t&) /home/jlayton/git/ceph/src/msg/async/frames_v2.h:251
    #7 0x7f8eff7a4363 in ProtocolV2::write_message(Message*, bool) /home/jlayton/git/ceph/src/msg/async/ProtocolV2.cc:512
    #8 0x7f8eff819695 in ProtocolV2::write_event() /home/jlayton/git/ceph/src/msg/async/ProtocolV2.cc:622
    #9 0x7f8eff6a53ce in AsyncConnection::handle_write() /home/jlayton/git/ceph/src/msg/async/AsyncConnection.cc:710
    #10 0x7f8eff852196 in EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) /home/jlayton/git/ceph/src/msg/async/Event.cc:433
    #11 0x7f8eff86f919 in operator() /home/jlayton/git/ceph/src/msg/async/Stack.cc:53
    #12 0x7f8eff86f919 in _M_invoke /usr/include/c++/9/bits/std_function.h:300
    #13 0x7f8efe23b6f3  (/lib64/libstdc++.so.6+0xd76f3)

Thread T232 (msgr-worker-1) created by T20 here:
    #0 0x7f8f09424955 in pthread_create (/lib64/libasan.so.5+0x3a955)
    #1 0x7f8efe23b9b8 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (/lib64/libstdc++.so.6+0xd79b8)
    #2 0x7f8eff86b48d in NetworkStack::start() /home/jlayton/git/ceph/src/msg/async/Stack.cc:137
    #3 0x7f8eff6f4fd3 in AsyncMessenger::AsyncMessenger(CephContext*, entity_name_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long) /home/jlayton/git/ceph/src/msg/async/AsyncMessenger.cc:293
    #4 0x7f8eff687c37 in Messenger::create(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, entity_name_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, unsigned long) /home/jlayton/git/ceph/src/msg/Messenger.cc:31
    #5 0x7f8eff688855 in Messenger::create_client_messenger(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) /home/jlayton/git/ceph/src/msg/Messenger.cc:17
    #6 0x7f8effa880f9 in MonClient::get_monmap_and_config() /home/jlayton/git/ceph/src/mon/MonClient.cc:123
    #7 0x7f8f09178cfe in librados::v14_2_0::RadosClient::connect() /home/jlayton/git/ceph/src/librados/RadosClient.cc:237
    #8 0x7f8f08f6efc3 in _rados_connect /home/jlayton/git/ceph/src/librados/librados_c.cc:178
    #9 0x55f67746f493 in connect_cluster[abi:cxx11](void**) /home/jlayton/git/ceph/src/test/librados/test.cc:156
    #10 0x55f6772f9c8f in shutdown_racer_func /home/jlayton/git/ceph/src/test/librados/misc.cc:321
    #11 0x7f8efe23b6f3  (/lib64/libstdc++.so.6+0xd76f3)

Thread T20 created by T0 here:
    #0 0x7f8f09424955 in pthread_create (/lib64/libasan.so.5+0x3a955)
    #1 0x7f8efe23b9b8 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (/lib64/libstdc++.so.6+0xd79b8)
    #2 0x55f67745e885 in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2439
    #3 0x55f67745e885 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2475
    #4 0x55f677431eba in testing::Test::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2514
    #5 0x55f677431eba in testing::Test::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2504
    #6 0x55f6774321bc in testing::TestInfo::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2690
    #7 0x55f6774321bc in testing::TestInfo::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2663
    #8 0x55f6774323fe in testing::TestSuite::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2822
    #9 0x55f6774323fe in testing::TestSuite::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2801
    #10 0x55f67743379d in testing::internal::UnitTestImpl::RunAllTests() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:5332
    #11 0x55f67745f945 in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2439
    #12 0x55f67745f945 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2475
    #13 0x55f677433d35 in testing::UnitTest::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:4920
    #14 0x55f6772f8234 in RUN_ALL_TESTS() /home/jlayton/git/ceph/src/googletest/googletest/include/gtest/gtest.h:2472
    #15 0x55f6772f8234 in main /home/jlayton/git/ceph/src/test/unit.cc:45
    #16 0x7f8efde5ff42 in __libc_start_main (/lib64/libc.so.6+0x23f42)

SUMMARY: AddressSanitizer: heap-use-after-free /home/jlayton/git/ceph/src/include/buffer.h:545 in ceph::buffer::v14_2_0::list::buffers_t::push_back(ceph::buffer::v14_2_0::ptr_node&)
Shadow bytes around the buggy address:
  0x0c0680130f30: fa fa fd fd fd fa fa fa fd fd fd fa fa fa fd fd
  0x0c0680130f40: fd fa fa fa fd fd fd fa fa fa fd fd fd fa fa fa
  0x0c0680130f50: fd fd fd fa fa fa fd fd fd fa fa fa fd fd fd fa
  0x0c0680130f60: fa fa fd fd fd fa fa fa fd fd fd fa fa fa 00 00
  0x0c0680130f70: 00 fa fa fa 00 00 00 fa fa fa 00 00 00 fa fa fa
=>0x0c0680130f80: 00 00 00 fa fa fa[fd]fd fd fa fa fa fd fd fd fa
  0x0c0680130f90: fa fa 00 00 00 fa fa fa fa fa fa fa fa fa fa fa
  0x0c0680130fa0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0680130fb0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0680130fc0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0680130fd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==12025==ABORTING

#6 Updated by Jeff Layton 9 months ago

FWIW, putting this in the global section of the config seems to work around a lot of the crashes. I think that suggests that this is somewhere down in the crypto code:

ms_mon_client_mode = crc

...UPDATE: I did hit this with that setting enabled, so there may be more than one bug in play here:

==20840==ERROR: AddressSanitizer: heap-use-after-free on address 0x60b0007debc0 at pc 0x7f3aac813b37 bp 0x7f395063b3e0 sp 0x7f395063b3d0
READ of size 8 at 0x60b0007debc0 thread T27841
    #0 0x7f3aac813b36 in ceph::timer_detail::timer<ceph::time_detail::coarse_mono_clock>::timer_thread() /home/jlayton/git/ceph/src/common/ceph_timer.h:112
    #1 0x7f3aa17eb6f3  (/lib64/libstdc++.so.6+0xd76f3)
    #2 0x7f3aac48c4bf in start_thread (/lib64/libpthread.so.0+0x84bf)
    #3 0x7f3aa14e8552 in __clone (/lib64/libc.so.6+0xfc552)

0x60b0007debc0 is located 0 bytes inside of 112-byte region [0x60b0007debc0,0x60b0007dec30)
freed by thread T301 here:
    #0 0x7f3aacaab0b5 in operator delete(void*, unsigned long) (/lib64/libasan.so.5+0x1110b5)
    #1 0x7f3aac8376cf in ceph::timer_detail::timer<ceph::time_detail::coarse_mono_clock>::cancel_event(unsigned long) /home/jlayton/git/ceph/src/common/ceph_timer.h:273
    #2 0x7f3aac78c6bb in Objecter::shutdown() /home/jlayton/git/ceph/src/osdc/Objecter.cc:515
    #3 0x7f3aac72661f in librados::v14_2_0::RadosClient::shutdown() /home/jlayton/git/ceph/src/librados/RadosClient.cc:375
    #4 0x7f3aac51f2a3 in _rados_shutdown /home/jlayton/git/ceph/src/librados/librados_c.cc:188
    #5 0x55d1a3750bf6 in shutdown_racer_func /home/jlayton/git/ceph/src/test/librados/misc.cc:322
    #6 0x7f3aa17eb6f3  (/lib64/libstdc++.so.6+0xd76f3)

previously allocated by thread T301 here:
    #0 0x7f3aacaa99d7 in operator new(unsigned long) (/lib64/libasan.so.5+0x10f9d7)
    #1 0x7f3aac75e42d in unsigned long ceph::timer_detail::timer<ceph::time_detail::coarse_mono_clock>::add_event<void (Objecter::*)(), Objecter*>(std::chrono::time_point<ceph::time_detail::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, void (Objecter::*&&)(), Objecter*&&) /home/jlayton/git/ceph/src/common/ceph_timer.h:213
    #2 0x7f3aac75e42d in unsigned long ceph::timer_detail::timer<ceph::time_detail::coarse_mono_clock>::add_event<void (Objecter::*)(), Objecter*>(std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >, void (Objecter::*&&)(), Objecter*&&) /home/jlayton/git/ceph/src/common/ceph_timer.h:205
    #3 0x7f3aac75e42d in Objecter::start_tick() /home/jlayton/git/ceph/src/osdc/Objecter.cc:2109
    #4 0x7f3aac77581c in Objecter::start(OSDMap const*) /home/jlayton/git/ceph/src/osdc/Objecter.cc:412
    #5 0x7f3aac72af74 in librados::v14_2_0::RadosClient::connect() /home/jlayton/git/ceph/src/librados/RadosClient.cc:318
    #6 0x7f3aac51efc3 in _rados_connect /home/jlayton/git/ceph/src/librados/librados_c.cc:178
    #7 0x55d1a38c6333 in connect_cluster[abi:cxx11](void**) /home/jlayton/git/ceph/src/test/librados/test.cc:156
    #8 0x55d1a3750c5f in shutdown_racer_func /home/jlayton/git/ceph/src/test/librados/misc.cc:321
    #9 0x7f3aa17eb6f3  (/lib64/libstdc++.so.6+0xd76f3)

Thread T27841 created by T301 here:
    #0 0x7f3aac9d4955 in pthread_create (/lib64/libasan.so.5+0x3a955)
    #1 0x7f3aa17eb9b8 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (/lib64/libstdc++.so.6+0xd79b8)
    #2 0x7f3aac729869 in librados::v14_2_0::RadosClient::connect() /home/jlayton/git/ceph/src/librados/RadosClient.cc:266
    #3 0x7f3aac51efc3 in _rados_connect /home/jlayton/git/ceph/src/librados/librados_c.cc:178
    #4 0x55d1a38c6333 in connect_cluster[abi:cxx11](void**) /home/jlayton/git/ceph/src/test/librados/test.cc:156
    #5 0x55d1a3750c5f in shutdown_racer_func /home/jlayton/git/ceph/src/test/librados/misc.cc:321
    #6 0x7f3aa17eb6f3  (/lib64/libstdc++.so.6+0xd76f3)

Thread T301 created by T0 here:
    #0 0x7f3aac9d4955 in pthread_create (/lib64/libasan.so.5+0x3a955)
    #1 0x7f3aa17eb9b8 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (/lib64/libstdc++.so.6+0xd79b8)
    #2 0x55d1a38b5725 in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2439
    #3 0x55d1a38b5725 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2475
    #4 0x55d1a3888d5a in testing::Test::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2514
    #5 0x55d1a3888d5a in testing::Test::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2504
    #6 0x55d1a388905c in testing::TestInfo::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2690
    #7 0x55d1a388905c in testing::TestInfo::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2663
    #8 0x55d1a388929e in testing::TestSuite::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2822
    #9 0x55d1a388929e in testing::TestSuite::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2801
    #10 0x55d1a388a63d in testing::internal::UnitTestImpl::RunAllTests() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:5332
    #11 0x55d1a38b67e5 in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2439
    #12 0x55d1a38b67e5 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2475
    #13 0x55d1a388abd5 in testing::UnitTest::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:4920
    #14 0x55d1a374f204 in RUN_ALL_TESTS() /home/jlayton/git/ceph/src/googletest/googletest/include/gtest/gtest.h:2472
    #15 0x55d1a374f204 in main /home/jlayton/git/ceph/src/test/unit.cc:45
    #16 0x7f3aa140ff42 in __libc_start_main (/lib64/libc.so.6+0x23f42)

SUMMARY: AddressSanitizer: heap-use-after-free /home/jlayton/git/ceph/src/common/ceph_timer.h:112 in ceph::timer_detail::timer<ceph::time_detail::coarse_mono_clock>::timer_thread()
Shadow bytes around the buggy address:
  0x0c16800f3d20: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fa fa
  0x0c16800f3d30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c16800f3d40: fa fa fa fa fa fa fa fa fa fa fa fa 00 00 00 00
  0x0c16800f3d50: 00 00 00 00 00 00 00 00 00 00 fa fa fa fa fa fa
  0x0c16800f3d60: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c16800f3d70: fa fa fa fa fa fa fa fa[fd]fd fd fd fd fd fd fd
  0x0c16800f3d80: fd fd fd fd fd fd fa fa fa fa fa fa fa fa fa fa
  0x0c16800f3d90: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c16800f3da0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c16800f3db0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c16800f3dc0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==20840==ABORTING

#8 Updated by Jeff Layton 9 months ago

  • Assignee set to Radoslaw Zarzynski

Reassigning this one to Radoslaw. I'll open a new tracker for the timer-related bug..

#9 Updated by Radoslaw Zarzynski 9 months ago

  • Status changed from New to In Progress
  • Assignee deleted (Radoslaw Zarzynski)

#10 Updated by Radoslaw Zarzynski 9 months ago

  • Assignee set to Radoslaw Zarzynski

#11 Updated by Radoslaw Zarzynski 9 months ago

  • Subject changed from test: LibCephFS.ShutdownRace segfaults to test: LibCephFS.ShutdownRace segfaults (msgr v2 related part)

#12 Updated by Radoslaw Zarzynski 9 months ago

  • Status changed from In Progress to Fix Under Review

#13 Updated by Kefu Chai 9 months ago

  • Related to Bug #40026: segfaults in ProtocolV2::write_message() added

#14 Updated by Brad Hubbard 9 months ago

  • Related to deleted (Bug #40026: segfaults in ProtocolV2::write_message())

#15 Updated by Brad Hubbard 9 months ago

  • Duplicated by Bug #40026: segfaults in ProtocolV2::write_message() added

#16 Updated by Sage Weil 9 months ago

  • Status changed from Fix Under Review to In Progress

#17 Updated by Nathan Cutler 9 months ago

  • Backport changed from nautilus,mimic to nautilus

Removing mimic backport because the subject line says "(msgr v2 related part)" and mimic does not have msgr v2

#18 Updated by chunsong feng 8 months ago

I also encountered the same problem, two threads operate on the same message, one thread in the ProtocolV2::write_message, and another thread executes ProtocolV2::stop. The solution is to add a lock in the ProtocolV2::write_message, so that another thread executes ProtocolV2::stop will wait for the lock, it will not crash.

Thread 13 (Thread 0xffff94d44200 (LWP 1533080)):
#0 0x0000ffff990a5618 in lll_lock_wait (futex=futex@entry=0xaaaadb98c298, private=0) at lowlevellock.c:46
#1 0x0000ffff9909e7d4 in _GI
_pthread_mutex_lock (mutex=0xaaaadb98c298) at pthread_mutex_lock.c:78
#2 0x0000ffff997ae400 in __gthread_mutex_lock (
_mutex=<optimized out>) at /usr/include/aarch64-linux-gnu/c++/9/bits/gthr-default.h:749
#3 std::mutex::lock (this=<optimized out>) at /usr/include/c++/9/bits/std_mutex.h:100
#4 ProtocolV2::write_message (this=this@entry=0xaaaadb8eb600, m=m@entry=0xaaaadbac3000, more=more@entry=false) at ./src/msg/async/ProtocolV2.cc:490
#5 0x0000ffff997c9290 in ProtocolV2::write_event (this=0xaaaadb8eb600) at ./src/msg/async/ProtocolV2.cc:622
#6 0x0000ffff9977aec8 in AsyncConnection::handle_write (this=0xaaaadb98c000) at /usr/include/c++/9/bits/unique_ptr.h:357
#7 0x0000ffff997d6978 in EventCenter::process_events (this=this@entry=0xaaaada096008, timeout_microseconds=<optimized out>,
working_dur=working_dur@entry=0xffff94d43880) at /usr/include/c++/9/bits/stl_deque.h:2110
#8 0x0000ffff997de610 in NetworkStack::<lambda()>::operator() (_closure=0xaaaada12fa78, __closure=0xaaaada12fa78) at ./src/msg/async/Stack.cc:53
#9 std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (
_functor=...)
at /usr/include/c++/9/bits/std_function.h:300
#10 0x0000aaaad36803ac in std::function<void ()>::operator()() const ()
#11 0x0000aaaad39a9dd4 in void std::__invoke_impl<void, std::function<void ()>>(std::__invoke_other, std::function<void ()>&&) ()
#12 0x0000aaaad39a9d58 in std::__invoke_result<std::function<void ()>>::type std::__invoke<std::function<void ()>>(std::function<void ()>&&) ()
#13 0x0000aaaad39a9ce0 in void std::thread::_Invoker<std::tuple<std::function<void ()> > >::_M_invoke<0ul>(std::_Index_tuple<0ul>) ()
#14 0x0000aaaad39a9c98 in std::thread::_Invoker<std::tuple<std::function<void ()> > >::operator()() ()
#15 0x0000aaaad39a9c64 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void ()> > > >::_M_run() ()
#16 0x0000ffff98f63ed4 in ?? () from target:/usr/lib/aarch64-linux-gnu/libstdc++.so.6
#17 0x0000ffff9909c088 in start_thread (arg=0xffffcf0b7d0f) at pthread_create.c:463
#18 0x0000ffff98d334ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 12 (Thread 0xffff8f539200 (LWP 1533079)):
#0 0x0000ffff990a5618 in lll_lock_wait (futex=futex@entry=0xaaaadb98c268, private=0) at lowlevellock.c:46
#1 0x0000ffff9909e7d4 in _GI
_pthread_mutex_lock (mutex=0xaaaadb98c268) at pthread_mutex_lock.c:78
#2 0x0000ffff997b41d8 in __gthread_mutex_lock (
_mutex=0xaaaadb98c268) at /usr/include/aarch64-linux-gnu/c++/9/bits/gthr-default.h:749
#3 std::mutex::lock (this=<optimized out>) at /usr/include/c++/9/bits/std_mutex.h:100
#4 std::lock_guard<std::mutex>::lock_guard (_m=..., this=<synthetic pointer>) at /usr/include/c++/9/bits/std_mutex.h:159
#5 ProtocolV2::stop (this=0xaaaadb8eb600) at ./src/msg/async/ProtocolV2.cc:171
#6 0x0000ffff997cc39c in ProtocolV2::handle_existing_connection (this=this@entry=0xaaaadba89080, existing=...) at /usr/include/c++/9/bits/unique_ptr.h:357
#7 0x0000ffff997cd8a0 in ProtocolV2::handle_client_ident (this=this@entry=0xaaaadba89080, payload=...) at ./src/msg/async/ProtocolV2.cc:2386
#8 0x0000ffff997cded8 in ProtocolV2::handle_frame_payload (this=this@entry=0xaaaadba89080) at ./src/msg/async/ProtocolV2.cc:1234
#9 0x0000ffff997ce0f4 in ProtocolV2::handle_read_frame_dispatch (this=0xaaaadba89080) at ./src/msg/async/ProtocolV2.cc:1134
#10 0x0000ffff997ce610 in ProtocolV2::handle_read_frame_epilogue_main (this=0xaaaadba89080, buffer=..., r=<optimized out>) at ./src/msg/async/ProtocolV2.cc:1360
#11 0x0000ffff997b4e28 in ProtocolV2::run_continuation (this=0xaaaadba89080, continuation=...) at ./src/msg/async/ProtocolV2.cc:45
#12 0x0000ffff9977c88c in std::function<void (char*, long)>::operator()(char*, long) const (
_args#1=0, _args#0=0xaaaadbaa6e00 "", this=0xaaaadb98e800)
at /usr/include/c++/9/bits/std_function.h:685
#13 AsyncConnection::process (this=0xaaaadb98e400) at ./src/msg/async/AsyncConnection.cc:456
#14 0x0000ffff997d62d4 in EventCenter::process_events (this=this@entry=0xaaaada095288, timeout_microseconds=<optimized out>,
working_dur=working_dur@entry=0xffff8f538880) at /usr/include/c++/9/bits/basic_ios.h:282
#15 0x0000ffff997de610 in NetworkStack::<lambda()>::operator() (
_closure=0xaaaada12fa48, _closure=0xaaaada12fa48) at ./src/msg/async/Stack.cc:53
#16 std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (
_functor=...)
at /usr/include/c++/9/bits/std_function.h:300
#17 0x0000aaaad36803ac in std::function<void ()>::operator()() const ()
---Type <return> to continue, or q <return> to quit---
#18 0x0000aaaad39a9dd4 in void std::__invoke_impl<void, std::function<void ()>>(std::__invoke_other, std::function<void ()>&&) ()
#19 0x0000aaaad39a9d58 in std::__invoke_result<std::function<void ()>>::type std::__invoke<std::function<void ()>>(std::function<void ()>&&) ()
#20 0x0000aaaad39a9ce0 in void std::thread::_Invoker<std::tuple<std::function<void ()> > >::_M_invoke<0ul>(std::_Index_tuple<0ul>) ()
#21 0x0000aaaad39a9c98 in std::thread::_Invoker<std::tuple<std::function<void ()> > >::operator()() ()
#22 0x0000aaaad39a9c64 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void ()> > > >::_M_run() ()
#23 0x0000ffff98f63ed4 in ?? () from target:/usr/lib/aarch64-linux-gnu/libstdc++.so.6
#24 0x0000ffff9909c088 in start_thread (arg=0xffffcf0b7d0f) at pthread_create.c:463
#25 0x0000ffff98d334ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

#19 Updated by Kefu Chai 8 months ago

  • Status changed from In Progress to Pending Backport

#20 Updated by Nathan Cutler 8 months ago

  • Copied to Backport #42856: nautilus: test: LibCephFS.ShutdownRace segfaults (msgr v2 related part) added

#21 Updated by Radoslaw Zarzynski 7 months ago

  • Related to Bug #43070: CephxSessionHandler::_calc_signature segv added

#22 Updated by David Zafman 5 months ago

  • Pull request ID set to 30717

#23 Updated by Nathan Cutler 4 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

#24 Updated by Brad Hubbard about 2 months ago

  • Duplicated by Bug #45523: mgr and mon crashed added

Also available in: Atom PDF