Project

General

Profile

Actions

Bug #42102

closed

use-after-free in Objecter timer handing

Added by Jeff Layton over 4 years ago. Updated over 2 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While hunting a crash in tracker #42026, I ran across this bug when testing with ASAN:

==20840==ERROR: AddressSanitizer: heap-use-after-free on address 0x60b0007debc0 at pc 0x7f3aac813b37 bp 0x7f395063b3e0 sp 0x7f395063b3d0
READ of size 8 at 0x60b0007debc0 thread T27841
    #0 0x7f3aac813b36 in ceph::timer_detail::timer<ceph::time_detail::coarse_mono_clock>::timer_thread() /home/jlayton/git/ceph/src/common/ceph_timer.h:112
    #1 0x7f3aa17eb6f3  (/lib64/libstdc++.so.6+0xd76f3)
    #2 0x7f3aac48c4bf in start_thread (/lib64/libpthread.so.0+0x84bf)
    #3 0x7f3aa14e8552 in __clone (/lib64/libc.so.6+0xfc552)

0x60b0007debc0 is located 0 bytes inside of 112-byte region [0x60b0007debc0,0x60b0007dec30)
freed by thread T301 here:
    #0 0x7f3aacaab0b5 in operator delete(void*, unsigned long) (/lib64/libasan.so.5+0x1110b5)
    #1 0x7f3aac8376cf in ceph::timer_detail::timer<ceph::time_detail::coarse_mono_clock>::cancel_event(unsigned long) /home/jlayton/git/ceph/src/common/ceph_timer.h:273
    #2 0x7f3aac78c6bb in Objecter::shutdown() /home/jlayton/git/ceph/src/osdc/Objecter.cc:515
    #3 0x7f3aac72661f in librados::v14_2_0::RadosClient::shutdown() /home/jlayton/git/ceph/src/librados/RadosClient.cc:375
    #4 0x7f3aac51f2a3 in _rados_shutdown /home/jlayton/git/ceph/src/librados/librados_c.cc:188
    #5 0x55d1a3750bf6 in shutdown_racer_func /home/jlayton/git/ceph/src/test/librados/misc.cc:322
    #6 0x7f3aa17eb6f3  (/lib64/libstdc++.so.6+0xd76f3)

previously allocated by thread T301 here:
    #0 0x7f3aacaa99d7 in operator new(unsigned long) (/lib64/libasan.so.5+0x10f9d7)
    #1 0x7f3aac75e42d in unsigned long ceph::timer_detail::timer<ceph::time_detail::coarse_mono_clock>::add_event<void (Objecter::*)(), Objecter*>(std::chrono::time_point<ceph::time_detail::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, void (Objecter::*&&)(), Objecter*&&) /home/jlayton/git/ceph/src/common/ceph_timer.h:213
    #2 0x7f3aac75e42d in unsigned long ceph::timer_detail::timer<ceph::time_detail::coarse_mono_clock>::add_event<void (Objecter::*)(), Objecter*>(std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >, void (Objecter::*&&)(), Objecter*&&) /home/jlayton/git/ceph/src/common/ceph_timer.h:205
    #3 0x7f3aac75e42d in Objecter::start_tick() /home/jlayton/git/ceph/src/osdc/Objecter.cc:2109
    #4 0x7f3aac77581c in Objecter::start(OSDMap const*) /home/jlayton/git/ceph/src/osdc/Objecter.cc:412
    #5 0x7f3aac72af74 in librados::v14_2_0::RadosClient::connect() /home/jlayton/git/ceph/src/librados/RadosClient.cc:318
    #6 0x7f3aac51efc3 in _rados_connect /home/jlayton/git/ceph/src/librados/librados_c.cc:178
    #7 0x55d1a38c6333 in connect_cluster[abi:cxx11](void**) /home/jlayton/git/ceph/src/test/librados/test.cc:156
    #8 0x55d1a3750c5f in shutdown_racer_func /home/jlayton/git/ceph/src/test/librados/misc.cc:321
    #9 0x7f3aa17eb6f3  (/lib64/libstdc++.so.6+0xd76f3)

Thread T27841 created by T301 here:
    #0 0x7f3aac9d4955 in pthread_create (/lib64/libasan.so.5+0x3a955)
    #1 0x7f3aa17eb9b8 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (/lib64/libstdc++.so.6+0xd79b8)
    #2 0x7f3aac729869 in librados::v14_2_0::RadosClient::connect() /home/jlayton/git/ceph/src/librados/RadosClient.cc:266
    #3 0x7f3aac51efc3 in _rados_connect /home/jlayton/git/ceph/src/librados/librados_c.cc:178
    #4 0x55d1a38c6333 in connect_cluster[abi:cxx11](void**) /home/jlayton/git/ceph/src/test/librados/test.cc:156
    #5 0x55d1a3750c5f in shutdown_racer_func /home/jlayton/git/ceph/src/test/librados/misc.cc:321
    #6 0x7f3aa17eb6f3  (/lib64/libstdc++.so.6+0xd76f3)

Thread T301 created by T0 here:
    #0 0x7f3aac9d4955 in pthread_create (/lib64/libasan.so.5+0x3a955)
    #1 0x7f3aa17eb9b8 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (/lib64/libstdc++.so.6+0xd79b8)
    #2 0x55d1a38b5725 in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2439
    #3 0x55d1a38b5725 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2475
    #4 0x55d1a3888d5a in testing::Test::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2514
    #5 0x55d1a3888d5a in testing::Test::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2504
    #6 0x55d1a388905c in testing::TestInfo::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2690
    #7 0x55d1a388905c in testing::TestInfo::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2663
    #8 0x55d1a388929e in testing::TestSuite::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2822
    #9 0x55d1a388929e in testing::TestSuite::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2801
    #10 0x55d1a388a63d in testing::internal::UnitTestImpl::RunAllTests() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:5332
    #11 0x55d1a38b67e5 in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2439
    #12 0x55d1a38b67e5 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:2475
    #13 0x55d1a388abd5 in testing::UnitTest::Run() /home/jlayton/git/ceph/src/googletest/googletest/src/gtest.cc:4920
    #14 0x55d1a374f204 in RUN_ALL_TESTS() /home/jlayton/git/ceph/src/googletest/googletest/include/gtest/gtest.h:2472
    #15 0x55d1a374f204 in main /home/jlayton/git/ceph/src/test/unit.cc:45
    #16 0x7f3aa140ff42 in __libc_start_main (/lib64/libc.so.6+0x23f42)

SUMMARY: AddressSanitizer: heap-use-after-free /home/jlayton/git/ceph/src/common/ceph_timer.h:112 in ceph::timer_detail::timer<ceph::time_detail::coarse_mono_clock>::timer_thread()
Shadow bytes around the buggy address:
  0x0c16800f3d20: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fa fa
  0x0c16800f3d30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c16800f3d40: fa fa fa fa fa fa fa fa fa fa fa fa 00 00 00 00
  0x0c16800f3d50: 00 00 00 00 00 00 00 00 00 00 fa fa fa fa fa fa
  0x0c16800f3d60: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c16800f3d70: fa fa fa fa fa fa fa fa[fd]fd fd fd fd fd fd fd
  0x0c16800f3d80: fd fd fd fd fd fd fa fa fa fa fa fa fa fa fa fa
  0x0c16800f3d90: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c16800f3da0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c16800f3db0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c16800f3dc0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==20840==ABORTING
Actions #1

Updated by Jeff Layton over 4 years ago

  • Assignee deleted (Jeff Layton)
Actions #2

Updated by Jeff Layton over 4 years ago

  • Severity changed from 3 - minor to 2 - major
Actions #3

Updated by Jeff Layton over 4 years ago

Found by running LibRadosMisc.ShutdownRace test built with -DWITH_ASAN=ON. I had to set:

ms_mon_client_mode = crc

...to work around the other AES-related bug in tracker #42026.

Actions #4

Updated by Jeff Layton over 4 years ago

  • Category set to common
Actions #5

Updated by Su Yue over 4 years ago

Jeff Layton wrote:

While hunting a crash in tracker #42026, I ran across this bug when testing with ASAN:

[...]

I don't know how Objecter works.
But from the code-wise:
ceph_timer.h: L131 timer_thread():
...
l.unlock();
<----Other thread may call cancel_event() to erase e?
e.f();
l.lock();
...

Actions #6

Updated by Su Yue over 4 years ago

Su Yue wrote:

Jeff Layton wrote:

While hunting a crash in tracker #42026, I ran across this bug when testing with ASAN:

[...]

I don't know how Objecter works.
But from the code-wise:
ceph_timer.h: L131 timer_thread():
...
l.unlock();
<----Other thread may call cancel_event() to erase e?

Correction: 'erase' -> 'delete'.

e.f();
l.lock();
...

Actions #7

Updated by Jeff Layton over 4 years ago

Su Yue wrote:

Jeff Layton wrote:

While hunting a crash in tracker #42026, I ran across this bug when testing with ASAN:

[...]

I don't know how Objecter works.
But from the code-wise:
ceph_timer.h: L131 timer_thread():
...
l.unlock();
<----Other thread may call cancel_event() to erase e?
e.f();
l.lock();
...

The timer handling code seems to use the event's presence in the "events" map as an indicator of whether it's safe to delete it. The event is removed from the map before the lock is dropped in timer_thread. cancel_event wouldn't be able to find it at that point to delete it.

I was wondering whether we might have a race with reschedule_me there, but that's only called by the Objecter when initialized is true, and that's set to false at the beginning of Objecter::shutdown.

I'm stumped.

Actions #8

Updated by Jeff Layton over 4 years ago

I will note that the test has to run for several minutes before the ASAN warning pops. ASAN does slow things down, but that suggests that the race might be tight.

One thing that seems a bit odd -- Objecter::start_tick() and Objecter::tick() both hold a shared_lock when they do the assignment of tick_event. shutdown() is the only other place that touches that variable, but it holds a unique_lock.

Could we be ending up in a situation where a tick_event is inserted into the maps, and the assignment is done, but then a racing thread overwrites that tick_event field with a 0? I may try to add some debug code into there to see if I can catch the race.

Actions #9

Updated by Jeff Layton over 4 years ago

I tested a Q&D patch that made start_tick take a unique_lock, but that didn't seem to fix the issue, so the race doesn't seem to be there.

Actions #10

Updated by Sage Weil over 4 years ago

2019-12-01T17:58:58.803 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc: [ RUN      ] LibRadosMisc.ShutdownRace
2019-12-01T17:58:58.804 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc: *** Caught signal (Segmentation fault) **
2019-12-01T17:58:58.804 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc:  in thread 7f36877f6700 thread_name:ceph_test_rados
2019-12-01T17:58:58.804 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc:  ceph version 15.0.0-7813-g16b8980 (16b8980c7cfc91c0afcd2122d91d2ac03ed7d399) octopus (dev)
2019-12-01T17:58:58.804 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc:  1: (()+0xf5f0) [0x7f36e63575f0]
2019-12-01T17:58:58.804 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc:  2: (()+0x16b566) [0x7f36db150566]
2019-12-01T17:58:58.804 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc:  3: (()+0x3aa869) [0x7f36dd29a869]
2019-12-01T17:58:58.804 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc:  4: (md_config_t::update_legacy_vals(ConfigValues&)+0x69) [0x7f36dd2a1a69]
2019-12-01T17:58:58.805 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc:  5: (md_config_t::md_config_t(ConfigValues&, ConfigTracker const&, bool)+0x1194b) [0x7f36dd2b35eb]
2019-12-01T17:58:58.805 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc:  6: (CephContext::CephContext(unsigned int, code_environment_t, int)+0x711) [0x7f36dd275b71]
2019-12-01T17:58:58.805 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc:  7: (common_preinit(CephInitParameters const&, code_environment_t, int)+0x4e) [0x7f36dd2993ae]
2019-12-01T17:58:58.805 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc:  8: (()+0x5c80a) [0x7f36e65c080a]
2019-12-01T17:58:58.805 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc:  9: (rados_create()+0x86) [0x7f36e65c0946]
2019-12-01T17:58:58.805 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc:  10: (connect_cluster(void**)+0xb4) [0x5556d3539c54]
2019-12-01T17:58:58.805 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc:  11: (()+0x3c5cd) [0x5556d34e35cd]
2019-12-01T17:58:58.806 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc:  12: (()+0x81da5f) [0x7f36dd70da5f]
2019-12-01T17:58:58.806 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc:  13: (()+0x7e65) [0x7f36e634fe65]
2019-12-01T17:58:58.806 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc:  14: (clone()+0x6d) [0x7f36db0e388d]
2019-12-01T17:58:58.806 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc: 2019-12-01T17:58:58.821+0000 7f36877f6700 -1 *** Caught signal (Segmentation fault) **
2019-12-01T17:58:58.806 INFO:tasks.workunit.client.0.smithi182.stdout:                 api_misc:  in thread 7f36877f6700 thread_name:ceph_test_rados

/a/sage-2019-12-01_16:04:14-rados-wip-sage4-testing-2019-11-30-0943-distro-basic-smithi/4558991
Actions #11

Updated by Sage Weil almost 3 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (common)
Actions #12

Updated by Neha Ojha over 2 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF