Project

General

Profile

Bug #52566

[pwl ssd] assert in _aio_stop() during shutdown

Added by CONGMIN YIN over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After fix https://tracker.ceph.com/issues/52235 by moving finish_op to the end of call back func. If aio_thread call back run in m_async_op_tracker.wait_for_ops(next_ctx) call back, bdev->close() call _aio_stop, then assert in aio_thread.join(). bdev->close() is in aio_thread if it run in wait_for_ops call back, and it will assert when aio_thread join itself.


[ RUN      ] TestMockCacheSSDWriteLog.read_hit_part_ssd_cache
Failed to load class: cas (lib/libcls_cas.so): lib/libcls_cas.so: undefined symbol: _Z26cls_get_manifest_ref_countPvNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
Failed to load class: cmpomap (lib/libcls_cmpomap.so): lib/libcls_cmpomap.so: undefined symbol: _Z28cls_cxx_map_get_vals_by_keysPvRKSt3setINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4lessIS6_ESaIS6_EEPSt3mapIS6_N4ceph6buffer7v15_2_04listES8_SaISt4pairIKS6_SH_EEE
Failed to load class: fifo (lib/libcls_fifo.so): lib/libcls_fifo.so: undefined symbol: _Z20cls_gen_random_bytesPci
Failed to load class: log (lib/libcls_log.so): lib/libcls_log.so: undefined symbol: _Z24cls_cxx_map_write_headerPvPN4ceph6buffer7v15_2_04listE
Failed to load class: lua (lib/libcls_lua.so): lib/libcls_lua.so: undefined symbol: _Z19cls_current_versionPv
Failed to load class: rgw (lib/libcls_rgw.so): lib/libcls_rgw.so: undefined symbol: _Z19cls_current_versionPv
Failed to load class: test_remote_reads (lib/libcls_test_remote_reads.so): lib/libcls_test_remote_reads.so: undefined symbol: _Z14cls_cxx_gatherPvRKSt3setINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4lessIS6_ESaIS6_EERKS6_PKcSG_RN4ceph6buffer7v15_2_04listE
Failed to load class: user (lib/libcls_user.so): lib/libcls_user.so: undefined symbol: _Z24cls_cxx_map_write_headerPvPN4ceph6buffer7v15_2_04listE
Thread::join(): pthread_join failed with error 35
../src/common/Thread.cc: In function 'int Thread::join(void**)' thread 7fdaeaffd700 time 2021-09-09T13:51:57.712515+0000
../src/common/Thread.cc: 182: FAILED ceph_assert(status == 0)
 ceph version 17.0.0-7399-gcdf3741806a (cdf3741806af5d104294113ec99033e56bad7cce) quincy (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12b) [0x7fdea42c3c92]
 2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7fdea42c3ec8]
 3: (Thread::join(void**)+0xea) [0x7fdea428f1ce]
 4: (KernelDevice::_aio_stop()+0x56a) [0x7fdea4f77942]
 5: (KernelDevice::close()+0x1a9) [0x7fdea4f77b73]
 6: (librbd::cache::pwl::ssd::WriteLog<librbd::MockImageCtx>::remove_pool_file()+0x34) [0x560b79152c12]
 7: (librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::shut_down(Context*)::{lambda(int)#4}::operator()(int) const+0x362) [0x560b791293b6]
 8: (LambdaContext<librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::shut_down(Context*)::{lambda(int)#4}>::finish(int)+0xd) [0x560b79129579]
 9: (Context::complete(int)+0x9) [0x560b781eac3b]
 10: (librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::flush_dirty_entries(Context*)+0x301) [0x560b7910e7a3]
 11: (LambdaContext<librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::shut_down(Context*)::{lambda(int)#5}>::finish(int)+0x29f) [0x560b7910eec3]
 12: (Context::complete(int)+0x9) [0x560b781eac3b]
 13: (std::_Function_handler<void (Context*, int), librbd::cache::pwl::TestMockCacheSSDWriteLog::expect_op_work_queue(librbd::MockImageCtx&)::{lambda(Context*, int)#1}>::_M_invoke(std::_Any_data const&, Context*&&, int&&)+0x12) [0x560b7914230d]
 14: (std::function<void (Context*, int)>::operator()(Context*, int) const+0x21) [0x560b782b98cf]
 15: (decltype (((forward<std::function<void (Context*, int)> const&>)({parm#1}))((get<0ul>)((forward<std::tuple<Context*, int> >)({parm#2})), (get<1ul>)((forward<std::tuple<Context*, int> >)({parm#2})))) testing::internal::ApplyImpl<std::function<void (Context*, int)> const&, std::tuple<Context*, int>, 0ul, 1ul>(std::function<void (Context*, int)> const&, std::tuple<Context*, int>&&, testing::internal::IndexSequence<0ul, 1ul>)+0x12) [0x560b782b98eb]
 16: (decltype (ApplyImpl((forward<std::function<void (Context*, int)> const&>)({parm#1}), (forward<std::tuple<Context*, int> >)({parm#2}), (testing::internal::MakeIndexSequenceImpl<std::tuple_size<std::remove_reference<std::tuple<Context*, int> >::type>::value>::type)())) testing::internal::Apply<std::function<void (Context*, int)> const&, std::tuple<Context*, int> >(std::function<void (Context*, int)> const&, std::tuple<Context*, int>&&)+0x9) [0x560b782b98f9]
 17: (testing::internal::FunctionMocker<void (Context*, int)>::UntypedPerformAction(void const*, void*) const+0x45) [0x560b782b9b07]
 18: (testing::internal::UntypedFunctionMockerBase::UntypedInvokeWith(void*)+0x115f) [0x560b797b7937]
 19: (librbd::util::detail::C_AsyncCallback<librbd::MockContextWQ>::finish(int)+0x51) [0x560b781f0221]
 20: (Context::complete(int)+0x9) [0x560b781eac3b]
 21: (AsyncOpTracker::finish_op()+0x277) [0x7fdea4212c7f]
 22: (librbd::cache::pwl::ssd::WriteLog<librbd::MockImageCtx>::append_op_log_entries(std::__cxx11::list<std::shared_ptr<librbd::cache::pwl::GenericLogOperation>, std::allocator<std::shared_ptr<librbd::cache::pwl::GenericLogOperation> > >&)::{lambda(int)#2}::operator()(int) const+0x555) [0x560b79154e89]
 23: (LambdaContext<librbd::cache::pwl::ssd::WriteLog<librbd::MockImageCtx>::append_op_log_entries(std::__cxx11::list<std::shared_ptr<librbd::cache::pwl::GenericLogOperation>, std::allocator<std::shared_ptr<librbd::cache::pwl::GenericLogOperation> > >&)::{lambda(int)#2}>::finish(int)+0xd) [0x560b79154eeb]
 24: (Context::complete(int)+0x9) [0x560b781eac3b]
 25: (LambdaContext<librbd::cache::pwl::ssd::WriteLog<librbd::MockImageCtx>::update_root_scheduled_ops()::{lambda(int)#4}>::finish(int)+0x1a3) [0x560b79159a8d]
 26: (Context::complete(int)+0x9) [0x560b781eac3b]
 27: (LambdaContext<librbd::cache::pwl::ssd::WriteLog<librbd::MockImageCtx>::update_root_scheduled_ops()::{lambda(int)#5}>::finish(int)+0xaa) [0x560b79159b66]
 28: (Context::complete(int)+0x9) [0x560b781eac3b]
 29: (librbd::cache::pwl::ssd::WriteLog<librbd::MockImageCtx>::aio_cache_cb(void*, void*)+0x18) [0x560b791454f8]
 30: (KernelDevice::_aio_thread()+0x900) [0x7fdea4f75d24]
 31: (KernelDevice::AioCompletionThread::entry()+0xd) [0x7fdea4f8711d]
./seg-fault-test: line 6: 1134903 Aborted                 (core dumped) bin/unittest_librbd --gtest_filter='TestMockCacheSSDWriteLog.read_hit_part_ssd_cache'
segment fault


Related issues

Related to rbd - Bug #52235: [pwl] deadlock on AbstractWriteLog::m_lock during shutdown Resolved

History

#1 Updated by CONGMIN YIN over 2 years ago

  • Related to Bug #52235: [pwl] deadlock on AbstractWriteLog::m_lock during shutdown added

#2 Updated by CONGMIN YIN over 2 years ago

  • Description updated (diff)

#3 Updated by CONGMIN YIN over 2 years ago

  • Pull request ID set to 42950

#4 Updated by Deepika Upadhyay over 2 years ago

  • Description updated (diff)

#5 Updated by CONGMIN YIN over 2 years ago

  • Subject changed from pwl ssd] assert in _aio_stop() during shutdown to [pwl ssd] assert in _aio_stop() during shutdown

#6 Updated by Deepika Upadhyay over 2 years ago

  • Status changed from New to Fix Under Review

#7 Updated by Deepika Upadhyay over 2 years ago

@CONGMIN YIN is this reproducable using

-DWITH_RBD_SSD_CACHE=ON -DWITH_RBD_RWL=ON

then running the vstart cluster using:

MON=1 OSD=1 MGR=0 MDS=0 RGW=0 ../src/vstart.sh -n -d -o "debug librbd = 20
debug ms = 1
debug rbd = 20
rbd_default_features = 61
rbd_persistent_cache_mode = ssd
rbd_persistent_cache_path = /home/ideepika/ceph2/ceph/build/write_back_cache
rbd_persistent_cache_size = 1073741824
rbd_plugins = pwl_cache" 

running test(on master without this fix) using:

export RBD_FEATURES=61;bin/unittest_librbd --gtest_filter='TestMockCacheSSDWriteLog.*' --gtest_repeat=40

were you able to reproduce the bug, if so can you help me know what I might be missing?

#8 Updated by CONGMIN YIN over 2 years ago

were you able to reproduce the bug, if so can you help me know what I might be missing?

Yes, I still can reproduce it. This may be difficult to reproduce.
1. As I described, this will occur only after the issue https://tracker.ceph.com/issues/52235 was resolved. You need to apply the https://github.com/ceph/ceph/pull/42950/commits/7da4d71ead2bf1d7ca49fc2b60d7e12f2df9de47 to fix dead lock issue first.

2. when do_cmake, need add configure to enable pwl:

-DWITH_RBD_SSD_CACHE=ON -DWITH_RBD_RWL=ON

3. don't start any cluster, if there's ceph.conf exist in /etc/ceph/ceph.conf or build/ceph.conf, please remove, guarantee unitest uses the default configuration. I guess that outputting log will affect the timing of bug during shutdown.

4. run

cd build
ninja -j 64 unittest_librbd
bin/unittest_librbd --gtest_filter='TestMockCacheSSDWriteLog.read_hit_part_ssd_cache' --gtest_repeat=100

#9 Updated by Deepika Upadhyay over 2 years ago

was able to reproduce both deadlock and this issue; fix seems to work as well

#10 Updated by Deepika Upadhyay over 2 years ago

  • Status changed from Fix Under Review to Pending Backport

#11 Updated by Ilya Dryomov about 2 years ago

  • Status changed from Pending Backport to Resolved

Backported to pacific under https://tracker.ceph.com/issues/52235.

Also available in: Atom PDF