Bug #52566
[pwl ssd] assert in _aio_stop() during shutdown
0%
Description
After fix https://tracker.ceph.com/issues/52235 by moving finish_op to the end of call back func. If aio_thread call back run in m_async_op_tracker.wait_for_ops(next_ctx) call back, bdev->close() call _aio_stop, then assert in aio_thread.join(). bdev->close() is in aio_thread if it run in wait_for_ops call back, and it will assert when aio_thread join itself.
[ RUN ] TestMockCacheSSDWriteLog.read_hit_part_ssd_cache Failed to load class: cas (lib/libcls_cas.so): lib/libcls_cas.so: undefined symbol: _Z26cls_get_manifest_ref_countPvNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE Failed to load class: cmpomap (lib/libcls_cmpomap.so): lib/libcls_cmpomap.so: undefined symbol: _Z28cls_cxx_map_get_vals_by_keysPvRKSt3setINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4lessIS6_ESaIS6_EEPSt3mapIS6_N4ceph6buffer7v15_2_04listES8_SaISt4pairIKS6_SH_EEE Failed to load class: fifo (lib/libcls_fifo.so): lib/libcls_fifo.so: undefined symbol: _Z20cls_gen_random_bytesPci Failed to load class: log (lib/libcls_log.so): lib/libcls_log.so: undefined symbol: _Z24cls_cxx_map_write_headerPvPN4ceph6buffer7v15_2_04listE Failed to load class: lua (lib/libcls_lua.so): lib/libcls_lua.so: undefined symbol: _Z19cls_current_versionPv Failed to load class: rgw (lib/libcls_rgw.so): lib/libcls_rgw.so: undefined symbol: _Z19cls_current_versionPv Failed to load class: test_remote_reads (lib/libcls_test_remote_reads.so): lib/libcls_test_remote_reads.so: undefined symbol: _Z14cls_cxx_gatherPvRKSt3setINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4lessIS6_ESaIS6_EERKS6_PKcSG_RN4ceph6buffer7v15_2_04listE Failed to load class: user (lib/libcls_user.so): lib/libcls_user.so: undefined symbol: _Z24cls_cxx_map_write_headerPvPN4ceph6buffer7v15_2_04listE Thread::join(): pthread_join failed with error 35 ../src/common/Thread.cc: In function 'int Thread::join(void**)' thread 7fdaeaffd700 time 2021-09-09T13:51:57.712515+0000 ../src/common/Thread.cc: 182: FAILED ceph_assert(status == 0) ceph version 17.0.0-7399-gcdf3741806a (cdf3741806af5d104294113ec99033e56bad7cce) quincy (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12b) [0x7fdea42c3c92] 2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7fdea42c3ec8] 3: (Thread::join(void**)+0xea) [0x7fdea428f1ce] 4: (KernelDevice::_aio_stop()+0x56a) [0x7fdea4f77942] 5: (KernelDevice::close()+0x1a9) [0x7fdea4f77b73] 6: (librbd::cache::pwl::ssd::WriteLog<librbd::MockImageCtx>::remove_pool_file()+0x34) [0x560b79152c12] 7: (librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::shut_down(Context*)::{lambda(int)#4}::operator()(int) const+0x362) [0x560b791293b6] 8: (LambdaContext<librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::shut_down(Context*)::{lambda(int)#4}>::finish(int)+0xd) [0x560b79129579] 9: (Context::complete(int)+0x9) [0x560b781eac3b] 10: (librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::flush_dirty_entries(Context*)+0x301) [0x560b7910e7a3] 11: (LambdaContext<librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::shut_down(Context*)::{lambda(int)#5}>::finish(int)+0x29f) [0x560b7910eec3] 12: (Context::complete(int)+0x9) [0x560b781eac3b] 13: (std::_Function_handler<void (Context*, int), librbd::cache::pwl::TestMockCacheSSDWriteLog::expect_op_work_queue(librbd::MockImageCtx&)::{lambda(Context*, int)#1}>::_M_invoke(std::_Any_data const&, Context*&&, int&&)+0x12) [0x560b7914230d] 14: (std::function<void (Context*, int)>::operator()(Context*, int) const+0x21) [0x560b782b98cf] 15: (decltype (((forward<std::function<void (Context*, int)> const&>)({parm#1}))((get<0ul>)((forward<std::tuple<Context*, int> >)({parm#2})), (get<1ul>)((forward<std::tuple<Context*, int> >)({parm#2})))) testing::internal::ApplyImpl<std::function<void (Context*, int)> const&, std::tuple<Context*, int>, 0ul, 1ul>(std::function<void (Context*, int)> const&, std::tuple<Context*, int>&&, testing::internal::IndexSequence<0ul, 1ul>)+0x12) [0x560b782b98eb] 16: (decltype (ApplyImpl((forward<std::function<void (Context*, int)> const&>)({parm#1}), (forward<std::tuple<Context*, int> >)({parm#2}), (testing::internal::MakeIndexSequenceImpl<std::tuple_size<std::remove_reference<std::tuple<Context*, int> >::type>::value>::type)())) testing::internal::Apply<std::function<void (Context*, int)> const&, std::tuple<Context*, int> >(std::function<void (Context*, int)> const&, std::tuple<Context*, int>&&)+0x9) [0x560b782b98f9] 17: (testing::internal::FunctionMocker<void (Context*, int)>::UntypedPerformAction(void const*, void*) const+0x45) [0x560b782b9b07] 18: (testing::internal::UntypedFunctionMockerBase::UntypedInvokeWith(void*)+0x115f) [0x560b797b7937] 19: (librbd::util::detail::C_AsyncCallback<librbd::MockContextWQ>::finish(int)+0x51) [0x560b781f0221] 20: (Context::complete(int)+0x9) [0x560b781eac3b] 21: (AsyncOpTracker::finish_op()+0x277) [0x7fdea4212c7f] 22: (librbd::cache::pwl::ssd::WriteLog<librbd::MockImageCtx>::append_op_log_entries(std::__cxx11::list<std::shared_ptr<librbd::cache::pwl::GenericLogOperation>, std::allocator<std::shared_ptr<librbd::cache::pwl::GenericLogOperation> > >&)::{lambda(int)#2}::operator()(int) const+0x555) [0x560b79154e89] 23: (LambdaContext<librbd::cache::pwl::ssd::WriteLog<librbd::MockImageCtx>::append_op_log_entries(std::__cxx11::list<std::shared_ptr<librbd::cache::pwl::GenericLogOperation>, std::allocator<std::shared_ptr<librbd::cache::pwl::GenericLogOperation> > >&)::{lambda(int)#2}>::finish(int)+0xd) [0x560b79154eeb] 24: (Context::complete(int)+0x9) [0x560b781eac3b] 25: (LambdaContext<librbd::cache::pwl::ssd::WriteLog<librbd::MockImageCtx>::update_root_scheduled_ops()::{lambda(int)#4}>::finish(int)+0x1a3) [0x560b79159a8d] 26: (Context::complete(int)+0x9) [0x560b781eac3b] 27: (LambdaContext<librbd::cache::pwl::ssd::WriteLog<librbd::MockImageCtx>::update_root_scheduled_ops()::{lambda(int)#5}>::finish(int)+0xaa) [0x560b79159b66] 28: (Context::complete(int)+0x9) [0x560b781eac3b] 29: (librbd::cache::pwl::ssd::WriteLog<librbd::MockImageCtx>::aio_cache_cb(void*, void*)+0x18) [0x560b791454f8] 30: (KernelDevice::_aio_thread()+0x900) [0x7fdea4f75d24] 31: (KernelDevice::AioCompletionThread::entry()+0xd) [0x7fdea4f8711d] ./seg-fault-test: line 6: 1134903 Aborted (core dumped) bin/unittest_librbd --gtest_filter='TestMockCacheSSDWriteLog.read_hit_part_ssd_cache' segment fault
Related issues
History
#1 Updated by CONGMIN YIN over 2 years ago
- Related to Bug #52235: [pwl] deadlock on AbstractWriteLog::m_lock during shutdown added
#2 Updated by CONGMIN YIN over 2 years ago
- Description updated (diff)
#3 Updated by CONGMIN YIN over 2 years ago
- Pull request ID set to 42950
#4 Updated by Deepika Upadhyay over 2 years ago
- Description updated (diff)
#5 Updated by CONGMIN YIN over 2 years ago
- Subject changed from pwl ssd] assert in _aio_stop() during shutdown to [pwl ssd] assert in _aio_stop() during shutdown
#6 Updated by Deepika Upadhyay over 2 years ago
- Status changed from New to Fix Under Review
#7 Updated by Deepika Upadhyay over 2 years ago
@CONGMIN YIN is this reproducable using
-DWITH_RBD_SSD_CACHE=ON -DWITH_RBD_RWL=ON
then running the vstart cluster using:
MON=1 OSD=1 MGR=0 MDS=0 RGW=0 ../src/vstart.sh -n -d -o "debug librbd = 20 debug ms = 1 debug rbd = 20 rbd_default_features = 61 rbd_persistent_cache_mode = ssd rbd_persistent_cache_path = /home/ideepika/ceph2/ceph/build/write_back_cache rbd_persistent_cache_size = 1073741824 rbd_plugins = pwl_cache"
running test(on master without this fix) using:
export RBD_FEATURES=61;bin/unittest_librbd --gtest_filter='TestMockCacheSSDWriteLog.*' --gtest_repeat=40
were you able to reproduce the bug, if so can you help me know what I might be missing?
#8 Updated by CONGMIN YIN over 2 years ago
were you able to reproduce the bug, if so can you help me know what I might be missing?
Yes, I still can reproduce it. This may be difficult to reproduce.
1. As I described, this will occur only after the issue https://tracker.ceph.com/issues/52235 was resolved. You need to apply the https://github.com/ceph/ceph/pull/42950/commits/7da4d71ead2bf1d7ca49fc2b60d7e12f2df9de47 to fix dead lock issue first.
2. when do_cmake, need add configure to enable pwl:
-DWITH_RBD_SSD_CACHE=ON -DWITH_RBD_RWL=ON
3. don't start any cluster, if there's ceph.conf exist in /etc/ceph/ceph.conf or build/ceph.conf, please remove, guarantee unitest uses the default configuration. I guess that outputting log will affect the timing of bug during shutdown.
4. run
cd build
ninja -j 64 unittest_librbd
bin/unittest_librbd --gtest_filter='TestMockCacheSSDWriteLog.read_hit_part_ssd_cache' --gtest_repeat=100
#9 Updated by Deepika Upadhyay over 2 years ago
was able to reproduce both deadlock and this issue; fix seems to work as well
#10 Updated by Deepika Upadhyay over 2 years ago
- Status changed from Fix Under Review to Pending Backport
#11 Updated by Ilya Dryomov about 2 years ago
- Status changed from Pending Backport to Resolved
Backported to pacific under https://tracker.ceph.com/issues/52235.