Project

General

Profile

Actions

Bug #63299

closed

The lifecycle of SnapTrimObjSubEvent::WaitRepop should be extended in case of interruption

Added by Yingxin Cheng 7 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

see heap-use-after-free issue in https://github.com/ceph/ceph/pull/53537#issuecomment-1770106603

It shows that OrderedConcurrentPhaseT::mutex is freed before the asynchronous calls to its unlock() method during OrderedConcurrentPhaseT::ExitBarrier::exit(), specifically at the code [1].

The reason is that SnapTrimObjSubEvent::repop (which is an OrderedConcurrentPhaseT) is attatched to the operation (SnapTrimObjSubEvent) itself. So when there is an interruption happened inside the phase SnapTrimObjSubEvent::repop, the operation will be interrupted and SnapTrimObjSubEvent will be freed with SnapTrimObjSubEvent::repop. But it is not guranteed that the code [1] must be called before the destruction.

We need to find a way to either extend the life of SnapTrimObjSubEvent or change the ownership of SnapTrimObjSubEvent::repop accordingly.

Related osd log signature showing that there is a related interruption:

INFO  2023-10-18 03:23:24,506 [shard 2] osd - snaptrimobj_subevent(id=33613684, detail=SnapTrimObjSubEvent(coid=71:c9e278a4:::abc:3 snapid=3)): writing updated snapset on 71:c9e278a4:::abc:head, snapset is 3=[]:{}
DEBUG 2023-10-18 03:23:24,507 [shard 2] osd -  pg_epoch 361 pg[71.13( v 359'7 (0'0,359'7] local-lis/les=356/357 n=3 ec=356/356 lis/c=356/356 les/c/f=357/357/0 sis=356) [3,2] r=0 lpr=356 luod=360'9 lua=0'0 crt=361'11 mlcod 359'7 active+clean+snaptrim  ReplicatedBackend::_submit_transaction: object 71:c9e278a4:::abc:3
DEBUG 2023-10-18 03:23:26,513 [shard 2] osd -  pg_epoch 363 pg[71.13( v 359'7 (0'0,359'7] local-lis/les=356/357 n=3 ec=356/356 lis/c=356/356 les/c/f=357/357/0 sis=363 pruub=7.132057667s) [] r=-1 lpr=363 pi=[356,363)/1 crt=361'11 mlcod 0'0 snaptrim pruub 444.213714600s@  ObjectContextLoader::notify_on_change: interrupting obc: 71:c9e278a4:::abc:head¬
DEBUG 2023-10-18 03:23:26,976 [shard 2] osd -  pg_epoch 363 pg[71.13( v 359'7 (0'0,359'7] local-lis/les=356/357 n=3 ec=356/356 lis/c=356/356 les/c/f=357/357/0 sis=363 pruub=7.132057667s) [] r=-1 lpr=363 pi=[356,363)/1 crt=361'11 mlcod 0'0 snaptrim NOTIFY pruub 444.213714600s@  ObjectContextLoader::with_head_obc: released object 71:c9e278a4:::abc:head¬
DEBUG 2023-10-18 03:23:26,976 [shard 2] osd -  pg_epoch 363 pg[71.13( v 359'7 (0'0,359'7] local-lis/les=356/357 n=3 ec=356/356 lis/c=356/356 les/c/f=357/357/0 sis=363 pruub=7.132057667s) [] r=-1 lpr=363 pi=[356,363)/1 crt=361'11 mlcod 0'0 snaptrim NOTIFY pruub 444.213714600s@  ObjectContextLoader::with_head_obc: released object 71:c9e278a4:::abc:head
DEBUG 2023-10-18 03:23:27,129 [shard 2] osd - snaptrimobj_subevent(id=33613684, detail=SnapTrimObjSubEvent(coid=71:c9e278a4:::abc:3 snapid=3)): exit
...(heap-use-after-free)

[1] https://github.com/ceph/ceph/blob/dd37ce5506a559f0a02f3ef3da29bde06ee97205/src/crimson/common/operation.h#L699


Related issues 1 (1 open0 closed)

Related to crimson - Bug #63647: SnapTrimEvent AddressSanitizer: heap-use-after-freeIn ProgressSamuel Just

Actions
Actions #2

Updated by Matan Breizman 6 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 54431
Actions #3

Updated by Matan Breizman 6 months ago

  • Assignee set to Matan Breizman
Actions #5

Updated by Matan Breizman 6 months ago

  • Related to Bug #63647: SnapTrimEvent AddressSanitizer: heap-use-after-free added
Actions #6

Updated by Matan Breizman 6 months ago

  • Status changed from Fix Under Review to Resolved

Yingxin Cheng wrote:

Issue is still present after https://github.com/ceph/ceph/pull/54513

See https://pulpito.ceph.com/yingxin-2023-11-27_02:15:02-crimson-rados-wip-yingxin-crimson-improve-mempool5-distro-default-smithi/
7467449, 7467459

IIUC this is a different issue:

Reactor stalled for 254 ms on shard 0. Backtrace: 0x45d5d 0x2c67ec1e 0x2c67ffcc 0x2c68151a 0x2c68189e 0x2c6819e8 0x2c681e3e 0x54daf 0x295323b9ed25 0x295323ba178a 0x295323ba2410 0x29532163e2d3 0x2c2d1aa9 0x295323ba2684 0x295323b91be9 0x295323b91cb5 0x295323b83165 0x295323b84fea 0xd6280 0x32402 0xbd907 0xbd194 0xbdc7a 0x1ec16ac7 0x2111255e 0x211128b1 0x1e7e570c 0x1e80a8f9 0x1e8282f7 0x2c62295b 0x2c6bc51c 0x2c8da55e 0x2c8dc281 0x2c3354f2 0x2c3373fb 0x1eb826c8 0x3feaf 0x3ff5f 0x1e62ba44
kernel callstack:
    #0 0x55bcdf96eac7 in seastar::shared_mutex::unlock() (/usr/bin/ceph-osd+0x1ec16ac7)
    #1 0x55bce1e6a55e in auto seastar::futurize_invoke<crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimObjSubEvent::WaitRepop>::ExitBarrier<crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimObjSubEvent::WaitRepop>::BlockingEvent::Trigger<crimson::osd::SnapTrimObjSubEvent> >::exit()::{lambda()#1}&>(crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimObjSubEvent::WaitRepop>::ExitBarrier<crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimObjSubEvent::WaitRepop>::BlockingEvent::Trigger<crimson::osd::SnapTrimObjSubEvent> >::exit()::{lambda()#1}&) (/usr/bin/ceph-osd+0x2111255e)
    #2 0x55bce1e6a8b1 in _ZN7seastar20noncopyable_functionIFNS_6futureIvEEvEE17direct_vtable_forIZNS2_4thenIZN7crimson23OrderedConcurrentPhaseTINS7_3osd19SnapTrimObjSubEvent9WaitRepopEE11ExitBarrierINSC_13BlockingEvent7TriggerISA_EEE4exitEvEUlvE_S2_EET0_OT_EUlDpOT_E_E4callEPKS4_ (/usr/bin/ceph-osd+0x211128b1)
    #3 0x55bcdf53d70c in auto seastar::internal::future_invoke<seastar::noncopyable_function<seastar::future<void> ()>&, seastar::internal::monostate>(seastar::noncopyable_function<seastar::future<void> ()>&, seastar::internal::monostate&&) (/usr/bin/ceph-osd+0x1e7e570c)
    #4 0x55bcdf5628f9 in void seastar::futurize<seastar::future<void> >::satisfy_with_result_of<seastar::future<void>::then_impl_nrvo<seastar::noncopyable_function<seastar::future<void> ()>, seastar::future<void> >(seastar::noncopyable_function<seastar::future<void> ()>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> ()>&, seastar::future_state<seastar::internal::monostate>&&)#1}::operator()(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> ()>&, seastar::future_state<seastar::internal::monostate>&&) const::{lambda()#1}>(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> ()>&&) (/usr/bin/ceph-osd+0x1e80a8f9)
    #5 0x55bcdf5802f7 in seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<seastar::future<void> ()>, seastar::future<void>::then_impl_nrvo<seastar::noncopyable_function<seastar::future<void> ()>, seastar::future<void> >(seastar::noncopyable_function<seastar::future<void> ()>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> ()>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>::run_and_dispose() (/usr/bin/ceph-osd+0x1e8282f7)
    #6 0x55bced37a95b in seastar::reactor::run_tasks(seastar::reactor::task_queue&) (/usr/bin/ceph-osd+0x2c62295b)
    #7 0x55bced41451c in seastar::reactor::run_some_tasks() (/usr/bin/ceph-osd+0x2c6bc51c)
    #8 0x55bced63255e in seastar::reactor::do_run() (/usr/bin/ceph-osd+0x2c8da55e)
    #9 0x55bced634281 in seastar::reactor::run() (/usr/bin/ceph-osd+0x2c8dc281)
    #10 0x55bced08d4f2 in seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) (/usr/bin/ceph-osd+0x2c3354f2)
    #11 0x55bced08f3fb in seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) (/usr/bin/ceph-osd+0x2c3373fb)
    #12 0x55bcdf8da6c8 in main (/usr/bin/ceph-osd+0x1eb826c8)
    #13 0x7f0fe223feaf in __libc_start_call_main (/lib64/libc.so.6+0x3feaf)
    #14 0x7f0fe223ff5f in __libc_start_main_impl (/lib64/libc.so.6+0x3ff5f)
    #15 0x55bcdf383a44 in _start (/usr/bin/ceph-osd+0x1e62ba44)

0x6190002e05ec is located 364 bytes inside of 920-byte region [0x6190002e0480,0x6190002e0818)
freed by thread T0 here:
    #0 0x7f0fe48b73cf in operator delete(void*, unsigned long) (/lib64/libasan.so.6+0xb73cf)
    #1 0x55bce1f8d408 in crimson::osd::SnapTrimObjSubEvent::~SnapTrimObjSubEvent() (/usr/bin/ceph-osd+0x21235408)

previously allocated by thread T0 here:
    #0 0x7f0fe48b6367 in operator new(unsigned long) (/lib64/libasan.so.6+0xb6367)
    #1 0x55bce1cdbad2 in auto crimson::osd::PerShardState::start_operation_may_interrupt<crimson::interruptible::interruptor<crimson::osd::IOInterruptCondition>, crimson::osd::SnapTrimObjSubEvent, boost::intrusive_ptr<crimson::osd::PG>&, hobject_t const&, snapid_t const&>(boost::intrusive_ptr<crimson::osd::PG>&, hobject_t const&, snapid_t const&) (/usr/bin/ceph-osd+0x20f83ad2)

SUMMARY: AddressSanitizer: heap-use-after-free (/usr/bin/ceph-osd+0x1ec16ac7) in seastar::shared_mutex::unlock()

This tracker here refers to SnapTrimObjSubEvent issues only. Let's continue to investigate the SnapTrimEvent issues in https://tracker.ceph.com/issues/63647 tracker.

Actions

Also available in: Atom PDF