Bug #63647
openSnapTrimEvent AddressSanitizer: heap-use-after-free
0%
Description
Sanitized backtrace:
DEBUG 2023-11-16 08:42:48,441 [shard 0] osd - snaptrim_event(id=21122, detail=SnapTrimEvent(pgid=3.1 snapid=3cb needs_pause=1)): interrupted crimson::common::actingset_changed (acting set changed
kernel callstack:
#0 0x55e310e0ace7 in seastar::shared_mutex::unlock() (/usr/bin/ceph-osd+0x1edd0ce7)
#1 0x55e313325d9c in auto seastar::futurize_invoke<crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimEvent::WaitSubop>::ExitBarrier<crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimEvent::WaitSubop>::BlockingEvent::Trigger<crimson::osd::SnapTrimEvent> >::exit()::{lambda()#1}&>(crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimEvent::WaitSubop>::ExitBarrier<crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimEvent::WaitSubop>::BlockingEvent::Trigger<crimson::osd::SnapTrimEvent> >::exit()::{lambda()#1}&) (/usr/bin/ceph-osd+0x212ebd9c)
#2 0x55e3133260ef in _ZN7seastar20noncopyable_functionIFNS_6futureIvEEvEE17direct_vtable_forIZNS2_4thenIZN7crimson23OrderedConcurrentPhaseTINS7_3osd13SnapTrimEvent9WaitSubopEE11ExitBarrierINSC_13BlockingEvent7TriggerISA_EEE4exitEvEUlvE_S2_EET0_OT_EUlDpOT_E_E4callEPKS4_ (/usr/bin/ceph-osd+0x212ec0ef)
0x61500013365c is located 92 bytes inside of 472-byte region [0x615000133600,0x6150001337d8)
freed by thread T2 here:
#0 0x7fb345ab73cf in operator delete(void*, unsigned long) (/lib64/libasan.so.6+0xb73cf)
#1 0x55e313474863 in crimson::osd::SnapTrimEvent::~SnapTrimEvent() (/usr/bin/ceph-osd+0x2143a863)
previously allocated by thread T2 here:
#0 0x7fb345ab6367 in operator new(unsigned long) (/lib64/libasan.so.6+0xb6367)
#1 0x55e31183ac18 in auto crimson::OperationRegistryI::create_operation<crimson::osd::SnapTrimEvent, crimson::osd::PG*, SnapMapper&, snapid_t const&, bool const&>(crimson::osd::PG*&&, SnapMapper&, snapid_t const&, bool const&) (/usr/bin/ceph-osd+0x1f800c18)
SUMMARY: AddressSanitizer: heap-use-after-free (/usr/bin/ceph-osd+0x1edd0ce7) in seastar::shared_mutex::unlock()
A fix attempt was introduced: https://github.com/ceph/ceph-ci/commit/abceb1652239629ed11187a5fc670a3b1a3a5bb1
However, this issue still appeared in newer tests:
https://pulpito.ceph.com/yingxin-2023-11-27_02:15:02-crimson-rados-wip-yingxin-crimson-improve-mempool5-distro-default-smithi/7467449
https://pulpito.ceph.com/yingxin-2023-11-27_02:15:02-crimson-rados-wip-yingxin-crimson-improve-mempool5-distro-default-smithi/7467459
Updated by Matan Breizman 5 months ago
- Related to Bug #63299: The lifecycle of SnapTrimObjSubEvent::WaitRepop should be extended in case of interruption added
Updated by Matan Breizman 5 months ago
- Status changed from New to In Progress
- Pull request ID set to 54688
Updated by Matan Breizman 5 months ago
Updated by Matan Breizman 5 months ago
- Has duplicate Bug #63845: crimson: use-after-free in seastar::shard_mutex::unlock() added
Updated by Samuel Just 2 months ago
- Related to Bug #64545: crimson: OrderedConcurrentPhase::ExitBarrier::exit() does not guarrantee that phase survives added
Updated by Samuel Just about 2 months ago
WaitSubop, WaitTrimTimmer, and WaitRepop are pipeline stages local to the
operation. As such they don't actually provide any ordering guarrantees as
only one operation will ever enter them. Rather, the intent is to hook into
the event system to expose information to an administrator.
This poses a problem for OrderedConcurrentPhase as it is currently implemented.
PipelineHandle::exit() is invoked prior to the op being destructed.
PipelineHandle::exit() does:
void exit() { barrier.reset(); }
For OrderedConcurrentPhase, ~ExitBarrier() invokes ExitBarrier::exit():
void exit() final { if (barrier) { assert(phase); assert(phase->core == seastar::this_shard_id()); std::ignore = std::move(*barrier ).then([phase=this->phase] { phase->mutex.unlock(); }); barrier = std::nullopt; phase = nullptr; } else if (phase) { assert(phase->core == seastar::this_shard_id()); phase->mutex.unlock(); phase = nullptr; } }
The problem comes in not waiting for the phase->mutex.unlock() to occur. For
SnapTrimEvent, phase is actually in the operation itself. It's possible for
that continuation
).then([phase=this->phase] { phase->mutex.unlock();
to occur after the last finally() in ShardServices::start_operation completes
and releases the final reference to SnapTrimEvent. This is harmless normally
provided that the PG or connection outlives it, but it's a problem for these
stages.
For now, let's just remove these stages. We can reintroduce another mechanism
later to set these event flags without an actual pipeline stage.
This is likely a bug even with pipelines not embedded in an operation, but we
can fix it later -- https://tracker.ceph.com/issues/64545.
Updated by Samuel Just about 2 months ago
Removing the above noop stages created a different problem.
SnapTrimEvent doesn't actually do or block on WaitForActive, RecoverMissing,
GetOBC, or Process. Entering Process, in
particular, causes problems unless we immediately leave it as
SnapTrimObjSubEvent needs to enter and leave it to complete. Entering one of
noop stages alluded to above had a side effect of exiting Process --
without that exit SnapTrimEvent and SnapTrimObjSubEvent mutually block
preventing snap trim or client io from making progress.
SnapTrimObjSubEvent doesn't need WaitForActive, PGActivationBlocker,
or RecoverMissing.
Updated by Samuel Just about 2 months ago
I'm working on a minor refactor to handle the above issues.
Updated by Matan Breizman 12 days ago ยท Edited
- Assignee deleted (
Matan Breizman) - Priority changed from Normal to Urgent
- Pull request ID deleted (
54688)