Project

General

Profile

Actions

Bug #63647

open

SnapTrimEvent AddressSanitizer: heap-use-after-free

Added by Matan Breizman 5 months ago. Updated 2 days ago.

Status:
In Progress
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Sanitized backtrace:

DEBUG 2023-11-16 08:42:48,441 [shard 0] osd - snaptrim_event(id=21122, detail=SnapTrimEvent(pgid=3.1 snapid=3cb needs_pause=1)): interrupted crimson::common::actingset_changed (acting set changed

kernel callstack:
    #0 0x55e310e0ace7 in seastar::shared_mutex::unlock() (/usr/bin/ceph-osd+0x1edd0ce7)
    #1 0x55e313325d9c in auto seastar::futurize_invoke<crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimEvent::WaitSubop>::ExitBarrier<crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimEvent::WaitSubop>::BlockingEvent::Trigger<crimson::osd::SnapTrimEvent> >::exit()::{lambda()#1}&>(crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimEvent::WaitSubop>::ExitBarrier<crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimEvent::WaitSubop>::BlockingEvent::Trigger<crimson::osd::SnapTrimEvent> >::exit()::{lambda()#1}&) (/usr/bin/ceph-osd+0x212ebd9c)
    #2 0x55e3133260ef in _ZN7seastar20noncopyable_functionIFNS_6futureIvEEvEE17direct_vtable_forIZNS2_4thenIZN7crimson23OrderedConcurrentPhaseTINS7_3osd13SnapTrimEvent9WaitSubopEE11ExitBarrierINSC_13BlockingEvent7TriggerISA_EEE4exitEvEUlvE_S2_EET0_OT_EUlDpOT_E_E4callEPKS4_ (/usr/bin/ceph-osd+0x212ec0ef)
0x61500013365c is located 92 bytes inside of 472-byte region [0x615000133600,0x6150001337d8)
freed by thread T2 here:
    #0 0x7fb345ab73cf in operator delete(void*, unsigned long) (/lib64/libasan.so.6+0xb73cf)
    #1 0x55e313474863 in crimson::osd::SnapTrimEvent::~SnapTrimEvent() (/usr/bin/ceph-osd+0x2143a863)

previously allocated by thread T2 here:
    #0 0x7fb345ab6367 in operator new(unsigned long) (/lib64/libasan.so.6+0xb6367)
    #1 0x55e31183ac18 in auto crimson::OperationRegistryI::create_operation<crimson::osd::SnapTrimEvent, crimson::osd::PG*, SnapMapper&, snapid_t const&, bool const&>(crimson::osd::PG*&&, SnapMapper&, snapid_t const&, bool const&) (/usr/bin/ceph-osd+0x1f800c18)
SUMMARY: AddressSanitizer: heap-use-after-free (/usr/bin/ceph-osd+0x1edd0ce7) in seastar::shared_mutex::unlock()

A fix attempt was introduced: https://github.com/ceph/ceph-ci/commit/abceb1652239629ed11187a5fc670a3b1a3a5bb1
However, this issue still appeared in newer tests:
https://pulpito.ceph.com/yingxin-2023-11-27_02:15:02-crimson-rados-wip-yingxin-crimson-improve-mempool5-distro-default-smithi/7467449
https://pulpito.ceph.com/yingxin-2023-11-27_02:15:02-crimson-rados-wip-yingxin-crimson-improve-mempool5-distro-default-smithi/7467459


Related issues 3 (1 open2 closed)

Related to crimson - Bug #63299: The lifecycle of SnapTrimObjSubEvent::WaitRepop should be extended in case of interruptionResolvedMatan Breizman

Actions
Related to crimson - Bug #64545: crimson: OrderedConcurrentPhase::ExitBarrier::exit() does not guarrantee that phase survivesNewSamuel Just

Actions
Has duplicate crimson - Bug #63845: crimson: use-after-free in seastar::shard_mutex::unlock()DuplicateSamuel Just

Actions
Actions #1

Updated by Matan Breizman 5 months ago

  • Related to Bug #63299: The lifecycle of SnapTrimObjSubEvent::WaitRepop should be extended in case of interruption added
Actions #4

Updated by Matan Breizman 5 months ago

  • Status changed from New to In Progress
  • Pull request ID set to 54688
Actions #5

Updated by Matan Breizman 5 months ago

  • Assignee set to Matan Breizman
Actions #7

Updated by Matan Breizman 5 months ago

  • Has duplicate Bug #63845: crimson: use-after-free in seastar::shard_mutex::unlock() added
Actions #11

Updated by Samuel Just 2 months ago

  • Related to Bug #64545: crimson: OrderedConcurrentPhase::ExitBarrier::exit() does not guarrantee that phase survives added
Actions #12

Updated by Samuel Just about 2 months ago

WaitSubop, WaitTrimTimmer, and WaitRepop are pipeline stages local to the
operation. As such they don't actually provide any ordering guarrantees as
only one operation will ever enter them. Rather, the intent is to hook into
the event system to expose information to an administrator.

This poses a problem for OrderedConcurrentPhase as it is currently implemented.
PipelineHandle::exit() is invoked prior to the op being destructed.
PipelineHandle::exit() does:

    void exit() {
        barrier.reset();
    }

For OrderedConcurrentPhase, ~ExitBarrier() invokes ExitBarrier::exit():

    void exit() final {
      if (barrier) {
        assert(phase);
        assert(phase->core == seastar::this_shard_id());
        std::ignore = std::move(*barrier
        ).then([phase=this->phase] {
          phase->mutex.unlock();
        });
        barrier = std::nullopt;
        phase = nullptr;
      } else if (phase) {
        assert(phase->core == seastar::this_shard_id());
        phase->mutex.unlock();
        phase = nullptr;
      }
    }

The problem comes in not waiting for the phase->mutex.unlock() to occur. For
SnapTrimEvent, phase is actually in the operation itself. It's possible for
that continuation

        ).then([phase=this->phase] {
          phase->mutex.unlock();

to occur after the last finally() in ShardServices::start_operation completes
and releases the final reference to SnapTrimEvent. This is harmless normally
provided that the PG or connection outlives it, but it's a problem for these
stages.

For now, let's just remove these stages. We can reintroduce another mechanism
later to set these event flags without an actual pipeline stage.

This is likely a bug even with pipelines not embedded in an operation, but we
can fix it later -- https://tracker.ceph.com/issues/64545.

Actions #13

Updated by Samuel Just about 2 months ago

Removing the above noop stages created a different problem.

SnapTrimEvent doesn't actually do or block on WaitForActive, RecoverMissing,
GetOBC, or Process. Entering Process, in
particular, causes problems unless we immediately leave it as
SnapTrimObjSubEvent needs to enter and leave it to complete. Entering one of
noop stages alluded to above had a side effect of exiting Process --
without that exit SnapTrimEvent and SnapTrimObjSubEvent mutually block
preventing snap trim or client io from making progress.

SnapTrimObjSubEvent doesn't need WaitForActive, PGActivationBlocker,
or RecoverMissing.

Actions #14

Updated by Samuel Just about 2 months ago

I'm working on a minor refactor to handle the above issues.

Actions

Also available in: Atom PDF