Project

General

Profile

Actions

Bug #63422

closed

librbd crash in journal discard wait_event

Added by Joshua Baergen 6 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
reef,quincy,pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When enabling journal-based mirroring for 4 volumes, 2 VMs crashed in librbd (mix of original backtrace and translated backtrace below):

/home/ceph-build/ceph-build/Ubuntu/WORKDIR/ceph-16.2.13-133-ga5a112d/src/librbd/Journal.cc: In function 'librbd::Journal<ImageCtxT>::Future librbd::Journal<ImageCtxT>::wait_event(ceph::mutex&, uint64_t, Context*) [with ImageCtxT = librbd::ImageCtx; librbd::Journal<ImageCtxT>::Future = journal::Future; ceph::mutex = std::mutex; uint64_t = long unsigned int]' thread 7f9f2effd700 time 2023-11-02T15:52:25.213408+0000
/home/ceph-build/ceph-build/Ubuntu/WORKDIR/ceph-16.2.13-133-ga5a112d/src/librbd/Journal.cc: 1019: FAILED ceph_assert(it != m_events.end())
 ceph version 16.2.13-133-ga5a112d (a5a112dc153167aadfd9bda69fde6cb6531dfb63) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f9f37650d85]
 2: /usr/lib/ceph/libceph-common.so.2(+0x267f8d) [0x7f9f37650f8d]

librbd::Journal<librbd::ImageCtx>::wait_event(std::mutex&, unsigned long, Context*)
./obj-x86_64-linux-gnu/src/librbd/./src/librbd/Journal.cc:1019 (discriminator 1)

librbd::Journal<librbd::ImageCtx>::wait_event(unsigned long, Context*)
./obj-x86_64-linux-gnu/src/librbd/./src/librbd/Journal.cc:1009 (discriminator 1)

librbd::journal::ObjectDispatch<librbd::ImageCtx>::wait_or_flush_event(unsigned long, int, Context*)
./obj-x86_64-linux-gnu/src/librbd/./src/librbd/journal/ObjectDispatch.cc:253

librbd::journal::ObjectDispatch<librbd::ImageCtx>::discard(unsigned long, unsigned long, unsigned long, std::shared_ptr<neorados::IOContext>, int, ZTracer::Trace const&, int*, unsigned long*, librbd::io::DispatchResult*, Context**, Context*)
./obj-x86_64-linux-gnu/src/librbd/./src/librbd/journal/ObjectDispatch.cc:105

librbd::io::ObjectDispatcher<librbd::ImageCtx>::SendVisitor::operator()(librbd::io::ObjectDispatchSpec::DiscardRequest&) const
./obj-x86_64-linux-gnu/src/librbd/./src/librbd/io/ObjectDispatcher.cc:58

librbd::io::Dispatcher<librbd::ImageCtx, librbd::io::ObjectDispatcherInterface>::send(librbd::io::ObjectDispatchSpec*)
./obj-x86_64-linux-gnu/src/librbd/./src/librbd/io/Dispatcher.h:130

librbd::io::ObjectDispatchSpec::C_Dispatcher::complete(int)
./obj-x86_64-linux-gnu/src/librbd/./src/librbd/io/ObjectDispatchSpec.cc:21

librbd::cache::WriteAroundObjectDispatch<librbd::ImageCtx>::handle_in_flight_io_complete(int, unsigned long, unsigned long, unsigned long, unsigned long)
./obj-x86_64-linux-gnu/src/librbd/./src/librbd/cache/WriteAroundObjectDispatch.cc:416

Context::complete(int)
./obj-x86_64-linux-gnu/src/librbd/./src/include/Context.h:100

librbd::journal::(anonymous namespace)::C_CommitIOEvent<librbd::ImageCtx>::finish(int)
./obj-x86_64-linux-gnu/src/librbd/./src/librbd/journal/ObjectDispatch.cc:66

Context::complete(int)
./obj-x86_64-linux-gnu/src/librbd/./src/include/Context.h:100

Context::complete(int)
./obj-x86_64-linux-gnu/src/librbd/./src/include/Context.h:100

librbd::io::SimpleSchedulerObjectDispatch<librbd::ImageCtx>::register_in_flight_request(unsigned long, utime_t const&, Context**)::{lambda(int)#1}::operator()(int) const
./obj-x86_64-linux-gnu/src/librbd/./src/librbd/io/SimpleSchedulerObjectDispatch.cc:445

Context::complete(int)
./obj-x86_64-linux-gnu/src/librbd/./src/include/Context.h:100

librbd::io::ObjectDispatchSpec::C_Dispatcher::finish(int)
./obj-x86_64-linux-gnu/src/librbd/./src/librbd/io/ObjectDispatchSpec.cc:34

librbd::io::ObjectDispatchSpec::C_Dispatcher::complete(int)
./obj-x86_64-linux-gnu/src/librbd/./src/librbd/io/ObjectDispatchSpec.cc:30

librbd::io::ObjectRequest<librbd::ImageCtx>::finish(int)
./obj-x86_64-linux-gnu/src/librbd/./src/librbd/io/ObjectRequest.cc:195

librbd::io::AbstractObjectWriteRequest<librbd::ImageCtx>::post_write_object_map_update()
./obj-x86_64-linux-gnu/src/librbd/./src/librbd/io/ObjectRequest.cc:634

librbd::io::AbstractObjectWriteRequest<librbd::ImageCtx>::handle_write_object(int)
./obj-x86_64-linux-gnu/src/librbd/./src/librbd/io/ObjectRequest.cc:554

_ZN5boost4asio6detail11executor_opIN4ceph5async17ForwardingHandlerINS4_17CompletionHandlerIZN6librbd4asio4util20get_callback_adapterIZNS7_2io26AbstractObjectWriteRequestINS7_8ImageCtxEE12write_objectEvEUliE2_EEDaOT_EUlNS_6system10error_codeEDpOT_E_St5tupleIJSJ_EEEEEESaINS4_6detail14CompletionImplINS0_10io_context13executor_typeESN_vJSJ_EEEENS1_19scheduler_operationEE11do_completeEPvPSY_RKSJ_m
./obj-x86_64-linux-gnu/src/librbd/./obj-x86_64-linux-gnu/boost/include/boost/asio/detail/executor_op.hpp:73

This is a custom build of 16.2.13 with some minor patches from us and some backports from 16.2.14. We don't have much logging enabled VM-side so nothing of interest to report from there.


Files

librbd.log (14.9 KB) librbd.log Joshua Baergen, 11/03/2023 08:35 PM

Related issues 3 (0 open3 closed)

Copied to rbd - Backport #63745: pacific: librbd crash in journal discard wait_eventResolvedJoshua BaergenActions
Copied to rbd - Backport #63746: reef: librbd crash in journal discard wait_eventResolvedJoshua BaergenActions
Copied to rbd - Backport #63747: quincy: librbd crash in journal discard wait_eventResolvedJoshua BaergenActions
Actions #1

Updated by Joshua Baergen 6 months ago

I was able to reproduce this by attaching an rbd image to a VM, then doing the following simultaneously (totally possible that this is overkill, but I was trying to mimic the user workload plus introduce some 4M misalignment):

# while :; do blkdiscard -o 4096 -p 8388608 /dev/sda; done
# while :; do sync /dev/sda; done
# dd if=/dev/urandom of=/dev/sda bs=16384

Attaching a debug_rbd=20 log.

Actions #2

Updated by Joshua Baergen 6 months ago

(to be clear - in that log, the problem happens on the second wait_event for tid=132078)

Actions #3

Updated by Joshua Baergen 6 months ago

Looking at the code, I'm not sure how this was supposed to work. In AbstractImageWriteRequest<I>::send_request(), if there are multiple object extents to submit, we append journal entries for them, but only the last journal tid is used for all of the submitted extents. Shouldn't each extent be associated with the correct journal tid?

Actions #4

Updated by Ilya Dryomov 6 months ago

Joshua Baergen wrote:

Looking at the code, I'm not sure how this was supposed to work. In AbstractImageWriteRequest<I>::send_request(), if there are multiple object extents to submit, we append journal entries for them, but only the last journal tid is used for all of the submitted extents. Shouldn't each extent be associated with the correct journal tid?

Mykola might be able to answer this question.

Actions #5

Updated by Joshua Baergen 6 months ago

OK, looking forward to more info! I started to look into modifications along the line of my thinking in #3, but quickly realized that the journal events are dealing in image_extents vs. the rest of this path dealing in object_extents, which makes things a bit more difficult.

Actions #6

Updated by Mykola Golub 6 months ago

Joshua Baergen wrote:

Looking at the code, I'm not sure how this was supposed to work. In AbstractImageWriteRequest<I>::send_request(), if there are multiple object extents to submit, we append journal entries for them, but only the last journal tid is used for all of the submitted extents. Shouldn't each extent be associated with the correct journal tid?

In AbstractImageWriteRequest<I>::send_request() we submit requests to write to journal and then in journal::ObjectDispatch just wait for them to complete. And we need to know only the latest tid. When it is complete all previous ones are expected to be complete too.

Actions #7

Updated by Joshua Baergen 6 months ago

Thanks Mykola - that confirms what I suspected was the intent here. I've posted a proposed fix for the issue reported here at https://github.com/ceph/ceph/pull/54362; it includes a somewhat length description of what is going wrong and the solution that I chose. Could folks please take a look and confirm that my understanding of the problem is correct and comment on my solution? I've originally wrote this patch against 16.2.13 and there it passes the reproducer that I posted previously in this ticket.

Actions #8

Updated by Joshua Baergen 6 months ago

Note that the solution as currently posted is incorrect - between the two days that I worked on it I conveniently forgot that I was working in object offsets instead of image offsets, but on top of that I was not previously aware of RBD striping. Other options that I'm exploring that I would love some comments on (and I'm open to more suggestions!):
  1. Can we submit a single journal event for all image extents and have it know about how many object requests will be issued?
  2. Do we need to commit journal events via the dispatcher path? Can they automatically clean themselves up, and then all the dispatcher path needs to worry about is waiting on the event if it still exists?
Actions #9

Updated by Joshua Baergen 6 months ago

Digging into journal events further, there already is a mechanism for keeping an event around until all parts have been committed: pending extents. This extent interval gets updated as each part of the journal event gets committed, and only once the interval is empty does the event get cleaned up.

The issue here of course is that all of the object requests get associated with the last journal event, which in turn has only a subset of the overall interval, and thus some of the object requests will not be represented in the event.

Here's what I wrote in the PR:

In the case where an image request is split across multiple object extents and journaling is enabled, multiple journal events are appended. Prior to this change, all object requests would wait for the last journal event to complete, since journal events complete in order and thus the last one completing implies that all prior journal events were safe at that point.

While conceptually correct, journal events don't have a mechanism for knowing how many requests might be associated with them (but as noted above the extent interval is supposed to do this). Thus, it's entirely possible for the following sequence to occur:

  1. An image request gets split into two object requests. Journal events are appended, one object request gets submitted and starts waiting on the last journal event (which also causes a C_CommitIOEvent to be instantiated against that journal event), but the other object request gets delayed due to an overlap.
  2. Journaling completes, and the C_CommitIOEvent fires, which causes the event to be cleaned up.
  3. The second object request from above is allowed to make progress; it tries to wait for the journal event that was just cleaned up which causes the assert in wait_event() to fire.

It might seem that the simple solution exists of allowing for wait_event() to return early when the event no longer exists, and indeed the code worked this way when originally written and was later changed to enforce that the event existed. However, there appears to be another problem here, which is that there is no C_CommitIOEvent instantiated for the earlier journal event that were appended, which means they are actually leaked and hang around forever. As far as I can tell, this happens today whenever multiple journal events are appended.

Actions #10

Updated by Joshua Baergen 6 months ago

Here's another attempt at fixing this that I think better honours the original intent of the relationship between image requests and journal events: https://github.com/ceph/ceph/pull/54377

Please let me know if you think I'm on the right track this time! What's left that I know of:
  • Fix WriteSame in the same way as Write and Discard
  • Refactor CompareAndWrite so that I can clean up the duplicated code
  • Update some tests, probably
Actions #11

Updated by Joshua Baergen 6 months ago

As I've been trying to write a unit test covering this, I've come to realize that I don't think this condition is possible on anything except the discard path, and there only recently. Up until 21a26a752843295ff946d1543c2f5f9fac764593 (librbd: Fix local rbd mirror journals growing forever), unless I'm missing something, m_image_extents always contained a single extent for all I/O types. This commit updated discard to have the possibility of multiple extents since it can translate object requests back to image extents without trying to merge them. This commit was fixing 7ca1bab (librbd: align discard requests to default bluestore allocation size), which first became available in Nautilus.

It makes me wonder if there is any utility in allowing multiple image extents for an image request (and whether I'm just missing the place where this is actually useful).

Actions #12

Updated by Joshua Baergen 6 months ago

As a consequence for the above, a very simple workaround for this crash appears to be to set rbd_skip_partial_discard=false.

Actions #13

Updated by Mykola Golub 6 months ago

  • Status changed from New to Fix Under Review
  • Backport set to reef, quincy, pacific
  • Pull request ID set to 54377
Actions #14

Updated by Ilya Dryomov 5 months ago

  • Assignee set to Joshua Baergen
Actions #15

Updated by Ilya Dryomov 5 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport changed from reef, quincy, pacific to reef,quincy,pacific
Actions #16

Updated by Backport Bot 5 months ago

  • Copied to Backport #63745: pacific: librbd crash in journal discard wait_event added
Actions #17

Updated by Backport Bot 5 months ago

  • Copied to Backport #63746: reef: librbd crash in journal discard wait_event added
Actions #18

Updated by Backport Bot 5 months ago

  • Copied to Backport #63747: quincy: librbd crash in journal discard wait_event added
Actions #19

Updated by Backport Bot 5 months ago

  • Tags set to backport_processed
Actions #20

Updated by Backport Bot 2 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions #21

Updated by Raimund Sacherer about 1 month ago

Joshua Baergen wrote:

As a consequence for the above, a very simple workaround for this crash appears to be to set rbd_skip_partial_discard=false.

Hello Joshua,

in this comment, do you mean to set rbd_skip_partial_discard to false, or to true? Because it seems it defaults to false already.

Thank you

Actions #22

Updated by Joshua Baergen about 1 month ago

Hey Raimund, as far as I can tell, the default is true in all recent releases. Which release do you see this in? What does "ceph config help rbd_skip_partial_discard" say the default is?

Actions #23

Updated by Raimund Sacherer about 1 month ago

Raimund Sacherer wrote:

Joshua Baergen wrote:

As a consequence for the above, a very simple workaround for this crash appears to be to set rbd_skip_partial_discard=false.

Hello Joshua,

in this comment, do you mean to set rbd_skip_partial_discard to false, or to true? Because it seems it defaults to false already.

Thank you

Please ignore my comment. Unfortunatly I got confused as the documentation (wrongly) did state that this is set to false per default. It turns out it is actually default true and setting it to false fixed the issue.

Best regards
Raimund

Actions

Also available in: Atom PDF