Sporadic failure in TestImageReplayer.StartReplayAndWrite
Analyzing the log, it looks like the following happens in the test: 1) ImageReplayer::process_entry is called -> Replay::process -> handle_event: AIO write event -> create_aio_modify_completion: stores on_safe(1) in m_aio_modify_unsafe_contexts. 2) ImageReplayer::flush is called -> Replay::create_aio_flush_completion: moves m_aio_modify_unsafe_contexts (with on_safe(1)) to C_AioFlushComplete::on_safe_ctxs. 3) Replay::handle_aio_flush_complete (for 2) is called, and in "strip out previously failed on_safe contexts" block, on_safe(1) is removed from on_safe_ctxs, because it is not found in m_aio_modify_safe_contexts (handle_aio_modify_complete is not called yet to store on_safe(1) in this list). 4) Replay::handle_aio_modify_complete is called: on_safe(1) is stored (forever) in m_aio_modify_safe_contexts.
The AioCompletion needs to be "started" before the librbd::journal::Replay lock is released within create_aio_modify_completion. This would prevent an out-of-band flush event from racing with the start of the op and corrupting the internal state.
This issue would only affect the unit tests and asok flush command.