Project

General

Profile

Bug #16708

Sporadic failure in TestImageReplayer.StartReplayAndWrite

Added by Jason Dillaman almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Target version:
-
Start date:
07/18/2016
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

From Mykola:

Analyzing the log, it looks like the following happens in the test:

1) ImageReplayer::process_entry is called -> Replay::process ->
handle_event: AIO write event -> create_aio_modify_completion:
stores on_safe(1) in m_aio_modify_unsafe_contexts.

2) ImageReplayer::flush is called -> Replay::create_aio_flush_completion:
moves m_aio_modify_unsafe_contexts (with on_safe(1)) to
C_AioFlushComplete::on_safe_ctxs.

3) Replay::handle_aio_flush_complete (for 2) is called, and in
"strip out previously failed on_safe contexts" block, on_safe(1)
is removed from on_safe_ctxs, because it is not found in
m_aio_modify_safe_contexts (handle_aio_modify_complete is not
called yet to store on_safe(1) in this list).

4) Replay::handle_aio_modify_complete is called: on_safe(1) is
stored (forever) in m_aio_modify_safe_contexts.

The AioCompletion needs to be "started" before the librbd::journal::Replay lock is released within create_aio_modify_completion. This would prevent an out-of-band flush event from racing with the start of the op and corrupting the internal state.

This issue would only affect the unit tests and asok flush command.


Related issues

Copied to rbd - Backport #17088: jewel: Sporadic failure in TestImageReplayer.StartReplayAndWrite Resolved

History

#2 Updated by Jason Dillaman almost 3 years ago

  • Status changed from New to In Progress
  • Assignee set to Jason Dillaman

#4 Updated by Jason Dillaman almost 3 years ago

  • Status changed from In Progress to Need Review
  • Backport set to jewel

#5 Updated by Mykola Golub almost 3 years ago

  • Status changed from Need Review to Pending Backport

#6 Updated by Loic Dachary almost 3 years ago

  • Copied to Backport #17088: jewel: Sporadic failure in TestImageReplayer.StartReplayAndWrite added

#7 Updated by Loic Dachary almost 3 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF