Bug #57396: [rbd-mirror] local image replayer stops and fails to update status - rbd - Ceph

Actions

Copy link

Bug #57396

closed

[rbd-mirror] local image replayer stops and fails to update status

Added by Josef Johansson over 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

Josef Johansson

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

pacific,quincy

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v15.2.16

ceph-qa-suite:

rbd

Pull request ID:

49614

Crash signature (v1):

Crash signature (v2):

Description

I'm trying out two way mirroring.

Everything seems to running fine except sometimes when the replay stops at the primary
clusters.

This means that VMs will not start properly until all journal entries are replayed, but
also that the journal grows by time.

I am trying to find out why this occurs, and where to look for more information.

I am currently using rbd --pool <pool> --image <image> journal status to see
if the clients are in sync or not.

Example output when things went sideways

minimum_set: 0
active_set: 2
registered clients:
[id=, commit_position=[positions=[[object_number=0, tag_tid=1, entry_tid=4592],
[object_number=3, tag_tid=1, entry_tid=4591], [object_number=2, tag_tid=1,
entry_tid=4590], [object_number=1, tag_tid=1, entry_tid=4589]]], state=connected]
[id=bdde9b90-df26-4e3d-84b3-66605dc45608, commit_position=[positions=[[object_number=5,
tag_tid=1, entry_tid=19913], [object_number=4, tag_tid=1, entry_tid=19912],
[object_number=7, tag_tid=1, entry_tid=19911], [object_number=6, tag_tid=1,
entry_tid=19910]]], state=disconnected]

Files

Download all files

issue.log (57.5 KB) issue.log		Josef Johansson, 12/21/2022 04:05 PM
issue.sh (6.04 KB) issue.sh		Josef Johansson, 12/21/2022 04:05 PM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Josef Johansson over 1 year ago

As it turns out this problem is not that easy.
1) Running CloudLinux 3.10.0-962.3.2.lve1.5.73.el7.x86_64 with virtio-scsi-single disks and discard does not work. However, switching to virtio-blk and still have discard all the way to the mounts works.
2) Running CloudLinux 4.18.0-372.26.1.lve.1.el8.x86_64 with virtio-scsi-single disks and discard does not work, but neither does virtio-blk. Switching off discard in the guest will solve both cases though.
3) Running Debian 5.10.0-19-amd64 with virtio-scsi-single disks and discard does not work, in some cases, as a reference.

It's clear that something happens with discard and commit of a journal entry.

I found out how to log within the qemu-process to a separate file, but even if I used all debug-flags I could set, there was no error. Just that it stopped logging that the commit_position was updated. The logs where quite huge, any hints on what would be interesting?

Actions

Copy link

Updated by Josef Johansson over 1 year ago

$ rpm -q kernel-3.10.0-962.3.2.lve1.5.73.el7.x86_64 --changelog | grep 'virtio_blk: add discard'

$ rpm --changelog ~~q kernel-4.18.0-372.26.1.lve.1.el8.x86_64 | grep 'virtio_blk: add discard'~~
[block] virtio_blk: add discard and write zeroes support (Stefano Garzarella) [1692939]

Which makes a ton of sense. When discards are issued librbd in the qemu-kvm process stops committing to disk.

When running 3.10 with virtio-scsi-single there is discard support, which then also makes the process stops committing to disk.

Actions

Copy link

Updated by Josef Johansson over 1 year ago

This will trigger the problem.

`echo 1>a;sync;rm a;sync;fstrim --all`

Actions

Copy link

Updated by Josef Johansson over 1 year ago

I managed to reproduce it with rbd-nbd as well.

A simple echo 1>a;sync;rm a; sync will not break it. But fstrim /mnt will.

Trying to parse the logs:

It prints `committed tid=`, but not `updated commit position: `.
Which means that either `!m_pending_commit_tids.empty()` is false, or !commit_entry.committed is true.

```
void JournalMetadata::committed(uint64_t commit_tid,
const CreateContext &create_context) {
ldout(m_cct, 20) << "committed tid=" << commit_tid << dendl;

ObjectSetPosition commit_position;
  Context *stale_ctx = nullptr;
{
    std::scoped_lock locker{*m_timer_lock, m_lock};
    ceph_assert(commit_tid > m_commit_position_tid);

if (!m_commit_position.object_positions.empty()) {
      // in-flight commit position update
      commit_position = m_commit_position;
    } else {
      // safe commit position
      commit_position = m_client.commit_position;
    }

CommitTids::iterator it = m_pending_commit_tids.find(commit_tid);
    ceph_assert(it != m_pending_commit_tids.end());

CommitEntry &commit_entry = it->second;
    commit_entry.committed = true;

bool update_commit_position = false;
    while (!m_pending_commit_tids.empty()) {
      CommitTids::iterator it = m_pending_commit_tids.begin();
      CommitEntry &commit_entry = it->second;
      if (!commit_entry.committed) {
        break;
      }

commit_position.object_positions.emplace_front(
        commit_entry.object_num, commit_entry.tag_tid,
        commit_entry.entry_tid);
      m_pending_commit_tids.erase(it);
      update_commit_position = true;
    }

if (!update_commit_position) {
      return;
    }

// prune the position to have one position per splay offset
    std::set&lt;uint8_t&gt; in_use_splay_offsets;
    ObjectPositions::iterator ob_it = commit_position.object_positions.begin();
    while (ob_it != commit_position.object_positions.end()) {
      uint8_t splay_offset = ob_it->object_number % m_splay_width;
      if (!in_use_splay_offsets.insert(splay_offset).second) {
        ob_it = commit_position.object_positions.erase(ob_it);
      } else {
        ++ob_it;
      }
    }

stale_ctx = m_commit_position_ctx;
    m_commit_position_ctx = create_context();
    m_commit_position = commit_position;
    m_commit_position_tid = commit_tid;

ldout(m_cct, 20) << "updated commit position: " << commit_position << ", " 
                     << "on_safe=" << m_commit_position_ctx << dendl;
    schedule_commit_task();
  }

if (stale_ctx != nullptr) {
    ldout(m_cct, 20) << "canceling stale commit: on_safe=" << stale_ctx
                     << dendl;
    stale_ctx->complete(-ESTALE);
  }
}
```

Actions

Copy link

Updated by Josef Johansson over 1 year ago

AFAIK this case is not covered in the test suite.

https://github.com/ceph/ceph/blob/6fee777d603aebce492c57b41f3b5760d50ddb07/src/test/librbd/journal/test_Replay.cc

Since this happens when JournalPlayer is in STATE_READY (from handler_flushing_ready), it seems to be in another state loop (since JournalPlayer handles AioDiscard perfectly fine). I think the test cases needs to be check that the events run fine in all states that handle replay.

Not sure how to tag people or get any feedback on issues here, but hopefully someone hears the echoes :)

To conclude, whenever any AioDiscard is happening when JournalPlayer is in STATE_READY, nothing will be replayed from the journal.

This bug is two fold, first the bug that stops the process. Second, there is no heartbeat that picks up that the player is not committing to disk.

Actions

Copy link Download all files

Updated by Josef Johansson over 1 year ago

File issue.log issue.log added
File issue.sh issue.sh added

I appended a test with a log output of it stopping when running fstrim.

I made it such that it creates a rbd-replay binary that is easier to replay rather than using nbd.

Actions

Copy link

Updated by Josef Johansson over 1 year ago

For some reason the following will trigger that journal is never commited to rbd image.

Given

rbd-nbd --device ${device} map ${image} &

blkdiscard  -o $((1996*4096+1024)) -l $((4096*104)) /dev/nbd0

Now nothing will be committed to the rbd image. (updated commit position in JournalMetadata is not touched).

And

blkdiscard  -o $((1996*4096)) -l $((4096*104)) /dev/nbd0

works properly

Setting

rbd skip partial discard = false

Makes the problem go away.

How do I make the test in src/test/librbd/journal/test_Replay.cc actually test this?
I would like to test both cases, my problem is that open_image does not seem to enable handle_flushing_replay.

Actions

Copy link

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rbd

Custom queries

Bug #57396

[rbd-mirror] local image replayer stops and fails to update status

Updated by Josef Johansson over 1 year ago

Updated by Josef Johansson over 1 year ago

Updated by Josef Johansson over 1 year ago

Updated by Josef Johansson over 1 year ago

Updated by Josef Johansson over 1 year ago

Updated by Josef Johansson over 1 year ago

Updated by Josef Johansson over 1 year ago

Updated by Josef Johansson over 1 year ago

Updated by Ilya Dryomov over 1 year ago

Updated by Ilya Dryomov over 1 year ago

Updated by Ilya Dryomov about 1 year ago

Updated by Ilya Dryomov about 1 year ago

Updated by Backport Bot about 1 year ago

Updated by Backport Bot about 1 year ago

Updated by Backport Bot about 1 year ago

Updated by Backport Bot about 1 year ago