Project

General

Profile

Actions

Bug #62869

open

rbd-mirror: non-primary images not deleted when mirror is disabled too quickly

Added by Daniel R 8 months ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When mirroring is disabled on an image too quickly after the primary image is promoted, the image on the previous (source) cluster enters an up+error state.

On Pacific, the message is "error bootstrapping replay"
On Nautilus, the message is "remote image does not exist"

Here is how to re-create the bug (after peering two clusters):
```
1. Enable mirroring on an image (journal mode)
2. Wait for replaying (also wait for entries_behind_primary = 0)
3. Demote the image on the source cluster
4. A few seconds later, promote the image on the new primary cluster
5. Quickly, disable mirroring on the image from the new primary cluster
```

On the source cluster, the image should now be in an up+error state. The only way to clean it up is to force-disable mirroring and then `rbd rm`.
The issue can be avoided by waiting for the source cluster to enter a `up+replaying` state before disabling mirroring, which can sometimes take 10 or 20+ seconds. However, even then it's still possible to enter an error state if disabling the mirror is done by extremely fast tooling.

Recent patches & backports to rbd-mirror (from 16.2.14) did not address this issue.


Files

cantry.py (5.13 KB) cantry.py Bug Reproducer Script (race condition) Daniel R, 10/04/2023 10:43 PM
Actions #1

Updated by Ilya Dryomov 8 months ago

Hi Daniel,

I think this is pretty much expected, unfortunately. If you don't wait for the old primary to "learn" that it's now secondary (and therefore cleanup is in order when the image on the new primary is deleted), no cleanup occurs.

If the image is still considered to be in a demoted state (i.e. if the promotion hasn't propagated yet), the image doesn't get cleaned up because it continues to be legible for a promotion of its own. It's basically the difference between demoted and secondary (non-primary) states.

Actions #2

Updated by Joshua Baergen 8 months ago

Hi Ilya,

Is there some state that can be watched make our automation more reliable during the cleanup phase? Up until Pacific it seemed that waiting for 'replaying' was sufficient, but as noted above it no longer appears that this is enough.

Actions #3

Updated by Ilya Dryomov 8 months ago

Hi Joshua,

I'm not aware of any changes in Pacific that would have (intentionally) taken that away. If the new secondary is reporting up+replaying, I would expect it to be cleaned up when the new primary is deleted.

Could you please post a reproducer (e.g. a script assuming site-a.conf and site-b.conf config files and using "rbd --cluster site-a ..." and "rbd --cluster site-b ..." style commands) for what used to be reliable in Octopus and broke in Pacific?

Actions #4

Updated by Joshua Baergen 8 months ago

Hi Ilya,

We can look into a reproducer at some point. I was looking at our existing automation and I noticed that while we are waiting for the "replaying" status on the now-secondary mirror, we aren't checking the "up" state before proceeding with teardown. Is it possible for a mirror to be "replaying" but not yet "up" after the primary transition, and could that be a source of the issues we're now seeing under Pacific?

Actions #5

Updated by Ilya Dryomov 8 months ago

Is it possible for a mirror to be "replaying" but not yet "up" after the primary transition, and could that be a source of the issues we're now seeing under Pacific?

I don't think so since you seem to be doing an orderly promotion and no issues with rbd-mirror daemon itself were mentioned.

However, keep in mind that if the first part is "down" (i.e. not "up"), then none of the states (the second part) can be trusted because rbd-mirror daemon -- the entity that is managing and responsible for updating these states -- isn't functioning properly.

Actions #6

Updated by Daniel R 8 months ago

Hi Ilya,

I am attaching a reproducer script that exposes the race condition we've been seeing. It's in Python and uses librados.

Unlike the original bug description, this script will wait for the secondary to be reporting up+replaying before disabling the mirror. When done quick enough and with minimal network latency, this will result in the secondary entering an error state even though it's not supposed to.

IMO, I wouldn't describe bug this as a regression or claim it was reliable in Octopus. IIRC I was able to re-create this race on Nautilus as well at one point.

Let me know if you have any questions, and thank you for your help thus far! :)

Actions

Also available in: Atom PDF