Bug #51031
closedrbd-mirror: metadata of mirrored image are not properly cleaned up after image deletion
0%
Description
Hello,
I have seen an issue where I have some "ghost" images when I tried to remove a RBD image from the cluster in a replication scenario.
When I delete an image from the main cluster, the image is deleted in the two cluster but I start to see the following logs from the rbd-mirror in the remote cluster:
2021-05-31T17:26:57.106+0200 7f194a535700 0 rbd::mirror::ImageReplayer: 0x55b3e1a3ab60 [13/343206b9-5618-41f5-b394-c627b4d2d920] handle_shut_down: remote image no longer exists: scheduling deletion
2021-05-31T17:27:00.762+0200 7f194a535700 0 rbd::mirror::ImageReplayer: 0x55b3e1a3ab60 [13/343206b9-5618-41f5-b394-c627b4d2d920] handle_shut_down: mirror image no longer exists
2021-05-31T17:27:00.762+0200 7f194a535700 0 rbd::mirror::ImageReplayer: 0x55b3e1a3ab60 [13/343206b9-5618-41f5-b394-c627b4d2d920] handle_shut_down: mirror image no longer exists
2021-05-31T17:27:00.763+0200 7f193ed1e700 0 rbd::mirror::ImageReplayer: 0x55b3e1a3ab60 [13/343206b9-5618-41f5-b394-c627b4d2d920] handle_shut_down: mirror image no longer exists
2021-05-31T17:27:00.763+0200 7f1937d10700 0 rbd::mirror::ImageReplayer: 0x55b3e1a3ab60 [13/343206b9-5618-41f5-b394-c627b4d2d920] handle_shut_down: mirror image no longer exists
I also can see an "unknown" image (my test image are named testX, which is clearly not the case here) in the rbd mirror status of the daemon:
"image_replayers": [
{
"name": "test2/343206b9-5618-41f5-b394-c627b4d2d920",
"state": "Stopped"
}
],
And then some OMAP keys (status_global_* on rbd_mirroring) remain on the main cluster while the one on the remote cluster are immediately cleaned up. After a minute or so the OMAP start to reappear in the remote cluster as well with some error in it ("error bootstrapping replay"). If I remove the OMAP key by hand and restart the rbd-mirror daemons, a OMAP key reappear on both clusters.
Steps to reproduce:- Have two cluster and configure rbd mirroring between them
- Create a pool with mirroring enabled (with the image mode in my case, but it probably doesn't matter)
- Create a RBD image and enable mirroring with the journal or snapshot mode
- Confirm that the image is replicated on your other peer
- Delete the image on your first cluster
- Confirm the deletion on both side with rbd ls
- Confirm that there is a ghost image checking the rbd-mirror log, the OMAP values or the rbd-mirror daemon socket
I attached a file describing with the remote_status_global_* keys for the whole scenario presented here (with a different image from the one presented in the log posted above).
Files
Updated by Arthur Outhenin-Chalandre almost 3 years ago
Hello,
I have investigated a bit this issue lately and from what I see, the MirroringWatcher never pick up the locally removed image and then the image_map_ key is never removed as a result. I fixed this by calling ImageRemoveRequest instead of invoking directly mirror_image_remove in the following PR: https://github.com/ceph/ceph/pull/41696.
It is still marked as WIP because this only solves the cleanup of image_map_ OMAP keys but there is still some remote_status_global_ OMAP keys hanging. I will check those next week.
Updated by Arthur Outhenin-Chalandre over 2 years ago
- Status changed from New to Fix Under Review
- Assignee changed from Deepika Upadhyay to Arthur Outhenin-Chalandre
- Backport set to pacific, octopus
- Pull request ID set to 41696
Updated by Mykola Golub over 2 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot over 2 years ago
- Copied to Backport #53031: octopus: rbd-mirror: metadata of mirrored image are not properly cleaned up after image deletion added
Updated by Backport Bot over 2 years ago
- Copied to Backport #53032: pacific: rbd-mirror: metadata of mirrored image are not properly cleaned up after image deletion added
Updated by Loïc Dachary about 2 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".