Potential race when removing two-way mirroring image
Let's we have cluster1 and cluster2 and run rbd-mirror daemons in both clusters, so cluster1 rbd-mirror replays images promoted in cluster2, and cluster2 rbd-mirror replays images promoted in cluster1.
1) Create a mirroring (promoted) image in cluster2
2) Wait for replay is started by cluster1 rbd-mirror
3) Remove the image in cluster1
Cluster1 rbd-mirror detects the remote image is removed and initiate removal of the local image. Before removing, mirroring is disabled for the image. When disabling mirroring for a demoted image, it is promoted first to avoid the journal left in an inconsistent ownership state (see #17588). At that moment (after the image is promoted but not removed yet) the cluster1 rbd-mirror has a chance to start replay for the promoted image.
The issue is some times reproduced when running qa/workunits/rbd/rbd_mirror_stress.sh: in such a case it fails in "delete images during bootstrap" test when creating an image and it fails with "already exists" error, because the previously deleted image with this name was recreated by the rbd-mirror:
+ rbd --cluster cluster2 -p mirror create --size 512M --image-feature layering,exclusive-lock,journaling image_6 2017-01-07 06:36:06.541592 7f8d9afce000 -1 librbd: rbd image image_6 already exists rbd: create error: (17) File exists
Some log entries for cluster1 rbd-mirror image deleter triggered after the image was deleted in cluster2:
2017-01-07 06:29:04.166474 7f977d7fa700 20 rbd::mirror::ImageDeleter: 0x7f97aff96410 process_image_delete: processing deletion of snapshot image_6@snap 2017-01-07 06:29:05.820029 7f977d7fa700 10 rbd::mirror::ImageDeleter: 0x7f97aff96410 process_image_delete: snapshot image_6@snap was deleted 2017-01-07 06:29:06.030410 7f977d7fa700 20 librbd: remove 0x7f977d7f7360 101d628c895d 2017-01-07 06:29:06.170824 7f977d7fa700 10 librbd::mirror::DisableRequest: 0x7f977001f910 send_get_mirror_image 2017-01-07 06:29:06.188326 7f9761ffb700 10 librbd::mirror::DisableRequest: 0x7f977001f910 send_promote_image 2017-01-07 06:29:06.254878 7f96f1ffb700 10 librbd::mirror::DisableRequest: 0x7f977001f910 handle_promote_image: r=0 2017-01-07 06:29:07.011857 7f977d7fa700 10 rbd::mirror::ImageDeleter: 0x7f97aff96410 process_image_delete: Successfully deleted image: image_6
And at the same time cluster2 rbd-mirror bootstrap is started and succeeds:
2017-01-07 06:29:06.152189 7f47727fc700 20 rbd::mirror::Replayer: 0x7f4816a49a90 set_sources: starting image replayer for d906b6e8-0fb0-440b-aef4-4a24319cadb3 2017-01-07 06:29:06.152189 7f47727fc700 20 rbd::mirror::Replayer: 0x7f4816a49a90 start_image_replayer: global_image_id=d906b6e8-0fb0-440b-aef4-4a24319cadb3 2017-01-07 06:29:06.271231 7f4771ffb700 20 rbd::mirror::image_replayer::BootstrapRequest: 0x7f47f4200600 handle_open_remote_image: r=0 2017-01-07 06:29:06.273750 7f4771ffb700 20 rbd::mirror::image_replayer::BootstrapRequest: 0x7f47f4200600 open_local_image 2017-01-07 06:29:06.274616 7f47fa7fc700 20 rbd::mirror::image_replayer::BootstrapRequest: 0x7f47f4200600 handle_open_local_image: r=-2 2017-01-07 06:29:06.274617 7f47fa7fc700 10 rbd::mirror::image_replayer::BootstrapRequest: 0x7f47f4200600 handle_open_local_image: local image missing 2017-01-07 06:29:06.274618 7f47fa7fc700 20 rbd::mirror::image_replayer::BootstrapRequest: 0x7f47f4200600 create_local_image 2017-01-07 06:29:06.274639 7f47fa7fc700 20 librbd::image::CreateRequest: name=image_6, id=1031436c6125, size=134217728, features=69, order=22, stripe_unit=4194304, stripe_count=1, journal_order=24, journal_splay_width=4, journal_pool=, data_pool=
Note, the image promotion time 06:29:06.254878 is less than open_remote_image time 06:29:06.271231 (when the bootstrap checks if the image is promoted).