Bug #53477
open[rbd-mirror] mirror demon crash
0%
Description
looks like sepia lab hiccup more than real issue, but adding just in case someone also hits this
2021-12-02T15:15:22.521 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:+ '[' '!' -e /var/run/ceph/rbd-mirror.cluster1-client.mirror.0.pid ']' 2021-12-02T15:15:22.521 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:+ echo 'cluster1 rbd-mirror not running or unknown' '(/var/run/ceph/rbd-mirror.cluster1-client.mirror.0.pid not exist)' 2021-12-02T15:15:22.521 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stdout:cluster1 rbd-mirror not running or unknown (/var/run/ceph/rbd-mirror.cluster1-client.mirror.0.pid not exist) 2021-12-02T15:15:22.524 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:+ continue 2021-12-02T15:15:22.524 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:+ for instance in `seq 0 ${LAST_MIRROR_INSTANCE}` 2021-12-02T15:15:22.524 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:++ daemon_pid_file cluster1:1 2021-12-02T15:15:22.524 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:++ local cluster=cluster1:1 2021-12-02T15:15:22.525 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:++ local instance 2021-12-02T15:15:22.525 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:++ set_cluster_instance cluster1:1 cluster instance 2021-12-02T15:15:22.525 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:++ local val=cluster1:1 2021-12-02T15:15:22.525 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:++ local cluster_var_name=cluster 2021-12-02T15:15:22.525 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:++ local instance_var_name=instance 2021-12-02T15:15:22.526 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:++ cluster=cluster1 2021-12-02T15:15:22.526 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:++ instance=1 2021-12-02T15:15:22.526 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:++ '[' 1 = cluster1:1 ']' 2021-12-02T15:15:22.526 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:++ eval cluster=cluster1 2021-12-02T15:15:22.527 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:+++ cluster=cluster1 2021-12-02T15:15:22.527 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:++ eval instance=1 2021-12-02T15:15:22.527 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:+++ instance=1 2021-12-02T15:15:22.527 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:+++ ceph-conf --cluster cluster1 --name client.mirror.1 'pid file' 2021-12-02T15:15:22.527 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:++ echo /var/run/ceph/rbd-mirror.cluster1-client.mirror.1.pid 2021-12-02T15:15:22.528 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:+ local pid_file=/var/run/ceph/rbd-mirror.cluster1-client.mirror.1.pid 2021-12-02T15:15:22.528 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:+ '[' '!' -e /var/run/ceph/rbd-mirror.cluster1-client.mirror.1.pid ']' 2021-12-02T15:15:22.528 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:+ echo 'cluster1 rbd-mirror not running or unknown' '(/var/run/ceph/rbd-mirror.cluster1-client.mirror.1.pid not exist)' 2021-12-02T15:15:22.528 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stdout:cluster1 rbd-mirror not running or unknown (/var/run/ceph/rbd-mirror.cluster1-client.mirror.1.pid not exist) 2021-12-02T15:15:22.531 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:+ continue
Updated by Mykola Golub over 2 years ago
The messages reported in the description is not an issue actually. This is generated on cleanup by "status" command, after the test already failed, when it tires to collect all information possible, and errors like this are expected. There was no rbd-mirror crash actually.
You should look for the real failure before "cleanup 1" line (or some other non-zero code), and in this case it is:
2021-12-02T15:12:56.304 INFO:tasks.rbd_mirror.cluster2.client.mirror.1.smithi099.stderr:2021-12-02T15:12:56.296+0000 7fc91d3a5700 -1 rbd::mirror::image_replayer::journal::PrepareReplayRequest: 0x56463a7b63c0 send: split-brain detected: local_image_id=11c951359ef6, registered local_image_id=11fc1fd3afc0 2021-12-02T15:12:56.304 INFO:tasks.rbd_mirror.cluster2.client.mirror.1.smithi099.stderr:2021-12-02T15:12:56.296+0000 7fc91d3a5700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x56463db7f860 handle_prepare_replay: failed to prepare local replay: (17) File exists 2021-12-02T15:12:56.314 INFO:tasks.rbd_mirror.cluster2.client.mirror.1.smithi099.stderr:2021-12-02T15:12:56.306+0000 7fc9233b1700 -1 rbd::mirror::ImageReplayer: 0x56463761cc80 [3/aa8f8f8e-730b-4750-8e87-b78d71ba23ce] operator(): start failed: (17) File exists 2021-12-02T15:12:56.331 INFO:tasks.rbd_mirror.cluster2.client.mirror.1.smithi099.stderr:2021-12-02T15:12:56.323+0000 7fc91d3a5700 -1 rbd::mirror::image_replayer::journal::PrepareReplayRequest: 0x56463a7b63c0 send: split-brain detected: local_image_id=10c8afcf8425, registered local_image_id=10f2fc17b45d 2021-12-02T15:12:56.331 INFO:tasks.rbd_mirror.cluster2.client.mirror.1.smithi099.stderr:2021-12-02T15:12:56.323+0000 7fc91d3a5700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x56463a6d2ea0 handle_prepare_replay: failed to prepare local replay: (17) File exists 2021-12-02T15:12:56.364 INFO:tasks.rbd_mirror.cluster2.client.mirror.1.smithi099.stderr:2021-12-02T15:12:56.356+0000 7fc923bb2700 -1 rbd::mirror::ImageReplayer: 0x56463cb39400 [3/22d9c1dc-d4e2-435c-936c-0b99d987564d] operator(): start failed: (17) File exists 2021-12-02T15:12:56.398 INFO:tasks.rbd_mirror.cluster2.client.mirror.1.smithi099.stderr:2021-12-02T15:12:56.389+0000 7fc8f5b56700 -1 rbd::mirror::image_replayer::CreateImageRequest: 0x56463fa285a0 handle_clone_image: failed to clone image mirror/10cafcd5cf14 to image_cluster1.client.mirror.3-clone8 2021-12-02T15:12:56.421 INFO:tasks.rbd_mirror.cluster2.client.mirror.1.smithi099.stderr:2021-12-02T15:12:56.413+0000 7fc91d3a5700 -1 rbd::mirror::image_replayer::journal::PrepareReplayRequest: 0x56463a7b63c0 send: split-brain detected: local_image_id=10c8b0d9d29e, registered local_image_id=10f291e7b59d 2021-12-02T15:12:56.421 INFO:tasks.rbd_mirror.cluster2.client.mirror.1.smithi099.stderr:2021-12-02T15:12:56.413+0000 7fc91d3a5700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x564639b589c0 handle_prepare_replay: failed to prepare local replay: (17) File exists 2021-12-02T15:12:56.428 INFO:tasks.rbd_mirror.cluster2.client.mirror.1.smithi099.stderr:2021-12-02T15:12:56.420+0000 7fc9233b1700 -1 rbd::mirror::ImageReplayer: 0x56463761db80 [3/b4388d95-84a1-4178-80e3-1950717ad031] operator(): start failed: (17) File exists 2021-12-02T15:12:57.284 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:+ rbd --cluster cluster2 info mirror/image_cluster1.client.mirror.0-clone21@78fa04b2-94eb-4362-9340-4bb0af92de96 2021-12-02T15:12:57.349 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:error setting snapshot context: (2) No such file or directory 2021-12-02T15:12:57.354 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:+ continue 2021-12-02T15:12:57.354 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:+ return 1 2021-12-02T15:12:57.355 INFO:tasks.workunit.cluster1.client.mirror.smithi099.stderr:+ cleanup 1
So actual failure is `rbd --cluster cluster2 info mirror/image_cluster1.client.mirror.0-clone21@78fa04b2-94eb-4362-9340-4bb0af92de96` because the snapshot was not found. And it looks similar to [1]. Though in this case before this rbd error we see the rbd-mirror split-brain error and I am not sure I saw this in [1] case (and the logs seems already expired).