You are right. The second log lines were from when I tried to remove the image from my target and was not able. I finally decided to remove a lock with 'rbd lock rm' but now I have a weird situation as I remove the image from the target but if I try to detach the image inside gwcli it says it is still mapped to the target. Closing gwcli and reopening makes the image appear inside my iSCSI target.
I/O operations on the image are still hanging (e.g. feature enable/disable and image resizing).
Which logs should I look into and what should I search for?
'ceph -s' says HEALTH_OK and:
osd: 30 osds: 30 up (since 4d), 30 in (since 5w)
It seems some OSDs were down at least close to the time my Windows server froze.
I have some of this on the logs:
2020-09-18 00:23:23.935155 mon.ceph01 (mon.0) 8624 : cluster [INF] osd.1 [v2:10.18.22.81:6833/3392,v1:10.18.22.81:6836/3392] boot
2020-09-18 00:23:23.935214 mon.ceph01 (mon.0) 8625 : cluster [INF] osd.10 [v2:10.18.22.81:6812/3406,v1:10.18.22.81:6815/3406] boot
2020-09-18 00:23:23.935266 mon.ceph01 (mon.0) 8626 : cluster [DBG] osdmap e67839: 30 total, 29 up, 30 in
2020-09-18 00:23:24.799475 mon.ceph01 (mon.0) 8627 : cluster [INF] Health check cleared: SLOW_OPS (was: 5 slow ops, oldest one blocked for 30 sec, osd.8 has slow ops)
2020-09-18 00:23:24.820213 mon.ceph01 (mon.0) 8628 : cluster [DBG] osdmap e67840: 30 total, 29 up, 30 in
2020-09-18 00:23:25.846554 mon.ceph01 (mon.0) 8629 : cluster [DBG] osdmap e67841: 30 total, 29 up, 30 in
2020-09-18 00:23:17.952518 osd.23 (osd.23) 246 : cluster [DBG] 29.1e scrub starts
2020-09-18 00:23:26.852741 mon.ceph01 (mon.0) 8630 : cluster [DBG] mds.? [v2:10.18.22.82:6800/2678136627,v1:10.18.22.82:6801/2678136627] up:boot
2020-09-18 00:23:26.852834 mon.ceph01 (mon.0) 8631 : cluster [DBG] mds.? [v2:10.18.22.81:6838/613848769,v1:10.18.22.81:6839/613848769] up:boot
2020-09-18 00:23:26.852888 mon.ceph01 (mon.0) 8632 : cluster [DBG] fsmap 3 up:standby
2020-09-18 00:23:27.400352 mon.ceph01 (mon.0) 8633 : cluster [WRN] Health check failed: 8 slow ops, oldest one blocked for 47 sec, mon.ceph01 has slow ops (SLOW_OPS)
2020-09-18 00:23:27.415445 mon.ceph01 (mon.0) 8634 : cluster [DBG] osdmap e67842: 30 total, 29 up, 30 in
2020-09-18 00:23:18.950937 osd.5 (osd.5) 237 : cluster [WRN] Monitor daemon marked osd.5 down, but it is still running
2020-09-18 00:23:18.950961 osd.5 (osd.5) 238 : cluster [DBG] map e67837 wrongly marked me down at e67833
2020-09-18 00:23:20.401182 osd.1 (osd.1) 226 : cluster [WRN] Monitor daemon marked osd.1 down, but it is still running
2020-09-18 00:23:20.401197 osd.1 (osd.1) 227 : cluster [DBG] map e67837 wrongly marked me down at e67837
2020-09-18 00:23:28.853573 mon.ceph01 (mon.0) 8635 : cluster [WRN] Health check update: Degraded data redundancy: 220871/9417759 objects degraded (2.345%), 10 pgs degraded (PG_DEGRADED)
2020-09-18 00:23:26.934449 osd.16 (osd.16) 324 : cluster [DBG] 29.52 scrub starts
Some scrubbing going on.
My main concern here is how to "unstuck" my image without losing data (which already happened 2 times)