https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2020-09-21T23:47:10ZCeph rbd - Bug #47562: RBD image is stuck after iSCSI disconnectionhttps://tracker.ceph.com/issues/47562?journal_id=1755122020-09-21T23:47:10ZBrad Hubbardbhubbard@redhat.com
<ul><li><strong>Project</strong> changed from <i>Ceph</i> to <i>rbd</i></li></ul> rbd - Bug #47562: RBD image is stuck after iSCSI disconnectionhttps://tracker.ceph.com/issues/47562?journal_id=1756582020-09-22T21:23:57ZFabio Durieux Lopes
<ul></ul><p>Tried to remove the image from my iSCSI target again and it seems there's some tcmu-runner thread locked:</p>
<p>2020-09-22 18:13:49.469 195034 [DEBUG] handle_netlink:203: cmd 2. Got header version 2. Supported 2.<br />2020-09-22 18:13:49.508 195034 [DEBUG] tcmu_cancel_lock_thread:248 rbd/rbd.scm_03: waiting for lock thread to exit</p>
<p>[root@ceph01 ~]# systemctl restart tcmu-runner.service<br />[root@ceph01 ~]# vi /var/log/tcmu-runner.log</p>
<p>2020-09-22 18:13:49.469 195034 [DEBUG] handle_netlink:203: cmd 2. Got header version 2. Supported 2.<br />2020-09-22 18:13:49.508 195034 [DEBUG] tcmu_cancel_lock_thread:248 rbd/rbd.scm_03: waiting for lock thread to exit<br />2020-09-22 18:15:54.511 376724 [DEBUG] load_our_module:538: Module 'target_core_user' is already loaded<br />2020-09-22 18:15:54.512 376724 [DEBUG] main:1291: handler path: /usr/lib64/tcmu-runner<br />2020-09-22 18:15:54.512 376724 [DEBUG] main:1297: blocking netlink<br />2020-09-22 18:15:54.512 376724 [DEBUG] main:1300: blocking netlink done<br />2020-09-22 18:15:54.512 376724 [DEBUG] main:1308: resetting netlink<br />2020-09-22 18:15:54.512 376724 [DEBUG] main:1310: reset netlink done<br />2020-09-22 18:15:54.545 376724 [DEBUG] main:1318: 3 runner handlers found<br />2020-09-22 18:15:54.546 376724 [DEBUG] tcmu_cfgfs_dev_exec_action:329 rbd/rbd.scm_t4: executing action block_dev<br />2020-09-22 18:15:54.546 376724 [DEBUG] tcmu_cfgfs_dev_exec_action:331 rbd/rbd.scm_t4: action block_dev done<br />2020-09-22 18:15:54.546 376724 [DEBUG] tcmu_cfgfs_dev_exec_action:329 rbd/rbd.scm_t4: executing action reset_ring<br />2020-09-22 18:15:54.546 376724 [DEBUG] tcmu_cfgfs_dev_exec_action:331 rbd/rbd.scm_t4: action reset_ring done<br />2020-09-22 18:15:54.546 376724 [DEBUG] parse_tcmu_runner_args:924 rbd/rbd.scm_t4: Updated cfgstring: rbd/rbd/scm_t4;osd_op_timeout=30.<br />2020-09-22 18:15:54.547 376724 [DEBUG] dev_added:985 rbd/rbd.scm_t4: Got block_size 512, size in bytes 1099511627776<br />2020-09-22 18:15:54.547 376724 [DEBUG] tcmu_rbd_open:857 rbd/rbd.scm_t4: tcmu_rbd_open config rbd/rbd/scm_t4;osd_op_timeout=30 block size 512 num lbas 2147483648.<br />2020-09-22 18:15:54.575 376724 [DEBUG] timer_check_and_set_def:391 rbd/rbd.scm_t4: The cluster's default osd op timeout(0.000000), osd heartbeat grace(20) interval(6)<br />2020-09-22 18:15:54.680 376724 [DEBUG] tcmu_rbd_detect_device_class:334 rbd/rbd.scm_t4: Pool rbd using crush rule "replicated_rule" <br />2020-09-22 18:15:54.680 376724 [DEBUG] tcmu_rbd_match_device_class:265 rbd/rbd.scm_t4: ssd not a registered device class.<br />2020-09-22 18:15:54.681 376724 [DEBUG] tcmu_rbd_match_device_class:265 rbd/rbd.scm_t4: nvme not a registered device class.<br />2020-09-22 18:15:54.736 376724 [DEBUG] tcmu_cfgfs_dev_exec_action:329 rbd/rbd.scm_t4: executing action block_dev<br />2020-09-22 18:15:54.736 376724 [DEBUG] tcmu_cfgfs_dev_exec_action:331 rbd/rbd.scm_t4: action block_dev done<br />2020-09-22 18:15:54.737 376724 [DEBUG] dyn_config_start:427: Inotify is watching "/etc/tcmu", wd: 1, mask: IN_MODIFY<br />2020-09-22 18:15:54.746 376724 [DEBUG] dbus_bus_acquired:451: bus org.kernel.TCMUService1 acquired<br />2020-09-22 18:15:54.747 376724 [DEBUG] dbus_name_acquired:467: name org.kernel.TCMUService1 acquired</p>
<p>gwcli remains frozen after I tried to delete the images from my target</p> rbd - Bug #47562: RBD image is stuck after iSCSI disconnectionhttps://tracker.ceph.com/issues/47562?journal_id=1756722020-09-23T04:37:01ZXiubo Lixiubli@redhat.com
<ul></ul><pre>
2020-09-18 00:23:03.900 195034 [DEBUG_SCSI_CMD] tcmu_cdb_print_info:353 rbd/rbd.scm_03: 8a 0 0 0 0 4 2 49 ae c0 0 0 2 0 0 0
2020-09-18 00:23:14.610 195034 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.scm_03: Timing out cmd.
2020-09-18 00:23:14.678 195034 [ERROR] tcmu_notify_conn_lost:201 rbd/rbd.scm_03: Handler connection lost (lock state 1)
2020-09-18 00:23:14.908 195034 [DEBUG] tgt_port_grp_recovery_thread_fn:223: Disabling iscsi/iqn.2001-07.com.ceph:1597702582812/tpgt_1.
2020-09-18 00:23:14.958 195034 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.scm_03: Timing out cmd.
2020-09-18 00:23:14.958 195034 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.scm_03: Timing out cmd.
</pre>
<p>From the above logs, the OSDs had stuck or something else then reports to tcmu-runner with the ETIMEDOUT error.</p>
<pre>
2020-09-22 18:13:49.469 195034 [DEBUG] handle_netlink:203: cmd 2. Got header version 2. Supported 2.
2020-09-22 18:13:49.508 195034 [DEBUG] tcmu_cancel_lock_thread:248 rbd/rbd.scm_03: waiting for lock thread to exit
</pre>
<p>From this logs, you were trying to remove the device, but it stucked by waiting the lock thread to finish.<br />This also could be happen if the OSDs stucked and the acquire lock stuck as above.</p>
<p>I didn't see other suspictable errors in tcmu-runner, have your check the ceph logs ?</p> rbd - Bug #47562: RBD image is stuck after iSCSI disconnectionhttps://tracker.ceph.com/issues/47562?journal_id=1758632020-09-23T19:40:40ZFabio Durieux Lopes
<ul></ul><p>You are right. The second log lines were from when I tried to remove the image from my target and was not able. I finally decided to remove a lock with 'rbd lock rm' but now I have a weird situation as I remove the image from the target but if I try to detach the image inside gwcli it says it is still mapped to the target. Closing gwcli and reopening makes the image appear inside my iSCSI target.</p>
<p>I/O operations on the image are still hanging (e.g. feature enable/disable and image resizing).</p>
<p>Which logs should I look into and what should I search for?</p>
<p>'ceph -s' says HEALTH_OK and:<br />osd: 30 osds: 30 up (since 4d), 30 in (since 5w)</p>
<p>It seems some OSDs were down at least close to the time my Windows server froze.</p>
<p>I have some of this on the logs:</p>
<p>2020-09-18 00:23:23.935155 mon.ceph01 (mon.0) 8624 : cluster [INF] osd.1 [v2:10.18.22.81:6833/3392,v1:10.18.22.81:6836/3392] boot<br />2020-09-18 00:23:23.935214 mon.ceph01 (mon.0) 8625 : cluster [INF] osd.10 [v2:10.18.22.81:6812/3406,v1:10.18.22.81:6815/3406] boot<br />2020-09-18 00:23:23.935266 mon.ceph01 (mon.0) 8626 : cluster [DBG] osdmap e67839: 30 total, 29 up, 30 in<br />2020-09-18 00:23:24.799475 mon.ceph01 (mon.0) 8627 : cluster [INF] Health check cleared: SLOW_OPS (was: 5 slow ops, oldest one blocked for 30 sec, osd.8 has slow ops)<br />2020-09-18 00:23:24.820213 mon.ceph01 (mon.0) 8628 : cluster [DBG] osdmap e67840: 30 total, 29 up, 30 in<br />2020-09-18 00:23:25.846554 mon.ceph01 (mon.0) 8629 : cluster [DBG] osdmap e67841: 30 total, 29 up, 30 in<br />2020-09-18 00:23:17.952518 osd.23 (osd.23) 246 : cluster [DBG] 29.1e scrub starts<br />2020-09-18 00:23:26.852741 mon.ceph01 (mon.0) 8630 : cluster [DBG] mds.? [v2:10.18.22.82:6800/2678136627,v1:10.18.22.82:6801/2678136627] up:boot<br />2020-09-18 00:23:26.852834 mon.ceph01 (mon.0) 8631 : cluster [DBG] mds.? [v2:10.18.22.81:6838/613848769,v1:10.18.22.81:6839/613848769] up:boot<br />2020-09-18 00:23:26.852888 mon.ceph01 (mon.0) 8632 : cluster [DBG] fsmap 3 up:standby<br />2020-09-18 00:23:27.400352 mon.ceph01 (mon.0) 8633 : cluster [WRN] Health check failed: 8 slow ops, oldest one blocked for 47 sec, mon.ceph01 has slow ops (SLOW_OPS)<br />2020-09-18 00:23:27.415445 mon.ceph01 (mon.0) 8634 : cluster [DBG] osdmap e67842: 30 total, 29 up, 30 in<br />2020-09-18 00:23:18.950937 osd.5 (osd.5) 237 : cluster [WRN] Monitor daemon marked osd.5 down, but it is still running<br />2020-09-18 00:23:18.950961 osd.5 (osd.5) 238 : cluster [DBG] map e67837 wrongly marked me down at e67833<br />2020-09-18 00:23:20.401182 osd.1 (osd.1) 226 : cluster [WRN] Monitor daemon marked osd.1 down, but it is still running<br />2020-09-18 00:23:20.401197 osd.1 (osd.1) 227 : cluster [DBG] map e67837 wrongly marked me down at e67837<br />2020-09-18 00:23:28.853573 mon.ceph01 (mon.0) 8635 : cluster [WRN] Health check update: Degraded data redundancy: 220871/9417759 objects degraded (2.345%), 10 pgs degraded (PG_DEGRADED)<br />2020-09-18 00:23:26.934449 osd.16 (osd.16) 324 : cluster [DBG] 29.52 scrub starts</p>
<p>Some scrubbing going on.</p>
<p>My main concern here is how to "unstuck" my image without losing data (which already happened 2 times)</p> rbd - Bug #47562: RBD image is stuck after iSCSI disconnectionhttps://tracker.ceph.com/issues/47562?journal_id=1758652020-09-23T19:59:17ZFabio Durieux Lopes
<ul><li><strong>File</strong> <a href="/attachments/download/5155/ceph.log-20200918.gz">ceph.log-20200918.gz</a> added</li><li><strong>File</strong> <a href="/attachments/download/5156/ceph.log-20200917.gz">ceph.log-20200917.gz</a> added</li></ul><p>Ceph's logs</p> rbd - Bug #47562: RBD image is stuck after iSCSI disconnectionhttps://tracker.ceph.com/issues/47562?journal_id=1761542020-09-28T21:16:27ZFabio Durieux Lopes
<ul></ul><p>My image is still stuck and I can't get it back up. Trying to remove/readd failed:</p>
<p>2020-09-22 18:26:28,270 ERROR [rbd-target-api:2574:call_api()] - _targetlun change on ceph02 failed with 400<br />2020-09-22 18:26:28,270 DEBUG [rbd-target-api:2596:call_api()] - failed on ceph02. Failed to remove the LUN - Delete from LIO/backstores failed - [Errno 2] No such file or directory: '/sys/kernel/config/target/core/user_0/rbd.scm_03/alua'<br />2020-09-28 18:02:37,390 ERROR [rbd-target-api:1756:_update_client()] - client update failed on iqn.1991-05.com.microsoft:svmnt-w12up-01 : Non-existent images [u'rbd.scm_03'] requested for iqn.1991-05.com.microsoft:svmnt-w12up-01<br />2020-09-28 18:02:37,398 ERROR [rbd-target-api:2574:call_api()] - _clientlun change on localhost failed with 500<br />[root@ceph01 ~]# ls -l /sys/kernel/config/target/core/user_0/rbd.scm_03/alua<br />ls: cannot access /sys/kernel/config/target/core/user_0/rbd.scm_03/alua: No such file or directory</p>
<p>Will try to reboot the servers</p> rbd - Bug #47562: RBD image is stuck after iSCSI disconnectionhttps://tracker.ceph.com/issues/47562?journal_id=1761552020-09-28T21:24:45ZFabio Durieux Lopes
<ul></ul><p>A soft reboot (shutdown -r now) did not succeed as the server did not shutdown and was hanging. Had to do a hard reset.</p>