Project

General

Profile

Bug #47562

RBD image is stuck after iSCSI disconnection

Added by Fabio Durieux Lopes about 1 month ago. Updated 26 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

I have a problem I'm having a hard time finding out the cause. I have an RBD image being used by a Windows Server 2019. On production, after connecting and writing successfully, SOMETIMES this image gets stuck right after disconnecting said server from iSCSI interface. By stuck I mean this image cannot be used inside ceph. I can't remove it from my iSCSI target, I can't resize it, I can't enable/discable features nor mount it even if it is on a Linux box using rbd.

Three times I left a robocopy syncing some data (write to ceph) that, after some days, caused my Windows Server to freeze and forced me to reboot it.

I was not able to reproduce this using an "identical" test VM (also Windows 2019 Server), but I managed to reproduce multiple times on my production server.

"ceph -s" says HEALTH_OK. Nothing really happening in my cluster as it is still in testing phase, so I have only one client at a time.

Linux is CentOS 7, Linux ceph02 3.10.0-1127.18.2.el7.x86_64 #1 SMP Sun Jul 26 15:27:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Ceph is:
ceph -v
ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable)

I have some tcmu-runner log here:
https://pastebin.com/ShLmDdE2

ceph.log-20200918.gz (891 KB) Fabio Durieux Lopes, 09/23/2020 07:58 PM

ceph.log-20200917.gz (808 KB) Fabio Durieux Lopes, 09/23/2020 07:58 PM

History

#1 Updated by Brad Hubbard about 1 month ago

  • Project changed from Ceph to rbd

#2 Updated by Fabio Durieux Lopes about 1 month ago

Tried to remove the image from my iSCSI target again and it seems there's some tcmu-runner thread locked:

2020-09-22 18:13:49.469 195034 [DEBUG] handle_netlink:203: cmd 2. Got header version 2. Supported 2.
2020-09-22 18:13:49.508 195034 [DEBUG] tcmu_cancel_lock_thread:248 rbd/rbd.scm_03: waiting for lock thread to exit

[root@ceph01 ~]# systemctl restart tcmu-runner.service
[root@ceph01 ~]# vi /var/log/tcmu-runner.log

2020-09-22 18:13:49.469 195034 [DEBUG] handle_netlink:203: cmd 2. Got header version 2. Supported 2.
2020-09-22 18:13:49.508 195034 [DEBUG] tcmu_cancel_lock_thread:248 rbd/rbd.scm_03: waiting for lock thread to exit
2020-09-22 18:15:54.511 376724 [DEBUG] load_our_module:538: Module 'target_core_user' is already loaded
2020-09-22 18:15:54.512 376724 [DEBUG] main:1291: handler path: /usr/lib64/tcmu-runner
2020-09-22 18:15:54.512 376724 [DEBUG] main:1297: blocking netlink
2020-09-22 18:15:54.512 376724 [DEBUG] main:1300: blocking netlink done
2020-09-22 18:15:54.512 376724 [DEBUG] main:1308: resetting netlink
2020-09-22 18:15:54.512 376724 [DEBUG] main:1310: reset netlink done
2020-09-22 18:15:54.545 376724 [DEBUG] main:1318: 3 runner handlers found
2020-09-22 18:15:54.546 376724 [DEBUG] tcmu_cfgfs_dev_exec_action:329 rbd/rbd.scm_t4: executing action block_dev
2020-09-22 18:15:54.546 376724 [DEBUG] tcmu_cfgfs_dev_exec_action:331 rbd/rbd.scm_t4: action block_dev done
2020-09-22 18:15:54.546 376724 [DEBUG] tcmu_cfgfs_dev_exec_action:329 rbd/rbd.scm_t4: executing action reset_ring
2020-09-22 18:15:54.546 376724 [DEBUG] tcmu_cfgfs_dev_exec_action:331 rbd/rbd.scm_t4: action reset_ring done
2020-09-22 18:15:54.546 376724 [DEBUG] parse_tcmu_runner_args:924 rbd/rbd.scm_t4: Updated cfgstring: rbd/rbd/scm_t4;osd_op_timeout=30.
2020-09-22 18:15:54.547 376724 [DEBUG] dev_added:985 rbd/rbd.scm_t4: Got block_size 512, size in bytes 1099511627776
2020-09-22 18:15:54.547 376724 [DEBUG] tcmu_rbd_open:857 rbd/rbd.scm_t4: tcmu_rbd_open config rbd/rbd/scm_t4;osd_op_timeout=30 block size 512 num lbas 2147483648.
2020-09-22 18:15:54.575 376724 [DEBUG] timer_check_and_set_def:391 rbd/rbd.scm_t4: The cluster's default osd op timeout(0.000000), osd heartbeat grace(20) interval(6)
2020-09-22 18:15:54.680 376724 [DEBUG] tcmu_rbd_detect_device_class:334 rbd/rbd.scm_t4: Pool rbd using crush rule "replicated_rule"
2020-09-22 18:15:54.680 376724 [DEBUG] tcmu_rbd_match_device_class:265 rbd/rbd.scm_t4: ssd not a registered device class.
2020-09-22 18:15:54.681 376724 [DEBUG] tcmu_rbd_match_device_class:265 rbd/rbd.scm_t4: nvme not a registered device class.
2020-09-22 18:15:54.736 376724 [DEBUG] tcmu_cfgfs_dev_exec_action:329 rbd/rbd.scm_t4: executing action block_dev
2020-09-22 18:15:54.736 376724 [DEBUG] tcmu_cfgfs_dev_exec_action:331 rbd/rbd.scm_t4: action block_dev done
2020-09-22 18:15:54.737 376724 [DEBUG] dyn_config_start:427: Inotify is watching "/etc/tcmu", wd: 1, mask: IN_MODIFY
2020-09-22 18:15:54.746 376724 [DEBUG] dbus_bus_acquired:451: bus org.kernel.TCMUService1 acquired
2020-09-22 18:15:54.747 376724 [DEBUG] dbus_name_acquired:467: name org.kernel.TCMUService1 acquired

gwcli remains frozen after I tried to delete the images from my target

#3 Updated by Xiubo Li about 1 month ago

2020-09-18 00:23:03.900 195034 [DEBUG_SCSI_CMD] tcmu_cdb_print_info:353 rbd/rbd.scm_03: 8a 0 0 0 0 4 2 49 ae c0 0 0 2 0 0 0
2020-09-18 00:23:14.610 195034 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.scm_03: Timing out cmd.
2020-09-18 00:23:14.678 195034 [ERROR] tcmu_notify_conn_lost:201 rbd/rbd.scm_03: Handler connection lost (lock state 1)
2020-09-18 00:23:14.908 195034 [DEBUG] tgt_port_grp_recovery_thread_fn:223: Disabling iscsi/iqn.2001-07.com.ceph:1597702582812/tpgt_1.
2020-09-18 00:23:14.958 195034 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.scm_03: Timing out cmd.
2020-09-18 00:23:14.958 195034 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.scm_03: Timing out cmd.

From the above logs, the OSDs had stuck or something else then reports to tcmu-runner with the ETIMEDOUT error.

2020-09-22 18:13:49.469 195034 [DEBUG] handle_netlink:203: cmd 2. Got header version 2. Supported 2.
2020-09-22 18:13:49.508 195034 [DEBUG] tcmu_cancel_lock_thread:248 rbd/rbd.scm_03: waiting for lock thread to exit

From this logs, you were trying to remove the device, but it stucked by waiting the lock thread to finish.
This also could be happen if the OSDs stucked and the acquire lock stuck as above.

I didn't see other suspictable errors in tcmu-runner, have your check the ceph logs ?

#4 Updated by Fabio Durieux Lopes about 1 month ago

You are right. The second log lines were from when I tried to remove the image from my target and was not able. I finally decided to remove a lock with 'rbd lock rm' but now I have a weird situation as I remove the image from the target but if I try to detach the image inside gwcli it says it is still mapped to the target. Closing gwcli and reopening makes the image appear inside my iSCSI target.

I/O operations on the image are still hanging (e.g. feature enable/disable and image resizing).

Which logs should I look into and what should I search for?

'ceph -s' says HEALTH_OK and:
osd: 30 osds: 30 up (since 4d), 30 in (since 5w)

It seems some OSDs were down at least close to the time my Windows server froze.

I have some of this on the logs:

2020-09-18 00:23:23.935155 mon.ceph01 (mon.0) 8624 : cluster [INF] osd.1 [v2:10.18.22.81:6833/3392,v1:10.18.22.81:6836/3392] boot
2020-09-18 00:23:23.935214 mon.ceph01 (mon.0) 8625 : cluster [INF] osd.10 [v2:10.18.22.81:6812/3406,v1:10.18.22.81:6815/3406] boot
2020-09-18 00:23:23.935266 mon.ceph01 (mon.0) 8626 : cluster [DBG] osdmap e67839: 30 total, 29 up, 30 in
2020-09-18 00:23:24.799475 mon.ceph01 (mon.0) 8627 : cluster [INF] Health check cleared: SLOW_OPS (was: 5 slow ops, oldest one blocked for 30 sec, osd.8 has slow ops)
2020-09-18 00:23:24.820213 mon.ceph01 (mon.0) 8628 : cluster [DBG] osdmap e67840: 30 total, 29 up, 30 in
2020-09-18 00:23:25.846554 mon.ceph01 (mon.0) 8629 : cluster [DBG] osdmap e67841: 30 total, 29 up, 30 in
2020-09-18 00:23:17.952518 osd.23 (osd.23) 246 : cluster [DBG] 29.1e scrub starts
2020-09-18 00:23:26.852741 mon.ceph01 (mon.0) 8630 : cluster [DBG] mds.? [v2:10.18.22.82:6800/2678136627,v1:10.18.22.82:6801/2678136627] up:boot
2020-09-18 00:23:26.852834 mon.ceph01 (mon.0) 8631 : cluster [DBG] mds.? [v2:10.18.22.81:6838/613848769,v1:10.18.22.81:6839/613848769] up:boot
2020-09-18 00:23:26.852888 mon.ceph01 (mon.0) 8632 : cluster [DBG] fsmap 3 up:standby
2020-09-18 00:23:27.400352 mon.ceph01 (mon.0) 8633 : cluster [WRN] Health check failed: 8 slow ops, oldest one blocked for 47 sec, mon.ceph01 has slow ops (SLOW_OPS)
2020-09-18 00:23:27.415445 mon.ceph01 (mon.0) 8634 : cluster [DBG] osdmap e67842: 30 total, 29 up, 30 in
2020-09-18 00:23:18.950937 osd.5 (osd.5) 237 : cluster [WRN] Monitor daemon marked osd.5 down, but it is still running
2020-09-18 00:23:18.950961 osd.5 (osd.5) 238 : cluster [DBG] map e67837 wrongly marked me down at e67833
2020-09-18 00:23:20.401182 osd.1 (osd.1) 226 : cluster [WRN] Monitor daemon marked osd.1 down, but it is still running
2020-09-18 00:23:20.401197 osd.1 (osd.1) 227 : cluster [DBG] map e67837 wrongly marked me down at e67837
2020-09-18 00:23:28.853573 mon.ceph01 (mon.0) 8635 : cluster [WRN] Health check update: Degraded data redundancy: 220871/9417759 objects degraded (2.345%), 10 pgs degraded (PG_DEGRADED)
2020-09-18 00:23:26.934449 osd.16 (osd.16) 324 : cluster [DBG] 29.52 scrub starts

Some scrubbing going on.

My main concern here is how to "unstuck" my image without losing data (which already happened 2 times)

#5 Updated by Fabio Durieux Lopes about 1 month ago

Ceph's logs

#6 Updated by Fabio Durieux Lopes 26 days ago

My image is still stuck and I can't get it back up. Trying to remove/readd failed:

2020-09-22 18:26:28,270 ERROR [rbd-target-api:2574:call_api()] - _targetlun change on ceph02 failed with 400
2020-09-22 18:26:28,270 DEBUG [rbd-target-api:2596:call_api()] - failed on ceph02. Failed to remove the LUN - Delete from LIO/backstores failed - [Errno 2] No such file or directory: '/sys/kernel/config/target/core/user_0/rbd.scm_03/alua'
2020-09-28 18:02:37,390 ERROR [rbd-target-api:1756:_update_client()] - client update failed on iqn.1991-05.com.microsoft:svmnt-w12up-01 : Non-existent images [u'rbd.scm_03'] requested for iqn.1991-05.com.microsoft:svmnt-w12up-01
2020-09-28 18:02:37,398 ERROR [rbd-target-api:2574:call_api()] - _clientlun change on localhost failed with 500
[root@ceph01 ~]# ls -l /sys/kernel/config/target/core/user_0/rbd.scm_03/alua
ls: cannot access /sys/kernel/config/target/core/user_0/rbd.scm_03/alua: No such file or directory

Will try to reboot the servers

#7 Updated by Fabio Durieux Lopes 26 days ago

A soft reboot (shutdown -r now) did not succeed as the server did not shutdown and was hanging. Had to do a hard reset.

Also available in: Atom PDF