Project

General

Profile

Actions

Bug #57960

open

iscsi - rbd-target-api unkillable on container exit, daemon enters error state

Added by David Heap over 1 year ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
orchestrator
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

We have seen lots of iscsi container restarts due to https://tracker.ceph.com/issues/57897 and during some of them the daemon enters error state.

On investigation it looks like the rbd-target-api process is held open when the container exits, due to a lock on /sys/kernel/config/target/core/user_0/<pool_name>.<disk_name>/enable

At present, the only way we've found to release the lock is to reboot the host.

We have seen this issue across our production and test clusters.


$ sudo ps aux | grep [r]bd
root      503587  0.0  0.0   1012     4 ?        Ss   10:34   0:00 /dev/init -- /usr/bin/rbd-target-api
root      503604  0.6  3.2 2995952 129844 ?      Dl   10:34   0:04 /usr/bin/python3.6 -s /usr/bin/rbd-target-api

$ sudo kill 503604

$ sudo ps aux | grep [r]bd
root      503587  0.0  0.0   1012     4 ?        Ss   10:34   0:00 /dev/init -- /usr/bin/rbd-target-api
root      503604  0.6  3.2 2995952 129844 ?      Dl   10:34   0:04 /usr/bin/python3.6 -s /usr/bin/rbd-target-api

$ sudo kill -9 503604

$ sudo ps aux | grep [r]bd
root      503587  0.0  0.0   1012     4 ?        Ss   10:34   0:00 /dev/init -- /usr/bin/rbd-target-api
root      503604  0.6  3.3 2995952 133700 ?      D    10:34   0:04 /usr/bin/python3.6 -s /usr/bin/rbd-target-api

$ sudo kill -9 503587

$ sudo ps aux | grep [r]bd
root      503604  0.6  3.3 2995952 133700 ?      D    10:34   0:04 /usr/bin/python3.6 -s /usr/bin/rbd-target-api

$ sudo kill -9 503604

$ sudo ps aux | grep [r]bd
root      503604  0.6  3.3 2995952 133700 ?      D    10:34   0:04 /usr/bin/python3.6 -s /usr/bin/rbd-target-api

$ sudo tail -f /var/log/kern.log
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334248] INFO: task rbd-target-api:503604 blocked for more than 845 seconds.
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334351]       Not tainted 5.10.0-17-amd64 #1 Debian 5.10.136-1
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334441] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334537] task:rbd-target-api  state:D stack:    0 pid:503604 ppid:503587 flags:0x00004004
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334541] Call Trace:
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334565]  __schedule+0x282/0x880
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334571]  ? __wake_up_common_lock+0x8a/0xc0
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334574]  ? usleep_range+0x90/0x90
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334576]  schedule+0x46/0xb0
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334578]  schedule_timeout+0x107/0x150
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334581]  ? __prepare_to_swait+0x4f/0x70
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334583]  __wait_for_common+0xae/0x160
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334599]  tcmu_netlink_event_send+0x188/0x2d0 [target_core_user]
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334603]  tcmu_configure_device+0x26e/0x3a0 [target_core_user]
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334630]  target_configure_device+0x76/0x250 [target_core_mod]
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334641]  target_dev_enable_store+0x32/0x50 [target_core_mod]
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334648]  configfs_write_file+0xe3/0x150 [configfs]
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334653]  vfs_write+0xc4/0x260
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334655]  ksys_write+0x5f/0xe0
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334659]  do_syscall_64+0x33/0x80
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334662]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334665] RIP: 0033:0x7f17925d8a17
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334666] RSP: 002b:00007fff714e57f0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334669] RAX: ffffffffffffffda RBX: 000000000000004b RCX: 00007f17925d8a17
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334670] RDX: 0000000000000002 RSI: 00005579c12874d0 RDI: 000000000000004b
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334671] RBP: 00005579c12874d0 R08: 0000000000000000 R09: 00007f17929fa0ed
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334672] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000002
Oct 19 10:49:17 dh-ceph02-test kernel: [165048.334673] R13: 000000000000004b R14: 00005579c12874d0 R15: 00005579bf208110

$ sudo strace -p 503604
strace: Process 503604 attached
^C
$ sudo lsof -p 503604 | grep [k]ernel
rbd-targe 503604 root   75w      REG    0,30        0 2781916 /sys/kernel/config/target/core/user_0/rbd.disk_1/enable

Actions #1

Updated by David Heap over 1 year ago

The systemd service shows the following:


Nov 02 10:46:46 dh-ceph01-test systemd[1]: Starting Ceph iscsi.iscsi.dh-ceph01-test.jrdndc for b274fb37-d18a-45ee-a088-cf175459601b...
Nov 02 10:46:46 dh-ceph01-test systemd[1]: ceph-b274fb37-d18a-45ee-a088-cf175459601b@iscsi.iscsi.dh-ceph01-test.jrdndc.service: Found left-over process 82862 (conmon) in control group while starting unit. Ignoring.
Nov 02 10:46:46 dh-ceph01-test systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Nov 02 10:46:46 dh-ceph01-test systemd[1]: ceph-b274fb37-d18a-45ee-a088-cf175459601b@iscsi.iscsi.dh-ceph01-test.jrdndc.service: Found left-over process 82873 (init) in control group while starting unit. Ignoring.
Nov 02 10:46:46 dh-ceph01-test systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Nov 02 10:46:46 dh-ceph01-test systemd[1]: ceph-b274fb37-d18a-45ee-a088-cf175459601b@iscsi.iscsi.dh-ceph01-test.jrdndc.service: Found left-over process 82907 (rbd-target-api) in control group while starting unit. Ignoring.
Nov 02 10:46:46 dh-ceph01-test systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Nov 02 10:46:46 dh-ceph01-test systemd[626448]: ceph-b274fb37-d18a-45ee-a088-cf175459601b@iscsi.iscsi.dh-ceph01-test.jrdndc.service: Failed to attach to cgroup /system.slice/system-ceph\x2db274fb37\x2dd18a\x2d45ee\x2da088\x2dcf175459601b.slice/ceph-b274fb37-d18a-45ee-a088-cf175459601b@iscsi.iscsi.dh-ceph01-test.jrdndc.service: Device or resource busy
Nov 02 10:46:46 dh-ceph01-test systemd[626448]: ceph-b274fb37-d18a-45ee-a088-cf175459601b@iscsi.iscsi.dh-ceph01-test.jrdndc.service: Failed at step CGROUP spawning /bin/bash: Device or resource busy
Nov 02 10:46:46 dh-ceph01-test systemd[1]: ceph-b274fb37-d18a-45ee-a088-cf175459601b@iscsi.iscsi.dh-ceph01-test.jrdndc.service: Control process exited, code=exited, status=219/CGROUP
Nov 02 10:46:46 dh-ceph01-test bash[626450]: Error: no container with name or ID ceph-b274fb37-d18a-45ee-a088-cf175459601b-iscsi-iscsi-dh-ceph01-test-jrdndc-tcmu found: no such container
Nov 02 10:46:46 dh-ceph01-test systemd[1]: ceph-b274fb37-d18a-45ee-a088-cf175459601b@iscsi.iscsi.dh-ceph01-test.jrdndc.service: Failed with result 'exit-code'.
Nov 02 10:46:46 dh-ceph01-test systemd[1]: ceph-b274fb37-d18a-45ee-a088-cf175459601b@iscsi.iscsi.dh-ceph01-test.jrdndc.service: Unit process 82862 (conmon) remains running after unit stopped.
Nov 02 10:46:46 dh-ceph01-test systemd[1]: ceph-b274fb37-d18a-45ee-a088-cf175459601b@iscsi.iscsi.dh-ceph01-test.jrdndc.service: Unit process 82873 (init) remains running after unit stopped.
Nov 02 10:46:46 dh-ceph01-test systemd[1]: ceph-b274fb37-d18a-45ee-a088-cf175459601b@iscsi.iscsi.dh-ceph01-test.jrdndc.service: Unit process 82907 (rbd-target-api) remains running after unit stopped.
Nov 02 10:46:46 dh-ceph01-test systemd[1]: Failed to start Ceph iscsi.iscsi.dh-ceph01-test.jrdndc for b274fb37-d18a-45ee-a088-cf175459601b.
Actions

Also available in: Atom PDF