Bug #36747
closedRBD Unmap Hang
0%
Description
In the last month, we've seen a small subset of our hosts with RBD clients experience hangs with this command:rbd unmap /dev/rbdX
The command hangs indefinitely and have seen as long as 3 weeks. Rebooting the system is our only recourse at removing the mapped RBDs.
Tracing and debugging has shown repeated attempts to write the RBD device number to /sys/bus/rbd/remove_single_major
.
Here is a trace an SRE captured when they noticed high CPU utilization on a system (they did not yet realize the RBD unmap hang issue).
kernel: [5878242.026440] sysrq: SysRq : Show backtrace of all active CPUs
kernel: [5878242.032600] NMI backtrace for cpu 7
kernel: [5878242.032602] CPU: 7 PID: 35062 Comm: bash Tainted: G O 4.14.61-vanilla-ams-7 #1
kernel: [5878242.032603] Hardware name: Quanta Cloud Technology Inc. L08A-AMS/L08A-AMS-MB, BIOS S1Q_3A08.NEO03 10/18/2016
kernel: [5878242.032604] Call Trace:
kernel: [5878242.032611] dump_stack+0x5c/0x78
kernel: [5878242.032614] nmi_cpu_backtrace+0xb3/0xc0
kernel: [5878242.032618] ? irq_force_complete_move+0xe0/0xe0
kernel: [5878242.032620] nmi_trigger_cpumask_backtrace+0x8d/0xc0
kernel: [5878242.032624] __handle_sysrq+0x84/0x130
kernel: [5878242.032626] write_sysrq_trigger+0x2b/0x30
kernel: [5878242.032629] proc_reg_write+0x40/0x70
kernel: [5878242.032632] __vfs_write+0x26/0x190
kernel: [5878242.032634] vfs_write+0xb1/0x1a0
kernel: [5878242.032636] SyS_write+0x42/0xa0
kernel: [5878242.032639] do_syscall_64+0x67/0x120
kernel: [5878242.032642] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
kernel: [5878242.032644] RIP: 0033:0x7f8ece5c5bb8
kernel: [5878242.032645] RSP: 002b:00007ffc1e7471f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
kernel: [5878242.032647] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f8ece5c5bb8
kernel: [5878242.032647] RDX: 0000000000000002 RSI: 0000558131c0d1c0 RDI: 0000000000000001
kernel: [5878242.032648] RBP: 0000558131c0d1c0 R08: 000000000000000a R09: 0000558131d0cc70
kernel: [5878242.032649] R10: 000000000000009b R11: 0000000000000246 R12: 0000000000000002
kernel: [5878242.032650] R13: 0000000000000001 R14: 00007f8ece898740 R15: 0000000000000002
kernel: [5878242.032652] Sending NMI from CPU 7 to CPUs 0-6,8-11:
kernel: [5878242.032657] NMI backtrace for cpu 0
kernel: [5878242.032659] CPU: 0 PID: 184436 Comm: kworker/0:3 Tainted: G O 4.14.61-vanilla-ams-7 #1
kernel: [5878242.032660] Hardware name: Quanta Cloud Technology Inc. L08A-AMS/L08A-AMS-MB, BIOS S1Q_3A08.NEO03 10/18/2016
kernel: [5878242.032664] Workqueue: events_power_efficient process_srcu
kernel: [5878242.032665] task: ffff880287a8d100 task.stack: ffffc9002d8a8000
kernel: [5878242.032669] RIP: 0010:set_work_pool_and_clear_pending+0x19/0x20
kernel: [5878242.032669] RSP: 0018:ffffc9002d8abe98 EFLAGS: 00000002
kernel: [5878242.032670] RAX: ffff88085f224405 RBX: ffffffff822b78c0 RCX: ffff88085f220320
kernel: [5878242.032671] RDX: ffff88085f220320 RSI: 0000000000000000 RDI: ffffffff822b78c0
kernel: [5878242.032672] RBP: ffff88085f224400 R08: 0000000000000001 R09: ffff88085f220300
kernel: [5878242.032672] R10: 0000000000000000 R11: 0000000000000040 R12: 0000000000000000
kernel: [5878242.032673] R13: 0000000000000000 R14: ffff88074f71c300 R15: ffff88085f220300
kernel: [5878242.032674] FS: 0000000000000000(0000) GS:ffff88085f200000(0000) knlGS:0000000000000000
kernel: [5878242.032675] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [5878242.032676] CR2: 00007f32dc9f6638 CR3: 000000000220a006 CR4: 00000000003606f0
kernel: [5878242.032676] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: [5878242.032677] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: [5878242.032677] Call Trace:
kernel: [5878242.032681] process_one_work+0x19a/0x3f0
kernel: [5878242.032684] worker_thread+0x2b/0x3d0
kernel: [5878242.032686] ? process_one_work+0x3f0/0x3f0
kernel: [5878242.032687] kthread+0x119/0x130
kernel: [5878242.032689] ? kthread_create_on_node+0x40/0x40
kernel: [5878242.032691] ret_from_fork+0x35/0x40
kernel: [5878242.032692] Code: 8b 43 20 e8 9a 17 b4 00 48 89 43 30 5b c3 0f 1f 40 00 0f 1f 44 00 00 48 8b 07 48 63 f6 48 c1 e6 05 a8 01 74 07 48 89 37 0f ae f0 <c3> 0f 0b eb f5 66 90 0f 1f 44 00 00 83 ff 20 75 07 65 8b 3d 87
kernel: [5878242.032757] NMI backtrace for cpu 1 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.032858] NMI backtrace for cpu 2 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.032959] NMI backtrace for cpu 3 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033069] NMI backtrace for cpu 5 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033093] NMI backtrace for cpu 4 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033168] NMI backtrace for cpu 6 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033268] NMI backtrace for cpu 8 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033368] NMI backtrace for cpu 9 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033497] NMI backtrace for cpu 10 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033591] NMI backtrace for cpu 11 skipped: idling at pc 0xffffffff818864f5
There are around 40 RBDs on a system. In the situations we've examined closely, the first unmap command to hang is at least minutes away from the previous/next nearest unmap command. Ceph cluster seems to be in a healthy state at the time of the first unmap hang but health granularity is not that fine.
# lsof /sys/bus/rbd/remove_single_major
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
rbd 497 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 2030 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 3705 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 12766 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 17817 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 22987 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 24851 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 26582 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 28278 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 29340 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 32297 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 33880 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 38617 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 155721 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 167310 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
bash 168901 root 1w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 170086 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 196590 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 203939 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 206025 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 211412 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 219181 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 222116 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 224584 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 229789 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 231042 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 232167 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 232645 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 239826 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 241864 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 243141 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 246922 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 248469 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 249450 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 250723 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 253133 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 253557 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
rbd 254352 root 7w REG 0,19 4096 53892 /sys/bus/rbd/remove_single_major
Guidance to debug, insights into anything else to look at, and suggestions for recovery methods are appreciated.
Files
Updated by Jason Dillaman over 5 years ago
- Project changed from rbd to Linux kernel client
Updated by Ilya Dryomov over 5 years ago
- Category set to rbd
- Assignee set to Ilya Dryomov
A bunch of blocked "rbd unmap" processes is unlikely to be the cause of high CPU utilization. They appear to be blocked on something in the kernel.
Try dumping the kernel stack with "cat /proc/<rbd unmap pid>/stack" or "echo w >/proc/sysrq-trigger".
Updated by Cliff Pajaro over 5 years ago
I have dumped it and it is repeatedly attempting to write the device block number to /sys/bus/rbd/remove_single_major
.
Updated by Ilya Dryomov over 5 years ago
What do you mean by repeatedly? Do you see the same "rbd unmap" process calling write(2) multiple times as opposed to being blocked after calling write(2) just once?
Can you paste "cat /proc/<rbd unmap pid>/stack" output?
Updated by Cliff Pajaro over 5 years ago
I see the write occurring multiple times a second. I do not have a system in this state to get the output.
Updated by Ilya Dryomov over 5 years ago
Cliff Pajaro wrote:
I see the write occurring multiple times a second.
Just clarifying, did you see the same "rbd unmap" process attempting that write multiple times a second? If so, how did you observe it?
Updated by Cliff Pajaro over 5 years ago
I attached gdb to an unmap process and it repeatedly gave that operation (apologies, I can't find the logs I recorded of this). All unmap processes were stuck doing this; we have a maximum of ~40 RBDs on a single host, by this point all of them were unmapping.
Updated by Ilya Dryomov over 5 years ago
gdb shows the userspace stack, so you never saw the kernel stack. If it happens again, do "cat /proc/<rbd unmap pid>/stack" and "echo w >/proc/sysrq-trigger" and paste the output (the latter outputs to dmesg).
Updated by Cliff Pajaro over 5 years ago
Clarification, I ran strace -o strace_out.log -fF -s2048 <pid> >>strace_out.log
which showed the repeated write attempts. I will get that output the next time I see this happen.
Updated by Cliff Pajaro over 5 years ago
From a recent occurrence of the rbd unmap hang:
/proc/206355/stat
206355 (rbd) R 206346 206346 201944 34817 206346 4210944 1371 0 0 0 8478 89583 0 0 20 0 4 0 229360826 311939072 5155 18446744073709551615 94189238824960 94189240738466 140729136559584 0 0 0 4096 0 1098908904 0 0 0 17 9 0 0 0 0 0 94189242836800 94189242861856 94189251248128 140729136564190 140729136564210 140729136564210 140729136566251 0
This was the strace with the write constantly repeating:
206357 restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
206356 futex(0x55aa24eb2394, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
206358 restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
206355 write(7, "0", 1) = 0
206355 write(7, "0", 1) = 0
206355 write(7, "0", 1) = 0
206355 write(7, "0", 1) = 0
206355 write(7, "0", 1) = 0
206355 write(7, "0", 1) = 0
206355 write(7, "0", 1) = 0
Updated by Ilya Dryomov over 5 years ago
I think I know what's causing the write loop. I'll prepare a patch next week.
Updated by Cliff Pajaro over 5 years ago
Thanks, Ilya, for the reply. It seems like you have something to address the repeated writes. Will this address why the sysfs file is not-writable in the first place or is this only addressing the repeated write attempts?
Also, I realized you asked for the stack and dmesg output.
stack:
[<ffffffff811062a8>] __synchronize_srcu+0x98/0xa0
[<ffffffff811ce512>] bdi_unregister+0xd2/0x130
[<ffffffff813b2d22>] del_gendisk+0x152/0x220
[<ffffffffa019fc5c>] do_rbd_remove.isra.17+0xfc/0x250 [rbd]
[<ffffffff8129628f>] kernfs_fop_write+0x10f/0x190
[<ffffffff81225dd6>] __vfs_write+0x26/0x190
[<ffffffff812260e1>] vfs_write+0xb1/0x1a0
[<ffffffff812262f2>] SyS_write+0x42/0xa0
[<ffffffff81001bd7>] do_syscall_64+0x67/0x120
[<ffffffff81a00081>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<ffffffffffffffff>] 0xffffffffffffffff
dmesg:
2019-01-03T16:17:54.956205+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510284] lshw D 0 200344 1 0x00000004
2019-01-03T16:17:54.956220+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510285] Call Trace:
2019-01-03T16:17:54.956222+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510287] ? __schedule+0x1ee/0x7b0
2019-01-03T16:17:54.956224+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510288] schedule+0x28/0x80
2019-01-03T16:17:54.956225+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510289] schedule_preempt_disabled+0xa/0x10
2019-01-03T16:17:54.956225+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510290] __mutex_lock.isra.0+0x269/0x4b0
2019-01-03T16:17:54.956226+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510292] ? usb_device_read+0x71/0x140
2019-01-03T16:17:54.956227+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510293] usb_device_read+0x71/0x140
2019-01-03T16:17:54.956228+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510294] ? _copy_to_user+0x22/0x30
2019-01-03T16:17:54.956229+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510296] full_proxy_read+0x54/0x90
2019-01-03T16:17:54.956230+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510298] __vfs_read+0x26/0x190
2019-01-03T16:17:54.956231+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510300] vfs_read+0x91/0x130
2019-01-03T16:17:54.956232+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510301] SyS_read+0x42/0xa0
2019-01-03T16:17:54.956233+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510303] do_syscall_64+0x67/0x120
2019-01-03T16:17:54.956234+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510304] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
The above was repeated several times.
Updated by Ilya Dryomov over 5 years ago
[PATCH] rbd: don't return 0 on unmap if RBD_DEV_FLAG_REMOVING is set
Updated by Ilya Dryomov over 5 years ago
Cliff Pajaro wrote:
Thanks, Ilya, for the reply. It seems like you have something to address the repeated writes. Will this address why the sysfs file is not-writable in the first place or is this only addressing the repeated write attempts?
Only the repeated write attempts. What I assume is happening is the initial "rbd unmap" takes a while (for reasons we have yet to identify) and then a repeated "rbd unmap" for the same device launched by either a human or some automation starts repeating writes instead of returning "Operation now in progress" right away.
Also, I realized you asked for the stack and dmesg output.
stack:
[<ffffffff811062a8>] __synchronize_srcu+0x98/0xa0 [<ffffffff811ce512>] bdi_unregister+0xd2/0x130 [<ffffffff813b2d22>] del_gendisk+0x152/0x220 [<ffffffffa019fc5c>] do_rbd_remove.isra.17+0xfc/0x250 [rbd] [<ffffffff8129628f>] kernfs_fop_write+0x10f/0x190 [<ffffffff81225dd6>] __vfs_write+0x26/0x190 [<ffffffff812260e1>] vfs_write+0xb1/0x1a0 [<ffffffff812262f2>] SyS_write+0x42/0xa0 [<ffffffff81001bd7>] do_syscall_64+0x67/0x120 [<ffffffff81a00081>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [<ffffffffffffffff>] 0xffffffffffffffff
What do you mean by not-writable? This trace shows that the write did its job -- the handler was invoked and the unmap was initiated. Now, it didn't return from the write because as part of tearing down the bdi it seems to be waiting for the RCU grace period. Are you seeing it sit in this state for days?
Updated by Cliff Pajaro over 5 years ago
Yes, I still have an unmap from last year:
cpajaro@ams3 ~ $ sudo cat /proc/43072/stack [<ffffffff811062a8>] __synchronize_srcu+0x98/0xa0 [<ffffffff811ce512>] bdi_unregister+0xd2/0x130 [<ffffffff813b2d22>] del_gendisk+0x152/0x220 [<ffffffffa0162c5c>] do_rbd_remove.isra.17+0xfc/0x250 [rbd] [<ffffffff8129628f>] kernfs_fop_write+0x10f/0x190 [<ffffffff81225dd6>] __vfs_write+0x26/0x190 [<ffffffff812260e1>] vfs_write+0xb1/0x1a0 [<ffffffff812262f2>] SyS_write+0x42/0xa0 [<ffffffff81001bd7>] do_syscall_64+0x67/0x120 [<ffffffff81a00081>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [<ffffffffffffffff>] 0xffffffffffffffff cpajaro@ams3 ~ $ ps -aux | grep 43072 root 43072 0.0 0.0 304628 20572 ? D 2018 0:04 /usr/bin/rbd unmap /dev/rbd1
Updated by Ilya Dryomov over 5 years ago
Hrm, there must be something holding up the SRCU read-side critical section. Can you dump all blocked tasks with "echo w >/proc/sysrq-trigger" on that node?
Updated by Cliff Pajaro over 5 years ago
I posted this above in #13 but I've just attached a file with the entirety of it just now
Updated by Ilya Dryomov over 5 years ago
I think you are hitting a variant of https://lore.kernel.org/patchwork/patch/840444/, except what is being removed is an rbd device instead of a usb device. dmesg shows a lot of "lshw" processes stuck on a mutex in usb_device_read() while holding up SRCU. All "rbd unmap" attempts are waiting for SRCU grace period to complete in __synchronize_srcu().
It looks like this deadlock was fixed in e9117a5a4bf6 ("debugfs: implement per-file removal protection") and subsequent commits in 4.15 -- the global debugfs_srcu is gone. Can you try upgrading your kernel and see if that helps?
Updated by Ilya Dryomov over 5 years ago
The repeated write attempts issue is now fixed: