Project

General

Profile

Actions

Bug #36747

closed

RBD Unmap Hang

Added by Cliff Pajaro over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
rbd
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

In the last month, we've seen a small subset of our hosts with RBD clients experience hangs with this command:
rbd unmap /dev/rbdX

The command hangs indefinitely and have seen as long as 3 weeks. Rebooting the system is our only recourse at removing the mapped RBDs.

Tracing and debugging has shown repeated attempts to write the RBD device number to /sys/bus/rbd/remove_single_major.

Here is a trace an SRE captured when they noticed high CPU utilization on a system (they did not yet realize the RBD unmap hang issue).

kernel: [5878242.026440] sysrq: SysRq : Show backtrace of all active CPUs
kernel: [5878242.032600] NMI backtrace for cpu 7
kernel: [5878242.032602] CPU: 7 PID: 35062 Comm: bash Tainted: G           O    4.14.61-vanilla-ams-7 #1
kernel: [5878242.032603] Hardware name: Quanta Cloud Technology Inc. L08A-AMS/L08A-AMS-MB, BIOS S1Q_3A08.NEO03 10/18/2016
kernel: [5878242.032604] Call Trace:
kernel: [5878242.032611]  dump_stack+0x5c/0x78
kernel: [5878242.032614]  nmi_cpu_backtrace+0xb3/0xc0
kernel: [5878242.032618]  ? irq_force_complete_move+0xe0/0xe0
kernel: [5878242.032620]  nmi_trigger_cpumask_backtrace+0x8d/0xc0
kernel: [5878242.032624]  __handle_sysrq+0x84/0x130
kernel: [5878242.032626]  write_sysrq_trigger+0x2b/0x30
kernel: [5878242.032629]  proc_reg_write+0x40/0x70
kernel: [5878242.032632]  __vfs_write+0x26/0x190
kernel: [5878242.032634]  vfs_write+0xb1/0x1a0
kernel: [5878242.032636]  SyS_write+0x42/0xa0
kernel: [5878242.032639]  do_syscall_64+0x67/0x120
kernel: [5878242.032642]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
kernel: [5878242.032644] RIP: 0033:0x7f8ece5c5bb8
kernel: [5878242.032645] RSP: 002b:00007ffc1e7471f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
kernel: [5878242.032647] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f8ece5c5bb8
kernel: [5878242.032647] RDX: 0000000000000002 RSI: 0000558131c0d1c0 RDI: 0000000000000001
kernel: [5878242.032648] RBP: 0000558131c0d1c0 R08: 000000000000000a R09: 0000558131d0cc70
kernel: [5878242.032649] R10: 000000000000009b R11: 0000000000000246 R12: 0000000000000002
kernel: [5878242.032650] R13: 0000000000000001 R14: 00007f8ece898740 R15: 0000000000000002
kernel: [5878242.032652] Sending NMI from CPU 7 to CPUs 0-6,8-11:
kernel: [5878242.032657] NMI backtrace for cpu 0
kernel: [5878242.032659] CPU: 0 PID: 184436 Comm: kworker/0:3 Tainted: G           O    4.14.61-vanilla-ams-7 #1
kernel: [5878242.032660] Hardware name: Quanta Cloud Technology Inc. L08A-AMS/L08A-AMS-MB, BIOS S1Q_3A08.NEO03 10/18/2016
kernel: [5878242.032664] Workqueue: events_power_efficient process_srcu
kernel: [5878242.032665] task: ffff880287a8d100 task.stack: ffffc9002d8a8000
kernel: [5878242.032669] RIP: 0010:set_work_pool_and_clear_pending+0x19/0x20
kernel: [5878242.032669] RSP: 0018:ffffc9002d8abe98 EFLAGS: 00000002
kernel: [5878242.032670] RAX: ffff88085f224405 RBX: ffffffff822b78c0 RCX: ffff88085f220320
kernel: [5878242.032671] RDX: ffff88085f220320 RSI: 0000000000000000 RDI: ffffffff822b78c0
kernel: [5878242.032672] RBP: ffff88085f224400 R08: 0000000000000001 R09: ffff88085f220300
kernel: [5878242.032672] R10: 0000000000000000 R11: 0000000000000040 R12: 0000000000000000
kernel: [5878242.032673] R13: 0000000000000000 R14: ffff88074f71c300 R15: ffff88085f220300
kernel: [5878242.032674] FS:  0000000000000000(0000) GS:ffff88085f200000(0000) knlGS:0000000000000000
kernel: [5878242.032675] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [5878242.032676] CR2: 00007f32dc9f6638 CR3: 000000000220a006 CR4: 00000000003606f0
kernel: [5878242.032676] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: [5878242.032677] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: [5878242.032677] Call Trace:
kernel: [5878242.032681]  process_one_work+0x19a/0x3f0
kernel: [5878242.032684]  worker_thread+0x2b/0x3d0
kernel: [5878242.032686]  ? process_one_work+0x3f0/0x3f0
kernel: [5878242.032687]  kthread+0x119/0x130
kernel: [5878242.032689]  ? kthread_create_on_node+0x40/0x40
kernel: [5878242.032691]  ret_from_fork+0x35/0x40
kernel: [5878242.032692] Code: 8b 43 20 e8 9a 17 b4 00 48 89 43 30 5b c3 0f 1f 40 00 0f 1f 44 00 00 48 8b 07 48 63 f6 48 c1 e6 05 a8 01 74 07 48 89 37 0f ae f0 <c3> 0f 0b eb f5 66 90 0f 1f 44 00 00 83 ff 20 75 07 65 8b 3d 87 
kernel: [5878242.032757] NMI backtrace for cpu 1 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.032858] NMI backtrace for cpu 2 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.032959] NMI backtrace for cpu 3 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033069] NMI backtrace for cpu 5 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033093] NMI backtrace for cpu 4 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033168] NMI backtrace for cpu 6 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033268] NMI backtrace for cpu 8 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033368] NMI backtrace for cpu 9 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033497] NMI backtrace for cpu 10 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033591] NMI backtrace for cpu 11 skipped: idling at pc 0xffffffff818864f5

There are around 40 RBDs on a system. In the situations we've examined closely, the first unmap command to hang is at least minutes away from the previous/next nearest unmap command. Ceph cluster seems to be in a healthy state at the time of the first unmap hang but health granularity is not that fine.

# lsof /sys/bus/rbd/remove_single_major
COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
rbd        497 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd       2030 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd       3705 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      12766 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      17817 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      22987 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      24851 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      26582 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      28278 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      29340 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      32297 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      33880 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      38617 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     155721 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     167310 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
bash    168901 root    1w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     170086 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     196590 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     203939 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     206025 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     211412 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     219181 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     222116 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     224584 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     229789 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     231042 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     232167 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     232645 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     239826 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     241864 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     243141 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     246922 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     248469 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     249450 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     250723 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     253133 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     253557 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     254352 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major

Guidance to debug, insights into anything else to look at, and suggestions for recovery methods are appreciated.


Files

dmesg.txt (256 KB) dmesg.txt Cliff Pajaro, 01/10/2019 05:19 PM
Actions #1

Updated by Josh Durgin over 5 years ago

  • Project changed from RADOS to rbd
Actions #2

Updated by Jason Dillaman over 5 years ago

  • Project changed from rbd to Linux kernel client
Actions #3

Updated by Ilya Dryomov over 5 years ago

  • Category set to rbd
  • Assignee set to Ilya Dryomov

A bunch of blocked "rbd unmap" processes is unlikely to be the cause of high CPU utilization. They appear to be blocked on something in the kernel.

Try dumping the kernel stack with "cat /proc/<rbd unmap pid>/stack" or "echo w >/proc/sysrq-trigger".

Actions #4

Updated by Cliff Pajaro over 5 years ago

I have dumped it and it is repeatedly attempting to write the device block number to /sys/bus/rbd/remove_single_major.

Actions #5

Updated by Ilya Dryomov over 5 years ago

What do you mean by repeatedly? Do you see the same "rbd unmap" process calling write(2) multiple times as opposed to being blocked after calling write(2) just once?

Can you paste "cat /proc/<rbd unmap pid>/stack" output?

Actions #6

Updated by Cliff Pajaro over 5 years ago

I see the write occurring multiple times a second. I do not have a system in this state to get the output.

Actions #7

Updated by Ilya Dryomov over 5 years ago

Cliff Pajaro wrote:

I see the write occurring multiple times a second.

Just clarifying, did you see the same "rbd unmap" process attempting that write multiple times a second? If so, how did you observe it?

Actions #8

Updated by Cliff Pajaro over 5 years ago

I attached gdb to an unmap process and it repeatedly gave that operation (apologies, I can't find the logs I recorded of this). All unmap processes were stuck doing this; we have a maximum of ~40 RBDs on a single host, by this point all of them were unmapping.

Actions #9

Updated by Ilya Dryomov over 5 years ago

gdb shows the userspace stack, so you never saw the kernel stack. If it happens again, do "cat /proc/<rbd unmap pid>/stack" and "echo w >/proc/sysrq-trigger" and paste the output (the latter outputs to dmesg).

Actions #10

Updated by Cliff Pajaro over 5 years ago

Clarification, I ran strace -o strace_out.log -fF -s2048 <pid> >>strace_out.log which showed the repeated write attempts. I will get that output the next time I see this happen.

Actions #11

Updated by Cliff Pajaro over 5 years ago

From a recent occurrence of the rbd unmap hang:
/proc/206355/stat

206355 (rbd) R 206346 206346 201944 34817 206346 4210944 1371 0 0 0 8478 89583 0 0 20 0 4 0 229360826 311939072 5155 18446744073709551615 94189238824960 94189240738466 140729136559584 0 0 0 4096 0 1098908904 0 0 0 17 9 0 0 0 0 0 94189242836800 94189242861856 94189251248128 140729136564190 140729136564210 140729136564210 140729136566251 0

This was the strace with the write constantly repeating:

206357 restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
206356 futex(0x55aa24eb2394, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
206358 restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
206355 write(7, "0", 1)                 = 0
206355 write(7, "0", 1)                 = 0
206355 write(7, "0", 1)                 = 0
206355 write(7, "0", 1)                 = 0
206355 write(7, "0", 1)                 = 0
206355 write(7, "0", 1)                 = 0
206355 write(7, "0", 1)                 = 0

Actions #12

Updated by Ilya Dryomov over 5 years ago

I think I know what's causing the write loop. I'll prepare a patch next week.

Actions #13

Updated by Cliff Pajaro over 5 years ago

Thanks, Ilya, for the reply. It seems like you have something to address the repeated writes. Will this address why the sysfs file is not-writable in the first place or is this only addressing the repeated write attempts?

Also, I realized you asked for the stack and dmesg output.

stack:

[<ffffffff811062a8>] __synchronize_srcu+0x98/0xa0
[<ffffffff811ce512>] bdi_unregister+0xd2/0x130
[<ffffffff813b2d22>] del_gendisk+0x152/0x220
[<ffffffffa019fc5c>] do_rbd_remove.isra.17+0xfc/0x250 [rbd]
[<ffffffff8129628f>] kernfs_fop_write+0x10f/0x190
[<ffffffff81225dd6>] __vfs_write+0x26/0x190
[<ffffffff812260e1>] vfs_write+0xb1/0x1a0
[<ffffffff812262f2>] SyS_write+0x42/0xa0
[<ffffffff81001bd7>] do_syscall_64+0x67/0x120
[<ffffffff81a00081>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<ffffffffffffffff>] 0xffffffffffffffff

dmesg:

2019-01-03T16:17:54.956205+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510284] lshw            D    0 200344      1 0x00000004
2019-01-03T16:17:54.956220+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510285] Call Trace:
2019-01-03T16:17:54.956222+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510287]  ? __schedule+0x1ee/0x7b0
2019-01-03T16:17:54.956224+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510288]  schedule+0x28/0x80
2019-01-03T16:17:54.956225+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510289]  schedule_preempt_disabled+0xa/0x10
2019-01-03T16:17:54.956225+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510290]  __mutex_lock.isra.0+0x269/0x4b0
2019-01-03T16:17:54.956226+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510292]  ? usb_device_read+0x71/0x140
2019-01-03T16:17:54.956227+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510293]  usb_device_read+0x71/0x140
2019-01-03T16:17:54.956228+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510294]  ? _copy_to_user+0x22/0x30
2019-01-03T16:17:54.956229+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510296]  full_proxy_read+0x54/0x90
2019-01-03T16:17:54.956230+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510298]  __vfs_read+0x26/0x190
2019-01-03T16:17:54.956231+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510300]  vfs_read+0x91/0x130
2019-01-03T16:17:54.956232+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510301]  SyS_read+0x42/0xa0
2019-01-03T16:17:54.956233+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510303]  do_syscall_64+0x67/0x120
2019-01-03T16:17:54.956234+00:00 ams3.r6c.04.atla.gaikai.net kernel: [2440017.510304]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

The above was repeated several times.

Actions #14

Updated by Ilya Dryomov over 5 years ago

[PATCH] rbd: don't return 0 on unmap if RBD_DEV_FLAG_REMOVING is set

Actions #15

Updated by Ilya Dryomov over 5 years ago

Cliff Pajaro wrote:

Thanks, Ilya, for the reply. It seems like you have something to address the repeated writes. Will this address why the sysfs file is not-writable in the first place or is this only addressing the repeated write attempts?

Only the repeated write attempts. What I assume is happening is the initial "rbd unmap" takes a while (for reasons we have yet to identify) and then a repeated "rbd unmap" for the same device launched by either a human or some automation starts repeating writes instead of returning "Operation now in progress" right away.

Also, I realized you asked for the stack and dmesg output.

stack:

[<ffffffff811062a8>] __synchronize_srcu+0x98/0xa0
[<ffffffff811ce512>] bdi_unregister+0xd2/0x130
[<ffffffff813b2d22>] del_gendisk+0x152/0x220
[<ffffffffa019fc5c>] do_rbd_remove.isra.17+0xfc/0x250 [rbd]
[<ffffffff8129628f>] kernfs_fop_write+0x10f/0x190
[<ffffffff81225dd6>] __vfs_write+0x26/0x190
[<ffffffff812260e1>] vfs_write+0xb1/0x1a0
[<ffffffff812262f2>] SyS_write+0x42/0xa0
[<ffffffff81001bd7>] do_syscall_64+0x67/0x120
[<ffffffff81a00081>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<ffffffffffffffff>] 0xffffffffffffffff

What do you mean by not-writable? This trace shows that the write did its job -- the handler was invoked and the unmap was initiated. Now, it didn't return from the write because as part of tearing down the bdi it seems to be waiting for the RCU grace period. Are you seeing it sit in this state for days?

Actions #16

Updated by Cliff Pajaro over 5 years ago

Yes, I still have an unmap from last year:

cpajaro@ams3 ~ $ sudo cat /proc/43072/stack
[<ffffffff811062a8>] __synchronize_srcu+0x98/0xa0
[<ffffffff811ce512>] bdi_unregister+0xd2/0x130
[<ffffffff813b2d22>] del_gendisk+0x152/0x220
[<ffffffffa0162c5c>] do_rbd_remove.isra.17+0xfc/0x250 [rbd]
[<ffffffff8129628f>] kernfs_fop_write+0x10f/0x190
[<ffffffff81225dd6>] __vfs_write+0x26/0x190
[<ffffffff812260e1>] vfs_write+0xb1/0x1a0
[<ffffffff812262f2>] SyS_write+0x42/0xa0
[<ffffffff81001bd7>] do_syscall_64+0x67/0x120
[<ffffffff81a00081>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<ffffffffffffffff>] 0xffffffffffffffff
cpajaro@ams3 ~ $ ps -aux | grep 43072      
root      43072  0.0  0.0 304628 20572 ?        D     2018   0:04 /usr/bin/rbd unmap /dev/rbd1

Actions #17

Updated by Ilya Dryomov over 5 years ago

Hrm, there must be something holding up the SRCU read-side critical section. Can you dump all blocked tasks with "echo w >/proc/sysrq-trigger" on that node?

Actions #18

Updated by Cliff Pajaro over 5 years ago

I posted this above in #13 but I've just attached a file with the entirety of it just now

Actions #19

Updated by Ilya Dryomov over 5 years ago

I think you are hitting a variant of https://lore.kernel.org/patchwork/patch/840444/, except what is being removed is an rbd device instead of a usb device. dmesg shows a lot of "lshw" processes stuck on a mutex in usb_device_read() while holding up SRCU. All "rbd unmap" attempts are waiting for SRCU grace period to complete in __synchronize_srcu().

It looks like this deadlock was fixed in e9117a5a4bf6 ("debugfs: implement per-file removal protection") and subsequent commits in 4.15 -- the global debugfs_srcu is gone. Can you try upgrading your kernel and see if that helps?

Actions #21

Updated by Ilya Dryomov about 5 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF