Project

General

Profile

Actions

Bug #36747

closed

RBD Unmap Hang

Added by Cliff Pajaro over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
rbd
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

In the last month, we've seen a small subset of our hosts with RBD clients experience hangs with this command:
rbd unmap /dev/rbdX

The command hangs indefinitely and have seen as long as 3 weeks. Rebooting the system is our only recourse at removing the mapped RBDs.

Tracing and debugging has shown repeated attempts to write the RBD device number to /sys/bus/rbd/remove_single_major.

Here is a trace an SRE captured when they noticed high CPU utilization on a system (they did not yet realize the RBD unmap hang issue).

kernel: [5878242.026440] sysrq: SysRq : Show backtrace of all active CPUs
kernel: [5878242.032600] NMI backtrace for cpu 7
kernel: [5878242.032602] CPU: 7 PID: 35062 Comm: bash Tainted: G           O    4.14.61-vanilla-ams-7 #1
kernel: [5878242.032603] Hardware name: Quanta Cloud Technology Inc. L08A-AMS/L08A-AMS-MB, BIOS S1Q_3A08.NEO03 10/18/2016
kernel: [5878242.032604] Call Trace:
kernel: [5878242.032611]  dump_stack+0x5c/0x78
kernel: [5878242.032614]  nmi_cpu_backtrace+0xb3/0xc0
kernel: [5878242.032618]  ? irq_force_complete_move+0xe0/0xe0
kernel: [5878242.032620]  nmi_trigger_cpumask_backtrace+0x8d/0xc0
kernel: [5878242.032624]  __handle_sysrq+0x84/0x130
kernel: [5878242.032626]  write_sysrq_trigger+0x2b/0x30
kernel: [5878242.032629]  proc_reg_write+0x40/0x70
kernel: [5878242.032632]  __vfs_write+0x26/0x190
kernel: [5878242.032634]  vfs_write+0xb1/0x1a0
kernel: [5878242.032636]  SyS_write+0x42/0xa0
kernel: [5878242.032639]  do_syscall_64+0x67/0x120
kernel: [5878242.032642]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
kernel: [5878242.032644] RIP: 0033:0x7f8ece5c5bb8
kernel: [5878242.032645] RSP: 002b:00007ffc1e7471f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
kernel: [5878242.032647] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f8ece5c5bb8
kernel: [5878242.032647] RDX: 0000000000000002 RSI: 0000558131c0d1c0 RDI: 0000000000000001
kernel: [5878242.032648] RBP: 0000558131c0d1c0 R08: 000000000000000a R09: 0000558131d0cc70
kernel: [5878242.032649] R10: 000000000000009b R11: 0000000000000246 R12: 0000000000000002
kernel: [5878242.032650] R13: 0000000000000001 R14: 00007f8ece898740 R15: 0000000000000002
kernel: [5878242.032652] Sending NMI from CPU 7 to CPUs 0-6,8-11:
kernel: [5878242.032657] NMI backtrace for cpu 0
kernel: [5878242.032659] CPU: 0 PID: 184436 Comm: kworker/0:3 Tainted: G           O    4.14.61-vanilla-ams-7 #1
kernel: [5878242.032660] Hardware name: Quanta Cloud Technology Inc. L08A-AMS/L08A-AMS-MB, BIOS S1Q_3A08.NEO03 10/18/2016
kernel: [5878242.032664] Workqueue: events_power_efficient process_srcu
kernel: [5878242.032665] task: ffff880287a8d100 task.stack: ffffc9002d8a8000
kernel: [5878242.032669] RIP: 0010:set_work_pool_and_clear_pending+0x19/0x20
kernel: [5878242.032669] RSP: 0018:ffffc9002d8abe98 EFLAGS: 00000002
kernel: [5878242.032670] RAX: ffff88085f224405 RBX: ffffffff822b78c0 RCX: ffff88085f220320
kernel: [5878242.032671] RDX: ffff88085f220320 RSI: 0000000000000000 RDI: ffffffff822b78c0
kernel: [5878242.032672] RBP: ffff88085f224400 R08: 0000000000000001 R09: ffff88085f220300
kernel: [5878242.032672] R10: 0000000000000000 R11: 0000000000000040 R12: 0000000000000000
kernel: [5878242.032673] R13: 0000000000000000 R14: ffff88074f71c300 R15: ffff88085f220300
kernel: [5878242.032674] FS:  0000000000000000(0000) GS:ffff88085f200000(0000) knlGS:0000000000000000
kernel: [5878242.032675] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [5878242.032676] CR2: 00007f32dc9f6638 CR3: 000000000220a006 CR4: 00000000003606f0
kernel: [5878242.032676] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: [5878242.032677] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: [5878242.032677] Call Trace:
kernel: [5878242.032681]  process_one_work+0x19a/0x3f0
kernel: [5878242.032684]  worker_thread+0x2b/0x3d0
kernel: [5878242.032686]  ? process_one_work+0x3f0/0x3f0
kernel: [5878242.032687]  kthread+0x119/0x130
kernel: [5878242.032689]  ? kthread_create_on_node+0x40/0x40
kernel: [5878242.032691]  ret_from_fork+0x35/0x40
kernel: [5878242.032692] Code: 8b 43 20 e8 9a 17 b4 00 48 89 43 30 5b c3 0f 1f 40 00 0f 1f 44 00 00 48 8b 07 48 63 f6 48 c1 e6 05 a8 01 74 07 48 89 37 0f ae f0 <c3> 0f 0b eb f5 66 90 0f 1f 44 00 00 83 ff 20 75 07 65 8b 3d 87 
kernel: [5878242.032757] NMI backtrace for cpu 1 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.032858] NMI backtrace for cpu 2 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.032959] NMI backtrace for cpu 3 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033069] NMI backtrace for cpu 5 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033093] NMI backtrace for cpu 4 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033168] NMI backtrace for cpu 6 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033268] NMI backtrace for cpu 8 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033368] NMI backtrace for cpu 9 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033497] NMI backtrace for cpu 10 skipped: idling at pc 0xffffffff818864f5
kernel: [5878242.033591] NMI backtrace for cpu 11 skipped: idling at pc 0xffffffff818864f5

There are around 40 RBDs on a system. In the situations we've examined closely, the first unmap command to hang is at least minutes away from the previous/next nearest unmap command. Ceph cluster seems to be in a healthy state at the time of the first unmap hang but health granularity is not that fine.

# lsof /sys/bus/rbd/remove_single_major
COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
rbd        497 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd       2030 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd       3705 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      12766 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      17817 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      22987 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      24851 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      26582 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      28278 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      29340 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      32297 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      33880 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd      38617 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     155721 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     167310 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
bash    168901 root    1w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     170086 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     196590 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     203939 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     206025 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     211412 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     219181 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     222116 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     224584 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     229789 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     231042 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     232167 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     232645 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     239826 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     241864 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     243141 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     246922 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     248469 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     249450 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     250723 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     253133 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     253557 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major
rbd     254352 root    7w   REG   0,19     4096 53892 /sys/bus/rbd/remove_single_major

Guidance to debug, insights into anything else to look at, and suggestions for recovery methods are appreciated.


Files

dmesg.txt (256 KB) dmesg.txt Cliff Pajaro, 01/10/2019 05:19 PM
Actions

Also available in: Atom PDF