Bug #51136
Random hanging issues with rbd after network issues
0%
Description
Hello,
We are experimenting random hanging issues with our LXC VM using a mapped RBD image.
We're using version 15.2.11 (from proxmox, but I'm not sure it make a difference). We did upgrade from 14 when we started to have the issue.
We're having random network glitches, creating some flapping with osd. We do have things like in logs:
Jun 8 00:19:24 srv005-de kernel: [1570086.718933] INFO: task filebeat:38133 blocked for more than 120 seconds. Jun 8 00:19:24 srv005-de kernel: [1570086.719857] Tainted: P O 5.4.114-1-pve #1 Jun 8 00:19:24 srv005-de kernel: [1570086.720846] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 8 00:19:24 srv005-de kernel: [1570086.721957] filebeat D 0 38133 33317 0x00000320 Jun 8 00:19:24 srv005-de kernel: [1570086.723136] Call Trace: Jun 8 00:19:24 srv005-de kernel: [1570086.724312] __schedule+0x2e6/0x700 Jun 8 00:19:24 srv005-de kernel: [1570086.725517] ? bit_wait_timeout+0xa0/0xa0 Jun 8 00:19:24 srv005-de kernel: [1570086.726778] schedule+0x33/0xa0 Jun 8 00:19:24 srv005-de kernel: [1570086.728051] io_schedule+0x16/0x40 Jun 8 00:19:24 srv005-de kernel: [1570086.729350] bit_wait_io+0x11/0x50 Jun 8 00:19:24 srv005-de kernel: [1570086.730673] __wait_on_bit+0x33/0xa0 Jun 8 00:19:24 srv005-de kernel: [1570086.732014] out_of_line_wait_on_bit+0x90/0xb0 Jun 8 00:19:24 srv005-de kernel: [1570086.733417] ? var_wake_function+0x30/0x30 Jun 8 00:19:24 srv005-de kernel: [1570086.734848] __wait_on_buffer+0x32/0x40 Jun 8 00:19:24 srv005-de kernel: [1570086.736293] __ext4_find_entry+0x30c/0x450 Jun 8 00:19:24 srv005-de kernel: [1570086.737773] ? ext4_fname_prepare_lookup+0xcd/0x120 Jun 8 00:19:24 srv005-de kernel: [1570086.739315] ext4_lookup+0xf4/0x260 Jun 8 00:19:24 srv005-de kernel: [1570086.740849] path_openat+0x983/0x16f0 Jun 8 00:19:24 srv005-de kernel: [1570086.742408] do_filp_open+0x93/0x100 Jun 8 00:19:24 srv005-de kernel: [1570086.743989] ? __alloc_fd+0x46/0x150 Jun 8 00:19:24 srv005-de kernel: [1570086.745587] do_sys_open+0x177/0x280 Jun 8 00:19:24 srv005-de kernel: [1570086.747202] __x64_sys_openat+0x20/0x30 Jun 8 00:19:24 srv005-de kernel: [1570086.748835] do_syscall_64+0x57/0x190 Jun 8 00:19:24 srv005-de kernel: [1570086.750503] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Jun 8 00:19:24 srv005-de kernel: [1570086.752215] RIP: 0033:0x47523a Jun 8 00:19:24 srv005-de kernel: [1570086.753950] Code: Bad RIP value. Jun 8 00:19:24 srv005-de kernel: [1570086.755720] RSP: 002b:000000c420f12bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000101 Jun 8 00:19:24 srv005-de kernel: [1570086.757618] RAX: ffffffffffffffda RBX: 000000c42001750c RCX: 000000000047523a Jun 8 00:19:24 srv005-de kernel: [1570086.759599] RDX: 0000000000181242 RSI: 000000c4204e09e0 RDI: ffffffffffffff9c Jun 8 00:19:24 srv005-de kernel: [1570086.761614] RBP: 000000c420f12c48 R08: 0000000000000000 R09: 0000000000000000 Jun 8 00:19:24 srv005-de kernel: [1570086.763695] R10: 0000000000000180 R11: 0000000000000202 R12: 0000000000000020 Jun 8 00:19:24 srv005-de kernel: [1570086.765773] R13: 00000000000009e0 R14: 0000000000000054 R15: 0000000000000000 Jun 8 00:19:24 srv005-de kernel: [1570086.772692] INFO: task Grafana Syncer :174259 blocked for more than 120 seconds. Jun 8 00:19:24 srv005-de kernel: [1570086.774912] Tainted: P O 5.4.114-1-pve #1 Jun 8 00:19:24 srv005-de kernel: [1570086.777188] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 8 00:19:24 srv005-de kernel: [1570086.779603] Grafana Syncer D 0 174259 156131 0x00000320 Jun 8 00:19:24 srv005-de kernel: [1570086.782030] Call Trace: Jun 8 00:19:24 srv005-de kernel: [1570086.784377] __schedule+0x2e6/0x700 Jun 8 00:19:24 srv005-de kernel: [1570086.786700] schedule+0x33/0xa0 Jun 8 00:19:24 srv005-de kernel: [1570086.788950] io_schedule+0x16/0x40 Jun 8 00:19:24 srv005-de kernel: [1570086.791134] wait_on_page_bit+0x141/0x210 Jun 8 00:19:24 srv005-de kernel: [1570086.793307] ? file_fdatawait_range+0x30/0x30 Jun 8 00:19:24 srv005-de kernel: [1570086.795499] wait_on_page_writeback+0x43/0x90 Jun 8 00:19:24 srv005-de kernel: [1570086.797682] wait_for_stable_page+0x45/0x60 Jun 8 00:19:24 srv005-de kernel: [1570086.799855] grab_cache_page_write_begin+0x30/0x40 Jun 8 00:19:24 srv005-de kernel: [1570086.802237] ext4_da_write_begin+0x11a/0x460 Jun 8 00:19:24 srv005-de kernel: [1570086.804467] generic_perform_write+0xf2/0x1b0 Jun 8 00:19:24 srv005-de kernel: [1570086.806714] ? file_update_time+0xed/0x130 Jun 8 00:19:24 srv005-de kernel: [1570086.808929] __generic_file_write_iter+0x101/0x1f0 Jun 8 00:19:24 srv005-de kernel: [1570086.811185] ext4_file_write_iter+0xb9/0x360 Jun 8 00:19:24 srv005-de kernel: [1570086.813422] new_sync_write+0x125/0x1c0 Jun 8 00:19:24 srv005-de kernel: [1570086.815662] __vfs_write+0x29/0x40 Jun 8 00:19:24 srv005-de kernel: [1570086.817877] vfs_write+0xab/0x1b0 Jun 8 00:19:24 srv005-de kernel: [1570086.820081] ksys_write+0x61/0xe0 Jun 8 00:19:24 srv005-de kernel: [1570086.822281] __x64_sys_write+0x1a/0x20 Jun 8 00:19:24 srv005-de kernel: [1570086.824496] do_syscall_64+0x57/0x190 Jun 8 00:19:24 srv005-de kernel: [1570086.826722] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Jun 8 00:19:24 srv005-de kernel: [1570086.828989] RIP: 0033:0x7f88fb01f2cf Jun 8 00:19:24 srv005-de kernel: [1570086.831259] Code: Bad RIP value. Jun 8 00:19:24 srv005-de kernel: [1570086.833518] RSP: 002b:00007f8855fbc020 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 Jun 8 00:19:24 srv005-de kernel: [1570086.835904] RAX: ffffffffffffffda RBX: 000000000000005a RCX: 00007f88fb01f2cf Jun 8 00:19:24 srv005-de kernel: [1570086.838284] RDX: 000000000000005a RSI: 00007f8868608430 RDI: 00000000000002cf Jun 8 00:19:24 srv005-de kernel: [1570086.840699] RBP: 00007f8868608430 R08: 0000000000000000 R09: 0000000000000058 Jun 8 00:19:24 srv005-de kernel: [1570086.843070] R10: 00007f889d1f2162 R11: 0000000000000293 R12: 00007f88680b4118 Jun 8 00:19:24 srv005-de kernel: [1570086.845390] R13: 00007f88fb01cbb0 R14: 00007f88680b40d0 R15: 00007f88680b4118 Jun 8 00:19:25 srv005-de kernel: [1570087.592647] libceph: osd7 down Jun 8 00:19:26 srv005-de kernel: [1570088.592125] libceph: osd9 down Jun 8 00:19:26 srv005-de kernel: [1570088.594368] libceph: osd10 down Jun 8 00:19:27 srv005-de kernel: [1570089.598545] libceph: osd17 down Jun 8 00:19:27 srv005-de kernel: [1570089.600711] libceph: osd9 up Jun 8 00:19:27 srv005-de kernel: [1570089.602817] libceph: osd10 up Jun 8 00:19:28 srv005-de kernel: [1570090.600495] libceph: osd17 up Jun 8 00:19:30 srv005-de kernel: [1570092.624549] libceph: osd7 up Jun 8 00:19:35 srv005-de kernel: [1570097.648839] libceph: osd14 down Jun 8 00:19:37 srv005-de kernel: [1570099.666524] libceph: osd14 up Jun 8 00:19:38 srv005-de kernel: [1570100.669367] libceph: osd4 down Jun 8 00:19:38 srv005-de kernel: [1570100.671171] libceph: osd18 down Jun 8 00:19:39 srv005-de kernel: [1570101.672803] libceph: osd18 up Jun 8 00:19:41 srv005-de kernel: [1570103.805108] libceph: osd15 down Jun 8 00:19:42 srv005-de kernel: [1570104.811184] libceph: osd4 up Jun 8 00:19:44 srv005-de kernel: [1570106.828069] libceph: osd16 down Jun 8 00:19:44 srv005-de kernel: [1570106.829714] libceph: osd15 up Jun 8 00:19:45 srv005-de kernel: [1570107.830437] libceph: osd16 up Jun 8 00:19:51 srv005-de kernel: [1570113.872510] libceph: osd17 down Jun 8 00:19:53 srv005-de kernel: [1570115.885757] libceph: osd17 up Jun 8 00:19:54 srv005-de kernel: [1570116.891014] libceph: osd10 down Jun 8 00:19:54 srv005-de kernel: [1570116.892497] libceph: osd13 down Jun 8 00:19:55 srv005-de kernel: [1570117.894696] libceph: osd10 up Jun 8 00:19:55 srv005-de kernel: [1570117.896132] libceph: osd13 up Jun 8 00:20:01 srv005-de kernel: [1570123.928539] libceph: osd3 down Jun 8 00:20:01 srv005-de kernel: [1570123.929883] libceph: osd9 down Jun 8 00:20:02 srv005-de kernel: [1570124.935188] libceph: osd6 down Jun 8 00:20:02 srv005-de kernel: [1570124.936471] libceph: osd3 up Jun 8 00:20:03 srv005-de kernel: [1570125.951571] libceph: osd6 up Jun 8 00:20:05 srv005-de kernel: [1570127.812051] libceph: osd7 down Jun 8 00:20:05 srv005-de kernel: [1570127.813229] libceph: osd9 up Jun 8 00:20:07 srv005-de kernel: [1570129.967049] libceph: osd18 down Jun 8 00:20:09 srv005-de kernel: [1570131.981916] libceph: osd7 up Jun 8 00:20:11 srv005-de kernel: [1570133.993845] libceph: osd18 up Jun 8 00:21:00 srv005-de kernel: [1570182.974037] libceph: osd0 (1)10.6.64.1:6806 socket closed (con state CONNECTING) Jun 8 03:49:44 srv005-de kernel: [1582706.900767] libceph: osd7 down Jun 8 03:49:44 srv005-de kernel: [1582706.901784] libceph: osd9 down Jun 8 03:49:44 srv005-de kernel: [1582707.398780] libceph: osd1 down Jun 8 03:49:44 srv005-de kernel: [1582707.399778] libceph: osd2 down Jun 8 03:49:44 srv005-de kernel: [1582707.400695] libceph: osd10 down Jun 8 03:49:45 srv005-de kernel: [1582708.401249] libceph: osd8 down Jun 8 03:49:45 srv005-de kernel: [1582708.402242] libceph: osd1 up Jun 8 03:49:45 srv005-de kernel: [1582708.403182] libceph: osd2 up Jun 8 03:49:45 srv005-de kernel: [1582708.404122] libceph: osd7 up Jun 8 03:49:45 srv005-de kernel: [1582708.405076] libceph: osd10 up Jun 8 03:49:46 srv005-de kernel: [1582709.400748] libceph: osd11 down Jun 8 03:49:46 srv005-de kernel: [1582709.401659] libceph: osd8 up Jun 8 03:49:47 srv005-de kernel: [1582710.407863] libceph: osd9 up Jun 8 03:49:47 srv005-de kernel: [1582710.408796] libceph: osd11 up Jun 8 03:50:14 srv005-de kernel: [1582737.151699] libceph: osd0 (1)10.6.64.1:6806 socket closed (con state OPEN) Jun 8 03:52:06 srv005-de kernel: [1582849.481279] libceph: osd9 down Jun 8 03:52:06 srv005-de kernel: [1582849.482320] libceph: osd15 down Jun 8 03:52:08 srv005-de kernel: [1582851.377054] rbd: rbd11: encountered watch error: -107 Jun 8 03:52:09 srv005-de kernel: [1582852.056647] libceph: osd17 down Jun 8 03:52:10 srv005-de kernel: [1582853.063856] libceph: osd9 up Jun 8 03:52:11 srv005-de kernel: [1582854.068605] libceph: osd15 up Jun 8 03:52:13 srv005-de kernel: [1582856.088847] libceph: osd17 up Jun 8 04:16:09 srv005-de kernel: [1584292.114088] libceph: osd3 down Jun 8 04:16:09 srv005-de kernel: [1584292.115238] libceph: osd16 down Jun 8 04:16:09 srv005-de kernel: [1584292.116408] libceph: osd18 down Jun 8 04:16:10 srv005-de kernel: [1584293.118181] libceph: osd3 up Jun 8 04:16:10 srv005-de kernel: [1584293.119319] libceph: osd18 up Jun 8 04:16:12 srv005-de kernel: [1584295.129276] libceph: osd16 up Jun 8 04:16:12 srv005-de kernel: [1584295.205633] rbd: rbd11: encountered watch error: -107 Jun 8 04:16:16 srv005-de kernel: [1584299.287492] libceph: osd4 down Jun 8 04:16:16 srv005-de kernel: [1584299.288751] libceph: osd8 down Jun 8 04:16:16 srv005-de kernel: [1584299.289993] libceph: osd11 down Jun 8 04:16:17 srv005-de kernel: [1584300.288661] libceph: osd4 up Jun 8 04:16:18 srv005-de kernel: [1584301.289716] libceph: osd11 up Jun 8 04:16:19 srv005-de kernel: [1584302.294871] libceph: osd13 down Jun 8 04:16:20 srv005-de kernel: [1584303.007898] libceph: osd6 down Jun 8 04:16:20 srv005-de kernel: [1584303.009079] libceph: osd17 down Jun 8 04:16:21 srv005-de kernel: [1584304.009817] libceph: osd8 up Jun 8 04:16:21 srv005-de kernel: [1584304.010939] libceph: osd13 up Jun 8 04:16:21 srv005-de kernel: [1584304.012068] libceph: osd17 up Jun 8 04:16:26 srv005-de kernel: [1584309.010836] libceph: osd6 up Jun 8 04:16:34 srv005-de kernel: [1584317.410469] libceph: osd1 down Jun 8 04:16:39 srv005-de kernel: [1584322.447281] libceph: osd1 up Jun 8 04:16:41 srv005-de kernel: [1584324.460433] libceph: osd18 down Jun 8 04:16:42 srv005-de kernel: [1584325.470049] libceph: osd5 down Jun 8 04:16:42 srv005-de kernel: [1584325.471185] libceph: osd18 up Jun 8 04:16:43 srv005-de kernel: [1584326.472689] libceph: osd0 down Jun 8 04:16:43 srv005-de kernel: [1584326.473869] libceph: osd9 down Jun 8 04:16:44 srv005-de kernel: [1584327.475587] libceph: osd0 up Jun 8 04:16:44 srv005-de kernel: [1584327.476736] libceph: osd9 up Jun 8 04:16:45 srv005-de kernel: [1584328.016516] libceph: osd7 down Jun 8 04:16:45 srv005-de kernel: [1584328.017590] libceph: osd10 down Jun 8 04:16:45 srv005-de kernel: [1584328.018592] libceph: osd11 down Jun 8 04:16:46 srv005-de kernel: [1584329.018743] libceph: osd7 up Jun 8 04:16:46 srv005-de kernel: [1584329.019695] libceph: osd10 up Jun 8 04:16:46 srv005-de kernel: [1584329.020577] libceph: osd11 up Jun 8 04:16:48 srv005-de kernel: [1584331.064864] libceph: osd5 up Jun 8 04:17:00 srv005-de kernel: [1584342.573831] libceph: osd2 down Jun 8 04:17:01 srv005-de kernel: [1584343.588498] rbd: rbd3: encountered watch error: -107 Jun 8 04:17:02 srv005-de kernel: [1584344.605140] libceph: osd2 up Jun 8 04:17:13 srv005-de kernel: [1584355.867136] libceph: osd14 down Jun 8 04:17:13 srv005-de kernel: [1584355.868152] libceph: osd16 down Jun 8 04:17:13 srv005-de kernel: [1584355.869137] libceph: osd17 down Jun 8 04:17:14 srv005-de kernel: [1584356.878931] libceph: osd16 up Jun 8 04:17:14 srv005-de kernel: [1584356.879936] libceph: osd17 up Jun 8 04:17:18 srv005-de kernel: [1584360.902204] libceph: osd14 up Jun 8 04:17:25 srv005-de kernel: [1584368.001714] libceph: osd1 down Jun 8 04:17:25 srv005-de kernel: [1584368.002603] libceph: osd7 down Jun 8 04:17:26 srv005-de kernel: [1584369.006402] libceph: osd0 down Jun 8 04:17:26 srv005-de kernel: [1584369.007283] libceph: osd10 down Jun 8 04:17:26 srv005-de kernel: [1584369.008154] libceph: osd11 down Jun 8 04:17:27 srv005-de kernel: [1584370.015210] libceph: osd0 up Jun 8 04:17:27 srv005-de kernel: [1584370.016074] libceph: osd10 up Jun 8 04:17:28 srv005-de kernel: [1584371.024987] libceph: osd18 down Jun 8 04:17:28 srv005-de kernel: [1584371.025863] libceph: osd1 up Jun 8 04:17:28 srv005-de kernel: [1584371.026770] libceph: osd7 up Jun 8 04:17:28 srv005-de kernel: [1584371.027695] libceph: osd11 up Jun 8 04:17:29 srv005-de kernel: [1584372.029459] libceph: osd18 up Jun 8 04:17:30 srv005-de kernel: [1584373.044629] libceph: osd6 down Jun 8 04:17:34 srv005-de kernel: [1584377.075503] libceph: osd6 up Jun 8 04:17:45 srv005-de kernel: [1584388.040526] libceph: osd17 down Jun 8 04:17:46 srv005-de kernel: [1584389.043745] libceph: osd2 down Jun 8 04:17:46 srv005-de kernel: [1584389.044608] libceph: osd17 up Jun 8 04:17:47 srv005-de kernel: [1584390.071395] libceph: osd2 up Jun 8 04:17:52 srv005-de kernel: [1584395.173328] libceph: osd0 down Jun 8 04:17:52 srv005-de kernel: [1584395.174195] libceph: osd15 down Jun 8 04:17:53 srv005-de kernel: [1584396.026473] libceph: osd9 (1)10.6.64.2:6812 socket closed (con state OPEN)
2021-06-08T04:29:59.994+0200 7fcfd751b700 0 log_channel(cluster) log [WRN] : Health detail: HEALTH_WARN Slow OSD heartbeats on back (longest 23895.853ms); Slow OSD heartbeats on front (longest 23809.919ms) 2021-06-08T04:39:59.994+0200 7fcfd751b700 0 log_channel(cluster) log [WRN] : Health detail: HEALTH_WARN Slow OSD heartbeats on back (longest 4444.081ms); Slow OSD heartbeats on front (longest 4910.412ms) 2021-06-08T04:59:59.997+0200 7fcfd751b700 0 log_channel(cluster) log [WRN] : Health detail: HEALTH_WARN Slow OSD heartbeats on back (longest 6693.196ms); Slow OSD heartbeats on front (longest 6691.759ms); Degraded data redundancy: 2/9160088 objects degraded (0.000%), 1 pg degraded; 5 slow ops, oldest one blocked for 39 sec, daemons [osd.11,osd.15,osd.16] have slow ops
Most of the time it's fine, however, it can happens that the filesystem stay stuck and that we cannot recover.
After the network issues, the cluster is green again:
ceph -s cluster: id: 6e38eb45-751a-4166-bc96-9ab0dcebd122 health: HEALTH_OK services: mon: 3 daemons, quorum srv001-de,srv002-de,srv003-de (age 24m) mgr: srv003-de(active, since 23m), standbys: srv001-de, srv002-de osd: 19 osds: 19 up (since 87m), 19 in (since 3w) data: pools: 4 pools, 353 pgs objects: 3.72M objects, 13 TiB usage: 29 TiB used, 57 TiB / 87 TiB avail pgs: 353 active+clean io: client: 390 KiB/s rd, 66 MiB/s wr, 5 op/s rd, 554 op/s wr
We don't have any stuck requests:
cat /sys/kernel/debug/ceph/6e38eb45-751a-4166-bc96-9ab0dcebd122.client314122254/osdc REQUESTS 0 homeless 0 LINGER REQUESTS 18446462598732841289 osd0 1.df17e71c 1.1c [0,7,15]/0 [0,7,15]/0 e8366 rbd_header.9aca2733bb1726 0x20 84 WC/0 18446462598732840965 osd3 2.4c2e7d31 2.31 [3,5]/3 [3,5]/3 e8366 rbd_header.eeb9e532ef6c04 0x20 47 WC/0 18446462598732840989 osd3 2.572947c5 2.5 [3,4]/3 [3,4]/3 e8366 rbd_header.9c8938c8c745ef 0x20 48 WC/0 18446462598732840993 osd3 2.e356688a 2.a [3,5]/3 [3,5]/3 e8366 rbd_header.3e1e505ca5255 0x20 47 WC/0 18446462598732841013 osd3 2.f7878bc8 2.8 [3,5]/3 [3,5]/3 e8366 rbd_header.7af65c8befd4db 0x20 47 WC/0 18446462598732841061 osd3 2.7088e5bf 2.3f [3,6]/3 [3,6]/3 e8366 rbd_header.f76f5fdebf064d 0x20 64 WC/0 18446462598732841069 osd3 2.854bc1c0 2.0 [3,5]/3 [3,5]/3 e8366 rbd_header.b08720e368a4fa 0x20 47 WC/0 18446462598732841073 osd3 2.7ffed0e7 2.27 [3,4]/3 [3,4]/3 e8366 rbd_header.03901d61a1ef71 0x20 48 WC/0 18446462598732841081 osd3 2.74e1c3ca 2.a [3,5]/3 [3,5]/3 e8366 rbd_header.dabcc04715dc15 0x20 47 WC/0 18446462598732841229 osd3 2.ea24701a 2.1a [3,4]/3 [3,4]/3 e8366 rbd_header.ef54431c65f606 0x20 36 WC/0 18446462598732841230 osd3 2.956de11a 2.1a [3,4]/3 [3,4]/3 e8366 rbd_header.047c9247926b56 0x20 36 WC/0 18446462598732840973 osd4 2.b345de9f 2.1f [4,3]/4 [4,3]/4 e8366 rbd_header.f0b737c68d5761 0x20 48 WC/0 18446462598732840985 osd4 2.5a6955c6 2.6 [4,6]/4 [4,6]/4 e8366 rbd_header.35481ba2a4f2bd 0x20 46 WC/0 18446462598732841029 osd4 2.9a058f0f 2.f [4,5]/4 [4,5]/4 e8366 rbd_header.7ae5e8bf566ba3 0x20 33 WC/0 18446462598732841057 osd4 2.7f98cf46 2.6 [4,6]/4 [4,6]/4 e8366 rbd_header.f64d3464c52dee 0x20 46 WC/0 18446462598732841085 osd4 2.ddef4fe3 2.23 [4,5]/4 [4,5]/4 e8366 rbd_header.e10165191b5751 0x20 31 WC/0 18446462598732841159 osd4 2.56c7d661 2.21 [4,6]/4 [4,6]/4 e8366 rbd_header.ecbea723350bf8 0x20 39 WC/0 18446462598732840969 osd5 2.fe007f03 2.3 [5,4]/5 [5,4]/5 e8366 rbd_header.ef55d3d2053d90 0x20 31 WC/0 18446462598732840981 osd5 2.bf8e4d8 2.18 [5,4]/5 [5,4]/5 e8366 rbd_header.3547ac1414c2e4 0x20 33 WC/0 18446462598732841041 osd5 2.77ea8166 2.26 [5,3]/5 [5,3]/5 e8366 rbd_header.cc25fc4a584119 0x20 47 WC/0 18446462598732841233 osd5 2.ab08c702 2.2 [5,6]/5 [5,6]/5 e8366 rbd_header.7ae1da6b66cd8c 0x20 35 WC/0 18446462598732841234 osd5 2.9a2e0342 2.2 [5,6]/5 [5,6]/5 e8366 rbd_header.cc4c69e81ca1d0 0x20 35 WC/0 18446462598732841009 osd6 2.4b70a3a2 2.22 [6,3]/6 [6,3]/6 e8366 rbd_header.049fe7b9fbd3a7 0x20 64 WC/0 18446462598732841017 osd6 2.a44341ad 2.2d [6,3]/6 [6,3]/6 e8366 rbd_header.7ae0a5a8eea3d4 0x20 66 WC/0 18446462598732841033 osd6 2.19b88f4c 2.c [6,3]/6 [6,3]/6 e8366 rbd_header.94955e932a1217 0x20 64 WC/0 18446462598732841037 osd6 2.62834262 2.22 [6,3]/6 [6,3]/6 e8366 rbd_header.cc0da871a9854f 0x20 64 WC/0 18446462598732841053 osd6 2.64d9c217 2.17 [6,5]/6 [6,5]/6 e8366 rbd_header.cc5c53e1921ce7 0x20 47 WC/0 18446462598732841065 osd6 2.5bde4e11 2.11 [6,3]/6 [6,3]/6 e8366 rbd_header.42550f5d5c4839 0x20 64 WC/0 18446462598732841077 osd6 2.8021af53 2.13 [6,4]/6 [6,4]/6 e8366 rbd_header.abd30ae91a4a78 0x20 46 WC/0 18446462598732841149 osd6 2.b25a16d4 2.14 [6,4]/6 [6,4]/6 e8366 rbd_header.aff3db6bd2f145 0x20 39 WC/0 18446462598732841161 osd6 2.b0ef1b55 2.15 [6,4]/6 [6,4]/6 e8366 rbd_header.0495b22d0214b0 0x20 39 WC/0 18446462598732841162 osd6 2.79baf6fd 2.3d [6,4]/6 [6,4]/6 e8366 rbd_header.7ae4b05860e330 0x20 39 WC/0 18446462598732841083 osd7 1.ecc14944 1.44 [7,16,13]/7 [7,16,13]/7 e8366 rbd_header.ac1b0ac2e0919e 0x20 106 WC/-107 18446462598732841147 osd7 1.ee4ca8d9 1.d9 [7,16,12]/7 [7,16,12]/7 e8366 rbd_header.22c77167a0475f 0x20 80 WC/0 18446462598732841252 osd7 1.7150a564 1.64 [7,2,11]/7 [7,2,11]/7 e8366 rbd_header.048ee6f29c7895 0x20 98 WC/0 18446462598732841314 osd8 1.b285d4b5 1.b5 [8,13,2]/8 [8,13,2]/8 e8366 rbd_header.352d0013d366b8 0x20 34 WC/0 18446462598732841251 osd10 1.4a91ad87 1.87 [10,0,17]/10 [10,0,17]/10 e8366 rbd_header.9490452bd85736 0x20 96 WC/0 18446462598732841330 osd10 1.d683a525 1.25 [10,13,18]/10 [10,13,18]/10 e8366 rbd_header.f1a32eef01a846 0x20 29 WC/-107 18446462598732841327 osd11 1.afcb3a0f 1.f [11,18,13]/11 [11,18,13]/11 e8366 rbd_header.430dd75ed065a4 0x20 24 WC/-107 18446462598732841328 osd11 1.a10c0ce4 1.e4 [11,14,9]/11 [11,14,9]/11 e8366 rbd_header.cc46f9ec2cc23e 0x20 23 WC/0 18446462598732841295 osd12 1.d62ec5e3 1.e3 [12,16,9]/12 [12,16,9]/12 e8366 rbd_header.fe1109b1007ae7 0x20 49 WC/0 18446462598732841296 osd12 1.862b85f2 1.f2 [12,9,16]/12 [12,9,16]/12 e8366 rbd_header.7a9251b2d5a2bc 0x20 49 WC/0 18446462598732840995 osd13 1.fceb6847 1.47 [13,18,10]/13 [13,18,10]/13 e8366 rbd_header.eca953ed5e66e5 0x20 105 WC/0 18446462598732841011 osd13 1.ad68c683 1.83 [13,18,10]/13 [13,18,10]/13 e8366 rbd_header.7a6a5bbb78dcc4 0x20 106 WC/0 18446462598732841099 osd13 1.cd55d909 1.9 [13,7,2]/13 [13,7,2]/13 e8366 rbd_header.eea281be6a0422 0x20 141 WC/-107 18446462598732841253 osd13 1.a7e91eaf 1.af [13,18,1]/13 [13,18,1]/13 e8366 rbd_header.3cea48bd6569bc 0x20 102 WC/0 18446462598732841271 osd13 1.ed2f5400 1.0 [13,0,8]/13 [13,0,8]/13 e8366 rbd_header.0499d29d46e831 0x20 81 WC/0 18446462598732841321 osd13 1.39ed34f0 1.f0 [13,16,9]/13 [13,16,9]/13 e8366 rbd_header.b086d2bd8416a4 0x20 26 WC/0 18446462598732841322 osd13 1.e922455c 1.5c [13,16,12]/13 [13,16,12]/13 e8366 rbd_header.038fa89e0b2ee0 0x20 18 WC/-107 18446462598732840987 osd14 1.e770e95e 1.5e [14,16,10]/14 [14,16,10]/14 e8366 rbd_header.3cbd08eb704503 0x20 101 WC/0 18446462598732841035 osd14 1.4f87490c 1.c [14,17,11]/14 [14,17,11]/14 e8366 rbd_header.9ab06e56b0b8e1 0x20 97 WC/0 18446462598732841218 osd15 1.3b7cbb4a 1.4a [15,2,17]/15 [15,2,17]/15 e8366 rbd_header.6c454a7176efcd 0x20 113 WC/0 18446462598732841303 osd15 1.76a964b6 1.b6 [15,9,2]/15 [15,9,2]/15 e8366 rbd_header.ef44c0f28b72f4 0x20 46 WC/0 18446462598732841332 osd15 1.7a4f4d88 1.88 [15,10,17]/15 [15,10,17]/15 e8366 rbd_header.22dfc24c7e12df 0x20 28 WC/-107 18446462598732841333 osd15 1.bda64b75 1.75 [15,12,18]/15 [15,12,18]/15 e8366 rbd_header.9b3310be1cd52d 0x20 27 WC/0 18446462598732840971 osd16 1.41a55a1d 1.1d [16,13,10]/16 [16,13,10]/16 e8366 rbd_header.f098ec4b8f73f3 0x20 103 WC/0 18446462598732841239 osd16 1.3164bd20 1.20 [16,13,2]/16 [16,13,2]/16 e8366 rbd_header.9b14c4f7eb570f 0x20 96 WC/0 18446462598732841329 osd16 1.76f6ea7d 1.7d [16,11,14]/16 [16,11,14]/16 e8366 rbd_header.3511765d43da89 0x20 22 WC/0 18446462598732841331 osd16 1.a76b8b0d 1.d [16,15,1]/16 [16,15,1]/16 e8366 rbd_header.4254a0ed122a0c 0x20 19 WC/0 18446462598732841079 osd17 1.5986cb7 1.b7 [17,7,14]/17 [17,7,14]/17 e8366 rbd_header.dabc691b3e9146 0x20 104 WC/0 18446462598732841267 osd17 1.caef016c 1.6c [17,0,9]/17 [17,0,9]/17 e8366 rbd_header.22c9f32dd768ec 0x20 87 WC/0 18446462598732841268 osd17 1.613812c 1.2c [17,10,0]/17 [17,10,0]/17 e8366 rbd_header.7aa082ef1f66e9 0x20 90 WC/0 18446462598732841309 osd18 1.917562ff 1.ff [18,15,2]/18 [18,15,2]/18 e8366 rbd_header.7aac8296618390 0x20 36 WC/0 BACKOFFS
Restaring all OSD, monitor or manager doesn't help.
It's also impossible to rbd unmap, even with force option.
Here is the list of all stuck process:
ps aux | grep ' D ' root 17324 0.0 0.0 0 0 ? D 09:22 0:00 [kworker/17:11+rbd] 100000 29186 0.0 0.0 4516 104 ? D 09:24 0:00 run-parts --report /etc/cron.daily 100000 29647 0.0 0.0 35220 9236 ? D 09:24 0:00 python3 /opt/XXX_list_scopes.py root 32508 0.0 0.0 0 0 ? D May20 0:10 [kmmpd-rbd33] root 32510 0.0 0.0 0 0 ? D May20 1:00 [jbd2/rbd33-8] 100000 100166 0.0 0.0 1235088 54908 ? D 09:34 0:00 /usr/bin/python3 /usr/bin/salt-minion 100000 211110 0.0 0.0 4624 92 ? D 09:50 0:00 /bin/sh -c borg info --json 100000 304787 0.0 0.0 1235328 54780 ? D 10:02 0:00 /usr/bin/python3 /usr/bin/salt-minion 100000 304874 0.0 0.0 35220 9220 ? D 10:02 0:00 python3 /opt/XXX_list_scopes.py root 750175 0.0 0.0 6140 2328 pts/36 S+ 10:45 0:00 grep D root 1798106 0.0 0.0 0 0 ? D 03:31 0:02 [kworker/29:4+ceph-msgr] root 2126115 0.0 0.0 0 0 ? D 04:21 0:01 [kworker/16:1+rbd] root 2348498 0.0 0.0 0 0 ? D 04:55 0:00 [kworker/19:2+ceph-msgr] root 2348682 0.0 0.0 0 0 ? D 04:55 0:00 [kworker/17:2+rbd] root 2358880 0.0 0.0 0 0 ? D 04:57 0:00 [kworker/u65:13+flush-252:528] root 2365405 0.0 0.0 0 0 ? D 04:58 0:00 [kworker/u64:0+rbd33-tasks] root 2380900 0.0 0.0 0 0 ? D 05:00 0:00 [kworker/u64:8+ceph-watch-notify] root 2668082 0.0 0.0 0 0 ? D 05:46 0:00 [kworker/15:0+rbd] root 2839118 0.0 0.0 0 0 ? D 06:12 0:00 [kworker/18:0+rbd] root 2998532 0.0 0.0 0 0 ? D 06:36 0:00 [kworker/17:7+rbd] root 2998536 0.0 0.0 0 0 ? D 06:36 0:00 [kworker/16:8+rbd] root 3075865 0.0 0.0 0 0 ? D 06:47 0:00 [kworker/18:7+rbd] root 3129962 0.0 0.0 0 0 ? D 06:55 0:00 [kworker/15:1+rbd] root 3135993 0.0 0.0 0 0 ? D 06:55 0:00 [kworker/16:2+rbd] root 3165701 0.0 0.0 0 0 ? D 07:00 0:00 [kworker/17:1+rbd] root 3182766 0.0 0.0 0 0 ? D 07:02 0:00 [kworker/18:1+rbd] root 3182772 0.0 0.0 0 0 ? D 07:02 0:00 [kworker/18:2+rbd] root 3189400 0.0 0.0 0 0 ? D 07:03 0:00 [kworker/16:3+rbd] root 3195772 0.0 0.0 0 0 ? D 07:04 0:00 [kworker/15:3+rbd] root 3195774 0.0 0.0 0 0 ? D 07:04 0:00 [kworker/16:4+rbd] root 3195775 0.0 0.0 0 0 ? D 07:04 0:00 [kworker/18:3+rbd] root 3202588 0.0 0.0 0 0 ? D 07:05 0:00 [kworker/16:5+rbd] root 3202936 0.0 0.0 0 0 ? D 07:05 0:00 [kworker/16:6+rbd] root 3202939 0.0 0.0 0 0 ? D 07:05 0:01 [kworker/18:4+rbd] root 3219420 0.0 0.0 0 0 ? D 07:08 0:00 [kworker/15:6+rbd] root 3231624 0.0 0.0 0 0 ? D 07:09 0:00 [kworker/16:9+rbd] root 3246545 0.0 0.0 0 0 ? D 07:11 0:00 [kworker/17:3+rbd] root 3270615 0.0 0.0 0 0 ? D 07:14 0:00 [kworker/17:5+rbd] root 3416476 0.0 0.0 0 0 ? D 07:35 0:00 [kworker/18:5+rbd] root 3416503 0.0 0.0 0 0 ? D 07:35 0:00 [kworker/18:6+rbd] root 3416504 0.0 0.0 0 0 ? D 07:35 0:00 [kworker/18:8+rbd] root 3416539 0.0 0.0 0 0 ? D 07:35 0:02 [kworker/18:9+rbd] root 3428688 0.0 0.0 0 0 ? D 07:36 0:01 [kworker/17:8+rbd] root 3582209 0.0 0.0 0 0 ? D 07:58 0:00 [kworker/18:10+rbd] root 3728375 0.0 0.0 0 0 ? D 08:17 0:00 [kworker/17:9+rbd] root 3731842 0.0 0.0 0 0 ? D 08:17 0:00 [kworker/15:9+rbd] root 3731843 0.0 0.0 0 0 ? D 08:17 0:00 [kworker/15:10+rbd] root 3795684 0.0 0.0 0 0 ? D 08:26 0:00 [kworker/18:11+rbd] root 3803165 0.0 0.0 0 0 ? D 08:27 0:00 [kworker/15:2+rbd] root 3845962 0.0 0.0 0 0 ? D 08:33 0:00 [kworker/15:4+rbd] 100000 3916481 0.0 0.0 1234968 54908 ? D 08:43 0:00 /usr/bin/python3 /usr/bin/salt-minion root 3930582 0.0 0.0 0 0 ? D 08:44 0:00 [kworker/18:14+rbd] root 3960557 0.0 0.0 0 0 ? D 08:48 0:00 [kworker/18:16+rbd] root 4069220 0.0 0.0 0 0 ? D 09:03 0:01 [kworker/15:7+rbd] root 4086146 0.0 0.0 0 0 ? D 09:05 0:00 [kworker/17:16+rbd]
In that case, it was the mapping rbd33 who is stuck.
Here is /proc/pic/stack for all stuck process:
17324 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 32508 [<0>] __wait_on_buffer+0x32/0x40 [<0>] write_mmp_block+0x104/0x120 [<0>] kmmpd+0x19a/0x3c0 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 32510 [<0>] wait_on_page_bit+0x141/0x210 [<0>] wait_on_page_writeback+0x43/0x90 [<0>] __filemap_fdatawait_range+0xae/0x120 [<0>] filemap_fdatawait_range_keep_errors+0x12/0x40 [<0>] jbd2_journal_commit_transaction+0xba2/0x1750 [<0>] kjournald2+0xc8/0x270 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 1798106 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_obj_handle_request+0x34/0x40 [rbd] [<0>] rbd_osd_req_callback+0x45/0x80 [rbd] [<0>] __complete_request+0x26/0x80 [libceph] [<0>] handle_reply+0x813/0x930 [libceph] [<0>] dispatch+0x167/0xb70 [libceph] [<0>] ceph_con_workfn+0xd9d/0x24d0 [libceph] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 2126115 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 2348498 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_obj_handle_request+0x34/0x40 [rbd] [<0>] rbd_osd_req_callback+0x45/0x80 [rbd] [<0>] __complete_request+0x26/0x80 [libceph] [<0>] handle_reply+0x813/0x930 [libceph] [<0>] dispatch+0x167/0xb70 [libceph] [<0>] ceph_con_workfn+0xd9d/0x24d0 [libceph] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 2348682 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 2358880 [<0>] __lock_page+0x122/0x220 [<0>] mpage_prepare_extent_to_map+0x291/0x2d0 [<0>] ext4_writepages+0x458/0xeb0 [<0>] do_writepages+0x41/0xd0 [<0>] __writeback_single_inode+0x40/0x310 [<0>] writeback_sb_inodes+0x209/0x4a0 [<0>] __writeback_inodes_wb+0x66/0xd0 [<0>] wb_writeback+0x25b/0x2f0 [<0>] wb_workfn+0x33e/0x490 [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 2365405 [<0>] rbd_quiesce_lock+0xa1/0xe0 [rbd] [<0>] rbd_reregister_watch+0x102/0x1b0 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 2380900 [<0>] rwsem_down_write_slowpath+0x2ed/0x4a0 [<0>] rbd_watch_errcb+0x2a/0x92 [rbd] [<0>] do_watch_error+0x40/0xb0 [libceph] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 2668082 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 2839118 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 2998532 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 2998536 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3075865 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3129962 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3135993 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3165701 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3182766 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3182772 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3189400 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3195772 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3195774 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3195775 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3202588 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3202936 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3202939 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3219420 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3231624 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3246545 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3270615 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3416476 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3416503 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3416504 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3416539 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3428688 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3582209 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3728375 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3731842 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3731843 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3795684 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3803165 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3845962 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3930582 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 3960557 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 4069220 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40 4086146 [<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd] [<0>] rbd_queue_workfn+0x225/0x360 [rbd] [<0>] process_one_work+0x20f/0x3d0 [<0>] worker_thread+0x34/0x400 [<0>] kthread+0x120/0x140 [<0>] ret_from_fork+0x22/0x40
The only solution found yet is to reboot the machine.
The network issues should not happend, however I would assume that rbd should be able to recover from those and not beeing stuck like this. This is also happening randomly and we don't have a way to reproduce it.
If needed, we can of course collect more data the next time we do hit the issue :)
Thanks a lot for your help,
Related issues
History
#1 Updated by Ilya Dryomov over 2 years ago
- Project changed from rbd to Linux kernel client
#2 Updated by Ilya Dryomov over 2 years ago
- Category set to rbd
#3 Updated by Ilya Dryomov over 2 years ago
I'm pretty sure this is https://tracker.ceph.com/issues/42757.
I'm working on a fix.
#4 Updated by Ilya Dryomov over 2 years ago
- Assignee set to Ilya Dryomov
#5 Updated by Ilya Dryomov over 2 years ago
- Duplicates Bug #42757: deadlock on lock_rwsem: rbd_quiesce_lock() vs watch errors added
#6 Updated by Ilya Dryomov over 2 years ago
- Status changed from New to Duplicate