Project

General

Profile

Bug #51136

Random hanging issues with rbd after network issues

Added by Maximilien Cuony over 1 year ago. Updated over 1 year ago.

Status:
Duplicate
Priority:
Normal
Assignee:
Category:
rbd
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

We are experimenting random hanging issues with our LXC VM using a mapped RBD image.

We're using version 15.2.11 (from proxmox, but I'm not sure it make a difference). We did upgrade from 14 when we started to have the issue.

We're having random network glitches, creating some flapping with osd. We do have things like in logs:

Jun  8 00:19:24 srv005-de kernel: [1570086.718933] INFO: task filebeat:38133 blocked for more than 120 seconds.
Jun  8 00:19:24 srv005-de kernel: [1570086.719857]       Tainted: P           O      5.4.114-1-pve #1
Jun  8 00:19:24 srv005-de kernel: [1570086.720846] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  8 00:19:24 srv005-de kernel: [1570086.721957] filebeat        D    0 38133  33317 0x00000320
Jun  8 00:19:24 srv005-de kernel: [1570086.723136] Call Trace:
Jun  8 00:19:24 srv005-de kernel: [1570086.724312]  __schedule+0x2e6/0x700
Jun  8 00:19:24 srv005-de kernel: [1570086.725517]  ? bit_wait_timeout+0xa0/0xa0
Jun  8 00:19:24 srv005-de kernel: [1570086.726778]  schedule+0x33/0xa0
Jun  8 00:19:24 srv005-de kernel: [1570086.728051]  io_schedule+0x16/0x40
Jun  8 00:19:24 srv005-de kernel: [1570086.729350]  bit_wait_io+0x11/0x50
Jun  8 00:19:24 srv005-de kernel: [1570086.730673]  __wait_on_bit+0x33/0xa0
Jun  8 00:19:24 srv005-de kernel: [1570086.732014]  out_of_line_wait_on_bit+0x90/0xb0
Jun  8 00:19:24 srv005-de kernel: [1570086.733417]  ? var_wake_function+0x30/0x30
Jun  8 00:19:24 srv005-de kernel: [1570086.734848]  __wait_on_buffer+0x32/0x40
Jun  8 00:19:24 srv005-de kernel: [1570086.736293]  __ext4_find_entry+0x30c/0x450
Jun  8 00:19:24 srv005-de kernel: [1570086.737773]  ? ext4_fname_prepare_lookup+0xcd/0x120
Jun  8 00:19:24 srv005-de kernel: [1570086.739315]  ext4_lookup+0xf4/0x260
Jun  8 00:19:24 srv005-de kernel: [1570086.740849]  path_openat+0x983/0x16f0
Jun  8 00:19:24 srv005-de kernel: [1570086.742408]  do_filp_open+0x93/0x100
Jun  8 00:19:24 srv005-de kernel: [1570086.743989]  ? __alloc_fd+0x46/0x150
Jun  8 00:19:24 srv005-de kernel: [1570086.745587]  do_sys_open+0x177/0x280
Jun  8 00:19:24 srv005-de kernel: [1570086.747202]  __x64_sys_openat+0x20/0x30
Jun  8 00:19:24 srv005-de kernel: [1570086.748835]  do_syscall_64+0x57/0x190
Jun  8 00:19:24 srv005-de kernel: [1570086.750503]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jun  8 00:19:24 srv005-de kernel: [1570086.752215] RIP: 0033:0x47523a
Jun  8 00:19:24 srv005-de kernel: [1570086.753950] Code: Bad RIP value.
Jun  8 00:19:24 srv005-de kernel: [1570086.755720] RSP: 002b:000000c420f12bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
Jun  8 00:19:24 srv005-de kernel: [1570086.757618] RAX: ffffffffffffffda RBX: 000000c42001750c RCX: 000000000047523a
Jun  8 00:19:24 srv005-de kernel: [1570086.759599] RDX: 0000000000181242 RSI: 000000c4204e09e0 RDI: ffffffffffffff9c
Jun  8 00:19:24 srv005-de kernel: [1570086.761614] RBP: 000000c420f12c48 R08: 0000000000000000 R09: 0000000000000000
Jun  8 00:19:24 srv005-de kernel: [1570086.763695] R10: 0000000000000180 R11: 0000000000000202 R12: 0000000000000020
Jun  8 00:19:24 srv005-de kernel: [1570086.765773] R13: 00000000000009e0 R14: 0000000000000054 R15: 0000000000000000
Jun  8 00:19:24 srv005-de kernel: [1570086.772692] INFO: task Grafana Syncer :174259 blocked for more than 120 seconds.
Jun  8 00:19:24 srv005-de kernel: [1570086.774912]       Tainted: P           O      5.4.114-1-pve #1
Jun  8 00:19:24 srv005-de kernel: [1570086.777188] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  8 00:19:24 srv005-de kernel: [1570086.779603] Grafana Syncer  D    0 174259 156131 0x00000320
Jun  8 00:19:24 srv005-de kernel: [1570086.782030] Call Trace:
Jun  8 00:19:24 srv005-de kernel: [1570086.784377]  __schedule+0x2e6/0x700
Jun  8 00:19:24 srv005-de kernel: [1570086.786700]  schedule+0x33/0xa0
Jun  8 00:19:24 srv005-de kernel: [1570086.788950]  io_schedule+0x16/0x40
Jun  8 00:19:24 srv005-de kernel: [1570086.791134]  wait_on_page_bit+0x141/0x210
Jun  8 00:19:24 srv005-de kernel: [1570086.793307]  ? file_fdatawait_range+0x30/0x30
Jun  8 00:19:24 srv005-de kernel: [1570086.795499]  wait_on_page_writeback+0x43/0x90
Jun  8 00:19:24 srv005-de kernel: [1570086.797682]  wait_for_stable_page+0x45/0x60
Jun  8 00:19:24 srv005-de kernel: [1570086.799855]  grab_cache_page_write_begin+0x30/0x40
Jun  8 00:19:24 srv005-de kernel: [1570086.802237]  ext4_da_write_begin+0x11a/0x460
Jun  8 00:19:24 srv005-de kernel: [1570086.804467]  generic_perform_write+0xf2/0x1b0
Jun  8 00:19:24 srv005-de kernel: [1570086.806714]  ? file_update_time+0xed/0x130
Jun  8 00:19:24 srv005-de kernel: [1570086.808929]  __generic_file_write_iter+0x101/0x1f0
Jun  8 00:19:24 srv005-de kernel: [1570086.811185]  ext4_file_write_iter+0xb9/0x360
Jun  8 00:19:24 srv005-de kernel: [1570086.813422]  new_sync_write+0x125/0x1c0
Jun  8 00:19:24 srv005-de kernel: [1570086.815662]  __vfs_write+0x29/0x40
Jun  8 00:19:24 srv005-de kernel: [1570086.817877]  vfs_write+0xab/0x1b0
Jun  8 00:19:24 srv005-de kernel: [1570086.820081]  ksys_write+0x61/0xe0
Jun  8 00:19:24 srv005-de kernel: [1570086.822281]  __x64_sys_write+0x1a/0x20
Jun  8 00:19:24 srv005-de kernel: [1570086.824496]  do_syscall_64+0x57/0x190
Jun  8 00:19:24 srv005-de kernel: [1570086.826722]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jun  8 00:19:24 srv005-de kernel: [1570086.828989] RIP: 0033:0x7f88fb01f2cf
Jun  8 00:19:24 srv005-de kernel: [1570086.831259] Code: Bad RIP value.
Jun  8 00:19:24 srv005-de kernel: [1570086.833518] RSP: 002b:00007f8855fbc020 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
Jun  8 00:19:24 srv005-de kernel: [1570086.835904] RAX: ffffffffffffffda RBX: 000000000000005a RCX: 00007f88fb01f2cf
Jun  8 00:19:24 srv005-de kernel: [1570086.838284] RDX: 000000000000005a RSI: 00007f8868608430 RDI: 00000000000002cf
Jun  8 00:19:24 srv005-de kernel: [1570086.840699] RBP: 00007f8868608430 R08: 0000000000000000 R09: 0000000000000058
Jun  8 00:19:24 srv005-de kernel: [1570086.843070] R10: 00007f889d1f2162 R11: 0000000000000293 R12: 00007f88680b4118
Jun  8 00:19:24 srv005-de kernel: [1570086.845390] R13: 00007f88fb01cbb0 R14: 00007f88680b40d0 R15: 00007f88680b4118
Jun  8 00:19:25 srv005-de kernel: [1570087.592647] libceph: osd7 down
Jun  8 00:19:26 srv005-de kernel: [1570088.592125] libceph: osd9 down
Jun  8 00:19:26 srv005-de kernel: [1570088.594368] libceph: osd10 down
Jun  8 00:19:27 srv005-de kernel: [1570089.598545] libceph: osd17 down
Jun  8 00:19:27 srv005-de kernel: [1570089.600711] libceph: osd9 up
Jun  8 00:19:27 srv005-de kernel: [1570089.602817] libceph: osd10 up
Jun  8 00:19:28 srv005-de kernel: [1570090.600495] libceph: osd17 up
Jun  8 00:19:30 srv005-de kernel: [1570092.624549] libceph: osd7 up
Jun  8 00:19:35 srv005-de kernel: [1570097.648839] libceph: osd14 down
Jun  8 00:19:37 srv005-de kernel: [1570099.666524] libceph: osd14 up
Jun  8 00:19:38 srv005-de kernel: [1570100.669367] libceph: osd4 down
Jun  8 00:19:38 srv005-de kernel: [1570100.671171] libceph: osd18 down
Jun  8 00:19:39 srv005-de kernel: [1570101.672803] libceph: osd18 up
Jun  8 00:19:41 srv005-de kernel: [1570103.805108] libceph: osd15 down
Jun  8 00:19:42 srv005-de kernel: [1570104.811184] libceph: osd4 up
Jun  8 00:19:44 srv005-de kernel: [1570106.828069] libceph: osd16 down
Jun  8 00:19:44 srv005-de kernel: [1570106.829714] libceph: osd15 up
Jun  8 00:19:45 srv005-de kernel: [1570107.830437] libceph: osd16 up
Jun  8 00:19:51 srv005-de kernel: [1570113.872510] libceph: osd17 down
Jun  8 00:19:53 srv005-de kernel: [1570115.885757] libceph: osd17 up
Jun  8 00:19:54 srv005-de kernel: [1570116.891014] libceph: osd10 down
Jun  8 00:19:54 srv005-de kernel: [1570116.892497] libceph: osd13 down
Jun  8 00:19:55 srv005-de kernel: [1570117.894696] libceph: osd10 up
Jun  8 00:19:55 srv005-de kernel: [1570117.896132] libceph: osd13 up
Jun  8 00:20:01 srv005-de kernel: [1570123.928539] libceph: osd3 down
Jun  8 00:20:01 srv005-de kernel: [1570123.929883] libceph: osd9 down
Jun  8 00:20:02 srv005-de kernel: [1570124.935188] libceph: osd6 down
Jun  8 00:20:02 srv005-de kernel: [1570124.936471] libceph: osd3 up
Jun  8 00:20:03 srv005-de kernel: [1570125.951571] libceph: osd6 up
Jun  8 00:20:05 srv005-de kernel: [1570127.812051] libceph: osd7 down
Jun  8 00:20:05 srv005-de kernel: [1570127.813229] libceph: osd9 up
Jun  8 00:20:07 srv005-de kernel: [1570129.967049] libceph: osd18 down
Jun  8 00:20:09 srv005-de kernel: [1570131.981916] libceph: osd7 up
Jun  8 00:20:11 srv005-de kernel: [1570133.993845] libceph: osd18 up
Jun  8 00:21:00 srv005-de kernel: [1570182.974037] libceph: osd0 (1)10.6.64.1:6806 socket closed (con state CONNECTING)
Jun  8 03:49:44 srv005-de kernel: [1582706.900767] libceph: osd7 down
Jun  8 03:49:44 srv005-de kernel: [1582706.901784] libceph: osd9 down
Jun  8 03:49:44 srv005-de kernel: [1582707.398780] libceph: osd1 down
Jun  8 03:49:44 srv005-de kernel: [1582707.399778] libceph: osd2 down
Jun  8 03:49:44 srv005-de kernel: [1582707.400695] libceph: osd10 down
Jun  8 03:49:45 srv005-de kernel: [1582708.401249] libceph: osd8 down
Jun  8 03:49:45 srv005-de kernel: [1582708.402242] libceph: osd1 up
Jun  8 03:49:45 srv005-de kernel: [1582708.403182] libceph: osd2 up
Jun  8 03:49:45 srv005-de kernel: [1582708.404122] libceph: osd7 up
Jun  8 03:49:45 srv005-de kernel: [1582708.405076] libceph: osd10 up
Jun  8 03:49:46 srv005-de kernel: [1582709.400748] libceph: osd11 down
Jun  8 03:49:46 srv005-de kernel: [1582709.401659] libceph: osd8 up
Jun  8 03:49:47 srv005-de kernel: [1582710.407863] libceph: osd9 up
Jun  8 03:49:47 srv005-de kernel: [1582710.408796] libceph: osd11 up
Jun  8 03:50:14 srv005-de kernel: [1582737.151699] libceph: osd0 (1)10.6.64.1:6806 socket closed (con state OPEN)
Jun  8 03:52:06 srv005-de kernel: [1582849.481279] libceph: osd9 down
Jun  8 03:52:06 srv005-de kernel: [1582849.482320] libceph: osd15 down
Jun  8 03:52:08 srv005-de kernel: [1582851.377054] rbd: rbd11: encountered watch error: -107
Jun  8 03:52:09 srv005-de kernel: [1582852.056647] libceph: osd17 down
Jun  8 03:52:10 srv005-de kernel: [1582853.063856] libceph: osd9 up
Jun  8 03:52:11 srv005-de kernel: [1582854.068605] libceph: osd15 up
Jun  8 03:52:13 srv005-de kernel: [1582856.088847] libceph: osd17 up
Jun  8 04:16:09 srv005-de kernel: [1584292.114088] libceph: osd3 down
Jun  8 04:16:09 srv005-de kernel: [1584292.115238] libceph: osd16 down
Jun  8 04:16:09 srv005-de kernel: [1584292.116408] libceph: osd18 down
Jun  8 04:16:10 srv005-de kernel: [1584293.118181] libceph: osd3 up
Jun  8 04:16:10 srv005-de kernel: [1584293.119319] libceph: osd18 up
Jun  8 04:16:12 srv005-de kernel: [1584295.129276] libceph: osd16 up
Jun  8 04:16:12 srv005-de kernel: [1584295.205633] rbd: rbd11: encountered watch error: -107
Jun  8 04:16:16 srv005-de kernel: [1584299.287492] libceph: osd4 down
Jun  8 04:16:16 srv005-de kernel: [1584299.288751] libceph: osd8 down
Jun  8 04:16:16 srv005-de kernel: [1584299.289993] libceph: osd11 down
Jun  8 04:16:17 srv005-de kernel: [1584300.288661] libceph: osd4 up
Jun  8 04:16:18 srv005-de kernel: [1584301.289716] libceph: osd11 up
Jun  8 04:16:19 srv005-de kernel: [1584302.294871] libceph: osd13 down
Jun  8 04:16:20 srv005-de kernel: [1584303.007898] libceph: osd6 down
Jun  8 04:16:20 srv005-de kernel: [1584303.009079] libceph: osd17 down
Jun  8 04:16:21 srv005-de kernel: [1584304.009817] libceph: osd8 up
Jun  8 04:16:21 srv005-de kernel: [1584304.010939] libceph: osd13 up
Jun  8 04:16:21 srv005-de kernel: [1584304.012068] libceph: osd17 up
Jun  8 04:16:26 srv005-de kernel: [1584309.010836] libceph: osd6 up
Jun  8 04:16:34 srv005-de kernel: [1584317.410469] libceph: osd1 down
Jun  8 04:16:39 srv005-de kernel: [1584322.447281] libceph: osd1 up
Jun  8 04:16:41 srv005-de kernel: [1584324.460433] libceph: osd18 down
Jun  8 04:16:42 srv005-de kernel: [1584325.470049] libceph: osd5 down
Jun  8 04:16:42 srv005-de kernel: [1584325.471185] libceph: osd18 up
Jun  8 04:16:43 srv005-de kernel: [1584326.472689] libceph: osd0 down
Jun  8 04:16:43 srv005-de kernel: [1584326.473869] libceph: osd9 down
Jun  8 04:16:44 srv005-de kernel: [1584327.475587] libceph: osd0 up
Jun  8 04:16:44 srv005-de kernel: [1584327.476736] libceph: osd9 up
Jun  8 04:16:45 srv005-de kernel: [1584328.016516] libceph: osd7 down
Jun  8 04:16:45 srv005-de kernel: [1584328.017590] libceph: osd10 down
Jun  8 04:16:45 srv005-de kernel: [1584328.018592] libceph: osd11 down
Jun  8 04:16:46 srv005-de kernel: [1584329.018743] libceph: osd7 up
Jun  8 04:16:46 srv005-de kernel: [1584329.019695] libceph: osd10 up
Jun  8 04:16:46 srv005-de kernel: [1584329.020577] libceph: osd11 up
Jun  8 04:16:48 srv005-de kernel: [1584331.064864] libceph: osd5 up
Jun  8 04:17:00 srv005-de kernel: [1584342.573831] libceph: osd2 down
Jun  8 04:17:01 srv005-de kernel: [1584343.588498] rbd: rbd3: encountered watch error: -107
Jun  8 04:17:02 srv005-de kernel: [1584344.605140] libceph: osd2 up
Jun  8 04:17:13 srv005-de kernel: [1584355.867136] libceph: osd14 down
Jun  8 04:17:13 srv005-de kernel: [1584355.868152] libceph: osd16 down
Jun  8 04:17:13 srv005-de kernel: [1584355.869137] libceph: osd17 down
Jun  8 04:17:14 srv005-de kernel: [1584356.878931] libceph: osd16 up
Jun  8 04:17:14 srv005-de kernel: [1584356.879936] libceph: osd17 up
Jun  8 04:17:18 srv005-de kernel: [1584360.902204] libceph: osd14 up
Jun  8 04:17:25 srv005-de kernel: [1584368.001714] libceph: osd1 down
Jun  8 04:17:25 srv005-de kernel: [1584368.002603] libceph: osd7 down
Jun  8 04:17:26 srv005-de kernel: [1584369.006402] libceph: osd0 down
Jun  8 04:17:26 srv005-de kernel: [1584369.007283] libceph: osd10 down
Jun  8 04:17:26 srv005-de kernel: [1584369.008154] libceph: osd11 down
Jun  8 04:17:27 srv005-de kernel: [1584370.015210] libceph: osd0 up
Jun  8 04:17:27 srv005-de kernel: [1584370.016074] libceph: osd10 up
Jun  8 04:17:28 srv005-de kernel: [1584371.024987] libceph: osd18 down
Jun  8 04:17:28 srv005-de kernel: [1584371.025863] libceph: osd1 up
Jun  8 04:17:28 srv005-de kernel: [1584371.026770] libceph: osd7 up
Jun  8 04:17:28 srv005-de kernel: [1584371.027695] libceph: osd11 up
Jun  8 04:17:29 srv005-de kernel: [1584372.029459] libceph: osd18 up
Jun  8 04:17:30 srv005-de kernel: [1584373.044629] libceph: osd6 down
Jun  8 04:17:34 srv005-de kernel: [1584377.075503] libceph: osd6 up
Jun  8 04:17:45 srv005-de kernel: [1584388.040526] libceph: osd17 down
Jun  8 04:17:46 srv005-de kernel: [1584389.043745] libceph: osd2 down
Jun  8 04:17:46 srv005-de kernel: [1584389.044608] libceph: osd17 up
Jun  8 04:17:47 srv005-de kernel: [1584390.071395] libceph: osd2 up
Jun  8 04:17:52 srv005-de kernel: [1584395.173328] libceph: osd0 down
Jun  8 04:17:52 srv005-de kernel: [1584395.174195] libceph: osd15 down
Jun  8 04:17:53 srv005-de kernel: [1584396.026473] libceph: osd9 (1)10.6.64.2:6812 socket closed (con state OPEN)
2021-06-08T04:29:59.994+0200 7fcfd751b700  0 log_channel(cluster) log [WRN] : Health detail: HEALTH_WARN Slow OSD heartbeats on back (longest 23895.853ms); Slow OSD heartbeats on front (longest 23809.919ms)
2021-06-08T04:39:59.994+0200 7fcfd751b700  0 log_channel(cluster) log [WRN] : Health detail: HEALTH_WARN Slow OSD heartbeats on back (longest 4444.081ms); Slow OSD heartbeats on front (longest 4910.412ms)
2021-06-08T04:59:59.997+0200 7fcfd751b700  0 log_channel(cluster) log [WRN] : Health detail: HEALTH_WARN Slow OSD heartbeats on back (longest 6693.196ms); Slow OSD heartbeats on front (longest 6691.759ms); Degraded data redundancy: 2/9160088 objects degraded (0.000%), 1 pg degraded; 5 slow ops, oldest one blocked for 39 sec, daemons [osd.11,osd.15,osd.16] have slow ops

Most of the time it's fine, however, it can happens that the filesystem stay stuck and that we cannot recover.

After the network issues, the cluster is green again:

ceph -s
  cluster:
    id:     6e38eb45-751a-4166-bc96-9ab0dcebd122
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum srv001-de,srv002-de,srv003-de (age 24m)
    mgr: srv003-de(active, since 23m), standbys: srv001-de, srv002-de
    osd: 19 osds: 19 up (since 87m), 19 in (since 3w)

  data:
    pools:   4 pools, 353 pgs
    objects: 3.72M objects, 13 TiB
    usage:   29 TiB used, 57 TiB / 87 TiB avail
    pgs:     353 active+clean

  io:
    client:   390 KiB/s rd, 66 MiB/s wr, 5 op/s rd, 554 op/s wr

We don't have any stuck requests:

cat /sys/kernel/debug/ceph/6e38eb45-751a-4166-bc96-9ab0dcebd122.client314122254/osdc
REQUESTS 0 homeless 0
LINGER REQUESTS
18446462598732841289    osd0    1.df17e71c      1.1c    [0,7,15]/0      [0,7,15]/0      e8366   rbd_header.9aca2733bb1726       0x20    84      WC/0
18446462598732840965    osd3    2.4c2e7d31      2.31    [3,5]/3 [3,5]/3 e8366   rbd_header.eeb9e532ef6c04       0x20    47      WC/0
18446462598732840989    osd3    2.572947c5      2.5     [3,4]/3 [3,4]/3 e8366   rbd_header.9c8938c8c745ef       0x20    48      WC/0
18446462598732840993    osd3    2.e356688a      2.a     [3,5]/3 [3,5]/3 e8366   rbd_header.3e1e505ca5255        0x20    47      WC/0
18446462598732841013    osd3    2.f7878bc8      2.8     [3,5]/3 [3,5]/3 e8366   rbd_header.7af65c8befd4db       0x20    47      WC/0
18446462598732841061    osd3    2.7088e5bf      2.3f    [3,6]/3 [3,6]/3 e8366   rbd_header.f76f5fdebf064d       0x20    64      WC/0
18446462598732841069    osd3    2.854bc1c0      2.0     [3,5]/3 [3,5]/3 e8366   rbd_header.b08720e368a4fa       0x20    47      WC/0
18446462598732841073    osd3    2.7ffed0e7      2.27    [3,4]/3 [3,4]/3 e8366   rbd_header.03901d61a1ef71       0x20    48      WC/0
18446462598732841081    osd3    2.74e1c3ca      2.a     [3,5]/3 [3,5]/3 e8366   rbd_header.dabcc04715dc15       0x20    47      WC/0
18446462598732841229    osd3    2.ea24701a      2.1a    [3,4]/3 [3,4]/3 e8366   rbd_header.ef54431c65f606       0x20    36      WC/0
18446462598732841230    osd3    2.956de11a      2.1a    [3,4]/3 [3,4]/3 e8366   rbd_header.047c9247926b56       0x20    36      WC/0
18446462598732840973    osd4    2.b345de9f      2.1f    [4,3]/4 [4,3]/4 e8366   rbd_header.f0b737c68d5761       0x20    48      WC/0
18446462598732840985    osd4    2.5a6955c6      2.6     [4,6]/4 [4,6]/4 e8366   rbd_header.35481ba2a4f2bd       0x20    46      WC/0
18446462598732841029    osd4    2.9a058f0f      2.f     [4,5]/4 [4,5]/4 e8366   rbd_header.7ae5e8bf566ba3       0x20    33      WC/0
18446462598732841057    osd4    2.7f98cf46      2.6     [4,6]/4 [4,6]/4 e8366   rbd_header.f64d3464c52dee       0x20    46      WC/0
18446462598732841085    osd4    2.ddef4fe3      2.23    [4,5]/4 [4,5]/4 e8366   rbd_header.e10165191b5751       0x20    31      WC/0
18446462598732841159    osd4    2.56c7d661      2.21    [4,6]/4 [4,6]/4 e8366   rbd_header.ecbea723350bf8       0x20    39      WC/0
18446462598732840969    osd5    2.fe007f03      2.3     [5,4]/5 [5,4]/5 e8366   rbd_header.ef55d3d2053d90       0x20    31      WC/0
18446462598732840981    osd5    2.bf8e4d8       2.18    [5,4]/5 [5,4]/5 e8366   rbd_header.3547ac1414c2e4       0x20    33      WC/0
18446462598732841041    osd5    2.77ea8166      2.26    [5,3]/5 [5,3]/5 e8366   rbd_header.cc25fc4a584119       0x20    47      WC/0
18446462598732841233    osd5    2.ab08c702      2.2     [5,6]/5 [5,6]/5 e8366   rbd_header.7ae1da6b66cd8c       0x20    35      WC/0
18446462598732841234    osd5    2.9a2e0342      2.2     [5,6]/5 [5,6]/5 e8366   rbd_header.cc4c69e81ca1d0       0x20    35      WC/0
18446462598732841009    osd6    2.4b70a3a2      2.22    [6,3]/6 [6,3]/6 e8366   rbd_header.049fe7b9fbd3a7       0x20    64      WC/0
18446462598732841017    osd6    2.a44341ad      2.2d    [6,3]/6 [6,3]/6 e8366   rbd_header.7ae0a5a8eea3d4       0x20    66      WC/0
18446462598732841033    osd6    2.19b88f4c      2.c     [6,3]/6 [6,3]/6 e8366   rbd_header.94955e932a1217       0x20    64      WC/0
18446462598732841037    osd6    2.62834262      2.22    [6,3]/6 [6,3]/6 e8366   rbd_header.cc0da871a9854f       0x20    64      WC/0
18446462598732841053    osd6    2.64d9c217      2.17    [6,5]/6 [6,5]/6 e8366   rbd_header.cc5c53e1921ce7       0x20    47      WC/0
18446462598732841065    osd6    2.5bde4e11      2.11    [6,3]/6 [6,3]/6 e8366   rbd_header.42550f5d5c4839       0x20    64      WC/0
18446462598732841077    osd6    2.8021af53      2.13    [6,4]/6 [6,4]/6 e8366   rbd_header.abd30ae91a4a78       0x20    46      WC/0
18446462598732841149    osd6    2.b25a16d4      2.14    [6,4]/6 [6,4]/6 e8366   rbd_header.aff3db6bd2f145       0x20    39      WC/0
18446462598732841161    osd6    2.b0ef1b55      2.15    [6,4]/6 [6,4]/6 e8366   rbd_header.0495b22d0214b0       0x20    39      WC/0
18446462598732841162    osd6    2.79baf6fd      2.3d    [6,4]/6 [6,4]/6 e8366   rbd_header.7ae4b05860e330       0x20    39      WC/0
18446462598732841083    osd7    1.ecc14944      1.44    [7,16,13]/7     [7,16,13]/7     e8366   rbd_header.ac1b0ac2e0919e       0x20    106     WC/-107
18446462598732841147    osd7    1.ee4ca8d9      1.d9    [7,16,12]/7     [7,16,12]/7     e8366   rbd_header.22c77167a0475f       0x20    80      WC/0
18446462598732841252    osd7    1.7150a564      1.64    [7,2,11]/7      [7,2,11]/7      e8366   rbd_header.048ee6f29c7895       0x20    98      WC/0
18446462598732841314    osd8    1.b285d4b5      1.b5    [8,13,2]/8      [8,13,2]/8      e8366   rbd_header.352d0013d366b8       0x20    34      WC/0
18446462598732841251    osd10   1.4a91ad87      1.87    [10,0,17]/10    [10,0,17]/10    e8366   rbd_header.9490452bd85736       0x20    96      WC/0
18446462598732841330    osd10   1.d683a525      1.25    [10,13,18]/10   [10,13,18]/10   e8366   rbd_header.f1a32eef01a846       0x20    29      WC/-107
18446462598732841327    osd11   1.afcb3a0f      1.f     [11,18,13]/11   [11,18,13]/11   e8366   rbd_header.430dd75ed065a4       0x20    24      WC/-107
18446462598732841328    osd11   1.a10c0ce4      1.e4    [11,14,9]/11    [11,14,9]/11    e8366   rbd_header.cc46f9ec2cc23e       0x20    23      WC/0
18446462598732841295    osd12   1.d62ec5e3      1.e3    [12,16,9]/12    [12,16,9]/12    e8366   rbd_header.fe1109b1007ae7       0x20    49      WC/0
18446462598732841296    osd12   1.862b85f2      1.f2    [12,9,16]/12    [12,9,16]/12    e8366   rbd_header.7a9251b2d5a2bc       0x20    49      WC/0
18446462598732840995    osd13   1.fceb6847      1.47    [13,18,10]/13   [13,18,10]/13   e8366   rbd_header.eca953ed5e66e5       0x20    105     WC/0
18446462598732841011    osd13   1.ad68c683      1.83    [13,18,10]/13   [13,18,10]/13   e8366   rbd_header.7a6a5bbb78dcc4       0x20    106     WC/0
18446462598732841099    osd13   1.cd55d909      1.9     [13,7,2]/13     [13,7,2]/13     e8366   rbd_header.eea281be6a0422       0x20    141     WC/-107
18446462598732841253    osd13   1.a7e91eaf      1.af    [13,18,1]/13    [13,18,1]/13    e8366   rbd_header.3cea48bd6569bc       0x20    102     WC/0
18446462598732841271    osd13   1.ed2f5400      1.0     [13,0,8]/13     [13,0,8]/13     e8366   rbd_header.0499d29d46e831       0x20    81      WC/0
18446462598732841321    osd13   1.39ed34f0      1.f0    [13,16,9]/13    [13,16,9]/13    e8366   rbd_header.b086d2bd8416a4       0x20    26      WC/0
18446462598732841322    osd13   1.e922455c      1.5c    [13,16,12]/13   [13,16,12]/13   e8366   rbd_header.038fa89e0b2ee0       0x20    18      WC/-107
18446462598732840987    osd14   1.e770e95e      1.5e    [14,16,10]/14   [14,16,10]/14   e8366   rbd_header.3cbd08eb704503       0x20    101     WC/0
18446462598732841035    osd14   1.4f87490c      1.c     [14,17,11]/14   [14,17,11]/14   e8366   rbd_header.9ab06e56b0b8e1       0x20    97      WC/0
18446462598732841218    osd15   1.3b7cbb4a      1.4a    [15,2,17]/15    [15,2,17]/15    e8366   rbd_header.6c454a7176efcd       0x20    113     WC/0
18446462598732841303    osd15   1.76a964b6      1.b6    [15,9,2]/15     [15,9,2]/15     e8366   rbd_header.ef44c0f28b72f4       0x20    46      WC/0
18446462598732841332    osd15   1.7a4f4d88      1.88    [15,10,17]/15   [15,10,17]/15   e8366   rbd_header.22dfc24c7e12df       0x20    28      WC/-107
18446462598732841333    osd15   1.bda64b75      1.75    [15,12,18]/15   [15,12,18]/15   e8366   rbd_header.9b3310be1cd52d       0x20    27      WC/0
18446462598732840971    osd16   1.41a55a1d      1.1d    [16,13,10]/16   [16,13,10]/16   e8366   rbd_header.f098ec4b8f73f3       0x20    103     WC/0
18446462598732841239    osd16   1.3164bd20      1.20    [16,13,2]/16    [16,13,2]/16    e8366   rbd_header.9b14c4f7eb570f       0x20    96      WC/0
18446462598732841329    osd16   1.76f6ea7d      1.7d    [16,11,14]/16   [16,11,14]/16   e8366   rbd_header.3511765d43da89       0x20    22      WC/0
18446462598732841331    osd16   1.a76b8b0d      1.d     [16,15,1]/16    [16,15,1]/16    e8366   rbd_header.4254a0ed122a0c       0x20    19      WC/0
18446462598732841079    osd17   1.5986cb7       1.b7    [17,7,14]/17    [17,7,14]/17    e8366   rbd_header.dabc691b3e9146       0x20    104     WC/0
18446462598732841267    osd17   1.caef016c      1.6c    [17,0,9]/17     [17,0,9]/17     e8366   rbd_header.22c9f32dd768ec       0x20    87      WC/0
18446462598732841268    osd17   1.613812c       1.2c    [17,10,0]/17    [17,10,0]/17    e8366   rbd_header.7aa082ef1f66e9       0x20    90      WC/0
18446462598732841309    osd18   1.917562ff      1.ff    [18,15,2]/18    [18,15,2]/18    e8366   rbd_header.7aac8296618390       0x20    36      WC/0
BACKOFFS

Restaring all OSD, monitor or manager doesn't help.

It's also impossible to rbd unmap, even with force option.

Here is the list of all stuck process:

ps aux | grep ' D '
root       17324  0.0  0.0      0     0 ?        D    09:22   0:00 [kworker/17:11+rbd]
100000     29186  0.0  0.0   4516   104 ?        D    09:24   0:00 run-parts --report /etc/cron.daily
100000     29647  0.0  0.0  35220  9236 ?        D    09:24   0:00 python3 /opt/XXX_list_scopes.py
root       32508  0.0  0.0      0     0 ?        D    May20   0:10 [kmmpd-rbd33]
root       32510  0.0  0.0      0     0 ?        D    May20   1:00 [jbd2/rbd33-8]
100000    100166  0.0  0.0 1235088 54908 ?       D    09:34   0:00 /usr/bin/python3 /usr/bin/salt-minion
100000    211110  0.0  0.0   4624    92 ?        D    09:50   0:00 /bin/sh -c borg info --json
100000    304787  0.0  0.0 1235328 54780 ?       D    10:02   0:00 /usr/bin/python3 /usr/bin/salt-minion
100000    304874  0.0  0.0  35220  9220 ?        D    10:02   0:00 python3 /opt/XXX_list_scopes.py
root      750175  0.0  0.0   6140  2328 pts/36   S+   10:45   0:00 grep  D 
root     1798106  0.0  0.0      0     0 ?        D    03:31   0:02 [kworker/29:4+ceph-msgr]
root     2126115  0.0  0.0      0     0 ?        D    04:21   0:01 [kworker/16:1+rbd]
root     2348498  0.0  0.0      0     0 ?        D    04:55   0:00 [kworker/19:2+ceph-msgr]
root     2348682  0.0  0.0      0     0 ?        D    04:55   0:00 [kworker/17:2+rbd]
root     2358880  0.0  0.0      0     0 ?        D    04:57   0:00 [kworker/u65:13+flush-252:528]
root     2365405  0.0  0.0      0     0 ?        D    04:58   0:00 [kworker/u64:0+rbd33-tasks]
root     2380900  0.0  0.0      0     0 ?        D    05:00   0:00 [kworker/u64:8+ceph-watch-notify]
root     2668082  0.0  0.0      0     0 ?        D    05:46   0:00 [kworker/15:0+rbd]
root     2839118  0.0  0.0      0     0 ?        D    06:12   0:00 [kworker/18:0+rbd]
root     2998532  0.0  0.0      0     0 ?        D    06:36   0:00 [kworker/17:7+rbd]
root     2998536  0.0  0.0      0     0 ?        D    06:36   0:00 [kworker/16:8+rbd]
root     3075865  0.0  0.0      0     0 ?        D    06:47   0:00 [kworker/18:7+rbd]
root     3129962  0.0  0.0      0     0 ?        D    06:55   0:00 [kworker/15:1+rbd]
root     3135993  0.0  0.0      0     0 ?        D    06:55   0:00 [kworker/16:2+rbd]
root     3165701  0.0  0.0      0     0 ?        D    07:00   0:00 [kworker/17:1+rbd]
root     3182766  0.0  0.0      0     0 ?        D    07:02   0:00 [kworker/18:1+rbd]
root     3182772  0.0  0.0      0     0 ?        D    07:02   0:00 [kworker/18:2+rbd]
root     3189400  0.0  0.0      0     0 ?        D    07:03   0:00 [kworker/16:3+rbd]
root     3195772  0.0  0.0      0     0 ?        D    07:04   0:00 [kworker/15:3+rbd]
root     3195774  0.0  0.0      0     0 ?        D    07:04   0:00 [kworker/16:4+rbd]
root     3195775  0.0  0.0      0     0 ?        D    07:04   0:00 [kworker/18:3+rbd]
root     3202588  0.0  0.0      0     0 ?        D    07:05   0:00 [kworker/16:5+rbd]
root     3202936  0.0  0.0      0     0 ?        D    07:05   0:00 [kworker/16:6+rbd]
root     3202939  0.0  0.0      0     0 ?        D    07:05   0:01 [kworker/18:4+rbd]
root     3219420  0.0  0.0      0     0 ?        D    07:08   0:00 [kworker/15:6+rbd]
root     3231624  0.0  0.0      0     0 ?        D    07:09   0:00 [kworker/16:9+rbd]
root     3246545  0.0  0.0      0     0 ?        D    07:11   0:00 [kworker/17:3+rbd]
root     3270615  0.0  0.0      0     0 ?        D    07:14   0:00 [kworker/17:5+rbd]
root     3416476  0.0  0.0      0     0 ?        D    07:35   0:00 [kworker/18:5+rbd]
root     3416503  0.0  0.0      0     0 ?        D    07:35   0:00 [kworker/18:6+rbd]
root     3416504  0.0  0.0      0     0 ?        D    07:35   0:00 [kworker/18:8+rbd]
root     3416539  0.0  0.0      0     0 ?        D    07:35   0:02 [kworker/18:9+rbd]
root     3428688  0.0  0.0      0     0 ?        D    07:36   0:01 [kworker/17:8+rbd]
root     3582209  0.0  0.0      0     0 ?        D    07:58   0:00 [kworker/18:10+rbd]
root     3728375  0.0  0.0      0     0 ?        D    08:17   0:00 [kworker/17:9+rbd]
root     3731842  0.0  0.0      0     0 ?        D    08:17   0:00 [kworker/15:9+rbd]
root     3731843  0.0  0.0      0     0 ?        D    08:17   0:00 [kworker/15:10+rbd]
root     3795684  0.0  0.0      0     0 ?        D    08:26   0:00 [kworker/18:11+rbd]
root     3803165  0.0  0.0      0     0 ?        D    08:27   0:00 [kworker/15:2+rbd]
root     3845962  0.0  0.0      0     0 ?        D    08:33   0:00 [kworker/15:4+rbd]
100000   3916481  0.0  0.0 1234968 54908 ?       D    08:43   0:00 /usr/bin/python3 /usr/bin/salt-minion
root     3930582  0.0  0.0      0     0 ?        D    08:44   0:00 [kworker/18:14+rbd]
root     3960557  0.0  0.0      0     0 ?        D    08:48   0:00 [kworker/18:16+rbd]
root     4069220  0.0  0.0      0     0 ?        D    09:03   0:01 [kworker/15:7+rbd]
root     4086146  0.0  0.0      0     0 ?        D    09:05   0:00 [kworker/17:16+rbd]

In that case, it was the mapping rbd33 who is stuck.

Here is /proc/pic/stack for all stuck process:

17324
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
32508
[<0>] __wait_on_buffer+0x32/0x40
[<0>] write_mmp_block+0x104/0x120
[<0>] kmmpd+0x19a/0x3c0
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
32510
[<0>] wait_on_page_bit+0x141/0x210
[<0>] wait_on_page_writeback+0x43/0x90
[<0>] __filemap_fdatawait_range+0xae/0x120
[<0>] filemap_fdatawait_range_keep_errors+0x12/0x40
[<0>] jbd2_journal_commit_transaction+0xba2/0x1750
[<0>] kjournald2+0xc8/0x270
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
1798106
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_obj_handle_request+0x34/0x40 [rbd]
[<0>] rbd_osd_req_callback+0x45/0x80 [rbd]
[<0>] __complete_request+0x26/0x80 [libceph]
[<0>] handle_reply+0x813/0x930 [libceph]
[<0>] dispatch+0x167/0xb70 [libceph]
[<0>] ceph_con_workfn+0xd9d/0x24d0 [libceph]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
2126115
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
2348498
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_obj_handle_request+0x34/0x40 [rbd]
[<0>] rbd_osd_req_callback+0x45/0x80 [rbd]
[<0>] __complete_request+0x26/0x80 [libceph]
[<0>] handle_reply+0x813/0x930 [libceph]
[<0>] dispatch+0x167/0xb70 [libceph]
[<0>] ceph_con_workfn+0xd9d/0x24d0 [libceph]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
2348682
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
2358880
[<0>] __lock_page+0x122/0x220
[<0>] mpage_prepare_extent_to_map+0x291/0x2d0
[<0>] ext4_writepages+0x458/0xeb0
[<0>] do_writepages+0x41/0xd0
[<0>] __writeback_single_inode+0x40/0x310
[<0>] writeback_sb_inodes+0x209/0x4a0
[<0>] __writeback_inodes_wb+0x66/0xd0
[<0>] wb_writeback+0x25b/0x2f0
[<0>] wb_workfn+0x33e/0x490
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
2365405
[<0>] rbd_quiesce_lock+0xa1/0xe0 [rbd]
[<0>] rbd_reregister_watch+0x102/0x1b0 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
2380900
[<0>] rwsem_down_write_slowpath+0x2ed/0x4a0
[<0>] rbd_watch_errcb+0x2a/0x92 [rbd]
[<0>] do_watch_error+0x40/0xb0 [libceph]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
2668082
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
2839118
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
2998532
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
2998536
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3075865
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3129962
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3135993
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3165701
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3182766
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3182772
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3189400
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3195772
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3195774
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3195775
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3202588
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3202936
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3202939
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3219420
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3231624
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3246545
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3270615
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3416476
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3416503
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3416504
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3416539
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3428688
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3582209
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3728375
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3731842
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3731843
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3795684
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3803165
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3845962
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3930582
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
3960557
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
4069220
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40
4086146
[<0>] rbd_img_handle_request+0x3f/0x1a0 [rbd]
[<0>] rbd_queue_workfn+0x225/0x360 [rbd]
[<0>] process_one_work+0x20f/0x3d0
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x22/0x40

The only solution found yet is to reboot the machine.

The network issues should not happend, however I would assume that rbd should be able to recover from those and not beeing stuck like this. This is also happening randomly and we don't have a way to reproduce it.

If needed, we can of course collect more data the next time we do hit the issue :)

Thanks a lot for your help,


Related issues

Duplicates Linux kernel client - Bug #42757: deadlock on lock_rwsem: rbd_quiesce_lock() vs watch errors Resolved

History

#1 Updated by Ilya Dryomov over 1 year ago

  • Project changed from rbd to Linux kernel client

#2 Updated by Ilya Dryomov over 1 year ago

  • Category set to rbd

#3 Updated by Ilya Dryomov over 1 year ago

I'm pretty sure this is https://tracker.ceph.com/issues/42757.

I'm working on a fix.

#4 Updated by Ilya Dryomov over 1 year ago

  • Assignee set to Ilya Dryomov

#5 Updated by Ilya Dryomov over 1 year ago

  • Duplicates Bug #42757: deadlock on lock_rwsem: rbd_quiesce_lock() vs watch errors added

#6 Updated by Ilya Dryomov over 1 year ago

  • Status changed from New to Duplicate

Also available in: Atom PDF