Bug #62863: Slowness or deadlock in ceph-fuse causes teuthology job to hang and fail - CephFS - Ceph

Actions

Copy link

Bug #62863

closed

Slowness or deadlock in ceph-fuse causes teuthology job to hang and fail

Added by Rishabh Dave 8 months ago. Updated 7 months ago.

Status:

Can't reproduce

Priority:

Low

Assignee:

Leonid Usov

Category:

Correctness/Safety

Target version:

Ceph - v19.0.0

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Client, ceph-fuse

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

https://pulpito.ceph.com/rishabh-2023-09-12_12:12:15-fs-wip-rishabh-2023sep12-b2-testing-default-smithi/7394785/

It seems that this due to a deadlock in ceph-fuse, the read-write on CephFS (which was mounted using ceph-fuse) was stuck. Due to this the job was stuck for almost 3 hours after which the job terminated itself.

From teuthology.log -

2023-09-13T06:19:38.633 INFO:tasks.workunit.client.0.smithi159.stdout:  CC      kernel/irq/migration.o
2023-09-13T09:10:31.862 DEBUG:teuthology.orchestra.run:got remote process result: 124
2023-09-13T09:11:31.897 DEBUG:teuthology.orchestra.run:timed out waiting; will kill: <Greenlet at 0x7f213277b480: copy_file_to(<paramiko.ChannelFile from <paramiko.Channel 104 (, <Logger tasks.workunit.client.0.smithi159.stderr (, None, False)>
2023-09-13T09:12:31.899 DEBUG:teuthology.orchestra.run:timed out waiting; will kill: <Greenlet at 0x7f213277b6a0: copy_file_to(<paramiko.ChannelFile from <paramiko.Channel 104 (, <Logger tasks.workunit.client.0.smithi159.stdout (, None, False)>
2023-09-13T09:12:31.899 INFO:tasks.workunit:Stopping ['kernel_untar_build.sh'] on client.0...
2023-09-13T09:12:31.900 DEBUG:teuthology.orchestra.run.smithi159:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2023-09-13T09:12:32.787 ERROR:teuthology.run_tasks:Saw exception from tasks.

Note the time difference between following 2 entries -

2023-09-13T06:19:38.633 INFO:tasks.workunit.client.0.smithi159.stdout:  CC      kernel/irq/migration.o
2023-09-13T09:10:31.862 DEBUG:teuthology.orchestra.run:got remote process result: 124

From /rishabh-2023-09-12_12:12:15-fs-wip-rishabh-2023sep12-b2-testing-default-smithi/7394785/console_logs/smithi159.log -

[ 2214.285332] Showing all locks held in the system:^M
[ 2214.291512] 1 lock held by rcu_tasks_kthre/12:^M
[ 2214.295954]  #0: ffffffff8255dd50 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x30/0x420^M
[ 2214.305517] 1 lock held by rcu_tasks_rude_/13:^M
[ 2214.309960]  #0: ffffffff8255dad0 (rcu_tasks_rude.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x30/0x420^M
[ 2214.319953] 1 lock held by khungtaskd/56:^M
[ 2214.323965]  #0: ffffffff8255e7c0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x12/0x1d0^M
[ 2214.332942] 1 lock held by ceph-fuse/128555:^M
[ 2214.337208]  #0: ffff888101ebdc68 (&pipe->mutex/1){+.+.}-{3:3}, at: splice_file_to_pipe+0x2a/0x80^M
[ 2214.346091] 1 lock held by ceph-fuse/128557:^M
[ 2214.350357]  #0: ffff888107c4ca68 (&pipe->mutex/1){+.+.}-{3:3}, at: splice_file_to_pipe+0x2a/0x80^M
[ 2214.359242] 1 lock held by ceph-fuse/128560:^M
[ 2214.363530]  #0: ffff88810d542268 (&pipe->mutex/1){+.+.}-{3:3}, at: splice_file_to_pipe+0x2a/0x80^M
[ 2214.372396] 1 lock held by ceph-fuse/128571:^M
[ 2214.376667]  #0: ffff888160d11a68 (&pipe->mutex/1){+.+.}-{3:3}, at: splice_file_to_pipe+0x2a/0x80^M
[ 2214.385535] 1 lock held by ceph-fuse/128577:^M
[ 2214.389807]  #0: ffff888101ebe268 (&pipe->mutex/1){+.+.}-{3:3}, at: splice_file_to_pipe+0x2a/0x80^M
[ 2214.398674] 1 lock held by ceph-fuse/128581:^M
[ 2214.402956]  #0: ffff888160d11468 (&pipe->mutex/1){+.+.}-{3:3}, at: splice_file_to_pipe+0x2a/0x80^M
[ 2214.411820] 1 lock held by ceph-fuse/128593:^M
[ 2214.416084]  #0: ffff888101ebea68 (&pipe->mutex/1){+.+.}-{3:3}, at: splice_file_to_pipe+0x2a/0x80^M
[ 2214.424970] 1 lock held by ceph-fuse/128596:^M
[ 2214.429241]  #0: ffff88810d540868 (&pipe->mutex/1){+.+.}-{3:3}, at: splice_file_to_pipe+0x2a/0x80^M
[ 2214.438111] 2 locks held by cc1/128215:^M
[ 2214.441947]  #0: ffff8883324771d0 (&type->i_mutex_dir_key#7){++++}-{3:3}, at: path_openat+0x5ca/0x9d0^M
[ 2214.451159]  #1: ffff888332477628 (&fi->mutex){+.+.}-{3:3}, at: fuse_lock_inode+0x31/0x40 [fuse]^M
[ 2214.459965] 2 locks held by cc1/128296:^M
[ 2214.463796]  #0: ffff888336080150 (&type->i_mutex_dir_key#7){++++}-{3:3}, at: path_openat+0x5ca/0x9d0^M
[ 2214.473009]  #1: ffff8883360805a8 (&fi->mutex){+.+.}-{3:3}, at: fuse_lock_inode+0x31/0x40 [fuse]^M
[ 2214.481788] 2 locks held by cc1/128348:^M
[ 2214.485619]  #0: ffff888336080150 (&type->i_mutex_dir_key#7){++++}-{3:3}, at: path_openat+0x5ca/0x9d0^M
[ 2214.494833]  #1: ffff8883360805a8 (&fi->mutex){+.+.}-{3:3}, at: fuse_lock_inode+0x31/0x40 [fuse]^M
[ 2214.503611] 1 lock held by cc1/128369:^M
[ 2214.507356]  #0: ffff888336080150 (&type->i_mutex_dir_key#7){++++}-{3:3}, at: path_openat+0x5ca/0x9d0^M
[ 2214.516569] 2 locks held by cc1/128419:^M
[ 2214.520400]  #0: ffff888265700790 (&type->i_mutex_dir_key#7){++++}-{3:3}, at: path_openat+0x5ca/0x9d0^M
[ 2214.529613]  #1: ffff888265700be8 (&fi->mutex){+.+.}-{3:3}, at: fuse_lock_inode+0x31/0x40 [fuse]^M
[ 2214.538412] 2 locks held by cc1/128466:^M
[ 2214.542248]  #0: ffff888265700790 (&type->i_mutex_dir_key#7){++++}-{3:3}, at: path_openat+0x5ca/0x9d0^M
[ 2214.551462]  #1: ffff888265700be8 (&fi->mutex){+.+.}-{3:3}, at: fuse_lock_inode+0x31/0x40 [fuse]^M
[ 2214.560251] 2 locks held by cc1/128595:^M
[ 2214.564088]  #0: ffff888346f89410 (&type->i_mutex_dir_key#7){++++}-{3:3}, at: walk_component+0x74/0x160^M
[ 2214.573487]  #1: ffff888346f89868 (&fi->mutex){+.+.}-{3:3}, at: fuse_lock_inode+0x31/0x40 [fuse]^M

Actions

Copy link

Updated by Rishabh Dave 8 months ago

Subject changed from ceph-fuse stuck causes teuthology job to fail to deadlock in ceph-fuse causes teuthology job to hang and fail

Actions

Copy link

Updated by Venky Shankar 8 months ago

Category set to Correctness/Safety
Assignee set to Leonid Usov
Target version set to v19.0.0
Component(FS) Client, ceph-fuse added

Actions

Copy link

Updated by Venky Shankar 8 months ago

Status changed from New to Triaged

Actions

Copy link

Updated by Patrick Donnelly 8 months ago

Related to Bug #62682: mon: no mdsmap broadcast after "fs set joinable" is set to true added

Actions

Copy link

Updated by Venky Shankar 8 months ago

I don't think this is related to #62682.

Actions

Copy link

Updated by Leonid Usov 8 months ago

I agree with Venky, this doesn't seem to be related to the linked issue. @Patrick, would you mind clarifying why you think these are related?

From what I've learned so far, it does seem like a deadlock. The system hanged while executing the qa/workunits/kernel_untar_build.sh work unit, during the build step:

2023-09-13T06:10:41.659 INFO:tasks.workunit.client.0.smithi159.stderr:+ mkdir t
2023-09-13T06:10:41.665 INFO:tasks.workunit.client.0.smithi159.stderr:+ cd t
2023-09-13T06:10:41.667 INFO:tasks.workunit.client.0.smithi159.stderr:+ tar xzf ../linux.tar.gz
2023-09-13T06:15:51.129 INFO:tasks.workunit.client.0.smithi159.stderr:+ cd linux-5.4
2023-09-13T06:15:51.133 INFO:tasks.workunit.client.0.smithi159.stderr:+ make defconfig
...
2023-09-13T06:16:11.936 INFO:tasks.workunit.client.0.smithi159.stderr:++ grep -c processor /proc/cpuinfo
2023-09-13T06:16:11.939 INFO:tasks.workunit.client.0.smithi159.stderr:+ make -j8
...
2023-09-13T06:19:36.099 INFO:tasks.workunit.client.0.smithi159.stdout:  CC      arch/x86/ia32/sys_ia32.o
2023-09-13T06:19:38.633 INFO:tasks.workunit.client.0.smithi159.stdout:  CC      kernel/irq/migration.o
2023-09-13T09:10:31.862 DEBUG:teuthology.orchestra.run:got remote process result: 124
2023-09-13T09:11:31.897 DEBUG:teuthology.orchestra.run:timed out waiting; will kill: <Greenlet at 0x7f213277b480: copy_file_to(<paramiko.ChannelFile from <paramiko.Channel 104 (, <Logger tasks.workunit.client.0.smithi159.stderr (, None, False)>
2023-09-13T09:12:31.899 DEBUG:teuthology.orchestra.run:timed out waiting; will kill: <Greenlet at 0x7f213277b6a0: copy_file_to(<paramiko.ChannelFile from <paramiko.Channel 104 (, <Logger tasks.workunit.client.0.smithi159.stdout (, None, False)>
2023-09-13T09:12:31.899 INFO:tasks.workunit:Stopping ['kernel_untar_build.sh'] on client.0...
2023-09-13T09:12:31.900 DEBUG:teuthology.orchestra.run.smithi159:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2023-09-13T09:12:32.787 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
...
  File "/home/teuthworker/src/git.ceph.com_teuthology_54e62bcbac4e53d9685e08328b790d3b20d71cae/teuthology/orchestra/run.py", line 181, in _raise_for_status
    raise CommandFailedError(
teuthology.exceptions.CommandFailedError: Command failed (workunit test kernel_untar_build.sh) on smithi159 with status 124:

Note the timestamps.

Actions

Copy link

Updated by Leonid Usov 8 months ago

There are 3 processes of the compiler that seem to be in a deadlock:

[ 1845.449431] INFO: task cc1:128296 blocked for more than 122 seconds.
[ 1845.455854]       Tainted: G S                 6.5.0-rc7-gfef64f93da2b #1
[ 1845.462646] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1845.470467] task:cc1             state:D stack:0     pid:128296 ppid:128293 flags:0x00004000
[ 1845.478899] Call Trace:
[ 1845.481352]  <TASK>
[ 1845.483458]  __schedule+0x37d/0xe80
[ 1845.486950]  ? find_held_lock+0x2d/0x90
[ 1845.490788]  ? __mutex_lock+0x7fc/0xc80
[ 1845.494629]  schedule+0x54/0xc0
[ 1845.497774]  schedule_preempt_disabled+0x15/0x20
[ 1845.502393]  __mutex_lock+0x801/0xc80
[ 1845.506061]  ? fuse_lock_inode+0x31/0x40 [fuse]
[ 1845.510612]  ? fuse_lock_inode+0x31/0x40 [fuse]
[ 1845.515144]  fuse_lock_inode+0x31/0x40 [fuse]
[ 1845.519519]  fuse_lookup+0x6e/0x140 [fuse]
[ 1845.523619]  ? lock_release+0xc5/0x260
[ 1845.527373]  ? d_alloc_parallel+0x350/0x970
[ 1845.531560]  fuse_atomic_open+0xcc/0x130 [fuse]
[ 1845.536100]  lookup_open+0x37b/0x630
[ 1845.539681]  ? verify_cpu+0x30/0x100
[ 1845.543261]  ? verify_cpu+0x30/0x100
[ 1845.546838]  path_openat+0x25a/0x9d0
[ 1845.550422]  do_filp_open+0xb8/0x120

[ 1845.630231] INFO: task cc1:128369 blocked for more than 123 seconds.
[ 1845.636593]       Tainted: G S                 6.5.0-rc7-gfef64f93da2b #1
[ 1845.643381] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1845.651220] task:cc1             state:D stack:0     pid:128369 ppid:128368 flags:0x00004000
[ 1845.659651] Call Trace:
[ 1845.662096]  <TASK>
[ 1845.664196]  __schedule+0x37d/0xe80
[ 1845.667687]  ? d_alloc_parallel+0x645/0x970
[ 1845.671875]  schedule+0x54/0xc0
[ 1845.675019]  d_alloc_parallel+0x64a/0x970
[ 1845.679032]  ? wake_up_state+0x10/0x10
[ 1845.682786]  lookup_open+0x189/0x630
[ 1845.686367]  ? verify_cpu+0x30/0x100
[ 1845.689962]  ? verify_cpu+0x30/0x100
[ 1845.693561]  path_openat+0x25a/0x9d0
[ 1845.697156]  do_filp_open+0xb8/0x120

[ 1845.776969] INFO: task cc1:128419 blocked for more than 123 seconds.
[ 1845.783336]       Tainted: G S                 6.5.0-rc7-gfef64f93da2b #1
[ 1845.790123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1845.797940] task:cc1             state:D stack:0     pid:128419 ppid:128416 flags:0x00004000
[ 1845.806391] Call Trace:
[ 1845.808836]  <TASK>
[ 1845.810941]  __schedule+0x37d/0xe80
[ 1845.814433]  ? find_held_lock+0x2d/0x90
[ 1845.818273]  ? __mutex_lock+0x7fc/0xc80
[ 1845.822116]  schedule+0x54/0xc0
[ 1845.825279]  schedule_preempt_disabled+0x15/0x20
[ 1845.829913]  __mutex_lock+0x801/0xc80
[ 1845.833603]  ? fuse_lock_inode+0x31/0x40 [fuse]
[ 1845.838142]  ? fuse_lock_inode+0x31/0x40 [fuse]
[ 1845.842679]  fuse_lock_inode+0x31/0x40 [fuse]
[ 1845.847039]  fuse_lookup+0x6e/0x140 [fuse]
[ 1845.851139]  ? lock_release+0xc5/0x260
[ 1845.854893]  ? d_alloc_parallel+0x350/0x970
[ 1845.859078]  fuse_atomic_open+0xcc/0x130 [fuse]
[ 1845.863623]  lookup_open+0x37b/0x630
[ 1845.867218]  ? verify_cpu+0x30/0x100
[ 1845.870797]  ? verify_cpu+0x30/0x100
[ 1845.874376]  path_openat+0x25a/0x9d0
[ 1845.877958]  do_filp_open+0xb8/0x120

then the same callstack report repeats

[ 1968.331724] INFO: task cc1:128296 blocked for more than 245 seconds.
[ 1968.512704] INFO: task cc1:128369 blocked for more than 245 seconds.
[ 1968.659391] INFO: task cc1:128419 blocked for more than 246 seconds.

[ 2091.213996] INFO: task cc1:128296 blocked for more than 368 seconds.
[ 2091.394662] INFO: task cc1:128369 blocked for more than 368 seconds.
[ 2091.541412] INFO: task cc1:128419 blocked for more than 368 seconds.

[ 2214.096338] INFO: task cc1:128296 blocked for more than 491 seconds.

Actions

Copy link

Updated by Patrick Donnelly 8 months ago

I don't think it's related either. I was probably trying to link a different ticket but I don't recall which.

Actions

Copy link

Updated by Patrick Donnelly 8 months ago

Related to deleted (Bug #62682: mon: no mdsmap broadcast after "fs set joinable" is set to true)

Actions

Copy link

#10

Updated by Leonid Usov 7 months ago

http://pulpito.front.sepia.ceph.com/leonidus-2023-10-03_17:11:08-fs-wip-lusov-wip-rishabh-2023sep12-b2-distro-default-smithi/ had the same test pass 50 times. The command to schedule was

$ teuthology-suite -s fs -c wip-lusov-wip-rishabh-2023sep12-b2 --filter-all "bluestore-bitmap,centos_8,pacific,kernel_untar_build" --seed 2531 -N 50 -p 150

Actions

Copy link

#11

Updated by Leonid Usov 7 months ago

A modified version of the test which runs make with -j20 instead of -j8 has passed 50 times (49 in the link below + 1 same config when I validated that my change had applied)

http://pulpito.front.sepia.ceph.com/leonidus-2023-10-04_11:36:27-fs-wip-lusov-wip-rishabh-2023sep12-b2-distro-default-smithi/

Actions

Copy link

#12

Updated by Leonid Usov 7 months ago

Status changed from Triaged to Can't reproduce
Priority changed from Normal to Low

Following a talk with Venky we decided that given the unsuccessful reproduction attempts (see the links above) and the absence of logs from the fuse client we will mark the issue as "Can't Reproduce". We'll be looking for other instances of fuse client deadlocks but for now, there's nothing else that can be done about this issue.

leonidus@teuthology:/a/rishabh-2023-09-12_12:12:15-fs-wip-rishabh-2023sep12-b2-testing-default-smithi/7394785$ find . -name "*.log*" 
./ansible.log
./console_logs/smithi141.log
./console_logs/smithi159.log
./console_logs/smithi141_reimage.log
./console_logs/smithi159_reimage.log
./teuthology.log
./supervisor.7394785.log

As a final point to mention, analyzing once again the traces for the deadlocked processes, we see that they are all stuck in the kernel traces and not in the fuse client side. So, it could be a bug with the fuse kernel module rather than ceph_fuse client daemon.

We can see that the problematic flow is the `lookup_open` kernel function which was entered by three different processes, presumably for the same directory (?)


[ 2214.459965] 2 locks held by cc1/128296:^M
[ 2214.463796]  #0: ffff888336080150 (&type->i_mutex_dir_key#7){++++}-{3:3}, at: path_openat+0x5ca/0x9d0^M
[ 2214.473009]  #1: ffff8883360805a8 (&fi->mutex){+.+.}-{3:3}, at: fuse_lock_inode+0x31/0x40 [fuse]^M

[ 1845.515144]  fuse_lock_inode+0x31/0x40 [fuse]
[ 1845.519519]  fuse_lookup+0x6e/0x140 [fuse]
[ 1845.523619]  ? lock_release+0xc5/0x260
[ 1845.527373]  ? d_alloc_parallel+0x350/0x970
[ 1845.531560]  fuse_atomic_open+0xcc/0x130 [fuse]
[ 1845.536100]  lookup_open+0x37b/0x630

--------------------------------------
[ 2214.503611] 1 lock held by cc1/128369:^M
[ 2214.507356]  #0: ffff888336080150 (&type->i_mutex_dir_key#7){++++}-{3:3}, at: path_openat+0x5ca/0x9d0^M

[ 1845.671875]  schedule+0x54/0xc0
[ 1845.675019]  d_alloc_parallel+0x64a/0x970
[ 1845.679032]  ? wake_up_state+0x10/0x10
[ 1845.682786]  lookup_open+0x189/0x630

--------------------------------------
[ 2214.516569] 2 locks held by cc1/128419:^M
[ 2214.520400]  #0: ffff888265700790 (&type->i_mutex_dir_key#7){++++}-{3:3}, at: path_openat+0x5ca/0x9d0^M
[ 2214.529613]  #1: ffff888265700be8 (&fi->mutex){+.+.}-{3:3}, at: fuse_lock_inode+0x31/0x40 [fuse]^M

[ 1845.515144]  fuse_lock_inode+0x31/0x40 [fuse]
[ 1845.519519]  fuse_lookup+0x6e/0x140 [fuse]
[ 1845.523619]  ? lock_release+0xc5/0x260
[ 1845.527373]  ? d_alloc_parallel+0x350/0x970
[ 1845.531560]  fuse_atomic_open+0xcc/0x130 [fuse]
[ 1845.536100]  lookup_open+0x37b/0x630

The internet search for similar deadlocks didn't give any matches

Actions

Copy link

#13

Updated by Leonid Usov 7 months ago

Thanks to challenges from Shachar Sharon, I continued to look into it. I focused on the lock list because it's rather confusing.

First of all, here is the relevant code that prints the locks held: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-8/-/blob/c8s/kernel/locking/lockdep.c#L749

We can tell from this function that the locks we see listed

are per kernel task, i.e. per thread - despite the fact that the id we see printed is that of a process this task belongs to.
are locks that are held, i.e. have been acquired by the said task
are in the historical order of acquisition

We can immediately spot some peculiar things related to two of the three hanging processes:

hanging process 128296 is holding the same locks as another process 128348

[ 1846.145932] 2 locks held by cc1/128296:
[ 1846.149764]  #0: ffff888336080150 (&type->i_mutex_dir_key#7){++++}-{3:3}, at: path_openat+0x5ca/0x9d0
[ 1846.158992]  #1: ffff8883360805a8 (&fi->mutex){+.+.}-{3:3}, at: fuse_lock_inode+0x31/0x40 [fuse]
[ 1846.167772] 2 locks held by cc1/128348:
[ 1846.171603]  #0: ffff888336080150 (&type->i_mutex_dir_key#7){++++}-{3:3}, at: path_openat+0x5ca/0x9d0
[ 1846.180844]  #1: ffff8883360805a8 (&fi->mutex){+.+.}-{3:3}, at: fuse_lock_inode+0x31/0x40 [fuse]

hanging process 128419 is holding the same two locks as another process 128466

[ 1846.202613] 2 locks held by cc1/128419:
[ 1846.206444]  #0: ffff888265700790 (&type->i_mutex_dir_key#7){++++}-{3:3}, at: path_openat+0x5ca/0x9d0
[ 1846.215659]  #1: ffff888265700be8 (&fi->mutex){+.+.}-{3:3}, at: fuse_lock_inode+0x31/0x40 [fuse]
[ 1846.224447] 2 locks held by cc1/128466:
[ 1846.228284]  #0: ffff888265700790 (&type->i_mutex_dir_key#7){++++}-{3:3}, at: path_openat+0x5ca/0x9d0
[ 1846.237499]  #1: ffff888265700be8 (&fi->mutex){+.+.}-{3:3}, at: fuse_lock_inode+0x31/0x40 [fuse]

How is it possible that the same mutex is held by different tasks? Let's look at the lock types. &type->i_mutex_dir_key#7 is actually an rwsem, read-write semaphore, which explains how multiple processes can hold the same instance - apparently, for read:

https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-8/-/blob/c8s/fs/inode.c#L965

            lockdep_set_class(&inode->i_rwsem,
                      &type->i_mutex_dir_key);

https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-8/-/blob/c8s/fs/namei.c?ref_type=heads#L3497

It seems like the flow is inlined into the path_openat function via its call into do_last: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-8/-/blob/c8s/fs/namei.c?ref_type=heads#L3718

However, the &fi->mutex lock type is really a mutex (https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-8/-/blob/c8s/fs/fuse/inode.c?ref_type=heads#L87)

    mutex_init(&fi->mutex);

This poses a big question, how come the same instance of the type is held by several threads? While that on its own isn't a deadlock, it may cause one because clients assume this can't happen. That is unless I am missing something...

For the curious, there is another piece of info we see in the logs that's however not helpful in our case, as far as I can tell. In the curly braces, we see encoded lock usage history bits, namely (&type->i_mutex_dir_key#7){++++} and (&fi->mutex){+.+.}

See this flow here https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-8/-/blob/c8s/kernel/locking/lockdep.c#L708. There's also some documentation available about these flags: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-8/-/blob/c8s/Documentation/locking/lockdep-design.rst?ref_type=heads

In short, these bits show whether the lock was ever taken for write/read in the hard/soft IRQ environment and whether IRQs were disabled or enabled at that point. The four chars mean (hard IRQ w, hard IRD r, soft IRQ w, sort IRQ r). The default mode is ., which is another evidence that (&fi->mutex) is never locked for reading, only exclusively, as opposed to (&type->i_mutex_dir_key#7)

What we see is that these locks were only ever taken outside of an IRQ context, whether soft or hard, and while IRQs were enabled, meaning that the locks are considered "IRQ safe" by the lock validator logic. Not interesting for us because it doesn't tell us what the problem is, and the validator never spotted any issues with how these locks were used.

Actions

Copy link

#14

Updated by Leonid Usov 7 months ago

Good news, I stand corrected!
The lockdep_depth is increased in the function __lock_acquire which is called before the lock is actually acquired. From the docs

      __acquire
          |
         lock _____
          |        \
          |    __contended
          |         |
          |       <wait>
          | _______/
          |/
          |
     __acquired
          |
          .
        <hold>
          .
          |
     __release
          |
       unlock

lock, unlock  - the regular lock functions
__*           - the hooks
<>            - states

So, the last lock in the list may be the one that's currently being waited for

Actions

Copy link

#15

Updated by Leonid Usov 7 months ago

OK, so the task

[ 1968.512704] INFO: task cc1:128369 blocked for more than 245 seconds.
[ 1968.519057]       Tainted: G S                 6.5.0-rc7-gfef64f93da2b #1
[ 1968.525860] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1968.533697] task:cc1             state:D stack:0     pid:128369 ppid:128368 flags:0x00004000
[ 1968.542145] Call Trace:
[ 1968.544598]  <TASK>
[ 1968.546704]  __schedule+0x37d/0xe80
[ 1968.550213]  ? d_alloc_parallel+0x645/0x970
[ 1968.554402]  schedule+0x54/0xc0
[ 1968.557545]  d_alloc_parallel+0x64a/0x970
[ 1968.561561]  ? wake_up_state+0x10/0x10

is in the schedule which it does while waiting for other lookups for the same directory to finish

https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-8/-/blob/c8s/fs/dcache.c?ref_type=heads#L2607

        /*
         * somebody is likely to be still doing lookup for it;
         * wait for them to finish
         */
        spin_lock(&dentry->d_lock);
        d_wait_lookup(dentry);

https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-8/-/blob/c8s/fs/dcache.c?ref_type=heads#L2517

static void d_wait_lookup(struct dentry *dentry)
{
    if (d_in_lookup(dentry)) {
        DECLARE_WAITQUEUE(wait, current);
        add_wait_queue(dentry->d_wait, &wait);
        do {
            set_current_state(TASK_UNINTERRUPTIBLE);
            spin_unlock(&dentry->d_lock);
            schedule();
            spin_lock(&dentry->d_lock);
        } while (d_in_lookup(dentry));
    }
}

So, this task is a victim of others who keep performing the lookups on the dir. One suspect less.

Actions

Copy link

#16

Updated by Leonid Usov 7 months ago

The two other hung tasks are both apparently sleeping on their corresponding mutexes, trying to acquire them

[ 1845.478899] Call Trace:
[ 1845.481352]  <TASK>
[ 1845.483458]  __schedule+0x37d/0xe80
[ 1845.486950]  ? find_held_lock+0x2d/0x90
[ 1845.490788]  ? __mutex_lock+0x7fc/0xc80
[ 1845.494629]  schedule+0x54/0xc0
[ 1845.497774]  schedule_preempt_disabled+0x15/0x20
[ 1845.502393]  __mutex_lock+0x801/0xc80
[ 1845.506061]  ? fuse_lock_inode+0x31/0x40 [fuse]
[ 1845.510612]  ? fuse_lock_inode+0x31/0x40 [fuse]
[ 1845.515144]  fuse_lock_inode+0x31/0x40 [fuse]

which means they can't be the ones to blame. That makes their peer processes which own said mutexes responsible for not releasing the mutexes for that long:

[ 1846.167772] 2 locks held by cc1/128348:
[ 1846.171603]  #0: ffff888336080150 (&type->i_mutex_dir_key#7){++++}-{3:3}, at: path_openat+0x5ca/0x9d0
[ 1846.180844]  #1: ffff8883360805a8 (&fi->mutex){+.+.}-{3:3}, at: fuse_lock_inode+0x31/0x40 [fuse]

[ 1846.224447] 2 locks held by cc1/128466:
[ 1846.228284]  #0: ffff888265700790 (&type->i_mutex_dir_key#7){++++}-{3:3}, at: path_openat+0x5ca/0x9d0
[ 1846.237499]  #1: ffff888265700be8 (&fi->mutex){+.+.}-{3:3}, at: fuse_lock_inode+0x31/0x40 [fuse]

These processes, however, aren't reported as hung. Meaning that they must be waking up regularly, but not finishing the task in a timely fashion. At this point, without fuse client logs, it's hard to say what those processes might be up to, but Shachar's interpretation of this issue seems the most relevant: we observe some slowness in the fuse client which results in this stack trace. It's not clear whether the lookup flow is to blame as we don't have the stack traces for the pids in question. They are holding a generic inode lock so it could be anything.

Actions

Copy link

#17

Updated by Leonid Usov 7 months ago

Subject changed from deadlock in ceph-fuse causes teuthology job to hang and fail to Slowness or deadlock in ceph-fuse causes teuthology job to hang and fail

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #62863

Slowness or deadlock in ceph-fuse causes teuthology job to hang and fail

Updated by Rishabh Dave 8 months ago

Updated by Venky Shankar 8 months ago

Updated by Venky Shankar 8 months ago

Updated by Patrick Donnelly 8 months ago

Updated by Venky Shankar 8 months ago

Updated by Leonid Usov 8 months ago

Updated by Leonid Usov 8 months ago

Updated by Patrick Donnelly 8 months ago

Updated by Patrick Donnelly 8 months ago

Updated by Leonid Usov 7 months ago

Updated by Leonid Usov 7 months ago

Updated by Leonid Usov 7 months ago

Updated by Leonid Usov 7 months ago

Updated by Leonid Usov 7 months ago

Updated by Leonid Usov 7 months ago

Updated by Leonid Usov 7 months ago

Updated by Leonid Usov 7 months ago