Project

General

Profile

Actions

Bug #36348

closed

luminous(?): blogbench I/O with two kernel clients; one stalls

Added by Patrick Donnelly over 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
kceph
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2018-10-06T01:41:36.157 INFO:tasks.workunit.client.0.smithi186.stdout:        57        45766           253         29521           214         19909           557
2018-10-06T01:41:36.158 INFO:tasks.workunit.client.1.smithi186.stdout:        74       282056           227        177601           192        113251           512
2018-10-06T01:41:46.158 INFO:tasks.workunit.client.0.smithi186.stdout:        61       177403           175        123272            72         70865           284
2018-10-06T01:41:46.158 INFO:tasks.workunit.client.1.smithi186.stdout:        75       134875           150         84465            79         55533           340
2018-10-06T01:41:56.158 INFO:tasks.workunit.client.0.smithi186.stdout:        63       246237           113        168618           263        104781           280
2018-10-06T01:41:56.158 INFO:tasks.workunit.client.1.smithi186.stdout:        77       155939            66        101268            56         67137           242
2018-10-06T01:42:06.158 INFO:tasks.workunit.client.0.smithi186.stdout:        67       192669            70        133553           156         79305           340
2018-10-06T01:42:06.158 INFO:tasks.workunit.client.1.smithi186.stdout:        77       195926           128        125944            70         81732           192
2018-10-06T01:44:51.724 INFO:tasks.workunit.client.0.smithi186.stdout:
2018-10-06T01:44:51.724 INFO:tasks.workunit.client.0.smithi186.stdout:Final score for writes:            67
2018-10-06T01:44:51.724 INFO:tasks.workunit.client.0.smithi186.stdout:Final score for reads :         66275
2018-10-06T01:44:51.724 INFO:tasks.workunit.client.0.smithi186.stdout:
2018-10-06T01:44:51.725 INFO:teuthology.orchestra.run:Running command with timeout 900
2018-10-06T01:44:51.726 INFO:teuthology.orchestra.run.smithi186:Running: 'sudo rm -rf -- /home/ubuntu/cephtest/mnt.0/client.0/tmp'
2018-10-06T01:45:39.523 INFO:tasks.workunit:Stopping ['suites/blogbench.sh'] on client.0...
2018-10-06T01:45:39.524 INFO:teuthology.orchestra.run.smithi186:Running: 'sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0'
2018-10-06T01:45:39.818 DEBUG:teuthology.parallel:result is None
2018-10-06T04:36:42.547 DEBUG:teuthology.orchestra.run:got remote process result: 124
2018-10-06T04:36:42.583 INFO:tasks.workunit:Stopping ['suites/blogbench.sh'] on client.1...
2018-10-06T04:36:42.583 INFO:teuthology.orchestra.run.smithi186:Running: 'sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.1 /home/ubuntu/cephtest/clone.client.1'
2018-10-06T04:36:42.883 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri4-testing-2018-10-05-2015-luminous/qa/tasks/workunit.py", line 206, in task
    cleanup=cleanup)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri4-testing-2018-10-05-2015-luminous/qa/tasks/workunit.py", line 356, in _spawn_on_all_clients
    timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 85, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 99, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 22, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri4-testing-2018-10-05-2015-luminous/qa/tasks/workunit.py", line 479, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 193, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 429, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 183, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed (workunit test suites/blogbench.sh) on smithi186 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.1/client.1/tmp && cd -- /home/ubuntu/cephtest/mnt.1/client.1/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=wip-yuri4-testing-2018-10-05-2015-luminous TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="1" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.1 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.1/qa/workunits/suites/blogbench.sh'

From: /ceph/teuthology-archive/yuriw-2018-10-05_22:19:38-multimds-wip-yuri4-testing-2018-10-05-2015-luminous-testing-basic-smithi/3105596/teuthology.log

Last client_request for the stalled client was:

2018-10-06 01:43:21.571868 7f8f4d47f700  1 -- 172.21.15.195:6816/1919957959 <== client.4259 172.21.15.186:0/2801504243 124457 ==== client_request(client.4259:315376 create #0x30000001b59/comment-1.xml.tmp 2018-10-06 01:43:21.569321 caller_uid=1000, caller_gid=1190{}) v2 ==== 139+0+0 (1101040945 0 0) 0x55eef482a3c0 con 0x55eef4804800

Only interesting thing I saw later was that rank 0 imports caps for that client:

2018-10-06 01:43:21.615523 7f8f47473700 10 mds.0.server finish_force_open_sessions: final v 5650
2018-10-06 01:43:21.615525 7f8f47473700 10 mds.0.migrator finish_import_inode_caps for client.4259 on [inode 0x30000000afd [2,head] /client.1/tmp/blogbench-1.0/src/blogtest_in/blog-34/picture-5.jpg auth{2=1} v833 dirtyparent s=58231 n(v0 rc2018-10-06 01:39:28.420308 b58231 1=1+0) (iversion lock) | importingcaps=1 dirtyparent=1 replicated=1 dirty=1 0x55eef4c5e000]
2018-10-06 01:43:21.615532 7f8f47473700 10 mds.0.cache.ino(0x30000000afd) add_client_cap first cap, joining realm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} 0x55eef4606a00)
2018-10-06 01:43:21.615537 7f8f47473700 10 mds.0.migrator finish_import_inode_caps for client.4259 on [inode 0x30000000957 [2,head] /client.1/tmp/blogbench-1.0/src/blogtest_in/blog-34/article-57.xml auth{2=1} v961 dirtyparent s=2759 n(v0 rc2018-10-06 01:39:17.216308 b2759 1=1+0) (iversion lock) | importingcaps=1 dirtyparent=1 replicated=1 dirty=1 0x55eef4c5e700]
2018-10-06 01:43:21.615541 7f8f47473700 10 mds.0.cache.ino(0x30000000957) add_client_cap first cap, joining realm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} 0x55eef4606a00)
2018-10-06 01:43:21.615543 7f8f47473700 10 mds.0.migrator finish_import_inode_caps for client.4259 on [inode 0x300000013cb [2,head] /client.1/tmp/blogbench-1.0/src/blogtest_in/blog-34/article-58.xml auth{2=1} v963 dirtyparent s=3511 n(v0 rc2018-10-06 01:40:12.436307 b3511 1=1+0) (iversion lock) | importingcaps=1 dirtyparent=1 replicated=1 dirty=1 0x55eef4c5ee00]
2018-10-06 01:43:21.615546 7f8f47473700 10 mds.0.cache.ino(0x300000013cb) add_client_cap first cap, joining realm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} 0x55eef4606a00)
2018-10-06 01:43:21.615548 7f8f47473700 10 mds.0.migrator finish_import_inode_caps for client.4259 on [inode 0x3000000096e [2,head] /client.1/tmp/blogbench-1.0/src/blogtest_in/blog-34/article-53.xml auth{2=1} v951 dirtyparent s=16116 n(v0 rc2018-10-06 01:39:18.128308 b16116 1=1+0) (iversion lock) | importingcaps=1 dirtyparent=1 replicated=1 dirty=1 0x55eef4c5f500]
2018-10-06 01:43:21.615552 7f8f47473700 10 mds.0.cache.ino(0x3000000096e) add_client_cap first cap, joining realm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} 0x55eef4606a00)

Files

kern.log.gz (5.91 KB) kern.log.gz Zheng Yan, 10/31/2019 02:39 PM

Related issues 1 (0 open1 closed)

Has duplicate CephFS - Bug #36366: luminous: qa: blogbench hang with two kclients and 3 active mdsDuplicateZheng Yan10/09/2018

Actions
Actions #1

Updated by Zheng Yan over 5 years ago

  • Status changed from New to Duplicate
Actions #2

Updated by Patrick Donnelly over 5 years ago

  • Has duplicate Bug #36366: luminous: qa: blogbench hang with two kclients and 3 active mds added
Actions #3

Updated by Patrick Donnelly over 5 years ago

  • Status changed from Duplicate to New

Reversing duplicate since this is older. Can't believe I forgot I already opened an issue for this.

Actions #4

Updated by Patrick Donnelly over 5 years ago

  • Status changed from New to Need More Info

Need logs.

Actions #5

Updated by Patrick Donnelly about 5 years ago

  • Target version changed from v14.0.0 to v15.0.0
Actions #8

Updated by Zheng Yan over 4 years ago

Venky Shankar wrote:

saw this with luminous: http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-24_18:14:12-multimds-wip-yuri8-testing-2019-10-24-1513-luminous-testing-basic-smithi/4441245/teuthology.log

blogbench.sh times out with errcode 124.

this one is because a kclient failed to respond cap revoke

2019-10-25 09:23:58.934329 7f3673310700  7 mds.1.locker issue_caps allowed=pAsXsFr, xlocker allowed=pAsXsFr on [inode 0x30000000048 [2,head] /client.1/tmp/blogbench-1.0/src/blogtest_in/blog-2/comment-10.xml rep@2.1 v1710 s=7139 n(v0 rc2019-10-25 09:23:54.459171 b7139 1=1+0)/n(v0 1=1+0) (ilink sync->lock) (ifile sync->mix) (iversion lock) cr={4240=0-4194304@1} caps={4240=pAsLsXsFr/pAsLsXsFscr/pFscr@82} | request=0 lock=0 caps=1 waiter=0 0x561cb17c7c00]
2019-10-25 09:23:58.934338 7f3673310700 20 mds.1.locker  client.4240 pending pAsLsXsFr allowed pAsXsFr wanted pFscr
2019-10-25 09:23:58.934340 7f3673310700  7 mds.1.locker    sending MClientCaps to client.4240 seq 83 new pending pAsXsFr was pAsLsXsFr
...

2019-10-25 12:23:00.952457 7f3673310700  7 mds.1.locker handle_client_caps async on 0x30000000048 tid 0 follows 0 op update
2019-10-25 12:23:00.952472 7f3673310700 20 mds.1.8 get_session have 0x561cb0a29600 client.4240 172.21.15.201:0/1300959356 state open
2019-10-25 12:23:00.952485 7f3673310700 10 mds.1.locker  head inode [inode 0x30000000048 [2,head] /client.1/tmp/blogbench-1.0/src/blogtest_in/blog-2/comment-10.xml rep@2.1 v1710 s=7139 n(v0 rc2019-10-25 09:23:54.459171 b7139 1=1+0)/n(v0 1=1+0) (ilink sync->lock) (ifile sync->mix) (iversion lock) cr={4240=0-4194304@1} caps={4240=pAsXsFr/pAsLsXsFscr/pFscr@83} | request=95 lock=0 caps=1 waiter=0 0x561cb17c7c00]

kernel client has following warning

2019-10-25T09:27:39.272223+00:00 smithi201 kernel: [  727.248399] INFO: task blogbench:15991 blocked for more than 120 seconds.
2019-10-25T09:27:39.279084+00:00 smithi201 kernel: [  727.255266]       Not tainted 5.4.0-rc3-ceph-g2cdfe458e23d #1
2019-10-25T09:27:39.284884+00:00 smithi201 kernel: [  727.261089] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2019-10-25T09:27:39.292862+00:00 smithi201 kernel: [  727.269015] blogbench       D    0 15991  10299 0x00000000
2019-10-25T09:27:39.292880+00:00 smithi201 kernel: [  727.269017] Call Trace:
2019-10-25T09:27:39.292882+00:00 smithi201 kernel: [  727.269020]  ? __schedule+0x378/0x850
2019-10-25T09:27:39.292883+00:00 smithi201 kernel: [  727.269024]  ? rwsem_down_read_slowpath+0x2d9/0x600
2019-10-25T09:27:39.292888+00:00 smithi201 kernel: [  727.269026]  schedule+0x36/0xc0
2019-10-25T09:27:39.292890+00:00 smithi201 kernel: [  727.269028]  rwsem_down_read_slowpath+0x474/0x600
2019-10-25T09:27:39.292891+00:00 smithi201 kernel: [  727.269034]  ? down_read+0xc3/0x140
2019-10-25T09:27:39.292892+00:00 smithi201 kernel: [  727.269036]  down_read+0xc3/0x140
2019-10-25T09:27:39.292893+00:00 smithi201 kernel: [  727.269055]  ceph_start_io_read+0x19/0x50 [ceph]
2019-10-25T09:27:39.292894+00:00 smithi201 kernel: [  727.269064]  ceph_read_iter+0x90d/0xdf0 [ceph]
2019-10-25T09:27:39.292895+00:00 smithi201 kernel: [  727.269068]  ? find_held_lock+0x2d/0x90
2019-10-25T09:27:39.292896+00:00 smithi201 kernel: [  727.269073]  ? aa_file_perm+0x106/0x4c0
2019-10-25T09:27:39.292898+00:00 smithi201 kernel: [  727.269080]  ? __fdget_pos+0x44/0x50
2019-10-25T09:27:39.292899+00:00 smithi201 kernel: [  727.269085]  ? new_sync_read+0x10e/0x190
2019-10-25T09:27:39.292900+00:00 smithi201 kernel: [  727.269088]  new_sync_read+0x10e/0x190
2019-10-25T09:27:39.292924+00:00 smithi201 kernel: [  727.269096]  vfs_read+0x9c/0x170
2019-10-25T09:27:39.292926+00:00 smithi201 kernel: [  727.269098]  ksys_read+0x5c/0xd0
2019-10-25T09:27:39.292927+00:00 smithi201 kernel: [  727.269105]  do_syscall_64+0x54/0x210
2019-10-25T09:27:39.292928+00:00 smithi201 kernel: [  727.269108]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The process hung while holding Fc caps. (which prevent kclient from releasing LsFsc)

Actions #9

Updated by Zheng Yan over 4 years ago

Actions #10

Updated by Zheng Yan over 4 years ago

  • Assignee changed from Zheng Yan to Jeff Layton
Actions #11

Updated by Jeff Layton over 4 years ago

The kernel stack trace just looks like the client is hung waiting for the inode's i_rwsem to become free, which means that another task is probably holding it and not letting it go. We'll probably need a coredump to figure out what's holding it (or maybe a machine in this state that I can log into and do some analysis of the live kernel).

Actions #12

Updated by Jeff Layton over 4 years ago

I set up 2 kclients and kicked off a blogbench run on each with both pointed at the same directory on cephfs. They both spewed hung task warnings:

[  492.531890] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  492.533346] blogbench       D26904  3813   1807 0x00000080
[  492.534472] Call Trace:
[  492.535158]  ? __schedule+0x82b/0x1630
[  492.535982]  ? firmware_map_remove+0x170/0x170
[  492.536888]  ? mark_held_locks+0xc8/0x140
[  492.537750]  schedule+0xbf/0x240
[  492.538499]  rwsem_down_write_slowpath+0x830/0xfa0
[  492.539409]  ? sched_clock+0x5/0x10
[  492.540070]  ? sched_clock_cpu+0x18/0x1e0
[  492.540789]  ? downgrade_write+0x3a0/0x3a0
[  492.541454]  ? lock_rename+0x1e3/0x240
[  492.542182]  ? kvm_sched_clock_read+0x14/0x30
[  492.542962]  ? sched_clock+0x5/0x10
[  492.543605]  ? sched_clock_cpu+0x18/0x1e0
[  492.544359]  ? find_held_lock+0x3a/0x1c0
[  492.545169]  ? lock_rename+0x1e3/0x240
[  492.545910]  ? debug_show_held_locks+0x50/0x50
[  492.546795]  ? down_write_nested+0x29e/0x3c0
[  492.547784]  down_write_nested+0x29e/0x3c0
[  492.548762]  ? _down_write_nest_lock+0x3c0/0x3c0
[  492.549583]  ? rcu_read_lock_sched_held+0x107/0x120
[  492.550365]  ? rcu_sync_lockdep_assert+0xa3/0x110
[  492.551145]  ? __sb_start_write+0x1b7/0x260
[  492.551772]  lock_rename+0x1e3/0x240
[  492.552434]  do_renameat2+0x3e8/0x960
[  492.553154]  ? trace_hardirqs_on_caller+0x39a/0x580
[  492.554008]  ? user_path_create+0x30/0x30
[  492.554704]  ? _raw_spin_unlock_irqrestore+0x4b/0x60
[  492.555739]  ? kvm_sched_clock_read+0x14/0x30
[  492.556727]  ? sched_clock+0x5/0x10
[  492.557469]  ? sched_clock_cpu+0x18/0x1e0
[  492.560197]  ? blkcg_exit_queue+0x20/0x20
[  492.562929]  ? _raw_spin_unlock_irq+0x29/0x40
[  492.565384]  ? trace_hardirqs_on_caller+0x39a/0x580
[  492.567615]  ? entry_SYSCALL_64_after_hwframe+0x7a/0xdf
[  492.570035]  __x64_sys_rename+0x5a/0x80
[  492.572257]  do_syscall_64+0xa5/0x4d0
[  492.574279]  entry_SYSCALL_64_after_hwframe+0x6a/0xdf
[  492.576839] RIP: 0033:0x7fa601fbdc2b
[  492.579407] Code: e8 2a 03 08 00 85 c0 0f 95 c0 0f b6 c0 f7 d8 5b c3 66 0f 1f 44 00 00 b8 ff ff ff ff 5b c3 90 f3 0f 1e fa b8 52 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 29 f2 34 00 f7 d8
[  492.585147] RSP: 002b:00007fa600748ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[  492.587881] RAX: ffffffffffffffda RBX: 0000000000007c00 RCX: 00007fa601fbdc2b
[  492.590687] RDX: 0000000000000000 RSI: 00007fa600749ed0 RDI: 00007fa600748eb0
[  492.593664] RBP: 00007fa600749ed0 R08: 0000000000000000 R09: 00007fa60230d260
[  492.596482] R10: 0000000000000180 R11: 0000000000000246 R12: 0000000000000400
[  492.598991] R13: 00007ffca359eaef R14: 0000000000000000 R15: 00007fa60074afc0
[  492.601470] INFO: task blogbench:3814 blocked for more than 120 seconds.
[  492.603991]       Not tainted 4.18.0-148.el8.jlayton.005.x86_64+debug #1

...as well as several stack traces for openat syscalls calls also stuck on the i_rwsem.

Looking at the mds, I see this:

2019-11-12T11:04:48.605-0500 7f1019371700  0 log_channel(cluster) log [WRN] : client.4248 isn't responding to mclientcaps(revoke), ino 0x1000000e3aa pending pAsXsFc issued pAsLsXsFscr, sent 3844.761046 seconds ago

...and I see that cap in the caps debugfs file on the centos8 client:

0x1000000e3aa      pAsXsFc          pAsLsXsFscr      

My guess is that one of the stuck syscalls is holding references to these caps and that's why they haven't been released, but that needs more investigation to confirm.

Actions #13

Updated by Jeff Layton over 4 years ago

  • Status changed from Need More Info to In Progress

Ran crash on the live (stuck) kernel. Most of the "blogbench" threads are stuck trying to acquire inode->i_rwsem for various inodes, but this one is issuing a rename to the MDS and hasn't gotten the reply:

PID: 3816   TASK: ffff88821124b480  CPU: 2   COMMAND: "blogbench" 
 #0 [ffff8882106b77a8] __schedule at ffffffff9d951d2b
 #1 [ffff8882106b78a8] schedule at ffffffff9d952bef
 #2 [ffff8882106b78d0] schedule_timeout at ffffffff9d961b42
 #3 [ffff8882106b7a58] wait_for_completion_killable_timeout at ffffffff9d9551f1
 #4 [ffff8882106b7b20] ceph_mdsc_do_request at ffffffffc18a2f37 [ceph]
 #5 [ffff8882106b7b48] ceph_rename at ffffffffc18339b2 [ceph]
 #6 [ffff8882106b7b88] vfs_rename at ffffffff9c516bf0
 #7 [ffff8882106b7c98] do_renameat2 at ffffffff9c5280fc
 #8 [ffff8882106b7f10] __x64_sys_rename at ffffffff9c5289ca
 #9 [ffff8882106b7f28] do_syscall_64 at ffffffff9bc09875
#10 [ffff8882106b7f50] entry_SYSCALL_64_after_hwframe at ffffffff9da000b2
    RIP: 00007fa601fbdc2b  RSP: 00007fa5fef45e98  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000002000  RCX: 00007fa601fbdc2b
    RDX: 0000000000000000  RSI: 00007fa5fef46ec0  RDI: 00007fa5fef45ea0
    RBP: 00007fa5fef46ec0   R8: 0000000000000000   R9: 00007fa60230d260
    R10: 0000000000000180  R11: 0000000000000246  R12: 0000000000000800
    R13: 00007ffca359eaef  R14: 0000000000000000  R15: 00007fa5fef47fc0
    ORIG_RAX: 0000000000000052  CS: 0033  SS: 002b

My guess is the rename is the reason that Ls is being revoked, but why isn't that being released?

Actions #14

Updated by Jeff Layton over 4 years ago

I got lucky and reproduced it once, but haven't been able to do so since.

Still, I think I may understand what's happening. I think we may have a lock-inversion problem of sorts that involves both the inode->i_rwsem and cap references.

Most of the time, the vfs takes the inode lock before calling into the filesystem, and then later we take cap references and then do some work. commit 321fe13c939876 reverses that order to handle the buffered/direct exclusionary locking -- we get cap references first and then try to take the rwsem to handle the exclusionary locking.

In some cases, I think we can get into a situation where a rename (for example) is holding the inode lock, and issues a call to the MDS. A reader or writer then takes cap references and tries to get the inode lock, but the rename is squatting on it. The MDS then tries to revoke caps but that can't proceed because there are outstanding references. So, we deadlock at that point.

I think we either need to move the buffered/exclusionary locking to use its own rwsem, or have the read/write code take the appropriate locks earlier, before getting cap references. I'm playing with a dedicated rwsem per inode first, but I may switch to just taking the existing locks earlier.

This is probably a regression in 321fe13c939876, but the original bug was opened a year or so ago, so the original problem may be unrelated to this.

Actions #15

Updated by Jeff Layton over 4 years ago

Ok, I posted a couple of patches to the mailing list this morning. The first one addresses this problem, and the second fixes a different problem I noticed by inspection while working on this one.

https://marc.info/?l=ceph-devel&m=157366193806815&w=2
https://marc.info/?l=ceph-devel&m=157366193806814&w=2

I've pushed both to the ceph-client/testing branch, so hopefully that will take care of this in teuthology runs. We should get this to Linus within the next few days.

I ran blogbench on two clients vs. the same cephfs for a couple of hours and didn't see the deadlock, so I feel pretty good about the fix. Please do let me know if you see more deadlocks of this nature and I'll take a closer look.

Actions #16

Updated by Jeff Layton over 4 years ago

  • Status changed from In Progress to Resolved

The patches were merged into -rc7 kernel, so this should be resolved now.

Actions

Also available in: Atom PDF