Bug #47998: cephfs kernel client hung - CephFS - Ceph

Actions

Copy link

Bug #47998

closed

cephfs kernel client hung

Added by geng jichao over 3 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Jeff Layton

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

kceph

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Files

Download all files

messages.0 (25.3 KB) messages.0	kernel log	geng jichao, 10/26/2020 12:19 PM
stacks_dump (101 KB) stacks_dump	stacks info	geng jichao, 10/30/2020 10:19 AM
osdc (756 KB) osdc		geng jichao, 11/03/2020 07:17 AM
diff (3.24 KB) diff		geng jichao, 11/03/2020 01:00 PM
0001-ceph-hold-a-reference-to-an-inode-for-readahead-requ.patch (2.3 KB) 0001-ceph-hold-a-reference-to-an-inode-for-readahead-requ.patch	RFC patch	Jeff Layton, 11/03/2020 01:51 PM
iput_diff (784 Bytes) iput_diff		geng jichao, 11/11/2020 04:03 AM

Actions

Copy link

Updated by geng jichao over 3 years ago

when the kernel client is used to mount, it will hang up, about once a day. The kernel stack when it appears is as follows:

Oct 22 12:45:15 JXQ-240-226-210 kernel: INFO: task nfsd:1786139 blocked for more than 120 seconds.
Oct 22 12:45:15 JXQ-240-226-210 kernel:      Not tainted 4.19.15+ #1
Oct 22 12:45:15 JXQ-240-226-210 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 22 12:45:15 JXQ-240-226-210 kernel: nfsd            D    0 1786139      2 0x80000080
Oct 22 12:45:15 JXQ-240-226-210 kernel: Call Trace:
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? __schedule+0x2ab/0x880
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? ceph_ioctl+0x4d0/0x4d0 [ceph]
Oct 22 12:45:15 JXQ-240-226-210 kernel: schedule+0x32/0x80
Oct 22 12:45:15 JXQ-240-226-210 kernel: __wait_on_freeing_inode+0xaa/0xe0
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? init_wait_var_entry+0x40/0x40
Oct 22 12:45:15 JXQ-240-226-210 kernel: find_inode+0x7a/0xb0
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? ceph_ioctl+0x4d0/0x4d0 [ceph]
Oct 22 12:45:15 JXQ-240-226-210 kernel: ilookup5_nowait+0x6c/0xa0
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? ceph_ioctl+0x4d0/0x4d0 [ceph]
Oct 22 12:45:15 JXQ-240-226-210 kernel: ilookup5+0x2a/0x90
Oct 22 12:45:15 JXQ-240-226-210 kernel: __fh_to_dentry+0x51/0x150 [ceph]
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? nfsd_proc_read+0x190/0x190 [nfsd]
Oct 22 12:45:15 JXQ-240-226-210 kernel: exportfs_decode_fh+0x63/0x310
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? exp_find_key+0x79/0xc0 [nfsd]
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? exp_find+0x57/0xb0 [nfsd]
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? _cond_resched+0x15/0x30
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? kmem_cache_alloc+0x170/0x1c0
Oct 22 12:45:15 JXQ-240-226-210 kernel: fh_verify+0x299/0x5f0 [nfsd]
Oct 22 12:45:15 JXQ-240-226-210 kernel: nfsd_open+0x3c/0x1e0 [nfsd]
Oct 22 12:45:15 JXQ-240-226-210 kernel: nfsd_read+0x5b/0x1d0 [nfsd]
Oct 22 12:45:15 JXQ-240-226-210 kernel: nfsd3_proc_read+0xb0/0x150 [nfsd]
Oct 22 12:45:15 JXQ-240-226-210 kernel: nfsd_dispatch+0xb7/0x250 [nfsd]
Oct 22 12:45:15 JXQ-240-226-210 kernel: svc_process_common+0x399/0x750 [sunrpc]
Oct 22 12:45:15 JXQ-240-226-210 kernel: svc_process+0xeb/0x100 [sunrpc]
Oct 22 12:45:15 JXQ-240-226-210 kernel: nfsd+0xe3/0x150 [nfsd]
Oct 22 12:45:15 JXQ-240-226-210 kernel: kthread+0xf8/0x130
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? nfsd_destroy+0x60/0x60 [nfsd]
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? kthread_bind+0x10/0x10
Oct 22 12:45:15 JXQ-240-226-210 kernel: ret_from_fork+0x35/0x40

Oct 22 12:45:15 JXQ-240-226-210 kernel: INFO: task kworker/51:2:3089805 blocked for more than 120 seconds.
Oct 22 12:45:15 JXQ-240-226-210 kernel:      Not tainted 4.19.15+ #1
Oct 22 12:45:15 JXQ-240-226-210 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 22 12:45:15 JXQ-240-226-210 kernel: kworker/51:2    D    0 3089805      2 0x80000080
Oct 22 12:45:15 JXQ-240-226-210 kernel: Workqueue: ceph-msgr ceph_con_workfn [libceph]
Oct 22 12:45:15 JXQ-240-226-210 kernel: Call Trace:
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? __schedule+0x2ab/0x880
Oct 22 12:45:15 JXQ-240-226-210 kernel: schedule+0x32/0x80
Oct 22 12:45:15 JXQ-240-226-210 kernel: schedule_preempt_disabled+0xa/0x10
Oct 22 12:45:15 JXQ-240-226-210 kernel: __mutex_lock.isra.8+0x19e/0x4d0
Oct 22 12:45:15 JXQ-240-226-210 kernel: ceph_check_caps+0x4d8/0xb50 [ceph]
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? _cond_resched+0x15/0x30
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? __pagevec_release+0x2b/0x30
Oct 22 12:45:15 JXQ-240-226-210 kernel: ceph_put_wrbuffer_cap_refs+0x1da/0x2d0 [ceph]
Oct 22 12:45:15 JXQ-240-226-210 kernel: writepages_finish+0x292/0x430 [ceph]
Oct 22 12:45:15 JXQ-240-226-210 kernel: __complete_request+0x26/0x80 [libceph]
Oct 22 12:45:15 JXQ-240-226-210 kernel: dispatch+0x354/0xbc0 [libceph]
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? read_partial_message+0x265/0x810 [libceph]
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph]
Oct 22 12:45:15 JXQ-240-226-210 kernel: try_read+0x801/0x11f0 [libceph]
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? __switch_to_asm+0x34/0x70
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? __switch_to_asm+0x40/0x70
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? __switch_to_asm+0x34/0x70
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? __switch_to_asm+0x40/0x70
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? __switch_to_asm+0x34/0x70
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? __switch_to_asm+0x40/0x70
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? __switch_to_asm+0x34/0x70
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? __switch_to_asm+0x40/0x70
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? __switch_to_asm+0x40/0x70
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? __switch_to_asm+0x34/0x70
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? __switch_to+0x8c/0x480
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? __switch_to_asm+0x34/0x70
Oct 22 12:45:15 JXQ-240-226-210 kernel: ceph_con_workfn+0xa8/0x5c0 [libceph]
Oct 22 12:45:15 JXQ-240-226-210 kernel: process_one_work+0x171/0x370
Oct 22 12:45:15 JXQ-240-226-210 kernel: worker_thread+0x49/0x3f0
Oct 22 12:45:15 JXQ-240-226-210 kernel: kthread+0xf8/0x130
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? max_active_store+0x80/0x80
Oct 22 12:45:15 JXQ-240-226-210 kernel: ? kthread_bind+0x10/0x10
Oct 22 12:45:15 JXQ-240-226-210 kernel: ret_from_fork+0x35/0x40_

Kernel version 4.19.15+

ceph: 12.2.12

Actions

Copy link

Updated by geng jichao over 3 years ago

File messages.0 messages.0 added

Actions

Copy link

Updated by Matt Benjamin over 3 years ago

Project changed from rgw to CephFS

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Target version deleted (~~v12.2.14~~)
Tags deleted (~~cephfs~~)
Component(FS) kceph added

Actions

Copy link

Updated by Jeff Layton over 3 years ago

Assignee set to Jeff Layton

Note that exporting kcephfs via knfsd is not a well-tested configuration. Most folks that want to export cephfs via NFS are using nfs-ganesha. In addition, v4.19 is 2 years old at this point, so you might be best served testing a newer kernel (preferably v5.9 or later) to see if it helps.

If you do want to pursue tracking this down though:

Looking at your stack traces, they don't seem to be directly related, but they could ultimately have the same cause. I can't tell from that info though:

The first one shows the task hung while searching for an inode in order to issue a read against it. The inode in question is probably in the process of being evicted from the cache at the time. Why that's not moving along is not clear.

The second stack shows the task working on a reply to an OSD write request. It's releasing cap references, and that looks like it's waiting on a mutex -- probably the session->s_mutex, but could be a different one. In any case, it's not clear why the mutex isn't being released.

To figure out the cause to these problems, you'll probably need to crash the box, collect a core dump and analyze it, or run the debugger on the live kernel.

Actions

Copy link

Updated by geng jichao over 3 years ago

Thank you for your reply。Unfortunately, when I repeated this problem, another problem occurred. The kernel directly oops. As you said, upgrade the client, use the latest kernel (5.9-rc5) is compatible with version 12.2.12？

The oops stack：
Oct 28 15:01:33 YZ-25-60-4 kernel: list_del corruption. prev->next should be ffff8eba12fbb2f8, but was ffff8eba12fbb2a8
Oct 28 15:01:33 YZ-25-60-4 kernel: WARNING: CPU: 31 PID: 84043 at lib/list_debug.c:53 __list_del_entry_valid+0x7c/0xa0
Oct 28 15:01:33 YZ-25-60-4 kernel: Modules linked in: rpcsec_gss_krb5 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ceph libceph libcrc32c dns_resolver bonding dm_mirror dm_region_hash dm_log dm_mod intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd cryptd iTCO_wdt glue_helper dcdbas iTCO_vendor_support mxm_wmi ipmi_si ipmi_devintf sg pcspkr ipmi_msghandler mei_me mei lpc_ich acpi_power_meter wmi tcp_dctcp ip_tables ext4 mbcache jbd2 sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe igb ahci libahci libata crc32c_intel megaraid_sas i2c_algo_bit i2c_core mdio dca
Oct 28 15:01:33 YZ-25-60-4 kernel: CPU: 31 PID: 84043 Comm: kworker/31:1 Tainted: G W 4.19.15+ #1
Oct 28 15:01:33 YZ-25-60-4 kernel: Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.3.4 11/08/2016
Oct 28 15:01:33 YZ-25-60-4 kernel: Workqueue: ceph-msgr ceph_con_workfn [libceph]
Oct 28 15:01:33 YZ-25-60-4 kernel: RIP: 0010:_list_del_entry_valid+0x7c/0xa0
Oct 28 15:01:33 YZ-25-60-4 kernel: Code: 31 c0 c3 48 89 fe 31 c0 48 c7 c7 48 c4 ca a9 e8 0a ec cc ff 0f 0b 31 c0 c3 48 89 fe 31 c0 48 c7 c7 08 c4 ca a9 e8 f4 eb cc ff <0f> 0b 31 c0 c3 48 89 fe 31 c0 48 c7 c7 d0 c3 ca a9 e8 de eb cc ff
Oct 28 15:01:33 YZ-25-60-4 kernel: RSP: 0018:ffffb8e8cbb07b60 EFLAGS: 00010286
Oct 28 15:01:33 YZ-25-60-4 kernel: RAX: 0000000000000000 RBX: ffff8eba12fbb2d0 RCX: 0000000000000006
Oct 28 15:01:33 YZ-25-60-4 kernel: RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff8eba3fbd68f0
Oct 28 15:01:33 YZ-25-60-4 kernel: RBP: ffff8eaa3ac8b61c R08: 0000000000000000 R09: 000000000000c702
Oct 28 15:01:33 YZ-25-60-4 kernel: R10: 0000000000000004 R11: 0000000000000000 R12: ffff8eaa3ac8b400
Oct 28 15:01:33 YZ-25-60-4 kernel: R13: ffff8eba12fbb2f8 R14: ffff8eba12ff93b0 R15: ffff8eba3e99a000
Oct 28 15:01:33 YZ-25-60-4 kernel: FS: 0000000000000000(0000) GS:ffff8eba3fbc0000(0000) knlGS:0000000000000000
Oct 28 15:01:33 YZ-25-60-4 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 28 15:01:33 YZ-25-60-4 kernel: CR2: 00007f898ab83000 CR3: 0000000a8e00a006 CR4: 00000000001606e0
Oct 28 15:01:33 YZ-25-60-4 kernel: Call Trace:
Oct 28 15:01:33 YZ-25-60-4 kernel: ceph_dentry_lru_del+0x42/0xa0 [ceph]
Oct 28 15:01:33 YZ-25-60-4 kernel: ceph_d_release+0x21/0x80 [ceph]
Oct 28 15:01:33 YZ-25-60-4 kernel: _dentry_kill+0xf9/0x180
Oct 28 15:01:33 YZ-25-60-4 kernel: d_prune_aliases+0x47/0xd0
Oct 28 15:01:33 YZ-25-60-4 kernel: trim_caps_cb+0x205/0x2e0 [ceph]
Oct 28 15:01:33 YZ-25-60-4 kernel: ? __ceph_destroy_xattrs+0x1d/0x100 [ceph]
Oct 28 15:01:33 YZ-25-60-4 kernel: ? ceph_destroy_inode+0x128/0x230 [ceph]
Oct 28 15:01:33 YZ-25-60-4 kernel: iterate_session_caps+0xab/0x250 [ceph]
Oct 28 15:01:33 YZ-25-60-4 kernel: ? __choose_mds+0x4c0/0x4c0 [ceph]
Oct 28 15:01:33 YZ-25-60-4 kernel: ceph_trim_caps+0x3f/0xb0 [ceph]
Oct 28 15:01:33 YZ-25-60-4 kernel: dispatch+0x4d5/0xab0 [ceph]
Oct 28 15:01:33 YZ-25-60-4 kernel: try_read+0x801/0x11f0 [libceph]
Oct 28 15:01:33 YZ-25-60-4 kernel: ? __switch_to_asm+0x34/0x70
Oct 28 15:01:33 YZ-25-60-4 kernel: ? __switch_to_asm+0x40/0x70
Oct 28 15:01:33 YZ-25-60-4 kernel: ? __switch_to_asm+0x34/0x70
Oct 28 15:01:33 YZ-25-60-4 kernel: ceph_con_workfn+0xa8/0x5c0 [libceph]
Oct 28 15:01:33 YZ-25-60-4 kernel: process_one_work+0x171/0x370
Oct 28 15:01:33 YZ-25-60-4 kernel: worker_thread+0x49/0x3f0
Oct 28 15:01:33 YZ-25-60-4 kernel: kthread+0xf8/0x130
Oct 28 15:01:33 YZ-25-60-4 kernel: ? max_active_store+0x80/0x80
Oct 28 15:01:33 YZ-25-60-4 kernel: ? kthread_bind+0x10/0x10
Oct 28 15:01:33 YZ-25-60-4 kernel: ret_from_fork+0x35/0x40
Oct 28 15:01:33 YZ-25-60-4 kernel: ---[ end trace 34e7f9cbaf2c2143 ]---
Oct 28 15:01:33 YZ-25-60-4 kernel: ------------[ cut here ]------------

Actions

Copy link

Updated by geng jichao over 3 years ago

File stacks_dump stacks_dump added

Through further repetition and analysis, I have made new discoveries. it similar to the problem that has been solved before(https://github.com/ceph/ceph-client/commit/3e1d0452edceebb903d23db53201013c940bf000). But it's not exactly same.

The main stack information is as follows,

A thread's stack:
[<0>] io_schedule+0x12/0x40
[<0>] __lock_page+0x105/0x150
[<0>] truncate_inode_pages_range+0x4ac/0x860
[<0>] evict+0x183/0x1a0
[<0>] __dentry_kill+0xd7/0x180
[<0>] dentry_kill+0x4d/0x110
[<0>] dput+0xb5/0xd0
[<0>] __fput+0x12b/0x210
[<0>] delayed_fput+0x21/0x30
[<0>] process_one_work+0x171/0x370
[<0>] worker_thread+0x49/0x3f0
[<0>] kthread+0xf8/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff

The operating system is actively releasing inode, and it does not hold s_mutex. At this time, the inode status is freeing, waiting for OSD to reply.

B thread stack:
[<0>] __wait_on_freeing_inode+0xaa/0xe0
[<0>] find_inode+0x7a/0xb0
[<0>] ilookup5_nowait+0x6c/0xa0
[<0>] ilookup5+0x2a/0x90
[<0>] iget5_locked+0x26/0x80
[<0>] ceph_get_inode+0x2d/0xc0 [ceph]
[<0>] ceph_fill_trace+0x28a/0x950 [ceph]
[<0>] handle_reply+0x4c6/0xc90 [ceph]
[<0>] dispatch+0x134/0xab0 [ceph]
[<0>] try_read+0x801/0x11f0 [libceph]
[<0>] ceph_con_workfn+0xa8/0x5c0 [libceph]
[<0>] process_one_work+0x171/0x370
[<0>] worker_thread+0x49/0x3f0
[<0>] kthread+0xf8/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff

It holds s_mutex and mdsc->snap_rwsem, and __wait_on_freeing_inode.

C thread stack:
[<0>] ceph_check_caps+0x4d8/0xb50 [ceph] m
[<0>] ceph_put_wrbuffer_cap_refs+0x1da/0x2d0 [ceph]
[<0>] writepages_finish+0x292/0x430 [ceph]
[<0>] __complete_request+0x26/0x80 [libceph]
[<0>] dispatch+0x354/0xbc0 [libceph]
[<0>] try_read+0x801/0x11f0 [libceph]
[<0>] ceph_con_workfn+0xa8/0x5c0 [libceph]
[<0>] process_one_work+0x171/0x370
[<0>] worker_thread+0x49/0x3f0
[<0>] kthread+0xf8/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff

But because it can't get the s_mutex, OSD reply cannot be processed。
In other words, B is waiting for A, A is waiting for C, and C is waiting for B，result dead lock。
Even if calls iput_final () in workqueue cannot solve this problem. It should be release inode_mutex and snap_rwsem before ceph_find_inode.

Actions

Copy link

Updated by Jeff Layton over 3 years ago

Thanks for the analysis -- that looks like a real problem. I don't immediately see a way to fix it however. We could try to offload ceph_check_caps to a workqueue or something, but that's a bit of a hack. I'll have to have a harder look at see if I can come up with an approach.

Actions

Copy link

Updated by Jeff Layton over 3 years ago

Also, to answer your earlier question: Yes, later kernels should work with earlier cluster versions. I'd still like to see this reproduced on a newer kenel.

I'm now wondering it got into this state, as it should not be calling iput_final on an inode that still has outstanding references, and the dirty pages should hold a reference to the inode at this point. Something seems off with the inode refcounting here.

Actions

Copy link

#10

Updated by geng jichao over 3 years ago

File osdc osdc added

I guess this is the case：
stack A and B are processing the same inode, but stack C is other. The inode corresponding to stack A may not have dirty data, but wait for readahead pages which can't handle because the all workqueues are blocked. In addition，I find some requests of read not process in the /sys/kernel/debug/xxx/osdc file.

Actions

Copy link

#11

Updated by Jeff Layton over 3 years ago

geng jichao wrote:

I guess this is the case：
stack A and B are processing the same inode, but stack C is other. The inode corresponding to stack A may not have dirty data, but wait for readahead pages which can't handle because the all workqueues are blocked.

I don't think that can be the case. If the inode is being evicted we would not be doing readahead against it as it can have no outstanding references.

I think the problem boils down to:

You have an inode being evicted and can't lock a page in its mapping. The question is: what's holding that page lock and why isn't it being released?

In some filesystems, you can evict the inode before all of the pages are done being written back, but that's not the case in ceph as the wrbuffer cap refs imply an inode reference.

Actions

Copy link

#12

Updated by geng jichao over 3 years ago

File diff diff added

iget5_lockd, called by ceph_get_inode, may be blocked in some cases.so when ceph_get_inode is called, shoud not hold any locks. I have a patch with no test. could you review it.

Actions

Copy link

#13

Updated by Jeff Layton over 3 years ago

File 0001-ceph-hold-a-reference-to-an-inode-for-readahead-requ.patch 0001-ceph-hold-a-reference-to-an-inode-for-readahead-requ.patch added

geng jichao wrote:

iget5_lockd, called by ceph_get_inode, may be blocked in some cases.so when ceph_get_inode is called, shoud not hold any locks. I have a patch with no test. could you review it.

I don't think that's what we want. That may work around the deadlock, but those locks are supposed to be held to protect certain data (though what that data is is not always clear). Releasing and reacquiring them like this is not likely to be safe.

If those locks are not actually needed at this point in the processing (and I suspect that they really aren't), what I'd prefer to see is a patch to reduce the scope of these locks so that iget5_locked is never called with them held. That will take careful analysis of the code however.

You mentioned readahead in an earlier post though and I had a look...

The readahead code does not hold a reference to the inode while the I/O is in flight. I think we need it to, or we could end up with the inode being evicted while I/O is still in progress (with a locked page), and that could set us up for a scenario very similar to the one you've described.

Part of the problem is the single-threaded nature of the OSD reply processing, but the s_mutex really has too much coverage overall. Those will take more structural fixes however.

What I think we probably want to do for now is something like this patch, which I plan to test with today. You may want try it out as well.

Actions

Copy link

#14

Updated by Jeff Layton over 3 years ago

Status changed from New to In Progress

Actions

Copy link

#15

Updated by Jeff Layton over 3 years ago

Oh, a friend pointed out that this patch may end up giving us "Busy inodes after umount" problems if the thing is unmounted before readahead completes. We'll probably need to mitigate that in order to fix this correctly, but if this patch avoids that deadlock on long-running mountpoints then that would be a good to know.

Actions

Copy link

#16

Updated by Jeff Layton over 3 years ago

I'm now wondering whether you might be hitting the problem in 3e1d0452edcee after all. Have you tested a kernel with that patch and determined whether this is a problem on it?

Actions

Copy link

#17

Updated by geng jichao over 3 years ago

Jeff Layton wrote:

I'm now wondering whether you might be hitting the problem in 3e1d0452edcee after all. Have you tested a kernel with that patch and determined whether this is a problem on it?

It hasn't been tested yet. I will probably upgrade the kernel version to 5.4 (including 3e1d0452edcee) in the online environment next week. The probability of recurrence of this problem is very low in test environment,

Actions

Copy link

#18

Updated by Jeff Layton over 3 years ago

My guess is that it probably won't help, actually. I think that you're correct that this is a similar problem but not the same one that 3e1d0452edcee is intended to fix.

One thing that might fix this particular issue is moving the iget5_locked call to before we take the s_mutex in handle_reply. I think that'd be safe, and would sidestep the main problem in the call stacks above. I do wonder if we have other, similar deadlocks lurking in here though.

Actions

Copy link

#19

Updated by Luis Henriques over 3 years ago

Jeff Layton wrote:

My guess is that it probably won't help, actually. I think that you're correct that this is a similar problem but not the same one that 3e1d0452edcee is intended to fix.

One thing that might fix this particular issue is moving the iget5_locked call to before we take the s_mutex in handle_reply. I think that'd be safe, and would sidestep the main problem in the call stacks above. I do wonder if we have other, similar deadlocks lurking in here though.

I may be hitting a similar issue with xfstests generic/014. Here's the 2 splats I managed to collect:

[ 1212.651594] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kwo]
[ 1212.653267] Kernel panic - not syncing: softlockup: hung tasks
[ 1212.654206] CPU: 0 PID: 5050 Comm: kworker/0:3 Tainted: G             L    5.10.0-6
[ 1212.655501] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.04
[ 1212.657125] Workqueue: events delayed_work
[ 1212.657714] Call Trace:
[ 1212.658077]  <IRQ>
[ 1212.658398]  dump_stack+0x9a/0xcc
[ 1212.658905]  panic+0x1b3/0x3b5
[ 1212.659425]  ? print_oops_end_marker.cold+0x10/0x10
[ 1212.660221]  ? add_taint+0x26/0x70
[ 1212.660714]  watchdog_timer_fn.cold+0xa8/0xc1
[ 1212.661369]  ? lockup_detector_update_enable+0x50/0x50
[ 1212.662095]  __hrtimer_run_queues+0x3a6/0x730
[ 1212.662642]  ? enqueue_hrtimer+0x1a0/0x1a0
[ 1212.663200]  ? recalibrate_cpu_khz+0x10/0x10
[ 1212.663806]  ? ktime_get_update_offsets_now+0xe4/0x1e0
[ 1212.664483]  hrtimer_run_queues+0x100/0x120
[ 1212.665156]  run_local_timers+0x1b/0x50
[ 1212.665748]  update_process_times+0x5f/0xa0
[ 1212.666469]  tick_handle_periodic+0x29/0x70
[ 1212.667149]  __sysvec_apic_timer_interrupt+0xda/0x300
[ 1212.667799]  asm_call_irq_on_stack+0x12/0x20
[ 1212.668369]  </IRQ>
[ 1212.668737]  sysvec_apic_timer_interrupt+0x84/0xa0
[ 1212.669414]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 1212.670230] RIP: 0010:lock_acquire+0x191/0x4d0
[ 1212.670886] Code: d5 f9 ef 7e e8 30 bb ff ff b8 ff ff ff ff 48 83 c4 20 65 0f c1 07
[ 1212.673306] RSP: 0018:ffffc900067578d8 EFLAGS: 00000246
[ 1212.674043] RAX: 0000000000000001 RBX: 1ffff92000ceaf1d RCX: dffffc0000000000
[ 1212.675301] RDX: 0000000000000003 RSI: 0000000000000008 RDI: ffff888112b45d38
[ 1212.676356] RBP: 0000000000000001 R08: ffffffff8111bc34 R09: fffffbfff0681c0e
[ 1212.677448] R10: 0000000000000246 R11: fffffbfff0681c0d R12: 0000000000000000
[ 1212.678477] R13: 0000000000000000 R14: ffff888110c2a408 R15: 0000000000000000
[ 1212.679527]  ? __lock_acquire+0x8a4/0x39b0
[ 1212.680193]  ? lock_release+0x3c0/0x3c0
[ 1212.680667]  ? do_raw_spin_lock+0x119/0x1d0
[ 1212.681355]  ? rwlock_bug.part.0+0x60/0x60
[ 1212.682082]  ? ceph_check_caps+0x37e/0xfc0
[ 1212.682737]  ? do_raw_spin_unlock+0xa3/0x130
[ 1212.683323]  _raw_spin_lock+0x2c/0x40
[ 1212.683893]  ? __cap_delay_requeue+0x68/0x280
[ 1212.684605]  __cap_delay_requeue+0x68/0x280
[ 1212.685207]  ceph_check_caps+0xda0/0xfc0
[ 1212.685911]  ? __ceph_should_report_size+0x90/0x90
[ 1212.686770]  ? lockdep_enabled+0x39/0x50
[ 1212.687501]  ? rcu_read_lock_sched_held+0xa1/0xd0
[ 1212.688323]  ? rcu_read_lock_bh_held+0xb0/0xb0
[ 1212.689072]  ? igrab+0x19/0x60
[ 1212.689609]  ? find_held_lock+0x85/0xa0
[ 1212.690167]  ? find_held_lock+0x85/0xa0
[ 1212.690789]  ? ceph_check_delayed_caps+0x119/0x1d0
[ 1212.691397]  ? lock_downgrade+0x370/0x370
[ 1212.691980]  ? rwlock_bug.part.0+0x60/0x60
[ 1212.692503]  ? __asan_store1+0x32/0x50
[ 1212.693071]  ceph_check_delayed_caps+0x13c/0x1d0
[ 1212.693777]  delayed_work+0x1fd/0x300
[ 1212.694308]  process_one_work+0x5dd/0xae0
[ 1212.694935]  ? queue_delayed_work_on+0x90/0x90
[ 1212.695669]  ? rwlock_bug.part.0+0x60/0x60
[ 1212.696265]  worker_thread+0x78/0x670
[ 1212.696736]  ? process_one_work+0xae0/0xae0
[ 1212.697352]  kthread+0x1af/0x200
[ 1212.697783]  ? kthread_create_on_node+0xd0/0xd0
[ 1212.698445]  ret_from_fork+0x22/0x30
[ 1212.699110] Kernel Offset: disabled
[ 1212.699723] ---[ end Kernel panic - not syncing: softlockup: hung tasks ]---

And another one:

[ 1044.619044] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kwor]
[ 1044.620428] Kernel panic - not syncing: softlockup: hung tasks
[ 1044.621344] CPU: 0 PID: 7321 Comm: kworker/0:1 Tainted: G             L    5.10.0-r6
[ 1044.622598] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-4
[ 1044.624212] Workqueue: events delayed_work
[ 1044.624759] Call Trace:                                                             
[ 1044.625198]  <IRQ>                                                                  
[ 1044.625536]  dump_stack+0x9a/0xcc    
[ 1044.626070]  panic+0x1b3/0x3b5                                                      
[ 1044.626482]  ? print_oops_end_marker.cold+0x10/0x10
[ 1044.627089]  ? add_taint+0x26/0x70
[ 1044.627524]  watchdog_timer_fn.cold+0xa8/0xc1
[ 1044.628037]  ? lockup_detector_update_enable+0x50/0x50
[ 1044.628798]  __hrtimer_run_queues+0x3a6/0x730
[ 1044.629348]  ? enqueue_hrtimer+0x1a0/0x1a0
[ 1044.629838]  ? recalibrate_cpu_khz+0x10/0x10
[ 1044.630506]  ? ktime_get_update_offsets_now+0xe4/0x1e0
[ 1044.631222]  hrtimer_run_queues+0x100/0x120
[ 1044.631851]  run_local_timers+0x1b/0x50
[ 1044.632430]  update_process_times+0x5f/0xa0
[ 1044.632995]  tick_handle_periodic+0x29/0x70
[ 1044.633543]  __sysvec_apic_timer_interrupt+0xda/0x300
[ 1044.634201]  asm_call_irq_on_stack+0x12/0x20
[ 1044.634897]  </IRQ>
[ 1044.635184]  sysvec_apic_timer_interrupt+0x84/0xa0
[ 1044.635950]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 1044.636648] RIP: 0010:lock_acquire+0x191/0x4d0
[ 1044.637387] Code: d5 f9 ef 7e e8 30 bb ff ff b8 ff ff ff ff 48 83 c4 20 65 0f c1 057
[ 1044.640183] RSP: 0018:ffffc90003847b88 EFLAGS: 00000246
[ 1044.640841] RAX: 0000000000000001 RBX: 1ffff92000708f73 RCX: dffffc0000000000
[ 1044.641851] RDX: 0000000000000003 RSI: 0000000000000008 RDI: ffff888121e25d38
[ 1044.642895] RBP: 0000000000000001 R08: ffffffff8111bc34 R09: fffffbfff0681c0e
[ 1044.643862] R10: 0000000000000246 R11: fffffbfff0681c0d R12: 0000000000000000
[ 1044.644698] R13: 0000000000000000 R14: ffff88812b3aab48 R15: 0000000000000000
[ 1044.645759]  ? __lock_acquire+0x8a4/0x39b0
[ 1044.646255]  ? lock_release+0x3c0/0x3c0
[ 1044.646845]  ? lock_release+0x3c0/0x3c0
[ 1044.647411]  ? ceph_check_delayed_caps+0x119/0x1d0
[ 1044.648051]  ? do_raw_spin_lock+0x119/0x1d0
[ 1044.648648]  ? rwlock_bug.part.0+0x60/0x60
[ 1044.649212]  ? __asan_store1+0x32/0x50
[ 1044.649724]  _raw_spin_lock+0x2c/0x40
[ 1044.650152]  ? igrab+0x19/0x60
[ 1044.650621]  igrab+0x19/0x60
[ 1044.651073]  ceph_check_delayed_caps+0xd7/0x1d0
[ 1044.651653]  delayed_work+0x1fd/0x300
[ 1044.652222]  process_one_work+0x5dd/0xae0
[ 1044.652839]  ? queue_delayed_work_on+0x90/0x90
[ 1044.653373]  ? rwlock_bug.part.0+0x60/0x60
[ 1044.653885]  worker_thread+0x78/0x670
[ 1044.654385]  ? process_one_work+0xae0/0xae0
[ 1044.654899]  kthread+0x1af/0x200
[ 1044.655312]  ? kthread_create_on_node+0xd0/0xd0
[ 1044.656027]  ret_from_fork+0x22/0x30
[ 1044.656601] Kernel Offset: disabled
[ 1044.657215] ---[ end Kernel panic - not syncing: softlockup: hung tasks ]---

Note that I was very lucky to trigger twice -- it happens only very sporadically.

I've started looking at this yesterday (I wasn't aware of this bug), and I theory was that there were too many caps to handle in ceph_check_delayed_caps(). This function is called from a workqueue and then ceph_async_iput() queues work to yet another wq (within a very difficult to understand loop). Anyway, I didn't manage to get to a solid conclusion yet.

Actions

Copy link

#20

Updated by Luis Henriques over 3 years ago

Luis Henriques wrote:
...

I've started looking at this yesterday (I wasn't aware of this bug), and I theory was that there were too many caps to handle in ceph_check_delayed_caps(). This function is called from a workqueue and then ceph_async_iput() queues work to yet another wq (within a very difficult to understand loop). Anyway, I didn't manage to get to a solid conclusion yet.

Ok, sorry for the noise. This doesn't seem to be related with the original problem.

I'm convinced that my problem is related to the fact that I'm running my kernel with KASAN, which makes it really slow. ceph_check_delayed_caps then take really a long time to drain the caps list, which I believe may get really big when running the xfstests generic tests (I always saw this with generic/014, but only when running from 001-014).

Actions

Copy link

#21

Updated by Jeff Layton over 3 years ago

Yeah, those look like they are stuck on spinlocks. This bug is more about sleeping locks (mutexes, rwsems and the page locks). That alone makes those unlikely to be the same problem, I think.

Note that I'm also running a KASAN-enabled kernel with xfstests and haven't seen this.

Actions

Copy link

#22

Updated by Luis Henriques over 3 years ago

Jeff Layton wrote:

Yeah, those look like they are stuck on spinlocks. This bug is more about sleeping locks (mutexes, rwsems and the page locks). That alone makes those unlikely to be the same problem, I think.

Note that I'm also running a KASAN-enabled kernel with xfstests and haven't seen this.

Yeah, I'm seeing this issue while running xfstests in a very unorthodox way: several clients running the same tests at the same time against a vstart cluster. Which means that most of the tests would fail as the clients were stepping on each others toes. Anyway, that's OT in this bug and I'll stop now. Sorry for the noise :-)

Actions

Copy link

#23

Updated by geng jichao over 3 years ago

File iput_diff iput_diff added

Jeff Layton wrote:

My guess is that it probably won't help, actually. I think that you're correct that this is a similar problem but not the same one that 3e1d0452edcee is intended to fix.

One thing that might fix this particular issue is moving the iget5_locked call to before we take the s_mutex in handle_reply. I think that'd be safe, and would sidestep the main problem in the call stacks above. I do wonder if we have other, similar deadlocks lurking in here though.

Finally，I didn't upgrade the kernel to 5.4,I just merge the patch which you commit,but it was a little different,didn't call ceph_async_iput, used iput directly. It works well six days.

Actions

Copy link

#24

Updated by Jeff Layton over 3 years ago

Ok, I sent a patch to the mailing list (and you) this morning that may help this. The basic idea is to move the iget5_locked call outside the mutexes so that we don't end up holding the s_mutex while waiting for the eviction to finish. It seems like this might have fixed the deadlock you reported originally, but I do wonder if that may just be one of a class of similar deadlocks, that might not be helped by this.

Still, this patch shouldn't hurt anything and may very well fix this particular issue.

Actions

Copy link

#25

Updated by Jeff Layton over 3 years ago

Status changed from In Progress to Fix Under Review

Actions

Copy link

#26

Updated by Jeff Layton over 2 years ago

Fixed in mainline in commit bca9fc14c70fc.

Actions

Copy link

#27

Updated by Jeff Layton over 2 years ago

Status changed from Fix Under Review to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #47998

cephfs kernel client hung

Updated by geng jichao over 3 years ago

Updated by geng jichao over 3 years ago

Updated by Matt Benjamin over 3 years ago

Updated by Patrick Donnelly over 3 years ago

Updated by Jeff Layton over 3 years ago

Updated by geng jichao over 3 years ago

Updated by geng jichao over 3 years ago

Updated by Jeff Layton over 3 years ago

Updated by Jeff Layton over 3 years ago

Updated by geng jichao over 3 years ago

Updated by Jeff Layton over 3 years ago

Updated by geng jichao over 3 years ago

Updated by Jeff Layton over 3 years ago

Updated by Jeff Layton over 3 years ago

Updated by Jeff Layton over 3 years ago

Updated by Jeff Layton over 3 years ago

Updated by geng jichao over 3 years ago

Updated by Jeff Layton over 3 years ago

Updated by Luis Henriques over 3 years ago

Updated by Luis Henriques over 3 years ago

Updated by Jeff Layton over 3 years ago

Updated by Luis Henriques over 3 years ago

Updated by geng jichao over 3 years ago

Updated by Jeff Layton over 3 years ago

Updated by Jeff Layton over 3 years ago

Updated by Jeff Layton over 2 years ago

Updated by Jeff Layton over 2 years ago