Bug #3370: All nfsd hung trying to lock page(s) on export of kclient ceph - CephFS - Ceph

Actions

Copy link

Bug #3370

closed

All nfsd hung trying to lock page(s) on export of kclient ceph

Added by David Zafman over 11 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Workunit bonnie hung over NFS client with retransmitted NFS read:

ubuntu 2667 2572 0 Oct18 ? 00:00:00 bash ~~c mkdir -~~ /tmp/cephtest/mnt.1/client.1/tmp && cd -- /tmp/cephtest/mnt.1/cli
ubuntu 2669 2667 0 Oct18 ? 00:00:00 /bin/bash /tmp/cephtest/workunit.client.1/suites/bonnie.sh
ubuntu 2672 2669 0 Oct18 ? 00:01:09 /usr/sbin/bonnie++ -n 100

In the syslog the kernel noticed nfsd not making progress:

INFO: task nfsd:1181 blocked for more than 120 seconds.

All 8 nfsd processes look like this
[<ffffffff8112a20e>] sleep_on_page+0xe/0x20
[<ffffffff8112a1f7>] __lock_page+0x67/0x70
[<ffffffff811aaa2f>] __generic_file_splice_read+0x59f/0x5d0
[<ffffffff811aaa9e>] generic_file_splice_read+0x3e/0x80
[<ffffffff811a921b>] do_splice_to+0x7b/0xa0
[<ffffffff811a94d7>] splice_direct_to_actor+0xa7/0x1c0
[<ffffffffa036b762>] nfsd_vfs_read.isra.13+0x112/0x160 [nfsd]
[<ffffffffa036dc98>] nfsd_read_file+0x88/0xb0 [nfsd]
[<ffffffffa037c7a2>] nfsd4_encode_read+0x132/0x1f0 [nfsd]
[<ffffffffa03815dd>] nfsd4_encode_operation+0x5d/0xa0 [nfsd]
[<ffffffffa037851a>] nfsd4_proc_compound+0x25a/0x630 [nfsd]
[<ffffffffa0367b4e>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
[<ffffffffa025ab19>] svc_process+0x489/0x7a0 [sunrpc]
[<ffffffffa036718d>] nfsd+0xbd/0x1a0 [nfsd]
[<ffffffff810791fe>] kthread+0xae/0xc0
[<ffffffff8163f3c4>] kernel_thread_helper+0x4/0x10
[<ffffffffffffffff>] 0xffffffffffffffff

A direct read attempt through the ceph client:

dd if=/tmp/cephtest/mnt.0/client.1/tmp/Bonnie.2672 of=/dev/null

Hung here
[<ffffffff8112a22e>] sleep_on_page_killable+0xe/0x40
[<ffffffff8112a187>] __lock_page_killable+0x67/0x70
[<ffffffff8112c63e>] generic_file_aio_read+0x48e/0x730
[<ffffffffa03f1d54>] ceph_aio_read+0x654/0x880 [ceph]
[<ffffffff8117b703>] do_sync_read+0xa3/0xe0
[<ffffffff8117c060>] vfs_read+0xb0/0x180
[<ffffffff8117c17a>] sys_read+0x4a/0x90
[<ffffffff8163e1e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

I'm categorizing as ceph client issue, it is likely an interaction with kernel nfs server.

Actions

Copy link

Updated by David Zafman over 11 years ago

Description updated (diff)

I verified that PG_locked was set in struct page flags field. I suspected that ceph_readpages() was leaving pages locked, so I ran my test case with that function disabled. That function is not called in a ceph kernel client read, but is part of readahead that ends up in the code path that the kernel NFS server uses to read files.

My Bonnie run with that function disabled was able to get past the I/O portion of the test without hanging. During some earlier testing I didn't see the function finish_read() getting called at all. I presume that's where the unlock_page() from the complete I/O is supposed to occur.

Actions

Copy link

Updated by Sage Weil over 11 years ago

It might be that leaving the pages locked for the duration of the read is the wrong thing. My recollection is vague, but I think we've switched this behavior around a few different times. In 7c272194e66e91830b90f6202e61c69f8590f1eb we switched from a blocking implementation (which sucked for obvious reasons, but left the pages locked for the duration of the read) to an async one, which still left them locked. I suggest checking other file systems to see what their readpages behavior is...

Actions

Copy link