Project

General

Profile

Actions

Bug #15255

closed

kclient: lost filesystem operations (on mds disconnect?)

Added by Greg Farnum about 8 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

We saw this on our sepia teuthology kernel client mount. I saved the dmesg output at teuthology:~ubuntu/322_kernel_hang.log, but we don't have MDS logging turned on so there was nothing useful there. :(
Looking at the debugfs ceph output, there were no outstanding mdsc requests. The MDS didn't have any either. The MDS was not aware of any issues. But access to many files (not all of them!) was hanging. New file creation worked. We eventually had to reboot the box; a lazy reboot did not repair all issues.

Judging by the caps going stale and then a swarm of OSD disconnects, it looks like a timeout occurred (...for unknown reasons) and then the client got blacklisted (starting at Tue Mar 22 08:00:10 2016). Then we start getting backtraces

[Tue Mar 22 08:02:38 2016] INFO: task teuthology:27943 blocked for more than 120 seconds.
[Tue Mar 22 08:02:38 2016]       Tainted: G        W   E   4.2.0-rc8-ceph-00011-g4a18ede #1
[Tue Mar 22 08:02:38 2016] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Mar 22 08:02:38 2016] teuthology      D ffff88036d78cdd8     0 27943   2902 0x00000000
[Tue Mar 22 08:02:38 2016]  ffff8802c8bbf938 0000000000000092 ffff8806a3f817e8 0000000000000042
[Tue Mar 22 08:02:38 2016]  ffffffff81c13580 ffff88036d78c880 ffff88081b68d098 ffff8802c8bc0000
[Tue Mar 22 08:02:38 2016]  7fffffffffffffff 7fffffffffffffff 0000000000000002 ffffffff817bd3d0
[Tue Mar 22 08:02:38 2016] Call Trace:
[Tue Mar 22 08:02:38 2016]  [<ffffffff817bd3d0>] ? bit_wait_io_timeout+0x60/0x60
[Tue Mar 22 08:02:38 2016]  [<ffffffff817bcc6d>] schedule+0x6d/0x90
[Tue Mar 22 08:02:38 2016]  [<ffffffff817c0a6f>] schedule_timeout+0x3f/0x350
[Tue Mar 22 08:02:38 2016]  [<ffffffff81025139>] ? read_tsc+0x9/0x10
[Tue Mar 22 08:02:38 2016]  [<ffffffff81106764>] ? ktime_get+0xc4/0x110
[Tue Mar 22 08:02:38 2016]  [<ffffffff81152a13>] ? __delayacct_blkio_start+0x23/0x30
[Tue Mar 22 08:02:38 2016]  [<ffffffff817bd3d0>] ? bit_wait_io_timeout+0x60/0x60
[Tue Mar 22 08:02:38 2016]  [<ffffffff817bc0e4>] io_schedule_timeout+0xa4/0x110
[Tue Mar 22 08:02:38 2016]  [<ffffffff817bd418>] bit_wait_io+0x48/0x50
[Tue Mar 22 08:02:38 2016]  [<ffffffff817bcf8b>] __wait_on_bit_lock+0x4b/0xa0
[Tue Mar 22 08:02:38 2016]  [<ffffffff8119ed3e>] __lock_page+0xae/0xb0
[Tue Mar 22 08:02:38 2016]  [<ffffffff810ca690>] ? wake_atomic_t_function+0x30/0x30
[Tue Mar 22 08:02:38 2016]  [<ffffffff811a062a>] pagecache_get_page+0x9a/0x1e0
[Tue Mar 22 08:02:38 2016]  [<ffffffff811a079e>] grab_cache_page_write_begin+0x2e/0x50
[Tue Mar 22 08:02:38 2016]  [<ffffffffc046dc43>] ceph_write_begin+0x43/0x100 [ceph]
[Tue Mar 22 08:02:38 2016]  [<ffffffff8119e92f>] generic_perform_write+0xaf/0x1c0
[Tue Mar 22 08:02:38 2016]  [<ffffffffc0467f32>] ceph_write_iter+0x9d2/0xc80 [ceph]
[Tue Mar 22 08:02:38 2016]  [<ffffffff81231025>] ? putname+0x55/0x60
[Tue Mar 22 08:02:38 2016]  [<ffffffff812220e6>] ? vfs_write+0xb6/0x190
[Tue Mar 22 08:02:38 2016]  [<ffffffff81221256>] __vfs_write+0xa6/0xd0
[Tue Mar 22 08:02:38 2016]  [<ffffffff812220f7>] vfs_write+0xc7/0x190
[Tue Mar 22 08:02:38 2016]  [<ffffffff81222402>] SyS_write+0x52/0xb0
[Tue Mar 22 08:02:38 2016]  [<ffffffff817c26f2>] entry_SYSCALL_64_fastpath+0x16/0x7a

The pagecache_get_page frame makes it look like this isn't actually the Ceph client getting stuck, but it comes just about exactly 2 minutes after caps go stale, so I presume it's actually blocked on caps?

...okay, I missed this yesterday:

[Tue Mar 22 08:12:34 2016] ceph: mds0 closed our session
[Tue Mar 22 08:12:34 2016] ceph: mds0 reconnect start
[Tue Mar 22 08:12:36 2016] ceph: mds0 reconnect denied
[Tue Mar 22 08:12:36 2016] ceph:  dropping unsafe request 25351907
[Tue Mar 22 08:12:36 2016] ceph:  dropping dirty+flushing Fw state for ffff8802db23db08 1099864856894

So, this seems like a bit late to be starting a reconnect, but anything after this point (eg, the hung file accesses) are probably by deliberate choice, right? So the question becomes how the kernel client got so stuck to begin with.


Related issues 1 (0 open1 closed)

Related to CephFS - Bug #16186: kclient: drops requests without poking system calls on reconnectDuplicateJeff Layton06/08/2016

Actions
Actions

Also available in: Atom PDF