Project

General

Profile

Bug #51279

kclient hangs on umount (testing branch)

Added by Patrick Donnelly about 1 year ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

100%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

2021-06-17T00:31:39.045 DEBUG:teuthology.run_tasks:Unwinding manager kclient
2021-06-17T00:31:39.054 INFO:tasks.kclient:Unmounting kernel clients...
2021-06-17T00:31:39.054 DEBUG:tasks.cephfs.kernel_mount:Unmounting client client.0...
2021-06-17T00:31:39.055 INFO:teuthology.orchestra.run:Running command with timeout 900
2021-06-17T00:31:39.055 DEBUG:teuthology.orchestra.run.smithi042:> sudo umount /home/ubuntu/cephtest/mnt.0
...
2021-06-17T00:46:39.071 ERROR:teuthology:Uncaught exception (Hub)
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_f359b10daba6e0103d42ccfc021bc797f3cd7edc/virtualenv/lib/python3.6/site-packages/paramiko/channel.py", line 747, in recv_stderr
    out = self.in_stderr_buffer.read(nbytes, self.timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_f359b10daba6e0103d42ccfc021bc797f3cd7edc/virtualenv/lib/python3.6/site-packages/paramiko/buffered_pipe.py", line 164, in read
    raise PipeTimeout()
paramiko.buffered_pipe.PipeTimeout

From: /ceph/teuthology-archive/pdonnell-2021-06-16_21:26:55-fs-wip-pdonnell-testing-20210616.191804-distro-basic-smithi/6175751/teuthology.log

With kernel testing branch. Test had otherwise completed successfully.


Subtasks

Bug #52431: cephfs client must clean up bogus caps after failed async createDuplicateJeff Layton


Related issues

Related to Linux kernel client - Bug #52997: rhel: hanging umount Duplicate

History

#1 Updated by Jeff Layton about 1 year ago

Definitely hung. The console log has this:

[ 1600.021975] INFO: task umount:44870 blocked for more than 122 seconds.
[ 1600.028834]       Not tainted 5.13.0-rc4-ceph-g78ad69ef5540 #1
[ 1600.034828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1600.042830] task:umount          state:D stack:    0 pid:44870 ppid: 44854 flags:0x00004000
[ 1600.051391] Call Trace:
[ 1600.053987]  __schedule+0x43a/0xaa0
[ 1600.057628]  schedule+0x51/0xc0
[ 1600.060910]  ceph_mdsc_sync+0x44f/0x4e0 [ceph]
[ 1600.065598]  ? do_wait_intr_irq+0xc0/0xc0
[ 1600.069905]  ceph_sync_fs+0x33/0x110 [ceph]
[ 1600.074262]  sync_filesystem+0x6b/0x90
[ 1600.078150]  generic_shutdown_super+0x22/0x110
[ 1600.082747]  kill_anon_super+0x14/0x30
[ 1600.086641]  ceph_kill_sb+0x39/0xa0 [ceph]
[ 1600.090883]  deactivate_locked_super+0x2e/0x70
[ 1600.095448]  cleanup_mnt+0xb8/0x140
[ 1600.099090]  task_work_run+0x6d/0xb0
[ 1600.102801]  exit_to_user_mode_prepare+0x210/0x220
[ 1600.107712]  syscall_exit_to_user_mode+0x27/0x70
[ 1600.112461]  do_syscall_64+0x42/0x70
[ 1600.116149]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1600.121313] RIP: 0033:0x7f1589743f5b
[ 1600.124992] RSP: 002b:00007ffc5b7f2398 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 1600.132691] RAX: 0000000000000000 RBX: 00005647d6fae5d0 RCX: 00007f1589743f5b
[ 1600.139933] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005647d6fae7b0
[ 1600.147165] RBP: 0000000000000000 R08: 00005647d6fae7e0 R09: 00007f15897c4d40
[ 1600.154412] R10: 0000000000000000 R11: 0000000000000246 R12: 00005647d6fae7b0
[ 1600.161651] R13: 00007f158a4ef184 R14: 0000000000000000 R15: 00000000ffffffff
[ 1600.168900] 
[ 1600.168900] Showing all locks held in the system:
[ 1600.175211] 1 lock held by khungtaskd/56:
[ 1600.179318]  #0: ffffffff82759d60 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x183
[ 1600.188447] 1 lock held by systemd-journal/546:
[ 1600.193203] 1 lock held by umount/44870:
[ 1600.197224]  #0: ffff88811b1a10e0 (&type->s_umount_key#60){++++}-{3:3}, at: deactivate_super+0x43/0x50
[ 1600.206664] 

It's probably one of these calls in ceph_mdsc_sync:

        wait_unsafe_requests(mdsc, want_tid);
        wait_caps_flush(mdsc, want_flush);

...but I'm not sure which.

This kernel doesn't have the mutex elision patches that I recently proposed, so the problem is not related to those.

When this occurs, we may need to collect a kernel coredump. Is that possible?

#2 Updated by Patrick Donnelly about 1 year ago

Maybe, I think this job is dead so nothing was collected. Could rerun and wait for the hang then ssh into the smithi node to collect a coredump manually.

#3 Updated by Patrick Donnelly about 1 year ago

Dead: 2021-07-04T14:42:31.307 DEBUG:teuthology.exit:Finished running handlers
2 jobs: ['6251283', '6251296']
suites intersection: ['clusters/1a11s-mds-1c-client-3node', 'conf/{client', 'fs/workload/{begin', 'k-testing}', 'mds', 'mon', 'mount/kclient/{mount', 'ms-die-on-skipped}}', 'osd-asserts', 'osd}', 'overrides/{distro/testing/{flavor/centos_latest', 'overrides/{frag_enable', 'ranks/5', 'session_timeout', 'standby-replay', 'tasks/{0-check-counter', 'whitelist_health', 'whitelist_wrongly_marked_down}', 'wsync/{no}}']
suites union: ['clusters/1a11s-mds-1c-client-3node', 'conf/{client', 'distro/{rhel_8}', 'distro/{ubuntu_latest}', 'fs/workload/{begin', 'k-testing}', 'mds', 'mon', 'mount/kclient/{mount', 'ms-die-on-skipped}}', 'objectstore-ec/bluestore-comp', 'objectstore-ec/bluestore-comp-ec-root', 'omap_limit/10', 'omap_limit/10000', 'osd-asserts', 'osd}', 'overrides/{distro/testing/{flavor/centos_latest', 'overrides/{frag_enable', 'ranks/5', 'replication/always', 'replication/default', 'scrub/no', 'scrub/yes', 'session_timeout', 'standby-replay', 'tasks/{0-check-counter', 'whitelist_health', 'whitelist_wrongly_marked_down}', 'workunit/kernel_untar_build}', 'workunit/suites/blogbench}', 'wsync/{no}}']

From: https://pulpito.ceph.com/pdonnell-2021-07-04_02:32:34-fs-wip-pdonnell-testing-20210703.052904-distro-basic-smithi/

#4 Updated by Jeff Layton about 1 year ago

Same sort of hang:

[ 1722.635192] INFO: task umount:47989 blocked for more than 122 seconds.
[ 1722.642008]       Not tainted 5.13.0-ceph-g023b468d6934 #1
[ 1722.647653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1722.655702] task:umount          state:D stack:    0 pid:47989 ppid: 47973 flags:0x00004000
[ 1722.664281] Call Trace:
[ 1722.666955]  __schedule+0x43a/0xaa0
[ 1722.670622]  schedule+0x51/0xc0
[ 1722.673924]  ceph_mdsc_sync+0x44f/0x4e0 [ceph]
[ 1722.678541]  ? do_wait_intr_irq+0xc0/0xc0
[ 1722.682697]  ceph_sync_fs+0x33/0x110 [ceph]
[ 1722.687057]  sync_filesystem+0x6b/0x90
[ 1722.690941]  generic_shutdown_super+0x22/0x110
[ 1722.695517]  kill_anon_super+0x14/0x30
[ 1722.699403]  ceph_kill_sb+0x39/0xa0 [ceph]
[ 1722.703640]  deactivate_locked_super+0x2e/0x70
[ 1722.708206]  cleanup_mnt+0xb8/0x140
[ 1722.711820]  task_work_run+0x6d/0xb0
[ 1722.715523]  exit_to_user_mode_prepare+0x264/0x270
[ 1722.720436]  syscall_exit_to_user_mode+0x27/0x70
[ 1722.725168]  do_syscall_64+0x42/0x70
[ 1722.728852]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1722.734019] RIP: 0033:0x7fa69cc27f5b
[ 1722.737699] RSP: 002b:00007ffd0ea19738 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 1722.745400] RAX: 0000000000000000 RBX: 0000561510e7b5d0 RCX: 00007fa69cc27f5b
[ 1722.752649] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000561510e7b7b0
[ 1722.759884] RBP: 0000000000000000 R08: 0000561510e7b7e0 R09: 00007fa69cca8d40
[ 1722.767133] R10: 0000000000000000 R11: 0000000000000246 R12: 0000561510e7b7b0
[ 1722.774374] R13: 00007fa69d9d3184 R14: 0000000000000000 R15: 00000000ffffffff
[ 1722.781618] 
[ 1722.781618] Showing all locks held in the system:
[ 1722.787930] 1 lock held by khungtaskd/56:
[ 1722.792035]  #0: ffffffff82759d60 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x183
[ 1722.801127] 2 locks held by systemd-journal/522:
[ 1722.805853] 2 locks held by kworker/u16:2/20495:
[ 1722.810562]  #0: ffff88810f1fa538 ((wq_completion)ixgbe){+.+.}-{0:0}, at: process_one_work+0x1bf/0x580
[ 1722.819993]  #1: ffffc90003173e70 ((work_completion)(&adapter->service_task)){+.+.}-{0:0}, at: process_one_work+0x1bf/0x580
[ 1722.831324] 1 lock held by daemon-helper/37551:
[ 1722.835962]  #0: ffff88885fd6cd98 (&rq->lock){-.-.}-{2:2}, at: __schedule+0x11e/0xaa0
[ 1722.843946] 1 lock held by umount/47989:
[ 1722.847968]  #0: ffff88810690b0e0 (&type->s_umount_key#60){++++}-{3:3}, at: deactivate_super+0x43/0x50

Unfortunately, the kernel-debug packages for this build have been reaped, so we can't quite tell where it was waiting.

#5 Updated by Jeff Layton about 1 year ago

Actually, the kernel was not reaped. We just don't build separate debuginfo packages. faddr2line script says:

wait_caps_flush at /home/jenkins-build/build/workspace/kernel/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/fs/ceph/mds_client.c:2048

Looking at that file/line for this commit, that lands smack in the middle of the wait_event call in here:

static void wait_caps_flush(struct ceph_mds_client *mdsc,
                            u64 want_flush_tid)
{
        dout("check_caps_flush want %llu\n", want_flush_tid);

        wait_event(mdsc->cap_flushing_wq,
                   check_caps_flush(mdsc, want_flush_tid));

        dout("check_caps_flush ok, flushed thru %llu\n", want_flush_tid);
}

So it sounds likely that this is hung waiting for caps to flush.

#6 Updated by Jeff Layton about 1 year ago

One thing we could do is merge a patch that makes this dump out some info about what it's waiting on when this is blocked for "too long" (>60s or so).

#7 Updated by Jeff Layton about 1 year ago

I have a theory as to what this is:

Once we removed the s_mutex from the cap sending codepath, I think that it became possible for cap flushes to be reordered such that a larger cap flush tid ends up being sent before a smaller one. Eventually the MDS only sends a FLUSH_ACK for the smaller value (since it was the last one received) thinking that that one should cover both of them, but the client won't see it that way.

I first need to look over the MDS code to see if this is possible. If it is, then we'll probably need to do something to ensure that cap flushes don't get reordered this way, since that's apparently what the protocol mandates. We should be able to do that without requiring more heavyweight locking however.

#8 Updated by Jeff Layton about 1 year ago

To be clear, we did add a patch to the testing kernel to give us more info on these hangs. What it showed was that there were a few leftover cap flushes stuck on the list even though the current tid counter was well beyond that point. I've proposed a new patch to add a bit more info when this occurs. That's now merged into the testing branch, so hopefully we'll have some info to go on when it does occur again.

#9 Updated by Jeff Layton about 1 year ago

The hang has been reproduced and we see this in dmesg:

[34268.021758] ceph: dump_cap_flushes: still waiting for cap flushes through 47750:
[34268.021771] ceph: 0x20000004f4e Fw 42889 0 0
[34268.021778] ceph: 0x20000004f48 Fw 46249 0 0
[34268.021783] ceph: 0x20000004f49 Fw 46402 0 1

The second to the last column is the tid of the last FLUSH_ACK. It's initialized to 0 when the inode is initialized and the first thing that ceph_handle_cap_flush_ack does is to update that to the tid of the call if it's newer. These inodes have never received a FLUSH_ACK.

Either the client silently dropped this on the floor, or the MDS never sent one at all.

#10 Updated by Jeff Layton about 1 year ago

There are a couple of cases in the kernel where we might get a cap message and drop it on the floor without logging anything:

- if it can't find the inode for the cap message
- it can't find the cap

I'm not sure how how often this might happen, but perhaps we can alter the debugging patch to log those events as well. I'll spin something up.

#11 Updated by Jeff Layton about 1 year ago

Note that this means that my initial theory is probably wrong. This is not an ordering issue with us dropping a FLUSH_ACK that mistakenly had a later tid. The client in this case seems to have never received a FLUSH_ACK at all for these inodes.

#12 Updated by Jeff Layton 12 months ago

We made some progress on this late last week.

There was a hung umount and I was able to log into the machine while it was in this state. It showed a bunch of async creates that had gotten ENOSPC errors back from the MDS. The client thought that it had dirty caps from writes and got stuck flushing them back to the MDS. Those FLUSH requests never got a response of any sort. Either they were dropped on the floor by the MDS, or the MDS was too badly hosed from the ENOSPC problem to reply.

So, we need to remove all of the caps when this occurs, but I think we also need to go farther:

We may have multiple open file descriptors for an inode that doesn't exist at that point. We need to shut those down so that I/O and other operations involving them get back an error. mmap reads and writes should get a SIGBUS. Any cached data for it should be invalidated (including dirty data).

I think we'll need to add a per-inode flag for inodes that are in this state, and we'll need to add checks in the appropriate places for this flag. We'll also need to consider how to handle races between that and dirtying new pagecache pages. Maybe we can use the DIO locking for that.

Probably, we'll also need to consider what to do with OSD objects that may have been created for this now-bogus inode. The MDS won't know anything about them, so asking it to clean up is probably not appropriate. Does the MDS have a way to purge the data pool of objects for inodes that don't exist?

There is also the CEPH_MOUNT_SHUTDOWN flag too for forced umounts. I don't think we want to use that here, as we might have dirty data for inodes that do exist, but we may want to unify the handling of both cases a bit, since they are somewhat similar.

#13 Updated by Jeff Layton 12 months ago

First, I changed the debug patch to only dump out cap flushes for the first 5 wait timeouts. That should cut down the log spam with that patch.

I cooked up a small fault injection patch that makes the MDS throw ENOSPC errors on any openc request. With that, I was able to reproduce the problem (and a nasty softlockup to go with it).

I refactored the code that handles forced umounts to also handle this case. When we get a failed create, we purge the caps and pagecache for that inode. With that, I'm able to umount after a failed create, even if I have cached data/dirty caps.

I need to do a bit more cleanup and testing of the set, but I should hopefully have something to post next week.

#14 Updated by Jeff Layton 11 months ago

  • Project changed from CephFS to Linux kernel client
  • Status changed from New to Fix Under Review

#15 Updated by Patrick Donnelly 10 months ago

  • Related to Bug #52997: rhel: hanging umount added

#16 Updated by Jeff Layton about 2 months ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF