Project

General

Profile

Actions

Bug #51279

open

kclient hangs on umount (testing branch)

Added by Patrick Donnelly almost 3 years ago. Updated 11 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

100%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

2021-06-17T00:31:39.045 DEBUG:teuthology.run_tasks:Unwinding manager kclient
2021-06-17T00:31:39.054 INFO:tasks.kclient:Unmounting kernel clients...
2021-06-17T00:31:39.054 DEBUG:tasks.cephfs.kernel_mount:Unmounting client client.0...
2021-06-17T00:31:39.055 INFO:teuthology.orchestra.run:Running command with timeout 900
2021-06-17T00:31:39.055 DEBUG:teuthology.orchestra.run.smithi042:> sudo umount /home/ubuntu/cephtest/mnt.0
...
2021-06-17T00:46:39.071 ERROR:teuthology:Uncaught exception (Hub)
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_f359b10daba6e0103d42ccfc021bc797f3cd7edc/virtualenv/lib/python3.6/site-packages/paramiko/channel.py", line 747, in recv_stderr
    out = self.in_stderr_buffer.read(nbytes, self.timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_f359b10daba6e0103d42ccfc021bc797f3cd7edc/virtualenv/lib/python3.6/site-packages/paramiko/buffered_pipe.py", line 164, in read
    raise PipeTimeout()
paramiko.buffered_pipe.PipeTimeout

From: /ceph/teuthology-archive/pdonnell-2021-06-16_21:26:55-fs-wip-pdonnell-testing-20210616.191804-distro-basic-smithi/6175751/teuthology.log

With kernel testing branch. Test had otherwise completed successfully.


Subtasks 1 (0 open1 closed)

Bug #52431: cephfs client must clean up bogus caps after failed async createDuplicateJeff Layton

Actions

Related issues 1 (0 open1 closed)

Related to Linux kernel client - Bug #52997: rhel: hanging umountDuplicateJeff Layton

Actions
Actions #1

Updated by Jeff Layton almost 3 years ago

Definitely hung. The console log has this:

[ 1600.021975] INFO: task umount:44870 blocked for more than 122 seconds.
[ 1600.028834]       Not tainted 5.13.0-rc4-ceph-g78ad69ef5540 #1
[ 1600.034828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1600.042830] task:umount          state:D stack:    0 pid:44870 ppid: 44854 flags:0x00004000
[ 1600.051391] Call Trace:
[ 1600.053987]  __schedule+0x43a/0xaa0
[ 1600.057628]  schedule+0x51/0xc0
[ 1600.060910]  ceph_mdsc_sync+0x44f/0x4e0 [ceph]
[ 1600.065598]  ? do_wait_intr_irq+0xc0/0xc0
[ 1600.069905]  ceph_sync_fs+0x33/0x110 [ceph]
[ 1600.074262]  sync_filesystem+0x6b/0x90
[ 1600.078150]  generic_shutdown_super+0x22/0x110
[ 1600.082747]  kill_anon_super+0x14/0x30
[ 1600.086641]  ceph_kill_sb+0x39/0xa0 [ceph]
[ 1600.090883]  deactivate_locked_super+0x2e/0x70
[ 1600.095448]  cleanup_mnt+0xb8/0x140
[ 1600.099090]  task_work_run+0x6d/0xb0
[ 1600.102801]  exit_to_user_mode_prepare+0x210/0x220
[ 1600.107712]  syscall_exit_to_user_mode+0x27/0x70
[ 1600.112461]  do_syscall_64+0x42/0x70
[ 1600.116149]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1600.121313] RIP: 0033:0x7f1589743f5b
[ 1600.124992] RSP: 002b:00007ffc5b7f2398 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 1600.132691] RAX: 0000000000000000 RBX: 00005647d6fae5d0 RCX: 00007f1589743f5b
[ 1600.139933] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005647d6fae7b0
[ 1600.147165] RBP: 0000000000000000 R08: 00005647d6fae7e0 R09: 00007f15897c4d40
[ 1600.154412] R10: 0000000000000000 R11: 0000000000000246 R12: 00005647d6fae7b0
[ 1600.161651] R13: 00007f158a4ef184 R14: 0000000000000000 R15: 00000000ffffffff
[ 1600.168900] 
[ 1600.168900] Showing all locks held in the system:
[ 1600.175211] 1 lock held by khungtaskd/56:
[ 1600.179318]  #0: ffffffff82759d60 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x183
[ 1600.188447] 1 lock held by systemd-journal/546:
[ 1600.193203] 1 lock held by umount/44870:
[ 1600.197224]  #0: ffff88811b1a10e0 (&type->s_umount_key#60){++++}-{3:3}, at: deactivate_super+0x43/0x50
[ 1600.206664] 

It's probably one of these calls in ceph_mdsc_sync:

        wait_unsafe_requests(mdsc, want_tid);
        wait_caps_flush(mdsc, want_flush);

...but I'm not sure which.

This kernel doesn't have the mutex elision patches that I recently proposed, so the problem is not related to those.

When this occurs, we may need to collect a kernel coredump. Is that possible?

Actions #2

Updated by Patrick Donnelly almost 3 years ago

Maybe, I think this job is dead so nothing was collected. Could rerun and wait for the hang then ssh into the smithi node to collect a coredump manually.

Actions #3

Updated by Patrick Donnelly almost 3 years ago

Dead: 2021-07-04T14:42:31.307 DEBUG:teuthology.exit:Finished running handlers
2 jobs: ['6251283', '6251296']
suites intersection: ['clusters/1a11s-mds-1c-client-3node', 'conf/{client', 'fs/workload/{begin', 'k-testing}', 'mds', 'mon', 'mount/kclient/{mount', 'ms-die-on-skipped}}', 'osd-asserts', 'osd}', 'overrides/{distro/testing/{flavor/centos_latest', 'overrides/{frag_enable', 'ranks/5', 'session_timeout', 'standby-replay', 'tasks/{0-check-counter', 'whitelist_health', 'whitelist_wrongly_marked_down}', 'wsync/{no}}']
suites union: ['clusters/1a11s-mds-1c-client-3node', 'conf/{client', 'distro/{rhel_8}', 'distro/{ubuntu_latest}', 'fs/workload/{begin', 'k-testing}', 'mds', 'mon', 'mount/kclient/{mount', 'ms-die-on-skipped}}', 'objectstore-ec/bluestore-comp', 'objectstore-ec/bluestore-comp-ec-root', 'omap_limit/10', 'omap_limit/10000', 'osd-asserts', 'osd}', 'overrides/{distro/testing/{flavor/centos_latest', 'overrides/{frag_enable', 'ranks/5', 'replication/always', 'replication/default', 'scrub/no', 'scrub/yes', 'session_timeout', 'standby-replay', 'tasks/{0-check-counter', 'whitelist_health', 'whitelist_wrongly_marked_down}', 'workunit/kernel_untar_build}', 'workunit/suites/blogbench}', 'wsync/{no}}']

From: https://pulpito.ceph.com/pdonnell-2021-07-04_02:32:34-fs-wip-pdonnell-testing-20210703.052904-distro-basic-smithi/

Actions #4

Updated by Jeff Layton almost 3 years ago

Same sort of hang:

[ 1722.635192] INFO: task umount:47989 blocked for more than 122 seconds.
[ 1722.642008]       Not tainted 5.13.0-ceph-g023b468d6934 #1
[ 1722.647653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1722.655702] task:umount          state:D stack:    0 pid:47989 ppid: 47973 flags:0x00004000
[ 1722.664281] Call Trace:
[ 1722.666955]  __schedule+0x43a/0xaa0
[ 1722.670622]  schedule+0x51/0xc0
[ 1722.673924]  ceph_mdsc_sync+0x44f/0x4e0 [ceph]
[ 1722.678541]  ? do_wait_intr_irq+0xc0/0xc0
[ 1722.682697]  ceph_sync_fs+0x33/0x110 [ceph]
[ 1722.687057]  sync_filesystem+0x6b/0x90
[ 1722.690941]  generic_shutdown_super+0x22/0x110
[ 1722.695517]  kill_anon_super+0x14/0x30
[ 1722.699403]  ceph_kill_sb+0x39/0xa0 [ceph]
[ 1722.703640]  deactivate_locked_super+0x2e/0x70
[ 1722.708206]  cleanup_mnt+0xb8/0x140
[ 1722.711820]  task_work_run+0x6d/0xb0
[ 1722.715523]  exit_to_user_mode_prepare+0x264/0x270
[ 1722.720436]  syscall_exit_to_user_mode+0x27/0x70
[ 1722.725168]  do_syscall_64+0x42/0x70
[ 1722.728852]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1722.734019] RIP: 0033:0x7fa69cc27f5b
[ 1722.737699] RSP: 002b:00007ffd0ea19738 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 1722.745400] RAX: 0000000000000000 RBX: 0000561510e7b5d0 RCX: 00007fa69cc27f5b
[ 1722.752649] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000561510e7b7b0
[ 1722.759884] RBP: 0000000000000000 R08: 0000561510e7b7e0 R09: 00007fa69cca8d40
[ 1722.767133] R10: 0000000000000000 R11: 0000000000000246 R12: 0000561510e7b7b0
[ 1722.774374] R13: 00007fa69d9d3184 R14: 0000000000000000 R15: 00000000ffffffff
[ 1722.781618] 
[ 1722.781618] Showing all locks held in the system:
[ 1722.787930] 1 lock held by khungtaskd/56:
[ 1722.792035]  #0: ffffffff82759d60 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x183
[ 1722.801127] 2 locks held by systemd-journal/522:
[ 1722.805853] 2 locks held by kworker/u16:2/20495:
[ 1722.810562]  #0: ffff88810f1fa538 ((wq_completion)ixgbe){+.+.}-{0:0}, at: process_one_work+0x1bf/0x580
[ 1722.819993]  #1: ffffc90003173e70 ((work_completion)(&adapter->service_task)){+.+.}-{0:0}, at: process_one_work+0x1bf/0x580
[ 1722.831324] 1 lock held by daemon-helper/37551:
[ 1722.835962]  #0: ffff88885fd6cd98 (&rq->lock){-.-.}-{2:2}, at: __schedule+0x11e/0xaa0
[ 1722.843946] 1 lock held by umount/47989:
[ 1722.847968]  #0: ffff88810690b0e0 (&type->s_umount_key#60){++++}-{3:3}, at: deactivate_super+0x43/0x50

Unfortunately, the kernel-debug packages for this build have been reaped, so we can't quite tell where it was waiting.

Actions #5

Updated by Jeff Layton almost 3 years ago

Actually, the kernel was not reaped. We just don't build separate debuginfo packages. faddr2line script says:

wait_caps_flush at /home/jenkins-build/build/workspace/kernel/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/fs/ceph/mds_client.c:2048

Looking at that file/line for this commit, that lands smack in the middle of the wait_event call in here:

static void wait_caps_flush(struct ceph_mds_client *mdsc,
                            u64 want_flush_tid)
{
        dout("check_caps_flush want %llu\n", want_flush_tid);

        wait_event(mdsc->cap_flushing_wq,
                   check_caps_flush(mdsc, want_flush_tid));

        dout("check_caps_flush ok, flushed thru %llu\n", want_flush_tid);
}

So it sounds likely that this is hung waiting for caps to flush.

Actions #6

Updated by Jeff Layton almost 3 years ago

One thing we could do is merge a patch that makes this dump out some info about what it's waiting on when this is blocked for "too long" (>60s or so).

Actions #7

Updated by Jeff Layton almost 3 years ago

I have a theory as to what this is:

Once we removed the s_mutex from the cap sending codepath, I think that it became possible for cap flushes to be reordered such that a larger cap flush tid ends up being sent before a smaller one. Eventually the MDS only sends a FLUSH_ACK for the smaller value (since it was the last one received) thinking that that one should cover both of them, but the client won't see it that way.

I first need to look over the MDS code to see if this is possible. If it is, then we'll probably need to do something to ensure that cap flushes don't get reordered this way, since that's apparently what the protocol mandates. We should be able to do that without requiring more heavyweight locking however.

Actions #8

Updated by Jeff Layton over 2 years ago

To be clear, we did add a patch to the testing kernel to give us more info on these hangs. What it showed was that there were a few leftover cap flushes stuck on the list even though the current tid counter was well beyond that point. I've proposed a new patch to add a bit more info when this occurs. That's now merged into the testing branch, so hopefully we'll have some info to go on when it does occur again.

Actions #9

Updated by Jeff Layton over 2 years ago

The hang has been reproduced and we see this in dmesg:

[34268.021758] ceph: dump_cap_flushes: still waiting for cap flushes through 47750:
[34268.021771] ceph: 0x20000004f4e Fw 42889 0 0
[34268.021778] ceph: 0x20000004f48 Fw 46249 0 0
[34268.021783] ceph: 0x20000004f49 Fw 46402 0 1

The second to the last column is the tid of the last FLUSH_ACK. It's initialized to 0 when the inode is initialized and the first thing that ceph_handle_cap_flush_ack does is to update that to the tid of the call if it's newer. These inodes have never received a FLUSH_ACK.

Either the client silently dropped this on the floor, or the MDS never sent one at all.

Actions #10

Updated by Jeff Layton over 2 years ago

There are a couple of cases in the kernel where we might get a cap message and drop it on the floor without logging anything:

- if it can't find the inode for the cap message
- it can't find the cap

I'm not sure how how often this might happen, but perhaps we can alter the debugging patch to log those events as well. I'll spin something up.

Actions #11

Updated by Jeff Layton over 2 years ago

Note that this means that my initial theory is probably wrong. This is not an ordering issue with us dropping a FLUSH_ACK that mistakenly had a later tid. The client in this case seems to have never received a FLUSH_ACK at all for these inodes.

Actions #12

Updated by Jeff Layton over 2 years ago

We made some progress on this late last week.

There was a hung umount and I was able to log into the machine while it was in this state. It showed a bunch of async creates that had gotten ENOSPC errors back from the MDS. The client thought that it had dirty caps from writes and got stuck flushing them back to the MDS. Those FLUSH requests never got a response of any sort. Either they were dropped on the floor by the MDS, or the MDS was too badly hosed from the ENOSPC problem to reply.

So, we need to remove all of the caps when this occurs, but I think we also need to go farther:

We may have multiple open file descriptors for an inode that doesn't exist at that point. We need to shut those down so that I/O and other operations involving them get back an error. mmap reads and writes should get a SIGBUS. Any cached data for it should be invalidated (including dirty data).

I think we'll need to add a per-inode flag for inodes that are in this state, and we'll need to add checks in the appropriate places for this flag. We'll also need to consider how to handle races between that and dirtying new pagecache pages. Maybe we can use the DIO locking for that.

Probably, we'll also need to consider what to do with OSD objects that may have been created for this now-bogus inode. The MDS won't know anything about them, so asking it to clean up is probably not appropriate. Does the MDS have a way to purge the data pool of objects for inodes that don't exist?

There is also the CEPH_MOUNT_SHUTDOWN flag too for forced umounts. I don't think we want to use that here, as we might have dirty data for inodes that do exist, but we may want to unify the handling of both cases a bit, since they are somewhat similar.

Actions #13

Updated by Jeff Layton over 2 years ago

First, I changed the debug patch to only dump out cap flushes for the first 5 wait timeouts. That should cut down the log spam with that patch.

I cooked up a small fault injection patch that makes the MDS throw ENOSPC errors on any openc request. With that, I was able to reproduce the problem (and a nasty softlockup to go with it).

I refactored the code that handles forced umounts to also handle this case. When we get a failed create, we purge the caps and pagecache for that inode. With that, I'm able to umount after a failed create, even if I have cached data/dirty caps.

I need to do a bit more cleanup and testing of the set, but I should hopefully have something to post next week.

Actions #14

Updated by Jeff Layton over 2 years ago

  • Project changed from CephFS to Linux kernel client
  • Status changed from New to Fix Under Review
Actions #15

Updated by Patrick Donnelly over 2 years ago

  • Related to Bug #52997: rhel: hanging umount added
Actions #16

Updated by Jeff Layton almost 2 years ago

  • Status changed from Fix Under Review to Resolved
Actions #18

Updated by Xiubo Li 11 months ago

  • Assignee changed from Jeff Layton to Xiubo Li

It seems reproduce it again.

Actions #20

Updated by Kotresh Hiremath Ravishankar 11 months ago

Xiubo Li wrote:

It seems reproduce it again.

Is this a new issue and do you think it's a blocker for reef release ?

Actions #21

Updated by Xiubo Li 11 months ago

Kotresh Hiremath Ravishankar wrote:

Xiubo Li wrote:

It seems reproduce it again.

Is this a new issue and do you think it's a blocker for reef release ?

Seems a new one. But I don't think this will block the reef release.

Actions #22

Updated by Kotresh Hiremath Ravishankar 11 months ago

  • Status changed from Need More Info to Resolved

It's a new issue tracked via https://tracker.ceph.com/issues/61591

Actions #23

Updated by Xiubo Li 18 days ago

  • Status changed from Resolved to In Progress

Reopen it since we see this again.

Actions #25

Updated by Xiubo Li 18 days ago

Xiubo Li wrote:

https://pulpito.ceph.com/mchangir-2024-03-30_16:00:44-fs:functional-wip-mchangir-testing1-main-20240328-072412-distro-default-smithi/7631112

From console_logs/smithi086.log:

[ 3567.940448] libceph: client34365 fsid 9985ea3e-ddc9-4da2-9c92-d318f0a6434d^M
[ 3652.310742] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: still waiting for cap flushes through 479:^M
[ 3652.320763] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 10000001902:fffffffffffffffe - 441 430 0 is_capsnap^M
[ 3652.331334] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 10000001901:fffffffffffffffe - 442 417 0 is_capsnap^M
[ 3652.341776] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 10000001900:fffffffffffffffe - 443 424 0 is_capsnap^M
[ 3652.352208] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018ff:fffffffffffffffe - 444 414 0 is_capsnap^M
[ 3652.362639] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018fe:fffffffffffffffe - 445 425 0 is_capsnap^M
[ 3652.373079] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018fd:fffffffffffffffe - 446 434 0 is_capsnap^M
[ 3652.383521] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018fc:fffffffffffffffe - 447 415 0 is_capsnap^M
[ 3652.393961] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018fb:fffffffffffffffe - 448 438 0 is_capsnap^M
[ 3652.404395] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018fa:fffffffffffffffe - 449 413 0 is_capsnap^M
[ 3652.414822] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018f9:fffffffffffffffe - 450 435 0 is_capsnap^M
[ 3652.425253] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018f8:fffffffffffffffe - 451 420 0 is_capsnap^M
[ 3652.435681] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018f7:fffffffffffffffe - 452 426 0 is_capsnap^M
[ 3652.446120] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018f6:fffffffffffffffe - 453 422 0 is_capsnap^M
[ 3652.456559] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018f5:fffffffffffffffe - 454 419 0 is_capsnap^M
[ 3652.466985] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018f4:fffffffffffffffe - 455 439 0 is_capsnap^M
[ 3652.477420] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018f3:fffffffffffffffe - 456 428 0 is_capsnap^M
[ 3652.487849] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018f2:fffffffffffffffe - 457 436 0 is_capsnap^M
[ 3652.498283] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018f1:fffffffffffffffe - 458 423 0 is_capsnap^M
[ 3652.508709] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018f0:fffffffffffffffe - 459 418 0 is_capsnap^M
[ 3652.519145] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018ef:fffffffffffffffe - 460 433 0 is_capsnap^M
[ 3652.529568] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018ee:fffffffffffffffe - 461 427 0 is_capsnap^M
[ 3652.539996] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018ed:fffffffffffffffe - 462 416 0 is_capsnap^M
[ 3652.550422] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018ec:fffffffffffffffe - 463 421 0 is_capsnap^M
[ 3652.560850] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018eb:fffffffffffffffe - 464 431 0 is_capsnap^M
[ 3652.571275] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018ea:fffffffffffffffe - 465 437 0 is_capsnap^M
[ 3652.581726] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018e9:fffffffffffffffe - 466 432 0 is_capsnap^M
[ 3652.592161] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018e8:fffffffffffffffe - 467 429 0 is_capsnap^M
[ 3652.602590] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018e7:fffffffffffffffe - 468 440 0 is_capsnap^M
[ 3652.613022] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018ce:fffffffffffffffe - 469 404 0 is_capsnap^M
[ 3652.623451] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018cb:fffffffffffffffe - 470 403 0 is_capsnap^M
[ 3652.633900] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018c7:fffffffffffffffe - 471 405 0 is_capsnap^M
[ 3652.644340] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018c6:fffffffffffffffe - 472 408 0 is_capsnap^M
[ 3652.654780] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018c3:fffffffffffffffe - 473 402 0 is_capsnap^M
[ 3652.665214] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018c1:fffffffffffffffe - 474 407 0 is_capsnap^M
[ 3652.675640] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018bf:fffffffffffffffe - 475 410 0 is_capsnap^M
[ 3652.686068] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018bc:fffffffffffffffe - 476 409 0 is_capsnap^M
[ 3652.696493] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018bb:fffffffffffffffe - 477 406 0 is_capsnap^M
[ 3652.706923] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018ba:fffffffffffffffe - 478 411 0 is_capsnap^M
[ 3652.717356] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 10000000070:fffffffffffffffe - 479 412 1 is_capsnap^M
...
[ 3836.968329] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018c6:fffffffffffffffe - 472 408 0 is_capsnap^M
[ 3836.978782] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018c3:fffffffffffffffe - 473 402 0 is_capsnap^M
[ 3836.989212] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018c1:fffffffffffffffe - 474 407 0 is_capsnap^M
[ 3836.999633] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018bf:fffffffffffffffe - 475 410 0 is_capsnap^M
[ 3837.010070] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018bc:fffffffffffffffe - 476 409 0 is_capsnap^M
[ 3837.020495] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018bb:fffffffffffffffe - 477 406 0 is_capsnap^M
[ 3837.030956] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 100000018ba:fffffffffffffffe - 478 411 0 is_capsnap^M
[ 3837.041385] ceph: [9985ea3e-ddc9-4da2-9c92-d318f0a6434d 34365]: 10000000070:fffffffffffffffe - 479 412 1 is_capsnap^M
[ 3893.623362] libceph: mds1 (1)172.21.15.196:6835 socket closed (con state OPEN)^M
[ 3893.633998] libceph: mds0 (1)172.21.15.86:6836 socket closed (con state OPEN)^M

When unmounting it just was stuck and waiting the flushsnap cap requests' acks.

But from the mds side, for example for tid 479, the MDS finished it in 3 seconds and have successfully acked:

2024-03-31T07:43:02.332+0000 7f08a1e00640  5 mds.1.log _submit_thread 8954495~1128 : EUpdate snap flush [metablob 0x1000000006f, 1 dirs]
2024-03-31T07:43:02.332+0000 7f08aaa00640  1 -- [v2:172.21.15.196:6833/109675901,v1:172.21.15.196:6835/109675901] <== client.34365 v1:192.168.0.1:0/2992703401 281 ==== client_caps(flushsnap ino 0x10000000070 0 seq 0 tid 479 caps=pFc dirty=Fxw wanted=- follows 226 mseq 1 size 1481/0 mtime 2019-11-25T00:32:01.000000+0000 ctime 2024-03-31T07:42:56.342229+0000 change_attr 2 tws 1) ==== 268+0+0 (unknown 419906353 0 0) 0x55ca9638c380 con 0x55ca9634ec00
2024-03-31T07:43:02.332+0000 7f08aaa00640  7 mds.1.locker handle_client_caps on 0x10000000070 tid 479 follows 226 op flushsnap flags 0x0
2024-03-31T07:43:02.332+0000 7f08aaa00640 20 mds.1.56 get_session have 0x55ca96120d00 client.34365 v1:192.168.0.1:0/2992703401 state open
...
2024-03-31T07:43:05.302+0000 7f08a2800640 10 mds.1.56 send_message_client_counted client.34365 seq 287 client_caps(flushsnap_ack ino 0x10000000070 0 seq 0 tid 479 caps=- dirty=Fxw wanted=- follows 226 size 0/0 mtime 0.000000 ctime 0.000000 change_attr 0)
2024-03-31T07:43:05.302+0000 7f08a2800640  1 -- [v2:172.21.15.196:6833/109675901,v1:172.21.15.196:6835/109675901] --> v1:192.168.0.1:0/2992703401 -- client_caps(flushsnap_ack ino 0x10000000070 0 seq 0 tid 479 caps=- dirty=Fxw wanted=- follows 226 size 0/0 mtime 0.000000 ctime 0.000000 change_attr 0) -- 0x55ca96303500 con 0x55ca9634ec00

So not sure why the kclient didn't advance it and kept waiting and waiting.

Actions #27

Updated by Xiubo Li 17 days ago

Milind Changire wrote:

main: (Data Uninlining PR)
https://pulpito.ceph.com/mchangir-2024-03-30_16:00:44-fs:functional-wip-mchangir-testing1-main-20240328-072412-distro-default-smithi/7631112

As discussed in the slack thread, when unmounting there are still some files being opened in background.

Actions #28

Updated by Xiubo Li 11 days ago

I have added more debug logs and will dump why the flushsnap_ack was dropped directly:

@@ -3978,8 +3983,8 @@ static void handle_cap_flushsnap_ack(struct inode *inode, u64 flush_tid,
        list_for_each_entry(iter, &ci->i_cap_snaps, ci_item) {
                if (iter->follows == follows) {
                        if (iter->cap_flush.tid != flush_tid) {
-                               doutc(cl, " cap_snap %p follows %lld " 
-                                     "tid %lld != %lld\n", iter,
+                               pr_info_client(cl, " %llx.%llx cap_snap %p follows %lld " 
+                                     "tid %lld != %lld\n", ceph_vinop(inode), iter,
                                      follows, flush_tid,
                                      iter->cap_flush.tid);
                                break;
@@ -3987,8 +3992,10 @@ static void handle_cap_flushsnap_ack(struct inode *inode, u64 flush_tid,
                        capsnap = iter;
                        break;
                } else {
-                       doutc(cl, " skipping cap_snap %p follows %lld\n",
-                             iter, iter->follows);
+                       pr_info_client(cl, " %llx.%llx skipping cap_snap %p follows %lld, " 
+                             "iter->follows %lld tid %lld iter->cap_flush.tid %lld\n",
+                             ceph_vinop(inode), iter, follows, iter->follows, flush_tid,
+                             iter->cap_flush.tid);
                }
        }
        if (capsnap)

Based on Jeff's previous debug commit:

commit ca18c83ae7caee39bcd64423e799456251ffa9b9
Author: Jeff Layton <jlayton@kernel.org>
Date:   Wed Jul 7 09:20:34 2021 -0400

    [DO NOT MERGE] ceph: dump info about cap flushes when we're waiting too long for them

    We've had some cases of hung umounts in teuthology testing. It looks
    like client is waiting for cap flushes to complete, but they aren't.

    Add a field to the inode to track the highest cap flush tid seen for
    that inode. Also, add a backpointer to the inode to the ceph_cap_flush
    struct.

    Change wait_caps_flush to wait 60s, and then dump info about the
    condition of the list.

    Also, print pr_info messages if we end up dropping a FLUSH_ACK for an
    inode onto the floor, or if we get a message on an unregistered
    session.

    Reported-by: Patrick Donnelly <pdonnell@redhat.com>
    URL: https://tracker.ceph.com/issues/51279
    Signed-off-by: Jeff Layton <jlayton@kernel.org>
Actions

Also available in: Atom PDF