Bug #51183
closedqa: FileNotFoundError: [Errno 2] No such file or directory: '/sys/kernel/debug/ceph/3fab6bea-f243-47a4-a956-8c03a62b61b5.client4721/mds_sessions'
0%
Description
2021-06-11T19:40:01.304 DEBUG:teuthology.orchestra.run.smithi093:> sudo python3 -c ' 2021-06-11T19:40:01.305 DEBUG:teuthology.orchestra.run.smithi093:> import glob 2021-06-11T19:40:01.305 DEBUG:teuthology.orchestra.run.smithi093:> import os 2021-06-11T19:40:01.306 DEBUG:teuthology.orchestra.run.smithi093:> import json 2021-06-11T19:40:01.306 DEBUG:teuthology.orchestra.run.smithi093:> 2021-06-11T19:40:01.306 DEBUG:teuthology.orchestra.run.smithi093:> def get_id_to_dir(): 2021-06-11T19:40:01.307 DEBUG:teuthology.orchestra.run.smithi093:> result = {} 2021-06-11T19:40:01.307 DEBUG:teuthology.orchestra.run.smithi093:> for dir in glob.glob("/sys/kernel/debug/ceph/*"): 2021-06-11T19:40:01.308 DEBUG:teuthology.orchestra.run.smithi093:> mds_sessions_lines = open(os.path.join(dir, "mds_sessions")).readlines() 2021-06-11T19:40:01.308 DEBUG:teuthology.orchestra.run.smithi093:> client_id = mds_sessions_lines[1].split()[1].strip('"'"'"'"'"') 2021-06-11T19:40:01.309 DEBUG:teuthology.orchestra.run.smithi093:> 2021-06-11T19:40:01.309 DEBUG:teuthology.orchestra.run.smithi093:> result[client_id] = dir 2021-06-11T19:40:01.309 DEBUG:teuthology.orchestra.run.smithi093:> return result 2021-06-11T19:40:01.310 DEBUG:teuthology.orchestra.run.smithi093:> 2021-06-11T19:40:01.310 DEBUG:teuthology.orchestra.run.smithi093:> print(json.dumps(get_id_to_dir())) 2021-06-11T19:40:01.311 DEBUG:teuthology.orchestra.run.smithi093:> ' 2021-06-11T19:40:01.387 INFO:teuthology.orchestra.run.smithi093.stderr:Traceback (most recent call last): 2021-06-11T19:40:01.387 INFO:teuthology.orchestra.run.smithi093.stderr: File "<string>", line 15, in <module> 2021-06-11T19:40:01.388 INFO:teuthology.orchestra.run.smithi093.stderr: File "<string>", line 9, in get_id_to_dir 2021-06-11T19:40:01.388 INFO:teuthology.orchestra.run.smithi093.stderr:FileNotFoundError: [Errno 2] No such file or directory: '/sys/kernel/debug/ceph/3fab6bea-f243-47a4-a956-8c03a62b61b5.client4721/mds_sessions' 2021-06-11T19:40:01.390 DEBUG:teuthology.orchestra.run:got remote process result: 1
From: /ceph/teuthology-archive/pdonnell-2021-06-11_18:02:10-fs-wip-pdonnell-testing-20210611.162716-distro-basic-smithi/6166803/teuthology.log
Updated by Patrick Donnelly almost 3 years ago
- Status changed from New to Triaged
- Assignee set to Jeff Layton
Updated by Jeff Layton almost 3 years ago
Weird:
2021-06-11T19:40:01.047 DEBUG:teuthology.orchestra.run.smithi093:> sudo nsenter --net=/var/run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage /bin/mount -t ceph :/ /home/ubuntu/cephtest/mnt.0 -v -o norequ ire_active_mds,name=0,conf=/etc/ceph/ceph.conf,norbytes 2021-06-11T19:40:01.126 INFO:teuthology.orchestra.run.smithi093.stdout:parsing options: rw,norequire_active_mds,name=0,conf=/etc/ceph/ceph.conf,norbytes 2021-06-11T19:40:01.126 INFO:teuthology.orchestra.run.smithi093.stdout:mount.ceph: options "norequire_active_mds,name=0,norbytes" will pass to kernel. 2021-06-11T19:40:01.237 INFO:teuthology.orchestra.run.smithi093.stdout:parsing options: rw,norequire_active_mds,name=0,conf=/etc/ceph/ceph.conf,norbytes 2021-06-11T19:40:01.238 INFO:teuthology.orchestra.run.smithi093.stdout:mount.ceph: options "norequire_active_mds,name=0,norbytes" will pass to kernel. 2021-06-11T19:40:01.240 INFO:tasks.cephfs.kernel_mount:mount command passed 2021-06-11T19:40:01.240 INFO:teuthology.orchestra.run:Running command with timeout 300 2021-06-11T19:40:01.241 DEBUG:teuthology.orchestra.run.smithi093:> sudo chmod 1777 /home/ubuntu/cephtest/mnt.0 2021-06-11T19:40:01.304 INFO:teuthology.orchestra.run:Running command with timeout 300 2021-06-11T19:40:01.304 DEBUG:teuthology.orchestra.run.smithi093:> sudo python3 -c ' 2021-06-11T19:40:01.305 DEBUG:teuthology.orchestra.run.smithi093:> import glob 2021-06-11T19:40:01.305 DEBUG:teuthology.orchestra.run.smithi093:> import os 2021-06-11T19:40:01.306 DEBUG:teuthology.orchestra.run.smithi093:> import json 2021-06-11T19:40:01.306 DEBUG:teuthology.orchestra.run.smithi093:> 2021-06-11T19:40:01.306 DEBUG:teuthology.orchestra.run.smithi093:> def get_id_to_dir(): 2021-06-11T19:40:01.307 DEBUG:teuthology.orchestra.run.smithi093:> result = {} 2021-06-11T19:40:01.307 DEBUG:teuthology.orchestra.run.smithi093:> for dir in glob.glob("/sys/kernel/debug/ceph/*"): 2021-06-11T19:40:01.308 DEBUG:teuthology.orchestra.run.smithi093:> mds_sessions_lines = open(os.path.join(dir, "mds_sessions")).readlines() 2021-06-11T19:40:01.308 DEBUG:teuthology.orchestra.run.smithi093:> client_id = mds_sessions_lines[1].split()[1].strip('"'"'"'"'"') 2021-06-11T19:40:01.309 DEBUG:teuthology.orchestra.run.smithi093:> 2021-06-11T19:40:01.309 DEBUG:teuthology.orchestra.run.smithi093:> result[client_id] = dir 2021-06-11T19:40:01.309 DEBUG:teuthology.orchestra.run.smithi093:> return result 2021-06-11T19:40:01.310 DEBUG:teuthology.orchestra.run.smithi093:> 2021-06-11T19:40:01.310 DEBUG:teuthology.orchestra.run.smithi093:> print(json.dumps(get_id_to_dir())) 2021-06-11T19:40:01.311 DEBUG:teuthology.orchestra.run.smithi093:> ' 2021-06-11T19:40:01.387 INFO:teuthology.orchestra.run.smithi093.stderr:Traceback (most recent call last): 2021-06-11T19:40:01.387 INFO:teuthology.orchestra.run.smithi093.stderr: File "<string>", line 15, in <module> 2021-06-11T19:40:01.388 INFO:teuthology.orchestra.run.smithi093.stderr: File "<string>", line 9, in get_id_to_dir 2021-06-11T19:40:01.388 INFO:teuthology.orchestra.run.smithi093.stderr:FileNotFoundError: [Errno 2] No such file or directory: '/sys/kernel/debug/ceph/3fab6bea-f243-47a4-a956-8c03a62b61b5.client4721/mds_sessions' 2021-06-11T19:40:01.390 DEBUG:teuthology.orchestra.run:got remote process result: 1 2021-06-11T19:40:01.392 INFO:teuthology.nuke.actions:Clearing teuthology firewall rules... 2021-06-11T19:40:01.393 DEBUG:teuthology.orchestra.run.smithi093:> sudo sh -c 'iptables-save | grep -v teuthology | iptables-restore' 2021-06-11T19:40:01.458 DEBUG:teuthology.orchestra.run.smithi183:> sudo sh -c 'iptables-save | grep -v teuthology | iptables-restore' 2021-06-11T19:40:01.491 INFO:teuthology.nuke.actions:Cleared teuthology firewall rules. 2021-06-11T19:40:01.491 DEBUG:tasks.cephfs.kernel_mount:Unmounting client client.0...
So the python function walks all of the dirs in /sys/kernel/debug/ceph/* and then tries to open the mds_session file in each, but it couldn't find this one. My guess is that this may have something to do with the namespace handling in this code, but I'll need to see if there is a way to reproduce it.
Updated by Jeff Layton almost 3 years ago
Kernel log from that period:
[ 640.286619] ceph-brx: port 2(brx.1) entered blocking state [ 640.292521] ceph-brx: port 2(brx.1) entered disabled state [ 640.298432] device brx.1 entered promiscuous mode [ 640.303869] ceph-brx: port 2(brx.1) entered blocking state [ 640.310010] ceph-brx: port 2(brx.1) entered forwarding state [ 640.478182] libceph: mon2 (1)172.21.15.183:6790 session established [ 640.501800] libceph: mon2 (1)172.21.15.183:6790 socket closed (con state OPEN) [ 640.509413] libceph: mon2 (1)172.21.15.183:6790 session lost, hunting for new mon [ 640.535151] libceph: mon1 (1)172.21.15.183:6789 session established [ 640.558832] libceph: client4466 fsid 3fab6bea-f243-47a4-a956-8c03a62b61b5 [ 670.513695] libceph: mon2 (1)172.21.15.183:6790 session established [ 670.537322] libceph: mon2 (1)172.21.15.183:6790 socket closed (con state OPEN) [ 670.544892] libceph: mon2 (1)172.21.15.183:6790 session lost, hunting for new mon [ 670.570191] libceph: mon0 (1)172.21.15.93:6789 session established [ 670.578656] libceph: client4721 fsid 3fab6bea-f243-47a4-a956-8c03a62b61b5 [ 671.847561] libceph: mon2 (1)172.21.15.183:6790 session established [ 671.871258] libceph: mon2 (1)172.21.15.183:6790 socket closed (con state OPEN) [ 671.878851] libceph: mon2 (1)172.21.15.183:6790 session lost, hunting for new mon [ 671.904405] libceph: mon1 (1)172.21.15.183:6789 session established [ 671.928027] libceph: client4724 fsid 3fab6bea-f243-47a4-a956-8c03a62b61b5 [ 681.324297] ceph-brx: port 2(brx.1) entered disabled state [ 681.348218] device brx.1 left promiscuous mode [ 681.353044] ceph-brx: port 2(brx.1) entered disabled state [ 681.443269] device brx.0 left promiscuous mode [ 681.448144] ceph-brx: port 1(brx.0) entered disabled state [ 710.437319] XFS (dm-0): Unmounting Filesystem [ 710.531215] XFS (dm-1): Unmounting Filesystem [ 710.596178] XFS (dm-2): Unmounting Filesystem [ 710.654181] XFS (dm-3): Unmounting Filesystem
We can see that client4721 was opened. Could we just have had that session be closed after the readdir but before the pathwalk to the mds_sessions file?
I could use a sense of how often and under what conditions this is failing. Does it happen every time, or is there some sort of raciness involved? Does it only happen when things are using the namespace stuff?
Updated by Patrick Donnelly almost 3 years ago
https://pulpito.ceph.com/pdonnell-2021-06-12_02:45:35-fs-wip-pdonnell-testing-20210612.002809-distro-basic-smithi/6167552/
https://pulpito.ceph.com/pdonnell-2021-06-12_02:45:35-fs-wip-pdonnell-testing-20210612.002809-distro-basic-smithi/6167679/
common thread here is RHEL stock kernel. We just updated tests to use RHEL 8.4 which is probably related.
Updated by Jeff Layton almost 3 years ago
This sounds unlikely to be a ceph bug. Its usage of debugfs is pretty straightforward. I'd be more inclined to think that this is a race in the RHEL debugfs implementation. Perhaps mounts are racing with a teardown of the previous mount, such that the mds_session file ended up being a negative dentry by the time we got there.
In fact, looking at kernel_mount.py:
# force delete the netns and umount self.client_remote.run(args=['sudo', 'umount', '-f', '-l', self.mountpoint], timeout=(15*60), omit_sudo=False)
...when you do a lazy umount (-l), the umount syscall may return before the superblock has been torn down. That could leave the debugfs directory for the mount intact for a little while (until the last reference is put). That may be where the race is here.
It may be best to stop using lazy umounts like this. If there are refcounting problems, this may end up papering over them and you might never know.
Updated by Jeff Layton almost 3 years ago
- Pull request ID set to 41860
I submitted a PR to add log messages in this scenario. Once we merge that, perhaps we can clarify what happened.
Updated by Patrick Donnelly almost 3 years ago
No lazy umount involved: /ceph/teuthology-archive/pdonnell-2021-06-26_00:57:00-fs-wip-pdonnell-testing-20210625.225421-distro-basic-smithi/6193795/teuthology.log
Updated by Patrick Donnelly almost 3 years ago
- Status changed from Triaged to Fix Under Review
- Assignee changed from Jeff Layton to Patrick Donnelly
- Pull request ID changed from 41860 to 42081
Updated by Jeff Layton almost 3 years ago
This is clearly a race of some sort. Either it is finding the directory and the mds_sessions file (and possibly the dir itself) are gone by the time of the open(), or it found the directory and the mds_sessions file hadn't shown up yet by the time of the open().
Either way, my guess is that the ones that aren't being found are not the ones that we're interested in anyway -- probably from dead mounts that are still in the process of being torn down.
Updated by Patrick Donnelly almost 3 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to pacific
Updated by Backport Bot almost 3 years ago
- Copied to Backport #51500: pacific: qa: FileNotFoundError: [Errno 2] No such file or directory: '/sys/kernel/debug/ceph/3fab6bea-f243-47a4-a956-8c03a62b61b5.client4721/mds_sessions' added
Updated by Loïc Dachary over 2 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".