Project

General

Profile

Bug #51183

qa: FileNotFoundError: [Errno 2] No such file or directory: '/sys/kernel/debug/ceph/3fab6bea-f243-47a4-a956-8c03a62b61b5.client4721/mds_sessions'

Added by Patrick Donnelly 5 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Urgent
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
qa-suite
Labels (FS):
qa, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-06-11T19:40:01.304 DEBUG:teuthology.orchestra.run.smithi093:> sudo python3 -c '
2021-06-11T19:40:01.305 DEBUG:teuthology.orchestra.run.smithi093:> import glob
2021-06-11T19:40:01.305 DEBUG:teuthology.orchestra.run.smithi093:> import os
2021-06-11T19:40:01.306 DEBUG:teuthology.orchestra.run.smithi093:> import json
2021-06-11T19:40:01.306 DEBUG:teuthology.orchestra.run.smithi093:>
2021-06-11T19:40:01.306 DEBUG:teuthology.orchestra.run.smithi093:> def get_id_to_dir():
2021-06-11T19:40:01.307 DEBUG:teuthology.orchestra.run.smithi093:>     result = {}
2021-06-11T19:40:01.307 DEBUG:teuthology.orchestra.run.smithi093:>     for dir in glob.glob("/sys/kernel/debug/ceph/*"):
2021-06-11T19:40:01.308 DEBUG:teuthology.orchestra.run.smithi093:>         mds_sessions_lines = open(os.path.join(dir, "mds_sessions")).readlines()
2021-06-11T19:40:01.308 DEBUG:teuthology.orchestra.run.smithi093:>         client_id = mds_sessions_lines[1].split()[1].strip('"'"'"'"'"')
2021-06-11T19:40:01.309 DEBUG:teuthology.orchestra.run.smithi093:>
2021-06-11T19:40:01.309 DEBUG:teuthology.orchestra.run.smithi093:>         result[client_id] = dir
2021-06-11T19:40:01.309 DEBUG:teuthology.orchestra.run.smithi093:>     return result
2021-06-11T19:40:01.310 DEBUG:teuthology.orchestra.run.smithi093:>
2021-06-11T19:40:01.310 DEBUG:teuthology.orchestra.run.smithi093:> print(json.dumps(get_id_to_dir()))
2021-06-11T19:40:01.311 DEBUG:teuthology.orchestra.run.smithi093:> '
2021-06-11T19:40:01.387 INFO:teuthology.orchestra.run.smithi093.stderr:Traceback (most recent call last):
2021-06-11T19:40:01.387 INFO:teuthology.orchestra.run.smithi093.stderr:  File "<string>", line 15, in <module>
2021-06-11T19:40:01.388 INFO:teuthology.orchestra.run.smithi093.stderr:  File "<string>", line 9, in get_id_to_dir
2021-06-11T19:40:01.388 INFO:teuthology.orchestra.run.smithi093.stderr:FileNotFoundError: [Errno 2] No such file or directory: '/sys/kernel/debug/ceph/3fab6bea-f243-47a4-a956-8c03a62b61b5.client4721/mds_sessions'
2021-06-11T19:40:01.390 DEBUG:teuthology.orchestra.run:got remote process result: 1

From: /ceph/teuthology-archive/pdonnell-2021-06-11_18:02:10-fs-wip-pdonnell-testing-20210611.162716-distro-basic-smithi/6166803/teuthology.log


Related issues

Copied to CephFS - Backport #51500: pacific: qa: FileNotFoundError: [Errno 2] No such file or directory: '/sys/kernel/debug/ceph/3fab6bea-f243-47a4-a956-8c03a62b61b5.client4721/mds_sessions' Resolved

History

#1 Updated by Patrick Donnelly 5 months ago

  • Status changed from New to Triaged
  • Assignee set to Jeff Layton

#2 Updated by Jeff Layton 5 months ago

Weird:

2021-06-11T19:40:01.047 DEBUG:teuthology.orchestra.run.smithi093:> sudo nsenter --net=/var/run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage /bin/mount -t ceph :/ /home/ubuntu/cephtest/mnt.0 -v -o norequ
ire_active_mds,name=0,conf=/etc/ceph/ceph.conf,norbytes
2021-06-11T19:40:01.126 INFO:teuthology.orchestra.run.smithi093.stdout:parsing options: rw,norequire_active_mds,name=0,conf=/etc/ceph/ceph.conf,norbytes
2021-06-11T19:40:01.126 INFO:teuthology.orchestra.run.smithi093.stdout:mount.ceph: options "norequire_active_mds,name=0,norbytes" will pass to kernel.
2021-06-11T19:40:01.237 INFO:teuthology.orchestra.run.smithi093.stdout:parsing options: rw,norequire_active_mds,name=0,conf=/etc/ceph/ceph.conf,norbytes
2021-06-11T19:40:01.238 INFO:teuthology.orchestra.run.smithi093.stdout:mount.ceph: options "norequire_active_mds,name=0,norbytes" will pass to kernel.
2021-06-11T19:40:01.240 INFO:tasks.cephfs.kernel_mount:mount command passed
2021-06-11T19:40:01.240 INFO:teuthology.orchestra.run:Running command with timeout 300
2021-06-11T19:40:01.241 DEBUG:teuthology.orchestra.run.smithi093:> sudo chmod 1777 /home/ubuntu/cephtest/mnt.0
2021-06-11T19:40:01.304 INFO:teuthology.orchestra.run:Running command with timeout 300
2021-06-11T19:40:01.304 DEBUG:teuthology.orchestra.run.smithi093:> sudo python3 -c '
2021-06-11T19:40:01.305 DEBUG:teuthology.orchestra.run.smithi093:> import glob
2021-06-11T19:40:01.305 DEBUG:teuthology.orchestra.run.smithi093:> import os
2021-06-11T19:40:01.306 DEBUG:teuthology.orchestra.run.smithi093:> import json
2021-06-11T19:40:01.306 DEBUG:teuthology.orchestra.run.smithi093:> 
2021-06-11T19:40:01.306 DEBUG:teuthology.orchestra.run.smithi093:> def get_id_to_dir():
2021-06-11T19:40:01.307 DEBUG:teuthology.orchestra.run.smithi093:>     result = {}
2021-06-11T19:40:01.307 DEBUG:teuthology.orchestra.run.smithi093:>     for dir in glob.glob("/sys/kernel/debug/ceph/*"):
2021-06-11T19:40:01.308 DEBUG:teuthology.orchestra.run.smithi093:>         mds_sessions_lines = open(os.path.join(dir, "mds_sessions")).readlines()
2021-06-11T19:40:01.308 DEBUG:teuthology.orchestra.run.smithi093:>         client_id = mds_sessions_lines[1].split()[1].strip('"'"'"'"'"')
2021-06-11T19:40:01.309 DEBUG:teuthology.orchestra.run.smithi093:> 
2021-06-11T19:40:01.309 DEBUG:teuthology.orchestra.run.smithi093:>         result[client_id] = dir
2021-06-11T19:40:01.309 DEBUG:teuthology.orchestra.run.smithi093:>     return result
2021-06-11T19:40:01.310 DEBUG:teuthology.orchestra.run.smithi093:> 
2021-06-11T19:40:01.310 DEBUG:teuthology.orchestra.run.smithi093:> print(json.dumps(get_id_to_dir()))
2021-06-11T19:40:01.311 DEBUG:teuthology.orchestra.run.smithi093:> '
2021-06-11T19:40:01.387 INFO:teuthology.orchestra.run.smithi093.stderr:Traceback (most recent call last):
2021-06-11T19:40:01.387 INFO:teuthology.orchestra.run.smithi093.stderr:  File "<string>", line 15, in <module>
2021-06-11T19:40:01.388 INFO:teuthology.orchestra.run.smithi093.stderr:  File "<string>", line 9, in get_id_to_dir
2021-06-11T19:40:01.388 INFO:teuthology.orchestra.run.smithi093.stderr:FileNotFoundError: [Errno 2] No such file or directory: '/sys/kernel/debug/ceph/3fab6bea-f243-47a4-a956-8c03a62b61b5.client4721/mds_sessions'
2021-06-11T19:40:01.390 DEBUG:teuthology.orchestra.run:got remote process result: 1
2021-06-11T19:40:01.392 INFO:teuthology.nuke.actions:Clearing teuthology firewall rules...
2021-06-11T19:40:01.393 DEBUG:teuthology.orchestra.run.smithi093:> sudo sh -c 'iptables-save | grep -v teuthology | iptables-restore'
2021-06-11T19:40:01.458 DEBUG:teuthology.orchestra.run.smithi183:> sudo sh -c 'iptables-save | grep -v teuthology | iptables-restore'
2021-06-11T19:40:01.491 INFO:teuthology.nuke.actions:Cleared teuthology firewall rules.
2021-06-11T19:40:01.491 DEBUG:tasks.cephfs.kernel_mount:Unmounting client client.0...

So the python function walks all of the dirs in /sys/kernel/debug/ceph/* and then tries to open the mds_session file in each, but it couldn't find this one. My guess is that this may have something to do with the namespace handling in this code, but I'll need to see if there is a way to reproduce it.

#3 Updated by Jeff Layton 5 months ago

Kernel log from that period:

[  640.286619] ceph-brx: port 2(brx.1) entered blocking state
[  640.292521] ceph-brx: port 2(brx.1) entered disabled state
[  640.298432] device brx.1 entered promiscuous mode
[  640.303869] ceph-brx: port 2(brx.1) entered blocking state
[  640.310010] ceph-brx: port 2(brx.1) entered forwarding state
[  640.478182] libceph: mon2 (1)172.21.15.183:6790 session established
[  640.501800] libceph: mon2 (1)172.21.15.183:6790 socket closed (con state OPEN)
[  640.509413] libceph: mon2 (1)172.21.15.183:6790 session lost, hunting for new mon
[  640.535151] libceph: mon1 (1)172.21.15.183:6789 session established
[  640.558832] libceph: client4466 fsid 3fab6bea-f243-47a4-a956-8c03a62b61b5
[  670.513695] libceph: mon2 (1)172.21.15.183:6790 session established
[  670.537322] libceph: mon2 (1)172.21.15.183:6790 socket closed (con state OPEN)
[  670.544892] libceph: mon2 (1)172.21.15.183:6790 session lost, hunting for new mon
[  670.570191] libceph: mon0 (1)172.21.15.93:6789 session established
[  670.578656] libceph: client4721 fsid 3fab6bea-f243-47a4-a956-8c03a62b61b5
[  671.847561] libceph: mon2 (1)172.21.15.183:6790 session established
[  671.871258] libceph: mon2 (1)172.21.15.183:6790 socket closed (con state OPEN)
[  671.878851] libceph: mon2 (1)172.21.15.183:6790 session lost, hunting for new mon
[  671.904405] libceph: mon1 (1)172.21.15.183:6789 session established
[  671.928027] libceph: client4724 fsid 3fab6bea-f243-47a4-a956-8c03a62b61b5
[  681.324297] ceph-brx: port 2(brx.1) entered disabled state
[  681.348218] device brx.1 left promiscuous mode
[  681.353044] ceph-brx: port 2(brx.1) entered disabled state
[  681.443269] device brx.0 left promiscuous mode
[  681.448144] ceph-brx: port 1(brx.0) entered disabled state
[  710.437319] XFS (dm-0): Unmounting Filesystem
[  710.531215] XFS (dm-1): Unmounting Filesystem
[  710.596178] XFS (dm-2): Unmounting Filesystem
[  710.654181] XFS (dm-3): Unmounting Filesystem

We can see that client4721 was opened. Could we just have had that session be closed after the readdir but before the pathwalk to the mds_sessions file?

I could use a sense of how often and under what conditions this is failing. Does it happen every time, or is there some sort of raciness involved? Does it only happen when things are using the namespace stuff?

#5 Updated by Jeff Layton 5 months ago

This sounds unlikely to be a ceph bug. Its usage of debugfs is pretty straightforward. I'd be more inclined to think that this is a race in the RHEL debugfs implementation. Perhaps mounts are racing with a teardown of the previous mount, such that the mds_session file ended up being a negative dentry by the time we got there.

In fact, looking at kernel_mount.py:

            # force delete the netns and umount
            self.client_remote.run(args=['sudo', 'umount', '-f', '-l',
                                         self.mountpoint],
                                   timeout=(15*60), omit_sudo=False)

...when you do a lazy umount (-l), the umount syscall may return before the superblock has been torn down. That could leave the debugfs directory for the mount intact for a little while (until the last reference is put). That may be where the race is here.

It may be best to stop using lazy umounts like this. If there are refcounting problems, this may end up papering over them and you might never know.

#6 Updated by Jeff Layton 4 months ago

  • Pull request ID set to 41860

I submitted a PR to add log messages in this scenario. Once we merge that, perhaps we can clarify what happened.

#7 Updated by Patrick Donnelly 4 months ago

No lazy umount involved: /ceph/teuthology-archive/pdonnell-2021-06-26_00:57:00-fs-wip-pdonnell-testing-20210625.225421-distro-basic-smithi/6193795/teuthology.log

#8 Updated by Patrick Donnelly 4 months ago

  • Status changed from Triaged to Fix Under Review
  • Assignee changed from Jeff Layton to Patrick Donnelly
  • Pull request ID changed from 41860 to 42081

#9 Updated by Jeff Layton 4 months ago

This is clearly a race of some sort. Either it is finding the directory and the mds_sessions file (and possibly the dir itself) are gone by the time of the open(), or it found the directory and the mds_sessions file hadn't shown up yet by the time of the open().

Either way, my guess is that the ones that aren't being found are not the ones that we're interested in anyway -- probably from dead mounts that are still in the process of being torn down.

#10 Updated by Patrick Donnelly 4 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to pacific

#11 Updated by Backport Bot 4 months ago

  • Copied to Backport #51500: pacific: qa: FileNotFoundError: [Errno 2] No such file or directory: '/sys/kernel/debug/ceph/3fab6bea-f243-47a4-a956-8c03a62b61b5.client4721/mds_sessions' added

#12 Updated by Loïc Dachary 3 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF