https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2020-08-10T20:12:33ZCeph CephFS - Bug #46883: kclient: ghost kernel mounthttps://tracker.ceph.com/issues/46883?journal_id=1725422020-08-10T20:12:33ZJeff Laytonjlayton@redhat.com
<ul></ul><p>I think the problem here is that teuthology is using umount -l. That just detaches the mount from the tree, but defers cleanup. The superblock can be found again, and if the options, etc. match, then it'll just reuse that instead of creating a new client, etc.</p>
<p>For this test, it may be simplest to create and mount a subdir and work under there, or maybe use a throwaway mds_namespace= mount option. That should ensure that the mount doesn't match the superblock with the blacklisted client.</p> CephFS - Bug #46883: kclient: ghost kernel mounthttps://tracker.ceph.com/issues/46883?journal_id=1730502020-08-17T13:45:59ZPatrick Donnellypdonnell@redhat.com
<ul></ul><p>So there are two issues here:</p>
<ul>
<li>umount should not use -l so we aren't papering over bugs. Use -f to umount. If -f fails/hangs, collect debug information and reboot the machine.</li>
</ul>
<ul>
<li>Use separate auth credentials for each mount so the superblocks are always different.</li>
</ul> CephFS - Bug #46883: kclient: ghost kernel mounthttps://tracker.ceph.com/issues/46883?journal_id=1730512020-08-17T14:33:30ZJeff Laytonjlayton@redhat.com
<ul></ul><p>Patrick Donnelly wrote:</p>
<blockquote>
<p>So there are two issues here:</p>
<ul>
<li>umount should not use -l so we aren't papering over bugs. Use -f to umount. If -f fails/hangs, collect debug information and reboot the machine.</li>
</ul>
<ul>
<li>Use separate auth credentials for each mount so the superblocks are always different.</li>
</ul>
</blockquote>
<p>Bear in mind too that with a forced umount, the cancellation may still not be immediate. So you will need to wait a bit -- possibly even a few minutes.</p>
<p>If we do need to reboot at that point, we should forcibly crash the box (echo c > /proc/sysrq-trigger) and collect a core dump via kdump. With that, we could analyze the core to determine the cause.</p> CephFS - Bug #46883: kclient: ghost kernel mounthttps://tracker.ceph.com/issues/46883?journal_id=1770622020-10-09T18:49:53ZPatrick Donnellypdonnell@redhat.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Triaged</i></li><li><strong>Assignee</strong> set to <i>Xiubo Li</i></li><li><strong>Labels (FS)</strong> <i>qa, qa-failure</i> added</li></ul> CephFS - Bug #46883: kclient: ghost kernel mounthttps://tracker.ceph.com/issues/46883?journal_id=1771462020-10-12T06:53:38ZXiubo Lixiubli@redhat.com
<ul></ul><p>Will work on it.</p> CephFS - Bug #46883: kclient: ghost kernel mounthttps://tracker.ceph.com/issues/46883?journal_id=1771482020-10-12T09:02:29ZXiubo Lixiubli@redhat.com
<ul></ul><p>Patrick Donnelly wrote:</p>
<blockquote>
<p>So there are two issues here:</p>
</blockquote>
<p>[...]</p>
<blockquote>
<ul>
<li>Use separate auth credentials for each mount so the superblocks are always different.</li>
</ul>
</blockquote>
<p>For making it not reuse the existing sb, how about add one `noshare` option instead ?</p> CephFS - Bug #46883: kclient: ghost kernel mounthttps://tracker.ceph.com/issues/46883?journal_id=1772312020-10-13T15:20:02ZPatrick Donnellypdonnell@redhat.com
<ul><li><strong>Status</strong> changed from <i>Triaged</i> to <i>Fix Under Review</i></li><li><strong>Backport</strong> set to <i>octopus,nautilus</i></li><li><strong>Pull request ID</strong> set to <i>37652</i></li></ul> CephFS - Bug #46883: kclient: ghost kernel mounthttps://tracker.ceph.com/issues/46883?journal_id=1772932020-10-14T19:06:42ZJeff Laytonjlayton@redhat.com
<ul></ul><p>I'm not a fan of this noshare option. That seems like a hacky workaround for a problem that I'm not sure any of us fully understand. Also, why do we need special options to umount at all? That should just work.</p>
<p>I could understand if we were (e.g.) dealing with a test where the MDS has gone unresponsive. You might need a '-f' in that case, or something, but ordinarily it shouldn't be needed.</p> CephFS - Bug #46883: kclient: ghost kernel mounthttps://tracker.ceph.com/issues/46883?journal_id=1772992020-10-15T08:27:46ZXiubo Lixiubli@redhat.com
<ul></ul><p>From: /ceph/teuthology-archive/pdonnell-2020-08-08_02:19:19-kcephfs-wip-pdonnell-testing-20200808.001303-distro-basic-smithi/5319130/teuthology.log</p>
<pre>
2020-08-08T16:09:18.434 INFO:teuthology.orchestra.run.smithi072:> sudo umount /home/ubuntu/cephtest/mnt.0 -f
2020-08-08T16:09:18.479 INFO:teuthology.orchestra.run.smithi072.stderr:umount: /home/ubuntu/cephtest/mnt.0: target is busy.
2020-08-08T16:09:18.481 DEBUG:teuthology.orchestra.run:got remote process result: 32
2020-08-08T16:09:18.482 INFO:teuthology.orchestra.run:Running command with timeout 900
2020-08-08T16:09:18.483 INFO:teuthology.orchestra.run.smithi072:> sudo PATH=/usr/sbin:$PATH lsof ; ps auxf
2020-08-08T16:09:18.523 INFO:teuthology.orchestra.run.smithi072.stderr:lsof: WARNING: can't stat() ceph file system /home/ubuntu/cephtest/mnt.0
2020-08-08T16:09:18.524 INFO:teuthology.orchestra.run.smithi072.stderr: Output information may be incomplete.
2020-08-08T16:09:18.848 INFO:teuthology.orchestra.run.smithi072.stdout:COMMAND PID TID TASKCMD USER FD TYPE DEVICE SIZE/OFF NODE NAME
2020-08-08T16:09:18.849 INFO:teuthology.orchestra.run.smithi072.stdout:systemd 1 root cwd DIR 8,1 4096 2 /
......
2020-08-08T16:09:19.944 INFO:teuthology.orchestra.run.smithi072.stdout:python3 33373 root 3w unknown /home/ubuntu/cephtest/mnt.0/background_file-1 (stat: Input/output error)
2020-08-08T16:09:19.944 INFO:teuthology.orchestra.run.smithi072.stdout:python3 33373 root 4w unknown /home/ubuntu/cephtest/mnt.0/background_file-2 (stat: Input/output error)
......
2020-08-08T16:09:20.016 INFO:teuthology.orchestra.run.smithi072.stdout:root 33355 0.0 0.0 144436 7676 ? Ss 16:09 0:00 \_ sudo adjust-ulimits daemon-helper kill python3 -c import time import fcntl import struct f1 = open("/home/ubuntu/cephtest/mnt.0/background_file-1", 'w') fcntl.flock(f1, fcntl.LOCK_EX | fcntl.LOCK_NB) f2 = open("/home/ubuntu/cephtest/mnt.0/background_file-2", 'w') lockdata = struct.pack('hhllhh', fcntl.F_WRLCK, 0, 0, 0, 0, 0) fcntl.fcntl(f2, fcntl.F_SETLK, lockdata) while True: time.sleep(1)
2020-08-08T16:09:20.017 INFO:teuthology.orchestra.run.smithi072.stdout:root 33371 0.3 0.0 40060 10848 ? S 16:09 0:00 | \_ /usr/bin/python3 /bin/daemon-helper kill python3 -c import time import fcntl import struct f1 = open("/home/ubuntu/cephtest/mnt.0/background_file-1", 'w') fcntl.flock(f1, fcntl.LOCK_EX | fcntl.LOCK_NB) f2 = open("/home/ubuntu/cephtest/mnt.0/background_file-2", 'w') lockdata = struct.pack('hhllhh', fcntl.F_WRLCK, 0, 0, 0, 0, 0) fcntl.fcntl(f2, fcntl.F_SETLK, lockdata) while True: time.sleep(1)
2020-08-08T16:09:20.017 INFO:teuthology.orchestra.run.smithi072.stdout:root 33373 0.1 0.0 40308 8744 ? Ss 16:09 0:00 | \_ python3 -c import time import fcntl import struct f1 = open("/home/ubuntu/cephtest/mnt.0/background_file-1", 'w') fcntl.flock(f1, fcntl.LOCK_EX | fcntl.LOCK_NB) f2 = open("/home/ubuntu/cephtest/mnt.0/background_file-2", 'w') lockdata = struct.pack('hhllhh', fcntl.F_WRLCK, 0, 0, 0, 0, 0) fcntl.fcntl(f2, fcntl.F_SETLK, lockdata) while True: time.sleep(1)
2020-08-08T16:09:20.017 INFO:teuthology.orchestra.run.smithi072.stdout:ubuntu 33450 0.0 0.0 12696 3076 ? Ss 16:09 0:00 \_ bash -c sudo PATH=/usr/sbin:$PATH lsof ; ps auxf
</pre>
<p>Checked the log again, there has serveral python process is still running and openning the mountpoint.</p> CephFS - Bug #46883: kclient: ghost kernel mounthttps://tracker.ceph.com/issues/46883?journal_id=1778992020-10-25T23:26:09ZPatrick Donnellypdonnell@redhat.com
<ul><li><strong>Status</strong> changed from <i>Fix Under Review</i> to <i>Resolved</i></li><li><strong>Backport</strong> deleted (<del><i>octopus,nautilus</i></del>)</li></ul> CephFS - Bug #46883: kclient: ghost kernel mounthttps://tracker.ceph.com/issues/46883?journal_id=1933632021-04-28T01:29:42ZRamana Rajarraja@redhat.com
<ul></ul><p>I saw the following failure multiple times in Yuri's nautilus runs,<br /><a class="external" href="https://pulpito.ceph.com/yuriw-2021-04-21_16:19:50-kcephfs-wip-yuri2-testing-2021-04-20-0721-nautilus-distro-basic-smithi/6062417/">https://pulpito.ceph.com/yuriw-2021-04-21_16:19:50-kcephfs-wip-yuri2-testing-2021-04-20-0721-nautilus-distro-basic-smithi/6062417/</a><br /><a class="external" href="https://pulpito.ceph.com/yuriw-2021-04-21_16:19:50-kcephfs-wip-yuri2-testing-2021-04-20-0721-nautilus-distro-basic-smithi/6062456/">https://pulpito.ceph.com/yuriw-2021-04-21_16:19:50-kcephfs-wip-yuri2-testing-2021-04-20-0721-nautilus-distro-basic-smithi/6062456/</a></p>
<pre>
2021-04-22T10:55:20.583 INFO:tasks.cephfs_test_runner:======================================================================
2021-04-22T10:55:20.583 INFO:tasks.cephfs_test_runner:ERROR: test_evicted_caps (tasks.cephfs.test_client_recovery.TestClientRecovery)
2021-04-22T10:55:20.583 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2021-04-22T10:55:20.584 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2021-04-22T10:55:20.584 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_b0d014fe25986033c2db9422289a173f45eea553/qa/tasks/cephfs/test_client_recovery.py", line 335, in test_evicted_caps
2021-04-22T10:55:20.584 INFO:tasks.cephfs_test_runner: cap_holder.wait()
2021-04-22T10:55:20.584 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_git_teuthology_2713a3cd31b17738a50039eaa9d859b5dc39fb8a/teuthology/orchestra/run.py", line 161, in wait
2021-04-22T10:55:20.584 INFO:tasks.cephfs_test_runner: self._raise_for_status()
2021-04-22T10:55:20.584 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_git_teuthology_2713a3cd31b17738a50039eaa9d859b5dc39fb8a/teuthology/orchestra/run.py", line 179, in _raise_for_status
2021-04-22T10:55:20.584 INFO:tasks.cephfs_test_runner: raise CommandCrashedError(command=self.command)
2021-04-22T10:55:20.584 INFO:tasks.cephfs_test_runner:teuthology.exceptions.CommandCrashedError: Command crashed: 'sudo adjust-ulimits daemon-helper kill python3 -c \'\nimport time\n\nwith open("/home/ubuntu/cephtest/mnt.0/background_file", \'"\'"\'w\'"\'"\') as f:\n f.write(\'"\'"\'content\'"\'"\')\n f.flush()\n f.write(\'"\'"\'content2\'"\'"\')\n while True:\n time.sleep(1)\n\''
</pre>
<p>The failing test was rewritten by this tracker's fix <br /><a class="external" href="https://github.com/ceph/ceph/pull/37652/commits/def177ff3ba32a9441cb3bfb05a8dd993b27d994#diff-dbce7a954ca60840ad53bc499ed18b959217ceb56c0b0e0e425c4d0ca986deb2L268">https://github.com/ceph/ceph/pull/37652/commits/def177ff3ba32a9441cb3bfb05a8dd993b27d994#diff-dbce7a954ca60840ad53bc499ed18b959217ceb56c0b0e0e425c4d0ca986deb2L268</a></p>
<p>Should we backport this to octopus/nautilus?</p> CephFS - Bug #46883: kclient: ghost kernel mounthttps://tracker.ceph.com/issues/46883?journal_id=1933662021-04-28T01:47:56ZXiubo Lixiubli@redhat.com
<ul></ul><p>Ramana Raja wrote:</p>
<blockquote>
<p>I saw the following failure multiple times in Yuri's nautilus runs,<br /><a class="external" href="https://pulpito.ceph.com/yuriw-2021-04-21_16:19:50-kcephfs-wip-yuri2-testing-2021-04-20-0721-nautilus-distro-basic-smithi/6062417/">https://pulpito.ceph.com/yuriw-2021-04-21_16:19:50-kcephfs-wip-yuri2-testing-2021-04-20-0721-nautilus-distro-basic-smithi/6062417/</a><br /><a class="external" href="https://pulpito.ceph.com/yuriw-2021-04-21_16:19:50-kcephfs-wip-yuri2-testing-2021-04-20-0721-nautilus-distro-basic-smithi/6062456/">https://pulpito.ceph.com/yuriw-2021-04-21_16:19:50-kcephfs-wip-yuri2-testing-2021-04-20-0721-nautilus-distro-basic-smithi/6062456/</a></p>
<p>[...]</p>
<p>The failing test was rewritten by this tracker's fix <br /><a class="external" href="https://github.com/ceph/ceph/pull/37652/commits/def177ff3ba32a9441cb3bfb05a8dd993b27d994#diff-dbce7a954ca60840ad53bc499ed18b959217ceb56c0b0e0e425c4d0ca986deb2L268">https://github.com/ceph/ceph/pull/37652/commits/def177ff3ba32a9441cb3bfb05a8dd993b27d994#diff-dbce7a954ca60840ad53bc499ed18b959217ceb56c0b0e0e425c4d0ca986deb2L268</a></p>
<p>Should we backport this to octopus/nautilus?</p>
</blockquote>
<p>Yeah, checked the logs, they are the same issue with this.</p>
<p>I will backport it to octopus/nautilus.</p> CephFS - Bug #46883: kclient: ghost kernel mounthttps://tracker.ceph.com/issues/46883?journal_id=1934552021-04-28T21:36:12ZRamana Rajarraja@redhat.com
<ul></ul><p>Xiubo Li wrote:</p>
<blockquote>
<p>Ramana Raja wrote:</p>
<blockquote>
<p>I saw the following failure multiple times in Yuri's nautilus runs,<br /><a class="external" href="https://pulpito.ceph.com/yuriw-2021-04-21_16:19:50-kcephfs-wip-yuri2-testing-2021-04-20-0721-nautilus-distro-basic-smithi/6062417/">https://pulpito.ceph.com/yuriw-2021-04-21_16:19:50-kcephfs-wip-yuri2-testing-2021-04-20-0721-nautilus-distro-basic-smithi/6062417/</a><br /><a class="external" href="https://pulpito.ceph.com/yuriw-2021-04-21_16:19:50-kcephfs-wip-yuri2-testing-2021-04-20-0721-nautilus-distro-basic-smithi/6062456/">https://pulpito.ceph.com/yuriw-2021-04-21_16:19:50-kcephfs-wip-yuri2-testing-2021-04-20-0721-nautilus-distro-basic-smithi/6062456/</a></p>
<p>[...]</p>
<p>The failing test was rewritten by this tracker's fix <br /><a class="external" href="https://github.com/ceph/ceph/pull/37652/commits/def177ff3ba32a9441cb3bfb05a8dd993b27d994#diff-dbce7a954ca60840ad53bc499ed18b959217ceb56c0b0e0e425c4d0ca986deb2L268">https://github.com/ceph/ceph/pull/37652/commits/def177ff3ba32a9441cb3bfb05a8dd993b27d994#diff-dbce7a954ca60840ad53bc499ed18b959217ceb56c0b0e0e425c4d0ca986deb2L268</a></p>
<p>Should we backport this to octopus/nautilus?</p>
</blockquote>
<p>Yeah, checked the logs, they are the same issue with this.</p>
<p>I will backport it to octopus/nautilus.</p>
</blockquote>
<p>Thanks a lot, Xiubo!</p>