Bug #55804
closedqa failure: pjd link tests failed
0%
Description
Bug discovered on QA run for PR - https://github.com/ceph/ceph/pull/45556
Teuthology run: https://pulpito.ceph.com/vshankar-2022-04-26_06:23:29-fs:workload-wip-45556-20220418-102656-testing-default-smithi/. These 4 jobs failed with same error: 6806467, 6806495, 6806509, 6806523.
The failure can be reproduced easily. Mount a subvolume, get pjdfstests -
$ wget http://download.ceph.com/qa/pjd-fstest-20090130-RC-aclfixes.tgz $ tar zxvf pjd*.tgz
And then run -
$ # /mnt/cephfs is the cephfs mountpoint $ cd /mnt/cephfs && sudo cp -r ~/pjd-fstest-20090130-RC ./ && cd pjd-fstest-20090130-RC/ && sudo make clean && sudo make && cd ../ && sudo mkdir tmp && cd tmp
And then run the failing testsuites -
$ sudo prove -r -v --exec 'bash -x' ../pjd*/tests/link/02.t $ sudo prove -r -v --exec 'bash -x' ../pjd*/tests/link/03.t
2022-04-26T06:59:11.148 INFO:tasks.workunit.client.0.smithi047.stdout:../pjd-fstest-20090130-RC/tests/link/02.t ...... 2022-04-26T06:59:11.148 INFO:tasks.workunit.client.0.smithi047.stdout:1..10 2022-04-26T06:59:11.148 INFO:tasks.workunit.client.0.smithi047.stdout:ok 1 2022-04-26T06:59:11.148 INFO:tasks.workunit.client.0.smithi047.stdout:ok 2 2022-04-26T06:59:11.148 INFO:tasks.workunit.client.0.smithi047.stdout:ok 3 2022-04-26T06:59:11.148 INFO:tasks.workunit.client.0.smithi047.stdout:not ok 4 2022-04-26T06:59:11.148 INFO:tasks.workunit.client.0.smithi047.stdout:ok 5 2022-04-26T06:59:11.149 INFO:tasks.workunit.client.0.smithi047.stdout:not ok 6 2022-04-26T06:59:11.149 INFO:tasks.workunit.client.0.smithi047.stdout:ok 7 2022-04-26T06:59:11.149 INFO:tasks.workunit.client.0.smithi047.stdout:ok 8 2022-04-26T06:59:11.149 INFO:tasks.workunit.client.0.smithi047.stdout:ok 9 2022-04-26T06:59:11.149 INFO:tasks.workunit.client.0.smithi047.stdout:ok 10 2022-04-26T06:59:11.149 INFO:tasks.workunit.client.0.smithi047.stdout:Failed 2/10 subtests
02.t test #4: expect 0 link ${n0} ${name255}
02.t test #6: expect 0 unlink ${name255}
2022-04-26T06:59:11.346 INFO:tasks.workunit.client.0.smithi047.stdout:../pjd-fstest-20090130-RC/tests/link/03.t ...... 2022-04-26T06:59:11.346 INFO:tasks.workunit.client.0.smithi047.stdout:1..16 2022-04-26T06:59:11.346 INFO:tasks.workunit.client.0.smithi047.stdout:ok 1 2022-04-26T06:59:11.346 INFO:tasks.workunit.client.0.smithi047.stdout:ok 2 2022-04-26T06:59:11.346 INFO:tasks.workunit.client.0.smithi047.stdout:ok 3 2022-04-26T06:59:11.346 INFO:tasks.workunit.client.0.smithi047.stdout:ok 4 2022-04-26T06:59:11.346 INFO:tasks.workunit.client.0.smithi047.stdout:ok 5 2022-04-26T06:59:11.346 INFO:tasks.workunit.client.0.smithi047.stdout:ok 6 2022-04-26T06:59:11.346 INFO:tasks.workunit.client.0.smithi047.stdout:ok 7 2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:not ok 8 2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:not ok 9 2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:ok 10 2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:ok 11 2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:ok 12 2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:ok 13 2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:ok 14 2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:ok 15 2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:ok 16 2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:Failed 2/16 subtests
03.t test #8: expect 0 link ${n0} ${path1023}
03.t test #9: expect 0 unlink ${path1023}
Final lines of output of pjd.sh
-
2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout:ok 40 2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout:ok 41 2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout:ok 42 2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout:ok 2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout: 2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout:Test Summary Report 2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout:------------------- 2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout:../pjd-fstest-20090130-RC/tests/link/02.t (Wstat: 0 Tests: 10 Failed: 2) 2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout: Failed tests: 4, 6 2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout:../pjd-fstest-20090130-RC/tests/link/03.t (Wstat: 0 Tests: 16 Failed: 2) 2022-04-26T06:59:51.016 INFO:tasks.workunit.client.0.smithi047.stdout: Failed tests: 8-9 2022-04-26T06:59:51.016 INFO:tasks.workunit.client.0.smithi047.stdout:Files=191, Tests=2286, 69 wallclock secs ( 1.30 usr 0.61 sys + 8.17 cusr 6.84 csys = 16.92 CPU) 2022-04-26T06:59:51.016 INFO:tasks.workunit.client.0.smithi047.stdout:Result: FAIL 2022-04-26T06:59:51.016 INFO:tasks.workunit:Stopping ['suites/pjd.sh'] on client.0... 2022-04-26T06:59:51.016 DEBUG:teuthology.orchestra.run.smithi047:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
Error -
teuthology.exceptions.CommandFailedError: Command failed (workunit test suites/pjd.sh) on smithi047 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=1ccbc711b8876e630c0358e1d8d923daa34dca1e TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 6h /home/ubuntu/cephtest/clone.client.0/qa/workunits/suites/pjd.sh'
Traceback from the job 6806467 -
Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/run_tasks.py", line 103, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/run_tasks.py", line 82, in run_one_task return task(**kwargs) File "/home/teuthworker/src/git.ceph.com_ceph-c_1ccbc711b8876e630c0358e1d8d923daa34dca1e/qa/tasks/workunit.py", line 148, in task cleanup=cleanup) File "/home/teuthworker/src/git.ceph.com_ceph-c_1ccbc711b8876e630c0358e1d8d923daa34dca1e/qa/tasks/workunit.py", line 298, in _spawn_on_all_clients timeout=timeout) File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/parallel.py", line 84, in __exit__ for result in self: File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/parallel.py", line 98, in __next__ resurrect_traceback(result) File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/parallel.py", line 30, in resurrect_traceback raise exc.exc_info[1] File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/parallel.py", line 23, in capture_traceback return func(*args, **kwargs) File "/home/teuthworker/src/git.ceph.com_ceph-c_1ccbc711b8876e630c0358e1d8d923daa34dca1e/qa/tasks/workunit.py", line 427, in _run_tests label="workunit test {workunit}".format(workunit=workunit) File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/orchestra/remote.py", line 509, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/orchestra/run.py", line 455, in run r.wait() File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/orchestra/run.py", line 161, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/orchestra/run.py", line 183, in _raise_for_status
Updated by Rishabh Dave almost 2 years ago
- Subject changed from pjd failure: link tests failed to qa failure: pjd link tests failed
Updated by Venky Shankar almost 2 years ago
This failure is pretty much related to cephfs subvolumes. Recent test runs:
Description: fs/workload/{0-rhel_8 begin/{0-install 1-cephadm 2-logrotate} clusters/1a11s-mds-1c-client-3node conf/{client mds mon osd} mount/kclient/{base/{mount-syntax/{v1} mount overrides/{distro/stock/{k-stock rhel_8} ms-die-on-skipped}} ms_mode/legacy wsync/yes} objectstore-ec/bluestore-comp-ec-root omap_limit/10000 overrides/{frag ignorelist_health ignorelist_wrongly_marked_down osd-asserts session_timeout} ranks/3 scrub/yes standby-replay subvolume/{with-no-extra-options} tasks/{0-check-counter workunit/suites/pjd}}
- https://pulpito.ceph.com/vshankar-2022-06-07_00:25:50-fs-wip-vshankar-testing-20220606-223254-testing-default-smithi/6865138
and
Description: fs/workload/{0-rhel_8 begin/{0-install 1-cephadm 2-logrotate} clusters/1a11s-mds-1c-client-3node conf/{client mds mon osd} mount/fuse objectstore-ec/bluestore-ec-root omap_limit/10000 overrides/{frag ignorelist_health ignorelist_wrongly_marked_down osd-asserts session_timeout} ranks/1 scrub/yes standby-replay subvolume/{with-quota} tasks/{0-check-counter workunit/suites/pjd}}
- https://pulpito.ceph.com/vshankar-2022-06-07_00:25:50-fs-wip-vshankar-testing-20220606-223254-testing-default-smithi/6865202/
Interesting thing is that this does not require special options when creating the subvolume. Just using a plain vanilla subvolume is good enough to reproduce the failure.
Updated by Patrick Donnelly over 1 year ago
- Related to Bug #56695: [RHEL stock] pjd test failures(a bug that need to wait the unlink to finish) added
Updated by Venky Shankar over 1 year ago
This issue is probably fixed by PR: https://github.com/ceph/ceph/pull/46331 ("mds: wait unlink to finish to avoid conflict when creating same dentries")
The problem is that the target inode is a stray inode. pjd test does the following:
expect 0 create ${name255} 0644 expect 0 link ${name255} ${n0} expect 0 unlink ${name255} expect 0 link ${n0} ${name255}
The `unlink` causes the inode to get linked in the stray dir. The subsequent `link` call finds the stray inode:
022-09-14T13:51:23.153+0000 7f5be44a9700 7 mds.0.server target is [inode 0x1000000033b [2,head] ~mds0/stray1/1000000033b auth v2 pv541 ap=4 s=0 n(v0 1=1+0)->n(v0 rc2022-09-14T13:51:23.126227+0000 1=1+0) (ilink xlockdone x=2) (iversion lock w=2 last_client=15396) caps={15396=pAsXsFscr/-@6} | request=4 lock=5 caps=1 waiter=0 authpin=1 0x5584d73f5600]
And the check:
CInode* target_pin = targeti->get_projected_parent_dir()->inode; SnapRealm *target_realm = target_pin->find_snaprealm(); if (target_pin != dir->inode && target_realm->get_subvolume_ino() != dir->inode->find_snaprealm()->get_subvolume_ino()) { dout(7) << "target is in different subvolume, failing..." << dendl; respond_to_request(mdr, -CEPHFS_EXDEV); return;
results in EXDEV. The PR mentioned above, waits for a pending unlink to finish and that should fix this problem.
Updated by Venky Shankar over 1 year ago
- Related to Bug #55332: Failure in snaptest-git-ceph.sh (it's an async unlink/create bug) added