Project

General

Profile

Bug #55804

qa failure: pjd link tests failed

Added by Rishabh Dave 8 months ago. Updated 3 months ago.

Status:
Duplicate
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
qa-suite
Labels (FS):
qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Bug discovered on QA run for PR - https://github.com/ceph/ceph/pull/45556

Teuthology run: https://pulpito.ceph.com/vshankar-2022-04-26_06:23:29-fs:workload-wip-45556-20220418-102656-testing-default-smithi/. These 4 jobs failed with same error: 6806467, 6806495, 6806509, 6806523.

The failure can be reproduced easily. Mount a subvolume, get pjdfstests -

$ wget http://download.ceph.com/qa/pjd-fstest-20090130-RC-aclfixes.tgz
$ tar zxvf pjd*.tgz

And then run -

$ # /mnt/cephfs is the cephfs mountpoint
$ cd /mnt/cephfs && sudo cp -r ~/pjd-fstest-20090130-RC ./ && cd pjd-fstest-20090130-RC/ && sudo make clean && sudo make && cd ../ && sudo mkdir tmp && cd tmp

And then run the failing testsuites -

$ sudo prove -r -v --exec 'bash -x' ../pjd*/tests/link/02.t
$ sudo prove -r -v --exec 'bash -x' ../pjd*/tests/link/03.t

    2022-04-26T06:59:11.148 INFO:tasks.workunit.client.0.smithi047.stdout:../pjd-fstest-20090130-RC/tests/link/02.t ......
    2022-04-26T06:59:11.148 INFO:tasks.workunit.client.0.smithi047.stdout:1..10
    2022-04-26T06:59:11.148 INFO:tasks.workunit.client.0.smithi047.stdout:ok 1
    2022-04-26T06:59:11.148 INFO:tasks.workunit.client.0.smithi047.stdout:ok 2
    2022-04-26T06:59:11.148 INFO:tasks.workunit.client.0.smithi047.stdout:ok 3
    2022-04-26T06:59:11.148 INFO:tasks.workunit.client.0.smithi047.stdout:not ok 4
    2022-04-26T06:59:11.148 INFO:tasks.workunit.client.0.smithi047.stdout:ok 5
    2022-04-26T06:59:11.149 INFO:tasks.workunit.client.0.smithi047.stdout:not ok 6
    2022-04-26T06:59:11.149 INFO:tasks.workunit.client.0.smithi047.stdout:ok 7
    2022-04-26T06:59:11.149 INFO:tasks.workunit.client.0.smithi047.stdout:ok 8
    2022-04-26T06:59:11.149 INFO:tasks.workunit.client.0.smithi047.stdout:ok 9
    2022-04-26T06:59:11.149 INFO:tasks.workunit.client.0.smithi047.stdout:ok 10
    2022-04-26T06:59:11.149 INFO:tasks.workunit.client.0.smithi047.stdout:Failed 2/10 subtests

02.t test #4: expect 0 link ${n0} ${name255}
02.t test #6: expect 0 unlink ${name255}

    2022-04-26T06:59:11.346 INFO:tasks.workunit.client.0.smithi047.stdout:../pjd-fstest-20090130-RC/tests/link/03.t ......
    2022-04-26T06:59:11.346 INFO:tasks.workunit.client.0.smithi047.stdout:1..16
    2022-04-26T06:59:11.346 INFO:tasks.workunit.client.0.smithi047.stdout:ok 1
    2022-04-26T06:59:11.346 INFO:tasks.workunit.client.0.smithi047.stdout:ok 2
    2022-04-26T06:59:11.346 INFO:tasks.workunit.client.0.smithi047.stdout:ok 3
    2022-04-26T06:59:11.346 INFO:tasks.workunit.client.0.smithi047.stdout:ok 4
    2022-04-26T06:59:11.346 INFO:tasks.workunit.client.0.smithi047.stdout:ok 5
    2022-04-26T06:59:11.346 INFO:tasks.workunit.client.0.smithi047.stdout:ok 6
    2022-04-26T06:59:11.346 INFO:tasks.workunit.client.0.smithi047.stdout:ok 7
    2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:not ok 8
    2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:not ok 9
    2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:ok 10
    2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:ok 11
    2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:ok 12
    2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:ok 13
    2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:ok 14
    2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:ok 15
    2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:ok 16
    2022-04-26T06:59:11.347 INFO:tasks.workunit.client.0.smithi047.stdout:Failed 2/16 subtests

03.t test #8: expect 0 link ${n0} ${path1023}
03.t test #9: expect 0 unlink ${path1023}

Final lines of output of pjd.sh -

    2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout:ok 40
    2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout:ok 41
    2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout:ok 42
    2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout:ok
    2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout:
    2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout:Test Summary Report
    2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout:-------------------
    2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout:../pjd-fstest-20090130-RC/tests/link/02.t    (Wstat: 0 Tests: 10 Failed: 2)
    2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout:  Failed tests:  4, 6
    2022-04-26T06:59:51.015 INFO:tasks.workunit.client.0.smithi047.stdout:../pjd-fstest-20090130-RC/tests/link/03.t    (Wstat: 0 Tests: 16 Failed: 2)
    2022-04-26T06:59:51.016 INFO:tasks.workunit.client.0.smithi047.stdout:  Failed tests:  8-9
    2022-04-26T06:59:51.016 INFO:tasks.workunit.client.0.smithi047.stdout:Files=191, Tests=2286, 69 wallclock secs ( 1.30 usr  0.61 sys +  8.17 cusr  6.84 csys = 16.92 CPU)
    2022-04-26T06:59:51.016 INFO:tasks.workunit.client.0.smithi047.stdout:Result: FAIL
    2022-04-26T06:59:51.016 INFO:tasks.workunit:Stopping ['suites/pjd.sh'] on client.0...
    2022-04-26T06:59:51.016 DEBUG:teuthology.orchestra.run.smithi047:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0

Error -

    teuthology.exceptions.CommandFailedError: Command failed (workunit test suites/pjd.sh) on smithi047 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=1ccbc711b8876e630c0358e1d8d923daa34dca1e TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 6h /home/ubuntu/cephtest/clone.client.0/qa/workunits/suites/pjd.sh'

Traceback from the job 6806467 -

    Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/run_tasks.py", line 103, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/run_tasks.py", line 82, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_1ccbc711b8876e630c0358e1d8d923daa34dca1e/qa/tasks/workunit.py", line 148, in task
    cleanup=cleanup)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_1ccbc711b8876e630c0358e1d8d923daa34dca1e/qa/tasks/workunit.py", line 298, in _spawn_on_all_clients
    timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_1ccbc711b8876e630c0358e1d8d923daa34dca1e/qa/tasks/workunit.py", line 427, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/orchestra/remote.py", line 509, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_788cfdd8098ad222aa448289edcfa4436091c32c/teuthology/orchestra/run.py", line 183, in _raise_for_status


Related issues

Related to CephFS - Bug #56695: [RHEL stock] pjd test failures Fix Under Review
Related to CephFS - Bug #55332: Failure in snaptest-git-ceph.sh (it's an async unlink/create bug) Pending Backport

History

#1 Updated by Rishabh Dave 8 months ago

  • Description updated (diff)

#2 Updated by Rishabh Dave 8 months ago

  • Description updated (diff)

#3 Updated by Rishabh Dave 8 months ago

  • Description updated (diff)

#4 Updated by Rishabh Dave 8 months ago

  • Description updated (diff)

#5 Updated by Rishabh Dave 8 months ago

  • Subject changed from pjd failure: link tests failed to qa failure: pjd link tests failed

#6 Updated by Venky Shankar 8 months ago

This failure is pretty much related to cephfs subvolumes. Recent test runs:

Description: fs/workload/{0-rhel_8 begin/{0-install 1-cephadm 2-logrotate} clusters/1a11s-mds-1c-client-3node conf/{client mds mon osd} mount/kclient/{base/{mount-syntax/{v1} mount overrides/{distro/stock/{k-stock rhel_8} ms-die-on-skipped}} ms_mode/legacy wsync/yes} objectstore-ec/bluestore-comp-ec-root omap_limit/10000 overrides/{frag ignorelist_health ignorelist_wrongly_marked_down osd-asserts session_timeout} ranks/3 scrub/yes standby-replay subvolume/{with-no-extra-options} tasks/{0-check-counter workunit/suites/pjd}}

- https://pulpito.ceph.com/vshankar-2022-06-07_00:25:50-fs-wip-vshankar-testing-20220606-223254-testing-default-smithi/6865138

and

Description: fs/workload/{0-rhel_8 begin/{0-install 1-cephadm 2-logrotate} clusters/1a11s-mds-1c-client-3node conf/{client mds mon osd} mount/fuse objectstore-ec/bluestore-ec-root omap_limit/10000 overrides/{frag ignorelist_health ignorelist_wrongly_marked_down osd-asserts session_timeout} ranks/1 scrub/yes standby-replay subvolume/{with-quota} tasks/{0-check-counter workunit/suites/pjd}}

- https://pulpito.ceph.com/vshankar-2022-06-07_00:25:50-fs-wip-vshankar-testing-20220606-223254-testing-default-smithi/6865202/

Interesting thing is that this does not require special options when creating the subvolume. Just using a plain vanilla subvolume is good enough to reproduce the failure.

#7 Updated by Patrick Donnelly 6 months ago

  • Related to Bug #56695: [RHEL stock] pjd test failures added

#8 Updated by Venky Shankar 3 months ago

This issue is probably fixed by PR: https://github.com/ceph/ceph/pull/46331 ("mds: wait unlink to finish to avoid conflict when creating same dentries")

The problem is that the target inode is a stray inode. pjd test does the following:

expect 0 create ${name255} 0644
expect 0 link ${name255} ${n0}
expect 0 unlink ${name255}
expect 0 link ${n0} ${name255}

The `unlink` causes the inode to get linked in the stray dir. The subsequent `link` call finds the stray inode:

022-09-14T13:51:23.153+0000 7f5be44a9700  7 mds.0.server target is [inode 0x1000000033b [2,head] ~mds0/stray1/1000000033b auth v2 pv541 ap=4 s=0 n(v0 1=1+0)->n(v0 rc2022-09-14T13:51:23.126227+0000 1=1+0) (ilink xlockdone x=2) (iversion lock w=2 last_client=15396) caps={15396=pAsXsFscr/-@6} | request=4 lock=5 caps=1 waiter=0 authpin=1 0x5584d73f5600]

And the check:

  CInode* target_pin = targeti->get_projected_parent_dir()->inode;
  SnapRealm *target_realm = target_pin->find_snaprealm();
  if (target_pin != dir->inode &&
      target_realm->get_subvolume_ino() !=
      dir->inode->find_snaprealm()->get_subvolume_ino()) {
    dout(7) << "target is in different subvolume, failing..." << dendl;
    respond_to_request(mdr, -CEPHFS_EXDEV);
    return;

results in EXDEV. The PR mentioned above, waits for a pending unlink to finish and that should fix this problem.

#9 Updated by Venky Shankar 3 months ago

  • Related to Bug #55332: Failure in snaptest-git-ceph.sh (it's an async unlink/create bug) added

#10 Updated by Venky Shankar 3 months ago

  • Status changed from New to Duplicate

Also available in: Atom PDF