Project

General

Profile

Bug #62262

workunits/fs/test_o_trunc.sh failed with timedout

Added by Xiubo Li 7 months ago. Updated 7 months ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

https://pulpito.ceph.com/vshankar-2023-07-25_11:29:34-fs-wip-vshankar-testing-20230725.043804-testing-default-smithi/7350758/

2023-07-27T22:01:39.149 INFO:tasks.workunit.client.0.smithi031.stdout:123/600: open fd = 3
2023-07-27T22:01:39.149 INFO:tasks.workunit.client.0.smithi031.stdout:write ret = 32
2023-07-27T22:01:39.149 INFO:tasks.workunit.client.0.smithi031.stdout:write ret = 32
2023-07-27T22:01:39.149 INFO:tasks.workunit.client.0.smithi031.stdout:pread ret = 64
2023-07-27T22:01:39.149 INFO:tasks.workunit.client.0.smithi031.stdout:124/600: open fd = 3
2023-07-27T22:01:39.149 INFO:tasks.workunit.client.0.smithi031.stdout:write ret = 32
2023-07-27T22:01:39.149 INFO:tasks.workunit.client.0.smithi031.stdout:write ret = 32
2023-07-27T22:01:39.149 INFO:tasks.workunit.client.0.smithi031.stdout:pread ret = 64
2023-07-27T22:01:39.150 INFO:tasks.workunit.client.0.smithi031.stdout:125/600: open fd = 3
2023-07-27T22:01:39.150 INFO:tasks.workunit.client.0.smithi031.stdout:write ret = 32
2023-07-27T22:01:39.150 INFO:tasks.workunit.client.0.smithi031.stdout:write ret = 32
2023-07-27T22:01:39.150 INFO:tasks.workunit.client.0.smithi031.stdout:pread ret = 64
2023-07-27T22:01:39.150 INFO:tasks.workunit.client.0.smithi031.stdout:126/600: open fd = 3
2023-07-27T22:01:39.150 INFO:tasks.workunit.client.0.smithi031.stdout:write ret = 32
...
2023-07-27T23:01:59.338 DEBUG:teuthology.orchestra.run:got remote process result: 124
2023-07-27T23:01:59.339 INFO:tasks.workunit.client.0.smithi031.stdout:write ret = 32
2023-07-27T23:01:59.339 INFO:tasks.workunit:Stopping ['fs/test_o_trunc.sh'] on client.0...
2023-07-27T23:01:59.340 DEBUG:teuthology.orchestra.run.smithi031:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2023-07-27T23:01:59.605 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_407880c6d3fb77318fff01c863715090f9c2de69/teuthology/run_tasks.py", line 105, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_407880c6d3fb77318fff01c863715090f9c2de69/teuthology/run_tasks.py", line 84, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_a5d48e8c688c2543997a98b8acbae2ce9635578c/qa/tasks/workunit.py", line 145, in task
    _spawn_on_all_clients(ctx, refspec, all_tasks, config.get('env'),
  File "/home/teuthworker/src/git.ceph.com_ceph-c_a5d48e8c688c2543997a98b8acbae2ce9635578c/qa/tasks/workunit.py", line 295, in _spawn_on_all_clients
    p.spawn(_run_tests, ctx, refspec, role, [unit], env,
  File "/home/teuthworker/src/git.ceph.com_teuthology_407880c6d3fb77318fff01c863715090f9c2de69/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_teuthology_407880c6d3fb77318fff01c863715090f9c2de69/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_teuthology_407880c6d3fb77318fff01c863715090f9c2de69/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_teuthology_407880c6d3fb77318fff01c863715090f9c2de69/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_a5d48e8c688c2543997a98b8acbae2ce9635578c/qa/tasks/workunit.py", line 424, in _run_tests
    remote.run(
  File "/home/teuthworker/src/git.ceph.com_teuthology_407880c6d3fb77318fff01c863715090f9c2de69/teuthology/orchestra/remote.py", line 522, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_407880c6d3fb77318fff01c863715090f9c2de69/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_407880c6d3fb77318fff01c863715090f9c2de69/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_407880c6d3fb77318fff01c863715090f9c2de69/teuthology/orchestra/run.py", line 181, in _raise_for_status
    raise CommandFailedError(
teuthology.exceptions.CommandFailedError: Command failed (workunit test fs/test_o_trunc.sh) on smithi031 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=a5d48e8c688c2543997a98b8acbae2ce9635578c TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/fs/test_o_trunc.sh'

History

#1 Updated by Xiubo Li 7 months ago

  • Assignee set to Xiubo Li

#2 Updated by Xiubo Li 7 months ago

Just copied my analysis from https://github.com/ceph/ceph/pull/45073/commits/f064cd9a78ae475c574d1d46db18748fe9c0014c#r1281465207:

This is causing a deal loop between setattr and cap revoke:

https://pulpito.ceph.com/vshankar-2023-07-25_11:29:34-fs-wip-vshankar-testing-20230725.043804-testing-default-smithi/7350758/

If there has only a signle old client, then this client could get the Asx caps always.

1, When truncating the file, it will try to acquire the rdlock for authlock. And then mds will try to revoke the Ax caps from the old client. And then this truncating request goes to sleep. Since the old client is the only client the Fsxrwcb caps will be granted to this old client.

2, Client releases the Ax caps. Then revoke the above truncating request.

3, This time it successfully gets that the fscrypt is not enabled and then continue to call handle_client_setattr(). But this time it will try to xlock the filelock and then need to revoke the Fsrxw caps from old client and then goes to sleep. But since there has only one client the Asx caps will be granted to the old client with the revoke request.

4, client released the Fsrxw caps and then wakes up the truncating request, and then goes to step 1.

Then it goes to a dead loop until the qa test timedout.

#3 Updated by Xiubo Li 7 months ago

  • Status changed from New to Won't Fix

This is not a issue in upstream, just caused by the new PR: https://github.com/ceph/ceph/pull/45073.

Will close it.

Also available in: Atom PDF