Project

General

Profile

Bug #48203

qa: quota failure

Added by Patrick Donnelly 2 months ago. Updated 16 days ago.

Status:
Resolved
Priority:
High
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client, kceph, qa-suite
Labels (FS):
Pull request ID:
Crash signature:

Description

2020-11-04T18:03:28.550 INFO:tasks.workunit.client.1.smithi096.stderr:+ mv files limit/
2020-11-04T18:03:28.559 INFO:tasks.workunit.client.1.smithi096.stderr:+ return 1
2020-11-04T18:03:28.561 DEBUG:teuthology.orchestra.run:got remote process result: 1
...
2020-11-04T18:03:29.472 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 90, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 69, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_batrick_ceph_wip-pdonnell-testing-20201103.210407/qa/tasks/workunit.py", line 147, in task
    cleanup=cleanup)
  File "/home/teuthworker/src/github.com_batrick_ceph_wip-pdonnell-testing-20201103.210407/qa/tasks/workunit.py", line 297, in _spawn_on_all_clients
    timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/github.com_batrick_ceph_wip-pdonnell-testing-20201103.210407/qa/tasks/workunit.py", line 425, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 215, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 446, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 160, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 182, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test fs/quota/quota.sh) on smithi096 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.1/client.1/tmp && cd -- /home/ubuntu/cephtest/mnt.1/client.1/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=d2e2c4f1d55b90d4d72fec898522c82d26aa11c4 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="1" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.1 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.1/qa/workunits/fs/quota/quota.sh'

From: /ceph/teuthology-archive/pdonnell-2020-11-04_17:39:34-fs-wip-pdonnell-testing-20201103.210407-distro-basic-smithi/5590496/teuthology.log

See also discussion here: https://tracker.ceph.com/issues/36593#note-13

History

#1 Updated by Patrick Donnelly 2 months ago

  • Description updated (diff)

#2 Updated by Patrick Donnelly 2 months ago

In response to: https://tracker.ceph.com/issues/36593#note-14

Yes, there is not an easy solution here. I guess we have to transmit the truncate as soon as it occurs.

#3 Updated by Luis Henriques 2 months ago

After discussing this with Jeff on the mailing-list1 we agreed that the best thing to do is to simply revert to returning -EXDEV when doing cross quotarealms renames. Any other solution would have a big performance impact. This effectively means reverting commit dffdcd71458e ("ceph: allow rename operation under different quota realms").

[1] https://www.spinics.net/lists/ceph-devel/msg49850.html

#4 Updated by Patrick Donnelly 2 months ago

Luis Henriques wrote:

After discussing this with Jeff on the mailing-list1 we agreed that the best thing to do is to simply revert to returning -EXDEV when doing cross quotarealms renames. Any other solution would have a big performance impact. This effectively means reverting commit dffdcd71458e ("ceph: allow rename operation under different quota realms").

[1] https://www.spinics.net/lists/ceph-devel/msg49850.html

Also need to revert the userspace change too. That was done for issue #39715.

To avoid forgetting why this is not a good idea, let's also add some good comments about why we don't allow cross-quota renames.

#5 Updated by Luis Henriques 2 months ago

Patrick Donnelly wrote:

Luis Henriques wrote:

After discussing this with Jeff on the mailing-list1 we agreed that the best thing to do is to simply revert to returning -EXDEV when doing cross quotarealms renames. Any other solution would have a big performance impact. This effectively means reverting commit dffdcd71458e ("ceph: allow rename operation under different quota realms").

[1] https://www.spinics.net/lists/ceph-devel/msg49850.html

Also need to revert the userspace change too. That was done for issue #39715.

I can submit the revert for the fuse client, but I wonder if that's really necessary (other than for consistency between the two clients, of course). AFAICS the fuse client does not have the same problem because the truncate operations are sync. Or did you saw a similar failure?

To avoid forgetting why this is not a good idea, let's also add some good comments about why we don't allow cross-quota renames.

The kernel client revert is currently queued on the testing branch, and I tried to describe there the problem with the commit.

#6 Updated by Patrick Donnelly 2 months ago

Luis Henriques wrote:

Patrick Donnelly wrote:

Luis Henriques wrote:

After discussing this with Jeff on the mailing-list1 we agreed that the best thing to do is to simply revert to returning -EXDEV when doing cross quotarealms renames. Any other solution would have a big performance impact. This effectively means reverting commit dffdcd71458e ("ceph: allow rename operation under different quota realms").

[1] https://www.spinics.net/lists/ceph-devel/msg49850.html

Also need to revert the userspace change too. That was done for issue #39715.

I can submit the revert for the fuse client, but I wonder if that's really necessary (other than for consistency between the two clients, of course). AFAICS the fuse client does not have the same problem because the truncate operations are sync. Or did you saw a similar failure?

The issue is not confined to a single client's consistent view of the quotas. We could have another client truncating files during the rename.

To avoid forgetting why this is not a good idea, let's also add some good comments about why we don't allow cross-quota renames.

The kernel client revert is currently queued on the testing branch, and I tried to describe there the problem with the commit.

Thanks!

#7 Updated by Luis Henriques 2 months ago

  • Pull request ID set to 38112

Created pull-request https://github.com/ceph/ceph/pull/38112 that simply reverts the fuse-client commit b8954e5734b3 ("client: optimize rename operation under different quota root").

#8 Updated by Patrick Donnelly about 2 months ago

  • Status changed from New to Resolved
  • Component(FS) Client added

#9 Updated by Patrick Donnelly 16 days ago

  • Status changed from Resolved to Need More Info

Hey Luis, I think this is still broken; the revert didn't work: https://pulpito.ceph.com/teuthology-2021-01-03_03:15:02-fs-master-distro-basic-smithi/5751916/

Can you take a look?

#10 Updated by Luis Henriques 16 days ago

Patrick Donnelly wrote:

Hey Luis, I think this is still broken; the revert didn't work: https://pulpito.ceph.com/teuthology-2021-01-03_03:15:02-fs-master-distro-basic-smithi/5751916/

Can you take a look?

Hmm... It looks like a distro kernel (kernel-4.18.0-240.1.1.el8_3.x86_64) is being used. Maybe the fix hasn't been backported to it? Or am I not looking at it right?

#11 Updated by Patrick Donnelly 16 days ago

  • Status changed from Need More Info to Resolved

Luis Henriques wrote:

Patrick Donnelly wrote:

Hey Luis, I think this is still broken; the revert didn't work: https://pulpito.ceph.com/teuthology-2021-01-03_03:15:02-fs-master-distro-basic-smithi/5751916/

Can you take a look?

Hmm... It looks like a distro kernel (kernel-4.18.0-240.1.1.el8_3.x86_64) is being used. Maybe the fix hasn't been backported to it? Or am I not looking at it right?

Ah, good catch. That's why. Thanks for looking!

Also available in: Atom PDF