Project

General

Profile

Bug #36593

qa: quota failure caused by clients stepping on each other

Added by Patrick Donnelly about 2 months ago. Updated about 1 month ago.

Status:
New
Priority:
High
Category:
-
Target version:
Start date:
10/24/2018
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:

Description

2018-10-24T03:08:42.204 INFO:tasks.workunit.client.0.smithi071.stderr:100+0 records in
2018-10-24T03:08:42.204 INFO:tasks.workunit.client.0.smithi071.stderr:100+0 records out
2018-10-24T03:08:42.204 INFO:tasks.workunit.client.0.smithi071.stderr:104857600 bytes (105 MB) copied, 1.68958 s, 62.1 MB/s
2018-10-24T03:08:42.211 INFO:tasks.workunit.client.0.smithi071.stderr:+ rm -rf big big2 second third
2018-10-24T03:08:42.244 INFO:tasks.workunit.client.0.smithi071.stderr:+ setfattr . -n ceph.quota.max_files -v 5
2018-10-24T03:08:42.252 INFO:tasks.workunit.client.0.smithi071.stderr:+ mkdir ok
2018-10-24T03:08:42.254 INFO:tasks.workunit.client.0.smithi071.stderr:+ touch ok/1
2018-10-24T03:08:42.262 INFO:tasks.workunit.client.0.smithi071.stderr:+ touch ok/2
2018-10-24T03:08:42.266 INFO:tasks.workunit.client.0.smithi071.stderr:+ touch 3
2018-10-24T03:08:42.271 INFO:tasks.workunit.client.0.smithi071.stderr:+ expect_false touch shouldbefail
2018-10-24T03:08:42.271 INFO:tasks.workunit.client.0.smithi071.stderr:+ set -x
2018-10-24T03:08:42.271 INFO:tasks.workunit.client.0.smithi071.stderr:+ touch shouldbefail
2018-10-24T03:08:42.276 INFO:tasks.workunit.client.0.smithi071.stderr:+ return 1
2018-10-24T03:08:42.279 DEBUG:teuthology.orchestra.run:got remote process result: 1
2018-10-24T03:08:42.279 INFO:tasks.workunit:Stopping ['fs/quota'] on client.0...
2018-10-24T03:08:42.279 INFO:teuthology.orchestra.run.smithi071:Running: 'sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0'
2018-10-24T03:08:42.340 INFO:tasks.workunit.client.3.smithi086.stderr:100+0 records in
2018-10-24T03:08:42.340 INFO:tasks.workunit.client.3.smithi086.stderr:100+0 records out
2018-10-24T03:08:42.340 INFO:tasks.workunit.client.3.smithi086.stderr:104857600 bytes (105 MB) copied, 1.69412 s, 61.9 MB/s
2018-10-24T03:08:42.342 INFO:tasks.workunit.client.3.smithi086.stderr:+ rm -rf big big2 second third
2018-10-24T03:08:42.373 INFO:tasks.workunit.client.3.smithi086.stderr:+ setfattr . -n ceph.quota.max_files -v 5
2018-10-24T03:08:42.397 INFO:tasks.workunit.client.3.smithi086.stderr:+ mkdir ok
2018-10-24T03:08:42.400 INFO:tasks.workunit.client.3.smithi086.stderr:+ touch ok/1
2018-10-24T03:08:42.408 INFO:tasks.workunit.client.3.smithi086.stderr:+ touch ok/2
2018-10-24T03:08:42.412 INFO:tasks.workunit.client.3.smithi086.stderr:+ touch 3
2018-10-24T03:08:42.418 INFO:tasks.workunit.client.3.smithi086.stderr:+ expect_false touch shouldbefail
2018-10-24T03:08:42.418 INFO:tasks.workunit.client.3.smithi086.stderr:+ set -x
2018-10-24T03:08:42.418 INFO:tasks.workunit.client.3.smithi086.stderr:+ touch shouldbefail
2018-10-24T03:08:42.422 INFO:tasks.workunit.client.3.smithi086.stderr:+ return 1
2018-10-24T03:08:42.423 DEBUG:teuthology.orchestra.run:got remote process result: 1
2018-10-24T03:08:42.423 INFO:tasks.workunit:Stopping ['fs/quota'] on client.3...
2018-10-24T03:08:42.424 INFO:teuthology.orchestra.run.smithi086:Running: 'sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.3 /home/ubuntu/cephtest/clone.client.3'
2018-10-24T03:08:42.484 INFO:tasks.workunit.client.1.smithi071.stderr:100+0 records in
2018-10-24T03:08:42.484 INFO:tasks.workunit.client.1.smithi071.stderr:100+0 records out
2018-10-24T03:08:42.485 INFO:tasks.workunit.client.1.smithi071.stderr:104857600 bytes (105 MB) copied, 1.61699 s, 64.8 MB/s
2018-10-24T03:08:42.486 INFO:tasks.workunit.client.1.smithi071.stderr:+ rm -rf big big2 second third
2018-10-24T03:08:42.517 INFO:tasks.workunit.client.1.smithi071.stderr:+ setfattr . -n ceph.quota.max_files -v 5
2018-10-24T03:08:42.535 INFO:tasks.workunit.client.1.smithi071.stderr:+ mkdir ok
2018-10-24T03:08:42.536 INFO:tasks.workunit.client.1.smithi071.stderr:+ touch ok/1
...
CommandFailedError: Command failed (workunit test fs/quota/quota.sh) on smithi071 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=307f3fef8e789fb91a70d2316de219f1d0e5899b TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/fs/quota/quota.sh'

From: /ceph/teuthology-archive/pdonnell-2018-10-24_02:35:37-fs-wip-pdonnell-testing-20181023.224346-distro-basic-smithi/3177753/teuthology.log

History

#1 Updated by Luis Henriques about 2 months ago

A quick look at the logs shows that there are 4 clients running this test simultaneously. I wonder if this something that used to succeed before. Because these clients seem to be interfering with each other, setting and removing quotas.

If that's the case, a possible fix would be to have each client create it's own test directory. Something like the diff below:

diff --git a/qa/workunits/fs/quota/quota.sh b/qa/workunits/fs/quota/quota.sh
index 1315be6d8609..d6e59317ecdf 100755
--- a/qa/workunits/fs/quota/quota.sh
+++ b/qa/workunits/fs/quota/quota.sh
@@ -25,8 +25,9 @@ function write_file()
        return 0
 }

-mkdir quota-test
-cd quota-test
+testdir=`hostname -A`-quota-test
+mkdir $testdir
+cd $testdir

 # bytes
 setfattr . -n ceph.quota.max_bytes -v 100000000  # 100m
@@ -123,6 +124,6 @@ expect_false setfattr -n ceph.quota -v "max_bytes=-1 max_files=-1" .
 #addme

 cd ..
-rm -rf quota-test
+rm -rf $testdir

 echo OK

#2 Updated by Patrick Donnelly about 2 months ago

  • Subject changed from quota failure to qa: quota failure caused by clients stepping on each other

#3 Updated by Patrick Donnelly about 2 months ago

  • Assignee set to Patrick Donnelly

#4 Updated by Patrick Donnelly about 1 month ago

  • Assignee deleted (Patrick Donnelly)

Luis Henriques wrote:

A quick look at the logs shows that there are 4 clients running this test simultaneously. I wonder if this something that used to succeed before. Because these clients seem to be interfering with each other, setting and removing quotas.

If that's the case, a possible fix would be to have each client create it's own test directory. Something like the diff below:

[...]

Luis, I took a look. Each client gets its own subdirectory in the CephFS mount:

2018-10-24T03:08:34.009 INFO:teuthology.orchestra.run.smithi086:Running (workunit test fs/quota/quota.sh): 'mkdir -p -- /home/ubuntu/cephtest/mnt.2/client.2/tmp && cd -- /home/ubuntu/cephtest/mnt.2/client.2/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=307f3fef8e789fb91a70d2316de219f1d0e5899b TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="2" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.2 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.2 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.2/qa/workunits/fs/quota/quota.sh'

Emphasis on "mkdir p - /home/ubuntu/cephtest/mnt.2/client.2/tmp && cd -- /home/ubuntu/cephtest/mnt.2/client.2/tmp". mnt.2 is the CephFS root.

Mind taking another look?

#5 Updated by Luis Henriques about 1 month ago

Patrick Donnelly wrote:

Luis Henriques wrote:

A quick look at the logs shows that there are 4 clients running this test simultaneously. I wonder if this something that used to succeed before. Because these clients seem to be interfering with each other, setting and removing quotas.

If that's the case, a possible fix would be to have each client create it's own test directory. Something like the diff below:

[...]

Luis, I took a look. Each client gets its own subdirectory in the CephFS mount:

[...]

Emphasis on "mkdir p - /home/ubuntu/cephtest/mnt.2/client.2/tmp && cd -- /home/ubuntu/cephtest/mnt.2/client.2/tmp". mnt.2 is the CephFS root.

Mind taking another look?

Ah, sorry! I missed that. Sure, I'll have another look at the logs.

#6 Updated by Patrick Donnelly about 1 month ago

Luis Henriques wrote:

Patrick Donnelly wrote:

Luis Henriques wrote:

A quick look at the logs shows that there are 4 clients running this test simultaneously. I wonder if this something that used to succeed before. Because these clients seem to be interfering with each other, setting and removing quotas.

If that's the case, a possible fix would be to have each client create it's own test directory. Something like the diff below:

[...]

Luis, I took a look. Each client gets its own subdirectory in the CephFS mount:

[...]

Emphasis on "mkdir p - /home/ubuntu/cephtest/mnt.2/client.2/tmp && cd -- /home/ubuntu/cephtest/mnt.2/client.2/tmp". mnt.2 is the CephFS root.

Mind taking another look?

Ah, sorry! I missed that. Sure, I'll have another look at the logs.

Great, thanks for having a look! I'll assign this to you for now :)

#7 Updated by Patrick Donnelly about 1 month ago

  • Assignee set to Luis Henriques

#8 Updated by Luis Henriques about 1 month ago

Quick update: Looking further at the logs helped me... getting more confused :-)

So, all the 4 clients are failing when they try to create the 'shouldbefail' file, because that operation should fail due to max_files quota being set.

What I'm seeing is that both client.0 and client.1 call insert_dentry_inode() for the 'quota-test' dir several times, with different vino. We have 4 clients: clients 2 and 3 call this function only once (with a unique vino each); the other 2 clients call it 4 times, with the vinos for all the 4 clients:

  • client.2: 0x100000007d4
  • client.3: 0x10000000bbd
  • client.0: 0x10000000002, 0x100000003eb, 0x100000007d4, 0x10000000bbd
  • client.1: 0x100000003eb, 0x10000000002, 0x100000007d4, 0x10000000bbd

This looks wrong to me, each client shouldn't be reading inside each others client.$ID where the 'quota-test' dirs are.

So, my current theory is that the MDS is doing something wrong setting the snaprealms and the clients are receiving the wrong snaprealms for their quota-test dirs. I'll try dig a bit deeper into the MDS code and see if I can find something, but maybe the above description rings a bell to anyone (Yan? :-) )

#9 Updated by Luis Henriques about 1 month ago

And another update: I can not understand why there are two clients (both on smithi071 btw) that do a readdir in the root directory. See for ex. aroung 03:08:36.514 in client.0.

Initially I thought this could also be a prob with casting/truncation because the readdir was being done on 0x1 instead of 0x10000000001 (I even thought about 32bits clients, but they all seem to be 64 bits). But then the client.1 should be readdir'ing 0x100000003ea, and still doing it on 0x1.

Also available in: Atom PDF