Project

General

Profile

Actions

Bug #56695

closed

[RHEL stock] pjd test failures(a bug that need to wait the unlink to finish)

Added by Patrick Donnelly almost 2 years ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
backport_processed
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

pdonnell@teuthology /ceph/teuthology-archive/pdonnell-2022-07-22_19:42:58-fs-wip-pdonnell-testing-20220721.235756-distro-default-smithi$ grep 'not ok' {'6945847','6945828','6945853','6945837','6945822','6945804','6945862','6945812'}/teu*
6945847/teuthology.log:2022-07-22T20:52:25.833 INFO:tasks.workunit.client.0.smithi135.stderr:+ echo 'not ok 4'
6945847/teuthology.log:2022-07-22T20:52:25.838 INFO:tasks.workunit.client.0.smithi135.stderr:+ echo 'not ok 6'
6945847/teuthology.log:2022-07-22T20:52:25.851 INFO:tasks.workunit.client.0.smithi135.stdout:not ok 4
6945847/teuthology.log:2022-07-22T20:52:25.852 INFO:tasks.workunit.client.0.smithi135.stdout:not ok 6
6945847/teuthology.log:2022-07-22T20:52:26.141 INFO:tasks.workunit.client.0.smithi135.stderr:+ echo 'not ok 8'
6945847/teuthology.log:2022-07-22T20:52:26.143 INFO:tasks.workunit.client.0.smithi135.stderr:+ echo 'not ok 9'
6945847/teuthology.log:2022-07-22T20:52:26.175 INFO:tasks.workunit.client.0.smithi135.stdout:not ok 8
6945847/teuthology.log:2022-07-22T20:52:26.175 INFO:tasks.workunit.client.0.smithi135.stdout:not ok 9
6945828/teuthology.log:2022-07-22T20:32:24.810 INFO:tasks.workunit.client.0.smithi100.stderr:+ echo 'not ok 4'
6945828/teuthology.log:2022-07-22T20:32:24.815 INFO:tasks.workunit.client.0.smithi100.stderr:+ echo 'not ok 6'
6945828/teuthology.log:2022-07-22T20:32:24.829 INFO:tasks.workunit.client.0.smithi100.stdout:not ok 4
6945828/teuthology.log:2022-07-22T20:32:24.829 INFO:tasks.workunit.client.0.smithi100.stdout:not ok 6
6945828/teuthology.log:2022-07-22T20:32:25.129 INFO:tasks.workunit.client.0.smithi100.stderr:+ echo 'not ok 8'
6945828/teuthology.log:2022-07-22T20:32:25.132 INFO:tasks.workunit.client.0.smithi100.stderr:+ echo 'not ok 9'
6945828/teuthology.log:2022-07-22T20:32:25.162 INFO:tasks.workunit.client.0.smithi100.stdout:not ok 8
6945828/teuthology.log:2022-07-22T20:32:25.162 INFO:tasks.workunit.client.0.smithi100.stdout:not ok 9
6945853/teuthology.log:2022-07-22T21:04:37.135 INFO:tasks.workunit.client.0.smithi107.stderr:+ echo 'not ok 4'
6945853/teuthology.log:2022-07-22T21:04:37.141 INFO:tasks.workunit.client.0.smithi107.stderr:+ echo 'not ok 6'
6945853/teuthology.log:2022-07-22T21:04:37.154 INFO:tasks.workunit.client.0.smithi107.stdout:not ok 4
6945853/teuthology.log:2022-07-22T21:04:37.155 INFO:tasks.workunit.client.0.smithi107.stdout:not ok 6
6945853/teuthology.log:2022-07-22T21:04:37.453 INFO:tasks.workunit.client.0.smithi107.stderr:+ echo 'not ok 8'
6945853/teuthology.log:2022-07-22T21:04:37.456 INFO:tasks.workunit.client.0.smithi107.stderr:+ echo 'not ok 9'
6945853/teuthology.log:2022-07-22T21:04:37.481 INFO:tasks.workunit.client.0.smithi107.stdout:not ok 8
6945853/teuthology.log:2022-07-22T21:04:37.481 INFO:tasks.workunit.client.0.smithi107.stdout:not ok 9
6945837/teuthology.log:2022-07-22T20:37:53.954 INFO:tasks.workunit.client.0.smithi017.stderr:+ echo 'not ok 8'
6945837/teuthology.log:2022-07-22T20:37:53.960 INFO:tasks.workunit.client.0.smithi017.stderr:+ echo 'not ok 9'
6945837/teuthology.log:2022-07-22T20:37:53.987 INFO:tasks.workunit.client.0.smithi017.stdout:not ok 8
6945837/teuthology.log:2022-07-22T20:37:53.987 INFO:tasks.workunit.client.0.smithi017.stdout:not ok 9
6945822/teuthology.log:2022-07-22T20:25:33.552 INFO:tasks.workunit.client.0.smithi029.stderr:+ echo 'not ok 8'
6945822/teuthology.log:2022-07-22T20:25:33.555 INFO:tasks.workunit.client.0.smithi029.stderr:+ echo 'not ok 9'
6945822/teuthology.log:2022-07-22T20:25:33.587 INFO:tasks.workunit.client.0.smithi029.stdout:not ok 8
6945822/teuthology.log:2022-07-22T20:25:33.587 INFO:tasks.workunit.client.0.smithi029.stdout:not ok 9
6945804/teuthology.log:2022-07-22T20:13:29.326 INFO:tasks.workunit.client.0.smithi006.stderr:+ echo 'not ok 4'
6945804/teuthology.log:2022-07-22T20:13:29.331 INFO:tasks.workunit.client.0.smithi006.stderr:+ echo 'not ok 6'
6945804/teuthology.log:2022-07-22T20:13:29.386 INFO:tasks.workunit.client.0.smithi006.stdout:not ok 4
6945804/teuthology.log:2022-07-22T20:13:29.387 INFO:tasks.workunit.client.0.smithi006.stdout:not ok 6
6945804/teuthology.log:2022-07-22T20:13:29.684 INFO:tasks.workunit.client.0.smithi006.stderr:+ echo 'not ok 8'
6945804/teuthology.log:2022-07-22T20:13:29.687 INFO:tasks.workunit.client.0.smithi006.stderr:+ echo 'not ok 9'
6945804/teuthology.log:2022-07-22T20:13:29.714 INFO:tasks.workunit.client.0.smithi006.stdout:not ok 8
6945804/teuthology.log:2022-07-22T20:13:29.714 INFO:tasks.workunit.client.0.smithi006.stdout:not ok 9
6945862/teuthology.log:2022-07-22T21:14:09.659 INFO:tasks.workunit.client.0.smithi027.stderr:+ echo 'not ok 8'
6945862/teuthology.log:2022-07-22T21:14:09.663 INFO:tasks.workunit.client.0.smithi027.stderr:+ echo 'not ok 9'
6945862/teuthology.log:2022-07-22T21:14:09.710 INFO:tasks.workunit.client.0.smithi027.stdout:not ok 8
6945862/teuthology.log:2022-07-22T21:14:09.710 INFO:tasks.workunit.client.0.smithi027.stdout:not ok 9
6945812/teuthology.log:2022-07-22T20:19:02.363 INFO:tasks.workunit.client.0.smithi064.stderr:+ echo 'not ok 4'
6945812/teuthology.log:2022-07-22T20:19:02.376 INFO:tasks.workunit.client.0.smithi064.stderr:+ echo 'not ok 6'
6945812/teuthology.log:2022-07-22T20:19:02.389 INFO:tasks.workunit.client.0.smithi064.stdout:not ok 4
6945812/teuthology.log:2022-07-22T20:19:02.390 INFO:tasks.workunit.client.0.smithi064.stdout:not ok 6
6945812/teuthology.log:2022-07-22T20:19:02.689 INFO:tasks.workunit.client.0.smithi064.stderr:+ echo 'not ok 8'
6945812/teuthology.log:2022-07-22T20:19:02.692 INFO:tasks.workunit.client.0.smithi064.stderr:+ echo 'not ok 9'
6945812/teuthology.log:2022-07-22T20:19:06.580 INFO:tasks.workunit.client.0.smithi064.stdout:not ok 8
6945812/teuthology.log:2022-07-22T20:19:06.581 INFO:tasks.workunit.client.0.smithi064.stdout:not ok 9

/ceph/teuthology-archive/pdonnell-2022-07-22_19:42:58-fs-wip-pdonnell-testing-20220721.235756-distro-default-smithi/6945847/teuthology.log
/ceph/teuthology-archive/pdonnell-2022-07-22_19:42:58-fs-wip-pdonnell-testing-20220721.235756-distro-default-smithi/6945828/teuthology.log
/ceph/teuthology-archive/pdonnell-2022-07-22_19:42:58-fs-wip-pdonnell-testing-20220721.235756-distro-default-smithi/6945853/teuthology.log
/ceph/teuthology-archive/pdonnell-2022-07-22_19:42:58-fs-wip-pdonnell-testing-20220721.235756-distro-default-smithi/6945837/teuthology.log
/ceph/teuthology-archive/pdonnell-2022-07-22_19:42:58-fs-wip-pdonnell-testing-20220721.235756-distro-default-smithi/6945822/teuthology.log
/ceph/teuthology-archive/pdonnell-2022-07-22_19:42:58-fs-wip-pdonnell-testing-20220721.235756-distro-default-smithi/6945804/teuthology.log
/ceph/teuthology-archive/pdonnell-2022-07-22_19:42:58-fs-wip-pdonnell-testing-20220721.235756-distro-default-smithi/6945862/teuthology.log
/ceph/teuthology-archive/pdonnell-2022-07-22_19:42:58-fs-wip-pdonnell-testing-20220721.235756-distro-default-smithi/6945812/teuthology.log


Related issues 6 (1 open5 closed)

Related to CephFS - Bug #55804: qa failure: pjd link tests failedDuplicateRishabh Dave

Actions
Related to CephFS - Bug #58340: mds: fsstress.sh hangs with multimds (deadlock between unlink and reintegrate straydn(rename))ResolvedXiubo Li

Actions
Related to CephFS - Bug #61818: mds: deadlock between unlink and linkmergePending BackportXiubo Li

Actions
Copied to CephFS - Backport #59384: reef: [RHEL stock] pjd test failures(a bug that need to wait the unlink to finish)ResolvedXiubo LiActions
Copied to CephFS - Backport #59385: quincy: [RHEL stock] pjd test failures(a bug that need to wait the unlink to finish)ResolvedXiubo LiActions
Copied to CephFS - Backport #59386: pacific: [RHEL stock] pjd test failures(a bug that need to wait the unlink to finish)ResolvedXiubo LiActions
Actions #1

Updated by Patrick Donnelly almost 2 years ago

  • Related to Bug #55804: qa failure: pjd link tests failed added
Actions #2

Updated by Xiubo Li over 1 year ago

  • Assignee set to Xiubo Li
Actions #3

Updated by Xiubo Li over 1 year ago

  • Status changed from New to In Progress

Currently the kclient's testing branch has merged the fscryption name related patches, which will limit the MAX_NAME to 240 instead of the default 256.

Will found one way to fix it.

Actions #4

Updated by Xiubo Li over 1 year ago

Xiubo Li wrote:

Currently the kclient's testing branch has merged the fscryption name related patches, which will limit the MAX_NAME to 240 instead of the default 256.

Will found one way to fix it.

Sorry, it seems not introduced by this:

Checked the first teuthology test link:

2022-07-22T20:31:40.242 INFO:teuthology.orchestra.run.smithi135.stdout:Package kernel-4.18.0-372.9.1.el8.x86_64 is already installed.

It was using the distro kernel.

Actions #5

Updated by Xiubo Li over 1 year ago

Test this with the latest testing kclient branch, I couldn't reproduce it.

Will switch to use the distro kernel and have a try.

Actions #6

Updated by Xiubo Li over 1 year ago

Tried 4.18.0-348.20.1.el8_5.x86_64 and couldn't reproduce it.

Will try the exact same version of kernel-4.18.0-372.9.1.el8.x86_64 with the qa test failures.

Actions #7

Updated by Xiubo Li over 1 year ago

Xiubo Li wrote:

Tried 4.18.0-348.20.1.el8_5.x86_64 and couldn't reproduce it.

Will try the exact same version of kernel-4.18.0-372.9.1.el8.x86_64 with the qa test failures.

Still the same, couldn't reproduce it too.

Actions #8

Updated by Xiubo Li over 1 year ago

  • Project changed from Linux kernel client to CephFS
  • Component(FS) MDS added

Patrick Donnelly wrote:

[...]

/ceph/teuthology-archive/pdonnell-2022-07-22_19:42:58-fs-wip-pdonnell-testing-20220721.235756-distro-default-smithi/6945847/teuthology.log

It's a MDS side bug, not kclient's:

Such as for the following error, the errno is EXDEV:

2022-07-22T20:52:25.830 INFO:tasks.workunit.client.0.smithi135.stderr:++ expr 3 + 1
2022-07-22T20:52:25.831 INFO:tasks.workunit.client.0.smithi135.stderr:+ ntest=4
2022-07-22T20:52:25.831 INFO:tasks.workunit.client.0.smithi135.stderr:+ expect 0 link fstest_d22bbf65b5ea67dbd4f2534c56018c49 _123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_1234
2022-07-22T20:52:25.831 INFO:tasks.workunit.client.0.smithi135.stderr:+ e=0
2022-07-22T20:52:25.831 INFO:tasks.workunit.client.0.smithi135.stderr:+ shift
2022-07-22T20:52:25.831 INFO:tasks.workunit.client.0.smithi135.stderr:++ /home/ubuntu/cephtest/mnt.0/client.0/tmp/tmp/../pjd-fstest-20090130-RC/tests/link/../../fstest link fstest_d22bbf65b5ea67dbd4f2534c56018c49 _123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_1234
2022-07-22T20:52:25.832 INFO:tasks.workunit.client.0.smithi135.stderr:++ tail -1
2022-07-22T20:52:25.832 INFO:tasks.workunit.client.0.smithi135.stderr:+ r=EXDEV
2022-07-22T20:52:25.832 INFO:tasks.workunit.client.0.smithi135.stderr:+ echo EXDEV
2022-07-22T20:52:25.832 INFO:tasks.workunit.client.0.smithi135.stderr:+ egrep '^0$'
2022-07-22T20:52:25.833 INFO:tasks.workunit.client.0.smithi135.stderr:+ '[' 1 -eq 0 ']'
2022-07-22T20:52:25.833 INFO:tasks.workunit.client.0.smithi135.stderr:+ echo 'not ok 4'

And in the remote/smithi141/log/a190153a-09fe-11ed-842f-001a4aab830c/ceph-mds.b.log.gz, before the pjd test #4 in test #3 it will unlink the file _123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_1234:

teuthology logs:

2022-07-22T20:52:25.828 INFO:tasks.workunit.client.0.smithi135.stderr:+ ntest=3
2022-07-22T20:52:25.828 INFO:tasks.workunit.client.0.smithi135.stderr:+ expect 0 unlink _123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_1234

mds.b logs:

2022-07-22T20:52:25.348+0000 7fe2bf6d4700  1 -- [v2:172.21.15.141:6838/1279389197,v1:172.21.15.141:6839/1279389197] <== client.24956 172.21.15.135:0/377103332 1199 ==== client_request(client.24956:592 unlink #0x10000000220/_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_1234 2022-07-22T20:52:25.348771+0000 caller_uid=0, caller_gid=0{0,}) v4 ==== 443+0+0 (crc 0 0 0) 0x55b102d34b00 con 0x55b10295e400
...
2022-07-22T20:52:25.348+0000 7fe2bf6d4700 10 mds.0.locker process_request_cap_release client.24956 pAsXsFscr on [inode 0x100000000d3 [2,head] /volumes/_nogroup/sv_1/6ae107c1-dbbf-417a-8a51-99df917679be/client.0/tmp/tmp/_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_1234 auth v455 pv458 ap=3 s=0 n(v0 1=1+0)->n(v0 rc2022-07-22T20:52:25.343772+0000 1=1+0) (ilink xlockdone x=1) (iversion lock w=1 last_client=24956) caps={24956=pAsLsXsFscr/pFscr@3},l=24956 | request=2 lock=3 caps=1 authpin=1 0x55b103050000]

And after that the inode 0x100000000d3 will go into the stray directory:

2022-07-22T20:52:25.349+0000 7fe2bf6d4700 10 mds.0.locker  got xlock on (dn xlock x=1 by 0x55b1028aec00) [dentry #0x100/stray1/100000000d3 [2,head] auth NULL (dn xlock x=1 by 0x55b1028aec00) (dversion lock) pv=0 v=492 ap=2 ino=(nil) state=1342177281|new | request=1 lock=1 authpin=1 0x55b103059b80]

So when the pjd test #4 began later:

2022-07-22T20:52:25.353+0000 7fe2bf6d4700  1 -- [v2:172.21.15.141:6838/1279389197,v1:172.21.15.141:6839/1279389197] <== client.24956 172.21.15.135:0/377103332 1200 ==== client_request(client.24956:593 link #0x10000000220/_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_1234 #0x10000000220/fstest_d22bbf65b5ea67dbd4f2534c56018c49 2022-07-22T20:52:25.353771+0000 caller_uid=0, caller_gid=0{0,}) v4 ==== 438+0+0 (crc 0 0 0) 0x55b102b4d8c0 con 0x55b10295e400

And when traversing the path2, it will find the it in the stray directory:

2022-07-22T20:52:25.354+0000 7fe2bf6d4700  7 mds.0.cache traverse: opening base ino 0x10000000220 snap head
2022-07-22T20:52:25.354+0000 7fe2bf6d4700 10 mds.0.locker try_rdlock_snap_layout request(client.24956:593 nref=3 cr=0x55b102b4d8c0) [inode 0x10000000220 [...2,head] /volumes/_nogroup/sv_1/6ae107c1-dbbf-417a-8a51-99df917679be/client.0/tmp/tmp/ auth{1=1,2=1,3=1,4=1} v199 pv209 ap=6 DIRTYPARENT f(v0 m2022-07-22T20:52:25.274773+0000 1=0+1)->f(v0 m2022-07-22T20:52:25.348771+0000 1=1+0) n(v3 rc2022-07-22T20:52:25.274773+0000 2=0+2)->n(v3 rc2022-07-22T20:52:25.348771+0000 1=0+1) (iauth sync r=1) (isnap sync r=5) (inest mix w=4 dirty) (ipolicy sync r=1) (ifile excl w=4) (iversion lock w=4 last_client=24956) caps={24956=pAsLsXsFsxcr/pAsLsXsFsxcral@52},l=24956 | dirtyscattered=1 request=0 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=1 replicated=1 dirty=1 authpin=1 scrubqueue=0 0x55b1030a2c00]
2022-07-22T20:52:25.354+0000 7fe2bf6d4700 12 mds.0.cache traverse: path seg depth 0 'fstest_d22bbf65b5ea67dbd4f2534c56018c49' snapid head
2022-07-22T20:52:25.354+0000 7fe2bf6d4700 20 mds.0.cache.dir(0x10000000220) lookup (fstest_d22bbf65b5ea67dbd4f2534c56018c49, 'head')
2022-07-22T20:52:25.354+0000 7fe2bf6d4700 20 mds.0.cache.dir(0x10000000220)   hit -> (fstest_d22bbf65b5ea67dbd4f2534c56018c49,head)
2022-07-22T20:52:25.354+0000 7fe2bf6d4700  7 mds.0.cache linking in remote in [inode 0x100000000d3 [2,head] ~mds0/stray1/100000000d3 auth v455 pv493 ap=4 s=0 n(v0 1=1+0)->n(v0 rc2022-07-22T20:52:25.348771+0000 1=1+0) (ilink xlockdone x=2) (iversion lock w=2 last_client=24956) caps={24956=pAsXsFscr/pFscr@3},l=24956 | request=3 lock=5 caps=1 authpin=1 0x55b103050000]

And then it assume the two paths were in two different volumes and then fail it with EXDEV:

2022-07-22T20:52:25.354+0000 7fe2bf6d4700  7 mds.0.server target is in different subvolume, failing...
2022-07-22T20:52:25.354+0000 7fe2bf6d4700  7 mds.0.server reply_client_request -18 ((18) Invalid cross-device link) client_request(client.24956:593 link #0x10000000220/_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_1234 #0x10000000220/fstest_d22bbf65b5ea67dbd4f2534c56018c49 2022-07-22T20:52:25.353771+0000 caller_uid=0, caller_gid=0{0,}) v4
Actions #9

Updated by Xiubo Li over 1 year ago

  • Backport set to quincy,pacific
Actions #10

Updated by Xiubo Li over 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 47399
Actions #11

Updated by Xiubo Li about 1 year ago

  • Subject changed from [RHEL stock] pjd test failures to [RHEL stock] pjd test failures(a bug that need to wait the unlink to finish)
Actions #12

Updated by Venky Shankar about 1 year ago

  • Category set to Correctness/Safety
  • Status changed from Fix Under Review to Pending Backport
  • Target version set to v19.0.0
  • Backport changed from quincy,pacific to reef,quincy,pacific
Actions #13

Updated by Backport Bot about 1 year ago

  • Copied to Backport #59384: reef: [RHEL stock] pjd test failures(a bug that need to wait the unlink to finish) added
Actions #14

Updated by Backport Bot about 1 year ago

  • Copied to Backport #59385: quincy: [RHEL stock] pjd test failures(a bug that need to wait the unlink to finish) added
Actions #15

Updated by Backport Bot about 1 year ago

  • Copied to Backport #59386: pacific: [RHEL stock] pjd test failures(a bug that need to wait the unlink to finish) added
Actions #16

Updated by Backport Bot about 1 year ago

  • Tags set to backport_processed
Actions #19

Updated by Xiubo Li 10 months ago

  • Status changed from Pending Backport to Resolved
Actions #20

Updated by Patrick Donnelly 8 months ago

  • Related to Bug #58340: mds: fsstress.sh hangs with multimds (deadlock between unlink and reintegrate straydn(rename)) added
Actions #21

Updated by Patrick Donnelly 8 months ago

  • Related to Bug #61818: mds: deadlock between unlink and linkmerge added
Actions

Also available in: Atom PDF