Project

General

Profile

Bug #62245

qa/workunits/libcephfs/test.sh failed: [ FAILED ] LibCephFS.DelegTimeout

Added by Xiubo Li 7 months ago. Updated 7 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/teuthology/vshankar-2023-07-25_11:29:34-fs-wip-vshankar-testing-20230725.043804-testing-default-smithi/7350738

2023-07-27T19:46:25.988 INFO:tasks.workunit.client.0.smithi005.stdout:[ RUN      ] LibCephFS.DelegTimeout
2023-07-27T19:46:26.062 INFO:tasks.workunit.client.0.smithi005.stdout:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-5128-ga5d48e8c/rpm/el8/BUILD/ceph-18.0.0-5128-ga5d48e8c/src/test/libcephfs/deleg.cc:331: Failure
2023-07-27T19:46:26.063 INFO:tasks.workunit.client.0.smithi005.stdout:Expected equality of these values:
2023-07-27T19:46:26.063 INFO:tasks.workunit.client.0.smithi005.stdout:  ceph_ll_getattr(cmount, root, &stx, 0, 0, perms)
2023-07-27T19:46:26.063 INFO:tasks.workunit.client.0.smithi005.stdout:    Which is: 0
2023-07-27T19:46:26.063 INFO:tasks.workunit.client.0.smithi005.stdout:  -107
2023-07-27T19:46:26.063 INFO:tasks.workunit.client.0.smithi005.stdout:[  FAILED  ] LibCephFS.DelegTimeout (75 ms)
...
2023-07-27T19:56:05.458 INFO:teuthology.run:Summary data:
description: fs/libcephfs/{begin/{0-install 1-ceph 2-logrotate} clusters/1-mds-1-client-coloc
  conf/{client mds mon osd} distro/{rhel_8} objectstore/bluestore-bitmap overrides/{ignorelist_health
  ignorelist_wrongly_marked_down no_client_pidfile} tasks/libcephfs/{frag test}}
duration: 1900.9459915161133
failure_reason: 'Command failed (workunit test libcephfs/test.sh) on smithi005 with
  status 1: ''mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp
  && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=a5d48e8c688c2543997a98b8acbae2ce9635578c
  TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin
  CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0
  CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage
  timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/libcephfs/test.sh'''
flavor: default
mon.a-kernel-sha1: bbae19e0824be3daf9282aaf5330af10cdf81109
mon.b-kernel-sha1: bbae19e0824be3daf9282aaf5330af10cdf81109
owner: scheduled_vshankar@teuthology
sentry_event: https://sentry.ceph.com/organizations/ceph/?query=9ece941c976c4976b4267936221dea06
status: fail
success: false

2023-07-27T19:56:05.459 DEBUG:teuthology.report:Pushing job info to https://paddles.front.sepia.ceph.com/
2023-07-27T19:56:05.547 INFO:teuthology.run:FAIL

History

#1 Updated by Xiubo Li 7 months ago

The DelegTimeout test case itself it's buggy:

299 TEST(LibCephFS, DelegTimeout) {
300   struct ceph_mount_info *cmount;
301   ASSERT_EQ(ceph_create(&cmount, NULL), 0);
302   ASSERT_EQ(ceph_conf_read_file(cmount, NULL), 0);
303   ASSERT_EQ(0, ceph_conf_parse_env(cmount, NULL)); 
304   ASSERT_EQ(ceph_mount(cmount, "/"), 0);
305   // tweak timeout to run quickly, since we don't plan to return it anyway
306   ASSERT_EQ(ceph_set_deleg_timeout(cmount, 2), 0);
307    
308   Inode *root, *file;
309   ASSERT_EQ(ceph_ll_lookup_root(cmount, &root), 0);
310    
311   char filename[32];
312   sprintf(filename, "delegtimeo%x", getpid());
313 
314   Fh *fh;
315   struct ceph_statx stx;
316   UserPerm *perms = ceph_mount_perms(cmount);
317 
318   ASSERT_EQ(ceph_ll_create(cmount, root, filename, 0666,
319                     O_RDWR|O_CREAT|O_EXCL, &file, &fh, &stx, 0, 0, perms), 0); 
320 
321   /* Reopen read-only */
322   ASSERT_EQ(ceph_ll_close(cmount, fh), 0);
323   ASSERT_EQ(ceph_ll_open(cmount, file, O_RDONLY, &fh, perms), 0);
324 
325   std::atomic_bool recalled(false);
326   ASSERT_EQ(ceph_ll_delegation_wait(cmount, fh, CEPH_DELEGATION_RD, dummy_deleg_cb, &recalled), 0);
327   std::atomic_bool opened(false);
328   std::thread breaker1(open_breaker_func, nullptr, filename, O_RDWR, &opened);
329   breaker1.join();
330   ASSERT_EQ(recalled.load(), true);
331   ASSERT_EQ(ceph_ll_getattr(cmount, root, &stx, 0, 0, perms), -CEPHFS_ENOTCONN);
332   ceph_release(cmount);
333 }

Because the MDS Locker allow multiple clients to open the file with rw and the lock state will be:

[LOCK_MIX]       = { 0,         false, LOCK_MIX,  0,    0,   REQ, ANY, 0,   0,   0,   CEPH_CAP_GRD|CEPH_CAP_GWR|CEPH_CAP_GLAZYIO,0,0,CEPH_CAP_GRD },

So the second mounter in child open_breaker_func thread when opening the file with rw mode, the client won't break the delegation in main thread and the child thread could finish very fast in 2 seconds, which is from 306 ASSERT_EQ(ceph_set_deleg_timeout(cmount, 2), 0); in main thread.

That means when in main thread Line#331, the delegation hasn't timedout yet. When the delegation times out it will umount the mounter directly and then when accessing the mounter later it should return -CEPHFS_ENOTCONN.

#2 Updated by Xiubo Li 7 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 52723

Also available in: Atom PDF