Project

General

Profile

Bug #55971

LibRadosMiscConnectFailure.ConnectFailure test failure

Added by Sridhar Seshasayee 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
quincy,pacific
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

All rados_api_tests in the run failed due to the same reason. This points to a regression.
One of the recent runs (http://pulpito.front.sepia.ceph.com/yuriw-2022-06-03_14:09:08-rados-wip-yuri7-testing-2022-06-02-1633-distro-default-smithi/) didn't have this failure.

/a/yuriw-2022-06-07_23:12:59-rados-wip-yuri7-testing-2022-06-07-1325-distro-default-smithi/6867183
/a/yuriw-2022-06-07_23:12:59-rados-wip-yuri7-testing-2022-06-07-1325-distro-default-smithi/6867203
/a/yuriw-2022-06-07_23:12:59-rados-wip-yuri7-testing-2022-06-07-1325-distro-default-smithi/6867215
/a/yuriw-2022-06-07_23:12:59-rados-wip-yuri7-testing-2022-06-07-1325-distro-default-smithi/6867253
/a/yuriw-2022-06-07_23:12:59-rados-wip-yuri7-testing-2022-06-07-1325-distro-default-smithi/6867284
/a/yuriw-2022-06-07_23:12:59-rados-wip-yuri7-testing-2022-06-07-1325-distro-default-smithi/6867374
/a/yuriw-2022-06-07_23:12:59-rados-wip-yuri7-testing-2022-06-07-1325-distro-default-smithi/6867409
/a/yuriw-2022-06-07_23:12:59-rados-wip-yuri7-testing-2022-06-07-1325-distro-default-smithi/6867412

Failure reason:
Log snippet from /a//a/yuriw-2022-06-07_23:12:59-rados-wip-yuri7-testing-2022-06-07-1325-distro-default-smithi/6867183

2022-06-08T01:26:00.304 INFO:tasks.workunit.client.0.smithi191.stdout:                 api_misc: [ RUN      ] LibRadosMiscConnectFailure.ConnectFailure
2022-06-08T01:26:00.305 INFO:tasks.workunit.client.0.smithi191.stdout:                 api_misc: /build/ceph-17.0.0-12873-g425f1005/src/test/librados/misc.cc:61: Failure
2022-06-08T01:26:00.305 INFO:tasks.workunit.client.0.smithi191.stdout:                 api_misc: Expected equality of these values:
2022-06-08T01:26:00.305 INFO:tasks.workunit.client.0.smithi191.stdout:                 api_misc:   0
2022-06-08T01:26:00.306 INFO:tasks.workunit.client.0.smithi191.stdout:                 api_misc:   rados_conf_set(cluster, "client_mount_timeout", "0.000000001")
2022-06-08T01:26:00.306 INFO:tasks.workunit.client.0.smithi191.stdout:                 api_misc:     Which is: -22
2022-06-08T01:26:00.307 INFO:tasks.workunit.client.0.smithi191.stdout:                 api_misc: [  FAILED  ] LibRadosMiscConnectFailure.ConnectFailure (34 ms)

Related issues

Copied to CephFS - Backport #56004: quincy: LibRadosMiscConnectFailure.ConnectFailure test failure Resolved
Copied to CephFS - Backport #56005: pacific: LibRadosMiscConnectFailure.ConnectFailure test failure Resolved

History

#1 Updated by Neha Ojha 6 months ago

  • Assignee set to Laura Flores

Laura, can you please triage this bug?

#2 Updated by Laura Flores 6 months ago

Ran some tests on recent main builds and pinpointed a good and bad commit. These tests go from newest main build to oldest:
  1. [BAD] 3bf1e368cf1b1780854250309eb051cea308c327: http://pulpito.front.sepia.ceph.com/lflores-2022-06-08_19:32:10-rados:monthrash-main-distro-default-smithi/
  2. [GOOD] 2469ae8b315b4d112fe4cb3bb7b6589f69c7ee5c: http://pulpito.front.sepia.ceph.com/lflores-2022-06-08_19:56:02-rados:monthrash-main-distro-default-smithi/
  3. [UNRELATED FAIL] a74fa9a66fe2f0aceffb9b297fc1560a1da19ca5: http://pulpito.front.sepia.ceph.com/lflores-2022-06-08_19:58:07-rados:monthrash-main-distro-default-smithi/
  4. [GOOD] 52f341b00a5dc4ff54351a75605d2f7dfe9b6033: http://pulpito.front.sepia.ceph.com/lflores-2022-06-08_19:59:38-rados:monthrash-main-distro-default-smithi/

Currently sifting through the commits between 1 and 2, as that seems to be where the potential regression was introduced.

#3 Updated by Laura Flores 6 months ago

The Tracker reports a problem with `client_mount_timeout`. Here are all of the changes that were made to the Client code between the good and bad commit:

[lflores@fedora ceph]$ git log --pretty=oneline --no-merges 2469ae8b315b4d112fe4cb3bb7b6589f69c7ee5c..3bf1e368cf1b1780854250309eb051cea308c327 src/client
c6cb986a2eea10b8ef15d1c6539873a88c5a69aa mds, client: remove useless feature required code
983b10506dc8466a0e47ff0d320d480dd09999ec client: Inode::hold_caps_until is time from monotonic clock now.
2e1f43c99b1818c2ffde64f5b01083c1907a9f87 (ci/wip-khiremat-46078-fuse-directory-dacs-override-1) client/fuse: Fix directory DACs overriding for root
a451a3670b7bb783ca6dcb8b2a31a8e6ec396899 client: allow overwrites to files with size greater than the max_file_size cfg
aabd5e9c578c2c6da9542bcb935bc36678503359 client: fix possible inifinite loop when getting an ESTALE from MDS

The only commit that changes `client_mount_timeout` is https://github.com/ceph/ceph/pull/44247/commits/983b10506dc8466a0e47ff0d320d480dd09999ec.

#4 Updated by Laura Flores 6 months ago

  • Regression changed from No to Yes

#5 Updated by Laura Flores 6 months ago

The unit for `client_mount_timeout` was changed from float to seconds in that commit, so the LibRadosMiscConnectFailure.ConnectFailure should be modified to reflect this change.

#6 Updated by Laura Flores 6 months ago

  • Project changed from RADOS to CephFS

#7 Updated by Laura Flores 6 months ago

  • Assignee changed from Laura Flores to Venky Shankar

@Venky please reassign as needed

#8 Updated by Laura Flores 6 months ago

What I've found is that the seconds unit expects only integers. So with the change from `float` -> `secs`, it is no longer possible to set the client_mount_timeout to anything other than an integer.

[lflores@folio01 build]$ ./bin/ceph config set client client_mount_timeout 0.1
Error EINVAL: error parsing value: unexpected trailing '.1'

#10 Updated by Laura Flores 6 months ago

This can be reproduced locally on the most up-to-date version of main with:

cd ceph/build
ninja ceph_test_rados_api_misc
./bin/ceph_test_rados_api_misc --gtest_filter=*LibRadosMiscConnectFailure*

#11 Updated by Laura Flores 6 months ago

  • Pull request ID set to 46604

I have opened a possible fix, but I don't think we can achieve the same equality with the `client_mount_timeout` values since the new seconds unit does not allow for float values. Setting the `client_mount_timeout` to 1 second passed all of my local tests, but this fix should be reviewed by the Core and CephFS team to see if it would make sense.

#12 Updated by Patrick Donnelly 6 months ago

  • Status changed from New to Fix Under Review
  • Assignee changed from Venky Shankar to Laura Flores
  • Target version set to v18.0.0
  • Source set to Q/A
  • Backport set to quincy,pacific

#14 Updated by Laura Flores 6 months ago

  • Status changed from Fix Under Review to Pending Backport

#15 Updated by Backport Bot 6 months ago

  • Copied to Backport #56004: quincy: LibRadosMiscConnectFailure.ConnectFailure test failure added

#16 Updated by Backport Bot 6 months ago

  • Copied to Backport #56005: pacific: LibRadosMiscConnectFailure.ConnectFailure test failure added

#17 Updated by Laura Flores 5 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF