Bug #59599

osd: cls_refcount unit test failures during upgrade sequence

Added by Sridhar Seshasayee 10 months ago. Updated 9 months ago.

Target version:
% Done:


3 - minor
Affected Versions:
Pull request ID:
Crash signature (v1):
Crash signature (v2):



Historically also seen in:

2023-05-01T19:56:01.877 INFO:tasks.workunit.client.0.smithi125.stdout:/build/ceph-16.2.12-83-g2b02306b/src/test/cls_refcount/ Failure
2023-05-01T19:56:01.877 INFO:tasks.workunit.client.0.smithi125.stdout:Expected equality of these values:
2023-05-01T19:56:01.877 INFO:tasks.workunit.client.0.smithi125.stdout:  -2
2023-05-01T19:56:01.877 INFO:tasks.workunit.client.0.smithi125.stdout:  ioctx.operate(oid, op)
2023-05-01T19:56:01.878 INFO:tasks.workunit.client.0.smithi125.stdout:    Which is: 0
2023-05-01T19:56:01.879 INFO:tasks.workunit.client.0.smithi125.stdout:[  FAILED  ] cls_rgw.test_implicit_ec (4304 ms)
2023-05-01T19:56:05.632 INFO:tasks.workunit.client.0.smithi125.stdout:/build/ceph-16.2.12-83-g2b02306b/src/test/cls_refcount/ Failure
2023-05-01T19:56:05.633 INFO:tasks.workunit.client.0.smithi125.stdout:Expected equality of these values:
2023-05-01T19:56:05.633 INFO:tasks.workunit.client.0.smithi125.stdout:  -2
2023-05-01T19:56:05.633 INFO:tasks.workunit.client.0.smithi125.stdout:  ioctx.operate(oid, op)
2023-05-01T19:56:05.633 INFO:tasks.workunit.client.0.smithi125.stdout:    Which is: 0
2023-05-01T19:56:05.634 INFO:tasks.workunit.client.0.smithi125.stdout:[  FAILED  ] cls_rgw.test_implicit_idempotent_ec (3755 ms)
2023-05-01T19:56:32.081 INFO:tasks.workunit.client.0.smithi125.stdout:[  FAILED  ] 2 tests, listed below:
2023-05-01T19:56:32.081 INFO:tasks.workunit.client.0.smithi125.stdout:[  FAILED  ] cls_rgw.test_implicit_ec
2023-05-01T19:56:32.081 INFO:tasks.workunit.client.0.smithi125.stdout:[  FAILED  ] cls_rgw.test_implicit_idempotent_ec
2023-05-01T19:56:32.081 INFO:tasks.workunit.client.0.smithi125.stdout:
2023-05-01T19:56:32.081 INFO:tasks.workunit.client.0.smithi125.stdout: 2 FAILED TESTS


#1 Updated by Sridhar Seshasayee 10 months ago


#2 Updated by Yuri Weinstein 10 months ago

See also here:

2023-05-05T15:30:03.114 INFO:tasks.workunit.client.0.smithi046.stdout:[----------] Global test environment tear-down
2023-05-05T15:30:03.114 INFO:tasks.workunit.client.0.smithi046.stdout:[==========] 10 tests from 1 test suite ran. (55944 ms total)
2023-05-05T15:30:03.115 INFO:tasks.workunit.client.0.smithi046.stdout:[  PASSED  ] 8 tests.
2023-05-05T15:30:03.115 INFO:tasks.workunit.client.0.smithi046.stdout:[  FAILED  ] 2 tests, listed below:
2023-05-05T15:30:03.115 INFO:tasks.workunit.client.0.smithi046.stdout:[  FAILED  ] cls_rgw.test_implicit_ec
2023-05-05T15:30:03.115 INFO:tasks.workunit.client.0.smithi046.stdout:[  FAILED  ] cls_rgw.test_implicit_idempotent_ec
2023-05-05T15:30:03.115 INFO:tasks.workunit.client.0.smithi046.stdout:
2023-05-05T15:30:03.115 INFO:tasks.workunit.client.0.smithi046.stdout: 2 FAILED TESTS

#3 Updated by Laura Flores 9 months ago


#4 Updated by Laura Flores 9 months ago

  • Project changed from Ceph to rgw

#5 Updated by Laura Flores 9 months ago

  • Tags set to test-failure
  • Backport set to quincy

#6 Updated by Casey Bodley 9 months ago

  • Project changed from rgw to RADOS

i don't see any significant changes to this refcount object class in a long time. the test_implicit_ec test case does the same thing as the passing test_implicit test case does, except against an erasure-coded pool. cls_refcount_get() returns the expected -ENOENT against a replicated pool, but returns 0 against an erasure-coded pool

this difference in behavior must be at the rados level, not in rgw or cls_refcount

#7 Updated by Radoslaw Zarzynski 9 months ago

  • Assignee set to Nitzan Mordechai

Hello Nitzan! Could it be related to

#8 Updated by Laura Flores 9 months ago


#9 Updated by Nitzan Mordechai 9 months ago

That behavior only happens with upgrade, i'm looking into it. But that error only occurs when the code that i added in PrimarylogPG is not invoke (from the other side, if the code is not there, the test shouldn't be there either..)

#10 Updated by Nitzan Mordechai 9 months ago

  • Status changed from New to In Progress

#11 Updated by Nitzan Mordechai 9 months ago

I think that i got it right - its pretty weird (for me) but thats what i found -
All the tests that failed in that bug report are upgrade tests, the test that failed is - cls_rgw.test_implicit_idempotent_ec that related to PRs:
main - (Merged)
quincy - (Open.. not Merged yet)
Pacific - (Merged)

The test install first Pacific and Upgrade to quincy - then run some tests - one of them is CLS tests with the failing test.

 - workunit:
        branch: pacific
          - cls

We are running workunit from pacific that have the test_implicit_idempotent_ec, but the cluster is running quincy that doesn't have the code in PrimaryLogPG that handle that fail and that causing the test to fail.

i wonder, why are we running pacific tests on quincy ? don't we want to test quincy workunit to make sure new features and tests went in after the upgrade? (in that case its backwords, but for normal situation when our upgraded version will probably have new tests or features)

let's wait to be merged and re-run that upgrade.

#12 Updated by Radoslaw Zarzynski 9 months ago

  • Status changed from In Progress to Resolved
  • Pull request ID set to 47332

Backporting has been done manually, without tracker tickets.

Also available in: Atom PDF