Project

General

Profile

Actions

Bug #63355

open

Bug #62449: test/cls_2pc_queue: TestCls2PCQueue.MultiProducer and TestCls2PCQueue.AsyncConsumer failure

test/cls_2pc_queue: fails during migration tests

Added by Yuval Lifshitz 6 months ago. Updated 5 months ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
test-failure backport_processed
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

as part of the fix here: https://github.com/ceph/ceph/pull/52439/
the structure describing the entry deletion operation was changed, and a field holding the number of deleted entries was added to it.
this field is set by the client, and used to update the queue stats.
however, when an old ("reef" or earlier) client sent this operation to a new OSD, it fails to parse the op structure.
it expects: cls_2pc_queue_remove_op
but the old cleint sends: cls_queue_remove_op

as a solution, best option would be to try and decode the op as "cls_2pc_queue_remove_op" and if this fails to try and decode as "cls_queue_remove_op".
then, use the entry listing operation in oder to count how many entries should be deleted, and update the stats accordingly.
this would have a performance cost, but only during the period when there is a cluster with different versions of OSDs and RGWs.

another option, is to perform the entry deletion without updating the stats, and add a new "radosgw-admin" command to recalculate the stats by going through all of the entries in the queue.

Actions #1

Updated by Yuval Lifshitz 6 months ago

  • Assignee set to Ali Masarwa
Actions #2

Updated by Casey Bodley 6 months ago

however, when an old ("reef" or earlier) client sent this operation to a new OSD, it fails to parse the op structure.
it expects: cls_2pc_queue_remove_op
but the old cleint sends: cls_queue_remove_op

if we need a separate cls_2pc_queue_remove_op, it should be copied from cls_queue_remove_op so the decode is identical. then you can bump the 'version' to add new fields

if we haven't backported this anywhere, we can still fix it on main

Actions #3

Updated by Yuval Lifshitz 6 months ago

Casey Bodley wrote:

if we need a separate cls_2pc_queue_remove_op, it should be copied from cls_queue_remove_op so the decode is identical. then you can bump the 'version' to add new fields

if we haven't backported this anywhere, we can still fix it on main

cls_2pc_queue_remove_op is a new data structure, before that we used the cls_queue_remove_op that we cannot change because it does not have the notion of "entries".
even tough cls_2pc_queue_remove_op is new, we would create a "version 2" of the structure and a treat cls_queue_remove_op as "version 1" of cls_2pc_queue_remove_op

Actions #4

Updated by Casey Bodley 6 months ago

  • Status changed from New to Triaged
Actions #5

Updated by Yuval Lifshitz 6 months ago

  • Status changed from Triaged to Fix Under Review
  • Pull request ID set to 54459
Actions #6

Updated by Casey Bodley 5 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #7

Updated by Backport Bot 5 months ago

  • Tags changed from test-failure to test-failure backport_processed
Actions #8

Updated by Yuval Lifshitz 5 months ago

unless we backport the new persistent topic observability feature, we don't need to backport this fix.
this is just an upgrade issue from an older version to "squid".

Actions

Also available in: Atom PDF