Bug #63355
openBug #62449: test/cls_2pc_queue: TestCls2PCQueue.MultiProducer and TestCls2PCQueue.AsyncConsumer failure
test/cls_2pc_queue: fails during migration tests
0%
Description
as part of the fix here: https://github.com/ceph/ceph/pull/52439/
the structure describing the entry deletion operation was changed, and a field holding the number of deleted entries was added to it.
this field is set by the client, and used to update the queue stats.
however, when an old ("reef" or earlier) client sent this operation to a new OSD, it fails to parse the op structure.
it expects: cls_2pc_queue_remove_op
but the old cleint sends: cls_queue_remove_op
as a solution, best option would be to try and decode the op as "cls_2pc_queue_remove_op" and if this fails to try and decode as "cls_queue_remove_op".
then, use the entry listing operation in oder to count how many entries should be deleted, and update the stats accordingly.
this would have a performance cost, but only during the period when there is a cluster with different versions of OSDs and RGWs.
another option, is to perform the entry deletion without updating the stats, and add a new "radosgw-admin" command to recalculate the stats by going through all of the entries in the queue.
Updated by Casey Bodley 6 months ago
however, when an old ("reef" or earlier) client sent this operation to a new OSD, it fails to parse the op structure.
it expects: cls_2pc_queue_remove_op
but the old cleint sends: cls_queue_remove_op
if we need a separate cls_2pc_queue_remove_op
, it should be copied from cls_queue_remove_op
so the decode is identical. then you can bump the 'version' to add new fields
if we haven't backported this anywhere, we can still fix it on main
Updated by Yuval Lifshitz 6 months ago
Casey Bodley wrote:
if we need a separate
cls_2pc_queue_remove_op
, it should be copied fromcls_queue_remove_op
so the decode is identical. then you can bump the 'version' to add new fieldsif we haven't backported this anywhere, we can still fix it on main
cls_2pc_queue_remove_op
is a new data structure, before that we used the cls_queue_remove_op
that we cannot change because it does not have the notion of "entries".
even tough cls_2pc_queue_remove_op
is new, we would create a "version 2" of the structure and a treat cls_queue_remove_op
as "version 1" of cls_2pc_queue_remove_op
Updated by Yuval Lifshitz 6 months ago
- Status changed from Triaged to Fix Under Review
- Pull request ID set to 54459
Updated by Casey Bodley 5 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 5 months ago
- Tags changed from test-failure to test-failure backport_processed
Updated by Yuval Lifshitz 5 months ago
unless we backport the new persistent topic observability feature, we don't need to backport this fix.
this is just an upgrade issue from an older version to "squid".