Bug #63355
openBug #62449: test/cls_2pc_queue: TestCls2PCQueue.MultiProducer and TestCls2PCQueue.AsyncConsumer failure
test/cls_2pc_queue: fails during migration tests
0%
Description
as part of the fix here: https://github.com/ceph/ceph/pull/52439/
the structure describing the entry deletion operation was changed, and a field holding the number of deleted entries was added to it.
this field is set by the client, and used to update the queue stats.
however, when an old ("reef" or earlier) client sent this operation to a new OSD, it fails to parse the op structure.
it expects: cls_2pc_queue_remove_op
but the old cleint sends: cls_queue_remove_op
as a solution, best option would be to try and decode the op as "cls_2pc_queue_remove_op" and if this fails to try and decode as "cls_queue_remove_op".
then, use the entry listing operation in oder to count how many entries should be deleted, and update the stats accordingly.
this would have a performance cost, but only during the period when there is a cluster with different versions of OSDs and RGWs.
another option, is to perform the entry deletion without updating the stats, and add a new "radosgw-admin" command to recalculate the stats by going through all of the entries in the queue.