Bug #64052
openosd/scrub: extreme delay of registration messages might cause a crash
0%
Description
This is the error scenario, to the best of my understanding:
PG 23.15 [2,1,0]
For OSD 2, the primary, the sequence was:
19:47:48.3: reservation request sent (epoch 2633) to osd.0 (that message took more than 17 minutes to arrive)
19:52:48 - the reservation timed-out. The Primary terminated the scrub, and sent a release message;
19:53:07 - a 2'nd reservation attempt. We are still in the same interval. Epoch: 2795
19:58:07 - the reservation timed-out. The Primary terminated the scrub, and sent a release message;
19:58:22 - 3'rd attempt. Epoch = 2947
20:03:22 - timeout + release
20:03:31 - 4'th request. Epoch = 3109
20:05:33 -- a Reject message from OSD.0 (with epoch = 2633)
So this request was just now acted upon by the replica (and was refused due to active recovery).
Things go downhill from here, as primary & replica are not synchronised regarding the
reservation status.
The primary just got a reject, and is treating it as an answer to its latest request.
The replica will receive (in a quick sequence, 15 minutes after the first of them was sent) multiple
request/release pairs, ending with a request (3109) that would be granted.
Updated by Ronen Friedman 4 months ago
- Status changed from New to Fix Under Review
- Pull request ID set to 55217
Updated by Radoslaw Zarzynski 3 months ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to reef
Updated by Backport Bot 3 months ago
- Copied to Backport #64233: reef: osd/scrub: extreme delay of registration messages might cause a crash added