Project

General

Profile

Actions

Bug #64052

open

osd/scrub: extreme delay of registration messages might cause a crash

Added by Ronen Friedman 4 months ago. Updated 3 months ago.

Status:
Pending Backport
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
reef
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

(https://pulpito.ceph.com/rfriedma-2024-01-15_18:13:21-rados:thrash-wip-rf-rm-penaltyq-distro-default-smithi/7517129/).

This is the error scenario, to the best of my understanding:
PG 23.15 [2,1,0]

For OSD 2, the primary, the sequence was:
19:47:48.3: reservation request sent (epoch 2633) to osd.0 (that message took more than 17 minutes to arrive)
19:52:48 - the reservation timed-out. The Primary terminated the scrub, and sent a release message;

19:53:07 - a 2'nd reservation attempt. We are still in the same interval. Epoch: 2795
19:58:07 - the reservation timed-out. The Primary terminated the scrub, and sent a release message;

19:58:22 - 3'rd attempt. Epoch = 2947
20:03:22 - timeout + release

20:03:31 - 4'th request. Epoch = 3109
20:05:33 -- a Reject message from OSD.0 (with epoch = 2633)
So this request was just now acted upon by the replica (and was refused due to active recovery).

Things go downhill from here, as primary & replica are not synchronised regarding the
reservation status.
The primary just got a reject, and is treating it as an answer to its latest request.
The replica will receive (in a quick sequence, 15 minutes after the first of them was sent) multiple
request/release pairs, ending with a request (3109) that would be granted.


Related issues 1 (1 open0 closed)

Copied to Ceph - Backport #64233: reef: osd/scrub: extreme delay of registration messages might cause a crashNewRonen FriedmanActions
Actions #1

Updated by Ronen Friedman 4 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 55217
Actions #2

Updated by Radoslaw Zarzynski 3 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to reef
Actions #3

Updated by Backport Bot 3 months ago

  • Copied to Backport #64233: reef: osd/scrub: extreme delay of registration messages might cause a crash added
Actions #4

Updated by Backport Bot 3 months ago

  • Tags set to backport_processed
Actions

Also available in: Atom PDF