Project

General

Profile

Actions

Bug #45690

open

pg_interval_t::check_new_interval is overly generous about guessing when EC PGs could have gone active

Added by ming guo almost 4 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

One EC PG stuck at peering+down forever, the problem occurs through the following steps:
Suppose the pg's acting set is [1,2,3,4,5,6],(k=4,m=2,min_size=4).
1.down osd.6,successfully completed peering.
2.down osd.5,successfully completed peering.
3.down osd.4,pg down.
4.up osd.6, need to wait for osd.4, pg down,no problem.
5.down osd.6,pg down
6.up osd.4,the problem occurred,need to wait for osd.6,but in interval of step 4,the pg state is down,it is unreasonable to wait for osd.6.

Actions #1

Updated by Greg Farnum almost 3 years ago

  • Project changed from Ceph to RADOS
  • Subject changed from pg_interval_t::check_new_interval - for ec pool, should not rely on IsPGRecoverablePredicate to determine if the PG was active at the interval to pg_interval_t::check_new_interval is overly generous about guessing when EC PGs could have gone active
  • Category deleted (OSD)

This description doesn't seem quite right to me -- OSDs 1-3 were part of the interval in step 4 so they know that nothing happened. There must be some extra pieces going wrong.

Actions

Also available in: Atom PDF