Bug #45690: pg_interval_t::check_new_interval is overly generous about guessing when EC PGs could have gone active - RADOS - Ceph

Actions

Copy link

Bug #45690

open

pg_interval_t::check_new_interval is overly generous about guessing when EC PGs could have gone active

Added by ming guo almost 4 years ago. Updated almost 3 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v12.2.12

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

One EC PG stuck at peering+down forever, the problem occurs through the following steps:
Suppose the pg's acting set is [1,2,3,4,5,6],(k=4,m=2,min_size=4).
1.down osd.6,successfully completed peering.
2.down osd.5,successfully completed peering.
3.down osd.4,pg down.
4.up osd.6, need to wait for osd.4, pg down,no problem.
5.down osd.6,pg down
6.up osd.4,the problem occurred,need to wait for osd.6,but in interval of step 4,the pg state is down,it is unreasonable to wait for osd.6.

Actions

Copy link

Updated by Greg Farnum almost 3 years ago

Project changed from Ceph to RADOS
Subject changed from pg_interval_t::check_new_interval - for ec pool, should not rely on IsPGRecoverablePredicate to determine if the PG was active at the interval to pg_interval_t::check_new_interval is overly generous about guessing when EC PGs could have gone active
Category deleted (~~OSD~~)

This description doesn't seem quite right to me -- OSDs 1-3 were part of the interval in step 4 so they know that nothing happened. There must be some extra pieces going wrong.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #45690

pg_interval_t::check_new_interval is overly generous about guessing when EC PGs could have gone active

Updated by Greg Farnum almost 3 years ago