Project

General

Profile

Bug #12162

pg_interval_t::check_new_interval - for ec pool, should not rely on min_size to determine if the PG was active at the interval

Added by Guang Yang about 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
06/25/2015
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
hammer, firefly
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:

Description

One PG on our cluster stuck at peering+down forever, log shows the peering was blocked by an out/down OSD

2015-06-23 21:27:59.948809 7f26d2bd9700 10 osd.52 pg_epoch: 27468 pg[3.1e3fs0( v 25576'94308 (15251'84307,25576'94308] local-les=25568 n=73284 ec=1152 les/c 25568/18381 27457/27457/27457) [52,23,10,456,433,200,388,330,493,104,426] r=0 lpr=27457 pi=18380-27456/559 crt=25298'94302 lcod 0'0 mlcod 0'0 peering]  PriorSet: build_prior final: probe 10(2),22(1),23(1),24(2),46(0),52(0),104(9),200(5),249(3),254(10),265(6),330(7),388(6),426(10),433(4),450(7),456(3),493(8) down 243 blocked_by {243=0} pg_down

The actual problem came from that when building the PriorSet, it blindly used the pool's min_size when check if the PG was r/w during the interval:

2015-06-23 21:28:00.357787 7f26d13d6700 10 osd.52 pg_epoch: 27471 pg[3.1e3fs0( v 25576'94308 (15251'84307,25576'94308] local-les=25568 n=73284 ec=1152 les/c 25568/18381 27471/27471/27471) [52,23,10,456,433,200,388,330,493,104,426] r=0 lpr=27471 pi=18380-27470/561 crt=25298'94302 lcod 0'0 mlcod 0'0 peering]  PriorSet: build_prior interval(25614-25615 up [2147483647,2147483647,2147483647,2147483647,2147483647,200,2147483647,2147483647,243,104,426](200) acting [2147483647,2147483647,2147483647,2147483647,2147483647,200,2147483647,2147483647,243,104,426](200) maybe_went_rw)

Ceph version: v0.80.4

Credit goes to Sam for the analysis, thanks Sam!


Related issues

Copied to Ceph - Backport #12488: pg_interval_t::check_new_interval - for ec pool, should not rely on min_size to determine if the PG was active at the interval Rejected
Copied to Ceph - Backport #12489: pg_interval_t::check_new_interval - for ec pool, should not rely on min_size to determine if the PG was active at the interval Resolved 06/25/2015

Associated revisions

Revision 68492744 (diff)
Added by Guang G Yang about 3 years ago

osd: pg_interval_t::check_new_interval should not rely on pool.min_size to determine if the PG was active

If the pool's min_size is set improperly, during peering, pg_interval_t::check_new_interval
might wrongly determine the PG's state and cause the PG to stuck at down+peering forever

Fixes: #12162
Signed-off-by: Guang Yang

Revision cd11b887 (diff)
Added by Guang G Yang about 3 years ago

osd: pg_interval_t::check_new_interval should not rely on pool.min_size to determine if the PG was active

If the pool's min_size is set improperly, during peering, pg_interval_t::check_new_interval
might wrongly determine the PG's state and cause the PG to stuck at down+peering forever

Fixes: #12162
Signed-off-by: Guang Yang
(cherry picked from commit 684927442d81ea08f95878a8af69d08d3a14d973)

Conflicts:
src/osd/PG.cc
because PG::start_peering_interval has an assert
that is not found in hammer in the context
src/test/osd/types.cc
because include/stringify.h is not included by
types.cc in hammer

History

#1 Updated by Guang Yang about 3 years ago

  • Subject changed from PG is stuck at down+peering forever to pg_interval_t::check_new_interval - for ec pool, should not relay on min_size to determine if the PG was active at the interval

#2 Updated by Guang Yang about 3 years ago

  • Subject changed from pg_interval_t::check_new_interval - for ec pool, should not relay on min_size to determine if the PG was active at the interval to pg_interval_t::check_new_interval - for ec pool, should not rely on min_size to determine if the PG was active at the interval

#4 Updated by Samuel Just about 3 years ago

  • Priority changed from Normal to Urgent

#5 Updated by Samuel Just about 3 years ago

  • Status changed from New to Testing

#6 Updated by Samuel Just about 3 years ago

  • Status changed from Testing to Pending Backport
  • Backport set to hammer, firefly

#7 Updated by Loic Dachary over 2 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF