Project

General

Profile

Actions

Bug #12162

closed

pg_interval_t::check_new_interval - for ec pool, should not rely on min_size to determine if the PG was active at the interval

Added by Guang Yang almost 9 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
hammer, firefly
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

One PG on our cluster stuck at peering+down forever, log shows the peering was blocked by an out/down OSD

2015-06-23 21:27:59.948809 7f26d2bd9700 10 osd.52 pg_epoch: 27468 pg[3.1e3fs0( v 25576'94308 (15251'84307,25576'94308] local-les=25568 n=73284 ec=1152 les/c 25568/18381 27457/27457/27457) [52,23,10,456,433,200,388,330,493,104,426] r=0 lpr=27457 pi=18380-27456/559 crt=25298'94302 lcod 0'0 mlcod 0'0 peering]  PriorSet: build_prior final: probe 10(2),22(1),23(1),24(2),46(0),52(0),104(9),200(5),249(3),254(10),265(6),330(7),388(6),426(10),433(4),450(7),456(3),493(8) down 243 blocked_by {243=0} pg_down

The actual problem came from that when building the PriorSet, it blindly used the pool's min_size when check if the PG was r/w during the interval:

2015-06-23 21:28:00.357787 7f26d13d6700 10 osd.52 pg_epoch: 27471 pg[3.1e3fs0( v 25576'94308 (15251'84307,25576'94308] local-les=25568 n=73284 ec=1152 les/c 25568/18381 27471/27471/27471) [52,23,10,456,433,200,388,330,493,104,426] r=0 lpr=27471 pi=18380-27470/561 crt=25298'94302 lcod 0'0 mlcod 0'0 peering]  PriorSet: build_prior interval(25614-25615 up [2147483647,2147483647,2147483647,2147483647,2147483647,200,2147483647,2147483647,243,104,426](200) acting [2147483647,2147483647,2147483647,2147483647,2147483647,200,2147483647,2147483647,243,104,426](200) maybe_went_rw)

Ceph version: v0.80.4

Credit goes to Sam for the analysis, thanks Sam!


Related issues 2 (0 open2 closed)

Copied to Ceph - Backport #12488: pg_interval_t::check_new_interval - for ec pool, should not rely on min_size to determine if the PG was active at the intervalRejectedSamuel JustActions
Copied to Ceph - Backport #12489: pg_interval_t::check_new_interval - for ec pool, should not rely on min_size to determine if the PG was active at the intervalResolvedLoïc Dachary06/25/2015Actions
Actions #1

Updated by Guang Yang almost 9 years ago

  • Subject changed from PG is stuck at down+peering forever to pg_interval_t::check_new_interval - for ec pool, should not relay on min_size to determine if the PG was active at the interval
Actions #2

Updated by Guang Yang almost 9 years ago

  • Subject changed from pg_interval_t::check_new_interval - for ec pool, should not relay on min_size to determine if the PG was active at the interval to pg_interval_t::check_new_interval - for ec pool, should not rely on min_size to determine if the PG was active at the interval
Actions #4

Updated by Samuel Just almost 9 years ago

  • Priority changed from Normal to Urgent
Actions #5

Updated by Samuel Just almost 9 years ago

  • Status changed from New to 7
Actions #6

Updated by Samuel Just over 8 years ago

  • Status changed from 7 to Pending Backport
  • Backport set to hammer, firefly
Actions #7

Updated by Loïc Dachary about 8 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF