Project

General

Profile

Bug #43189

pgs stuck in laggy state

Added by Sage Weil about 4 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

1.b       6        0         6       0  8585216           0          0  66 active+clean+remapped+laggy    4h   74'66 5757:5982 [4,0,2,NONE,7,3]p4    [4,0,2,0,7,3]p4 2019-12-08T17:10:29.229694+0000 2019-12-08T17:10:29.229694+0000 

/a/sage-2019-12-08_16:23:05-rados:thrash-erasure-code-master-distro-basic-smithi/4581925/

pretty reproducible with rados/thrash-erasure-code --subset 1/99 (1/50 runs hung)


Related issues

Related to RADOS - Bug #43213: OSDMap::pg_to_up_acting etc specify primary as osd, not pg_shard_t(osd+shard) New
Copied to RADOS - Backport #43232: nautilus: pgs stuck in laggy state Rejected

History

#1 Updated by Sage Weil about 4 years ago

more logs here:
/a/sage-2019-12-07_18:31:18-rados:thrash-erasure-code-wip-sage3-testing-2019-12-05-0959-distro-basic-smithi/4579417

PG  OBJECTS DEGRADED MISPLACED UNFOUND BYTES   OMAP_BYTES* OMAP_KEYS* LOG STATE                       SINCE VERSION REPORTED    UP                       ACTING                   SCRUB_STAMP                     DEEP_SCRUB_STAMP                
1.0       2        2         0       0 2539520           0          0  21  active+undersized+degraded    7h   53'21 10709:10736 [1,5,2,2147483647,3,4]p1 [1,5,2,2147483647,3,4]p1 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.1       2        2         0       0 1261568           0          0  11  active+undersized+degraded    7h   68'11 10709:10775 [3,2147483647,2,4,5,1]p3 [3,2147483647,2,4,5,1]p3 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.2       3        0         3       0 1474560           0          0  16       active+clean+remapped    7h   64'16 10709:10706 [5,4,1,2147483647,2,3]p5          [5,4,1,5,2,3]p5 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.3       4        4         0       0 3260416           0          0  23  active+undersized+degraded    7h   67'23 10709:10842 [4,1,2,5,2147483647,3]p4 [4,1,2,5,2147483647,3]p4 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.4       1        0         1       0  311296           0          0  10       active+clean+remapped    7h   53'10 10709:10735 [2,3,1,2147483647,5,4]p2          [2,3,1,5,5,4]p2 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.5       4        4         0       0 2916352           0          0  13  active+undersized+degraded    7h   68'13 10709:10800 [2,1,3,4,2147483647,5]p2 [2,1,3,4,2147483647,5]p2 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.6       7        0         7       0 7995392           0          0  92 active+clean+remapped+laggy    7h   69'92 10709:10905 [3,2,4,1,2147483647,5]p3          [3,2,4,1,4,5]p3 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.7       6        0         6       0 5718016           0          0  51       active+clean+remapped    7h   69'51 10709:10899 [1,4,2147483647,3,5,2]p1          [1,4,3,3,5,2]p1 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.8       8        8         0       0 5603328           0          0  61  active+undersized+degraded    7h   68'61 10709:10878 [1,4,5,2,3,2147483647]p1 [1,4,5,2,3,2147483647]p1 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.9       4        0         4       0 5914624           0          0  52       active+clean+remapped    7h   69'52 10709:10888 [1,4,5,2147483647,2,3]p1          [1,4,5,2,2,3]p1 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.a       2        2         0       0 1654784           0          0   9  active+undersized+degraded    7h    57'9 10709:10783 [1,4,2,5,3,2147483647]p1 [1,4,2,5,3,2147483647]p1 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.b       6        0         6       0 7585792           0          0  57 active+clean+remapped+laggy    7h   69'57 10709:10933 [4,3,2,2147483647,1,5]p4          [4,3,2,3,1,5]p4 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.c       6        6         0       0 3555328           0          0  56  active+undersized+degraded    7h   62'56 10709:10725 [2147483647,5,4,1,2,3]p5 [2147483647,5,4,1,2,3]p5 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.d       4        0         4       0 4374528           0          0  36       active+clean+remapped    7h   67'36 10709:10821 [2,1,5,3,4,2147483647]p2          [2,1,5,3,4,5]p2 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.e       3        3         0       0  704512           0          0  12  active+undersized+degraded    7h   62'12 10709:10784 [1,3,4,5,2147483647,2]p1 [1,3,4,5,2147483647,2]p1 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.f       5        5         0       0 4259840           0          0  19  active+undersized+degraded    7h   66'19 10709:10793 [3,2,5,4,1,2147483647]p3 [3,2,5,4,1,2147483647]p3 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 

#2 Updated by Sage Weil about 4 years ago

  • Status changed from New to In Progress

#3 Updated by Sage Weil about 4 years ago

The problem is the role. The proc_lease() method does this check

  bool is_nonprimary() const {
    return role >= 0 && pg_whoami != primary;
  }

and for 1.6s2 role==2, but for 1.6s4 role==-1.

that appears to be due to the call to calc_pg_rank in start_peering_interval():

  int role = osdmap->calc_pg_role(pg_whoami.osd, acting, acting.size());
  if (pool.info.is_replicated() || role == pg_whoami.shard)
    set_role(role);
  else
    set_role(-1);

calc_pg_role returns 2, which works for 1.6s2, but 1.6s4 ends up with role=-1.

#4 Updated by Sage Weil about 4 years ago

  • Related to Bug #43213: OSDMap::pg_to_up_acting etc specify primary as osd, not pg_shard_t(osd+shard) added

#5 Updated by Sage Weil about 4 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 32132

#6 Updated by Sage Weil about 4 years ago

  • Status changed from Fix Under Review to Resolved

#7 Updated by Sage Weil about 4 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to nautilus

I'm not sure whether we should backport this to nautilus or not. We only noticed qa failures because the new octopus laggy stuff was affected by the bad role value. I'm not sure if there are other (user-visible) effects of this bug. Maybe once it has backed in master for a couple months?

#8 Updated by Sage Weil about 4 years ago

  • Priority changed from Urgent to Normal

#9 Updated by Nathan Cutler about 4 years ago

#10 Updated by Neha Ojha almost 2 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF