Project

General

Profile

Actions

Bug #43189

closed

pgs stuck in laggy state

Added by Sage Weil over 4 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

1.b       6        0         6       0  8585216           0          0  66 active+clean+remapped+laggy    4h   74'66 5757:5982 [4,0,2,NONE,7,3]p4    [4,0,2,0,7,3]p4 2019-12-08T17:10:29.229694+0000 2019-12-08T17:10:29.229694+0000 

/a/sage-2019-12-08_16:23:05-rados:thrash-erasure-code-master-distro-basic-smithi/4581925/

pretty reproducible with rados/thrash-erasure-code --subset 1/99 (1/50 runs hung)


Related issues 2 (1 open1 closed)

Related to RADOS - Bug #43213: OSDMap::pg_to_up_acting etc specify primary as osd, not pg_shard_t(osd+shard)New

Actions
Copied to RADOS - Backport #43232: nautilus: pgs stuck in laggy stateRejectedNeha OjhaActions
Actions #1

Updated by Sage Weil over 4 years ago

more logs here:
/a/sage-2019-12-07_18:31:18-rados:thrash-erasure-code-wip-sage3-testing-2019-12-05-0959-distro-basic-smithi/4579417

PG  OBJECTS DEGRADED MISPLACED UNFOUND BYTES   OMAP_BYTES* OMAP_KEYS* LOG STATE                       SINCE VERSION REPORTED    UP                       ACTING                   SCRUB_STAMP                     DEEP_SCRUB_STAMP                
1.0       2        2         0       0 2539520           0          0  21  active+undersized+degraded    7h   53'21 10709:10736 [1,5,2,2147483647,3,4]p1 [1,5,2,2147483647,3,4]p1 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.1       2        2         0       0 1261568           0          0  11  active+undersized+degraded    7h   68'11 10709:10775 [3,2147483647,2,4,5,1]p3 [3,2147483647,2,4,5,1]p3 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.2       3        0         3       0 1474560           0          0  16       active+clean+remapped    7h   64'16 10709:10706 [5,4,1,2147483647,2,3]p5          [5,4,1,5,2,3]p5 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.3       4        4         0       0 3260416           0          0  23  active+undersized+degraded    7h   67'23 10709:10842 [4,1,2,5,2147483647,3]p4 [4,1,2,5,2147483647,3]p4 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.4       1        0         1       0  311296           0          0  10       active+clean+remapped    7h   53'10 10709:10735 [2,3,1,2147483647,5,4]p2          [2,3,1,5,5,4]p2 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.5       4        4         0       0 2916352           0          0  13  active+undersized+degraded    7h   68'13 10709:10800 [2,1,3,4,2147483647,5]p2 [2,1,3,4,2147483647,5]p2 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.6       7        0         7       0 7995392           0          0  92 active+clean+remapped+laggy    7h   69'92 10709:10905 [3,2,4,1,2147483647,5]p3          [3,2,4,1,4,5]p3 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.7       6        0         6       0 5718016           0          0  51       active+clean+remapped    7h   69'51 10709:10899 [1,4,2147483647,3,5,2]p1          [1,4,3,3,5,2]p1 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.8       8        8         0       0 5603328           0          0  61  active+undersized+degraded    7h   68'61 10709:10878 [1,4,5,2,3,2147483647]p1 [1,4,5,2,3,2147483647]p1 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.9       4        0         4       0 5914624           0          0  52       active+clean+remapped    7h   69'52 10709:10888 [1,4,5,2147483647,2,3]p1          [1,4,5,2,2,3]p1 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.a       2        2         0       0 1654784           0          0   9  active+undersized+degraded    7h    57'9 10709:10783 [1,4,2,5,3,2147483647]p1 [1,4,2,5,3,2147483647]p1 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.b       6        0         6       0 7585792           0          0  57 active+clean+remapped+laggy    7h   69'57 10709:10933 [4,3,2,2147483647,1,5]p4          [4,3,2,3,1,5]p4 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.c       6        6         0       0 3555328           0          0  56  active+undersized+degraded    7h   62'56 10709:10725 [2147483647,5,4,1,2,3]p5 [2147483647,5,4,1,2,3]p5 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.d       4        0         4       0 4374528           0          0  36       active+clean+remapped    7h   67'36 10709:10821 [2,1,5,3,4,2147483647]p2          [2,1,5,3,4,5]p2 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.e       3        3         0       0  704512           0          0  12  active+undersized+degraded    7h   62'12 10709:10784 [1,3,4,5,2147483647,2]p1 [1,3,4,5,2147483647,2]p1 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 
1.f       5        5         0       0 4259840           0          0  19  active+undersized+degraded    7h   66'19 10709:10793 [3,2,5,4,1,2147483647]p3 [3,2,5,4,1,2147483647]p3 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 

Actions #2

Updated by Sage Weil over 4 years ago

  • Status changed from New to In Progress
Actions #3

Updated by Sage Weil over 4 years ago

The problem is the role. The proc_lease() method does this check

  bool is_nonprimary() const {
    return role >= 0 && pg_whoami != primary;
  }

and for 1.6s2 role==2, but for 1.6s4 role==-1.

that appears to be due to the call to calc_pg_rank in start_peering_interval():

  int role = osdmap->calc_pg_role(pg_whoami.osd, acting, acting.size());
  if (pool.info.is_replicated() || role == pg_whoami.shard)
    set_role(role);
  else
    set_role(-1);

calc_pg_role returns 2, which works for 1.6s2, but 1.6s4 ends up with role=-1.

Actions #4

Updated by Sage Weil over 4 years ago

  • Related to Bug #43213: OSDMap::pg_to_up_acting etc specify primary as osd, not pg_shard_t(osd+shard) added
Actions #5

Updated by Sage Weil over 4 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 32132
Actions #6

Updated by Sage Weil over 4 years ago

  • Status changed from Fix Under Review to Resolved
Actions #7

Updated by Sage Weil over 4 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to nautilus

I'm not sure whether we should backport this to nautilus or not. We only noticed qa failures because the new octopus laggy stuff was affected by the bad role value. I'm not sure if there are other (user-visible) effects of this bug. Maybe once it has backed in master for a couple months?

Actions #8

Updated by Sage Weil over 4 years ago

  • Priority changed from Urgent to Normal
Actions #9

Updated by Nathan Cutler over 4 years ago

Actions #10

Updated by Neha Ojha about 2 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF