Actions
Bug #43189
closedpgs stuck in laggy state
% Done:
0%
Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
1.b 6 0 6 0 8585216 0 0 66 active+clean+remapped+laggy 4h 74'66 5757:5982 [4,0,2,NONE,7,3]p4 [4,0,2,0,7,3]p4 2019-12-08T17:10:29.229694+0000 2019-12-08T17:10:29.229694+0000
/a/sage-2019-12-08_16:23:05-rados:thrash-erasure-code-master-distro-basic-smithi/4581925/
pretty reproducible with rados/thrash-erasure-code --subset 1/99 (1/50 runs hung)
Updated by Sage Weil over 4 years ago
more logs here:
/a/sage-2019-12-07_18:31:18-rados:thrash-erasure-code-wip-sage3-testing-2019-12-05-0959-distro-basic-smithi/4579417
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 1.0 2 2 0 0 2539520 0 0 21 active+undersized+degraded 7h 53'21 10709:10736 [1,5,2,2147483647,3,4]p1 [1,5,2,2147483647,3,4]p1 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 1.1 2 2 0 0 1261568 0 0 11 active+undersized+degraded 7h 68'11 10709:10775 [3,2147483647,2,4,5,1]p3 [3,2147483647,2,4,5,1]p3 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 1.2 3 0 3 0 1474560 0 0 16 active+clean+remapped 7h 64'16 10709:10706 [5,4,1,2147483647,2,3]p5 [5,4,1,5,2,3]p5 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 1.3 4 4 0 0 3260416 0 0 23 active+undersized+degraded 7h 67'23 10709:10842 [4,1,2,5,2147483647,3]p4 [4,1,2,5,2147483647,3]p4 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 1.4 1 0 1 0 311296 0 0 10 active+clean+remapped 7h 53'10 10709:10735 [2,3,1,2147483647,5,4]p2 [2,3,1,5,5,4]p2 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 1.5 4 4 0 0 2916352 0 0 13 active+undersized+degraded 7h 68'13 10709:10800 [2,1,3,4,2147483647,5]p2 [2,1,3,4,2147483647,5]p2 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 1.6 7 0 7 0 7995392 0 0 92 active+clean+remapped+laggy 7h 69'92 10709:10905 [3,2,4,1,2147483647,5]p3 [3,2,4,1,4,5]p3 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 1.7 6 0 6 0 5718016 0 0 51 active+clean+remapped 7h 69'51 10709:10899 [1,4,2147483647,3,5,2]p1 [1,4,3,3,5,2]p1 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 1.8 8 8 0 0 5603328 0 0 61 active+undersized+degraded 7h 68'61 10709:10878 [1,4,5,2,3,2147483647]p1 [1,4,5,2,3,2147483647]p1 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 1.9 4 0 4 0 5914624 0 0 52 active+clean+remapped 7h 69'52 10709:10888 [1,4,5,2147483647,2,3]p1 [1,4,5,2,2,3]p1 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 1.a 2 2 0 0 1654784 0 0 9 active+undersized+degraded 7h 57'9 10709:10783 [1,4,2,5,3,2147483647]p1 [1,4,2,5,3,2147483647]p1 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 1.b 6 0 6 0 7585792 0 0 57 active+clean+remapped+laggy 7h 69'57 10709:10933 [4,3,2,2147483647,1,5]p4 [4,3,2,3,1,5]p4 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 1.c 6 6 0 0 3555328 0 0 56 active+undersized+degraded 7h 62'56 10709:10725 [2147483647,5,4,1,2,3]p5 [2147483647,5,4,1,2,3]p5 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 1.d 4 0 4 0 4374528 0 0 36 active+clean+remapped 7h 67'36 10709:10821 [2,1,5,3,4,2147483647]p2 [2,1,5,3,4,5]p2 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 1.e 3 3 0 0 704512 0 0 12 active+undersized+degraded 7h 62'12 10709:10784 [1,3,4,5,2147483647,2]p1 [1,3,4,5,2147483647,2]p1 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000 1.f 5 5 0 0 4259840 0 0 19 active+undersized+degraded 7h 66'19 10709:10793 [3,2,5,4,1,2147483647]p3 [3,2,5,4,1,2147483647]p3 2019-12-07T19:33:55.603918+0000 2019-12-07T19:33:55.603918+0000
Updated by Sage Weil over 4 years ago
The problem is the role. The proc_lease() method does this check
bool is_nonprimary() const { return role >= 0 && pg_whoami != primary; }
and for 1.6s2 role==2, but for 1.6s4 role==-1.
that appears to be due to the call to calc_pg_rank in start_peering_interval():
int role = osdmap->calc_pg_role(pg_whoami.osd, acting, acting.size()); if (pool.info.is_replicated() || role == pg_whoami.shard) set_role(role); else set_role(-1);
calc_pg_role returns 2, which works for 1.6s2, but 1.6s4 ends up with role=-1.
Updated by Sage Weil over 4 years ago
- Related to Bug #43213: OSDMap::pg_to_up_acting etc specify primary as osd, not pg_shard_t(osd+shard) added
Updated by Sage Weil over 4 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 32132
Updated by Sage Weil over 4 years ago
- Status changed from Fix Under Review to Resolved
Updated by Sage Weil over 4 years ago
- Status changed from Resolved to Pending Backport
- Backport set to nautilus
I'm not sure whether we should backport this to nautilus or not. We only noticed qa failures because the new octopus laggy stuff was affected by the bad role value. I'm not sure if there are other (user-visible) effects of this bug. Maybe once it has backed in master for a couple months?
Updated by Nathan Cutler over 4 years ago
- Copied to Backport #43232: nautilus: pgs stuck in laggy state added
Updated by Neha Ojha about 2 years ago
- Status changed from Pending Backport to Resolved
Actions