Actions
Bug #15523
closedosd: acting_primary not updated on split
Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Q/A
Tags:
Backport:
jewel,infernalis,hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
/a/sage-2016-04-15_05:22:29-rados-master-distro-basic-smithi/131192
a pg is stuck stale,
0.129 0 0 0 0 0 0 0 0 stale+active+clean 2016-04-15 14:32:04.537262 0'0 549:7 [0,1,3] 0 [4,1,3] 4 0'0 2016-04-15 14:26:36.590356 0'0 2016-04-15 14:26:36.590356
because the mon wrongly marks it that way
2016-04-15 14:32:01.120114 7fd35b482700 10 mon.a@0(leader).pg v731 register_pg will create 0.129 primary 0 acting [0,1,3] parent 0.29 by 1 bits ... 2016-04-15 14:32:01.239533 7fd35b482700 20 mon.a@0(leader).pg v731 refreshing pg 0.129 0:0 creating ... 2016-04-15 14:32:02.706037 7fd359c7f700 15 mon.a@0(leader).pg v733 got 0.129 reported at 548:1 state creating -> peering (osdmap is 548) ... 2016-04-15 14:32:03.408916 7fd35b482700 20 mon.a@0(leader).pg v733 refreshing pg 0.129 548:1 peering (osdmap is 548) ... 2016-04-15 14:32:07.711309 7fd359c7f700 15 mon.a@0(leader).pg v738 got 0.129 reported at 549:7 state peering -> active+clean ... 2016-04-15 14:32:07.979124 7fd35b482700 10 mon.a@0(leader).pg v738 check_down_pgs last_osdmap_epoch 552 (note: epoch 552, osd.4 is down, pg not marked stale here, ergo acting_primary != 4.. presumably 0 as intended) 2016-04-15 14:32:07.979584 7fd35b482700 10 mon.a@0(leader).paxosservice(pgmap 1..738) propose_pending ... 2016-04-15 14:32:08.097296 7fd35b482700 20 mon.a@0(leader).pg v738 refreshing pg 0.129 549:7 active+clean ... 2016-04-15 14:32:09.239834 7fd35b482700 10 mon.a@0(leader).pg v739 check_down_pgs last_osdmap_epoch 553 2016-04-15 14:32:09.239984 7fd35b482700 10 mon.a@0(leader).pg v739 marking pg 0.129 stale (acting_primary 4) (now acting_primary is 4, but shouldn't be)
either the osd corrupted it's stats.acting_primary value (it's only set by init and start_peering_interval, which didn't happen between the reports.. maybe some race? memory corruption?), or the mon did something similarly stupid. :/
Updated by Sage Weil about 8 years ago
maybe /a/teuthology-2016-04-24_22:00:02-rados-jewel-distro-basic-smithi/147520
Updated by Samuel Just about 8 years ago
sjust@teuthology:/a/samuelj-2016-04-28_23:23:57-rados-wip-sam-testing-distro-basic-smithi/155488/remote also possibly
Updated by Sage Weil about 8 years ago
- Status changed from Need More Info to Pending Backport
- Backport set to jewel,infernalis,hammer
Updated by Sage Weil about 8 years ago
- Subject changed from osd: info.stat.acting_primary corrupted? to osd: acting_primary not updated on split
Updated by Nathan Cutler about 8 years ago
- Copied to Backport #15728: jewel: osd: acting_primary not updated on split added
Updated by Nathan Cutler about 8 years ago
- Copied to Backport #15729: infernalis: osd: acting_primary not updated on split added
Updated by Nathan Cutler about 8 years ago
- Copied to Backport #15730: hammer: osd: acting_primary not updated on split added
Updated by Loïc Dachary over 7 years ago
- Status changed from Pending Backport to Resolved
Actions