Project

General

Profile

Bug #15523

osd: acting_primary not updated on split

Added by Sage Weil over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
Start date:
04/15/2016
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
jewel,infernalis,hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

/a/sage-2016-04-15_05:22:29-rados-master-distro-basic-smithi/131192

a pg is stuck stale,

0.129   0       0       0       0       0       0       0       0       stale+active+clean      2016-04-15 14:32:04.537262      0'0     549:7   [0,1,3] 0       [4,1,3] 4       0'0     2016-04-15 14:26:36.590356      0'0     2016-04-15 14:26:36.590356

because the mon wrongly marks it that way
2016-04-15 14:32:01.120114 7fd35b482700 10 mon.a@0(leader).pg v731 register_pg  will create 0.129 primary 0 acting [0,1,3] parent 0.29 by 1 bits
...
2016-04-15 14:32:01.239533 7fd35b482700 20 mon.a@0(leader).pg v731  refreshing pg 0.129 0:0 creating
...
2016-04-15 14:32:02.706037 7fd359c7f700 15 mon.a@0(leader).pg v733  got 0.129 reported at 548:1 state creating -> peering
(osdmap is 548)
...
2016-04-15 14:32:03.408916 7fd35b482700 20 mon.a@0(leader).pg v733  refreshing pg 0.129 548:1 peering
(osdmap is 548)
...
2016-04-15 14:32:07.711309 7fd359c7f700 15 mon.a@0(leader).pg v738  got 0.129 reported at 549:7 state peering -> active+clean
...
2016-04-15 14:32:07.979124 7fd35b482700 10 mon.a@0(leader).pg v738 check_down_pgs last_osdmap_epoch 552
(note: epoch 552, osd.4 is down, pg not marked stale here, ergo acting_primary != 4.. presumably 0 as intended)
2016-04-15 14:32:07.979584 7fd35b482700 10 mon.a@0(leader).paxosservice(pgmap 1..738) propose_pending
...
2016-04-15 14:32:08.097296 7fd35b482700 20 mon.a@0(leader).pg v738  refreshing pg 0.129 549:7 active+clean
...
2016-04-15 14:32:09.239834 7fd35b482700 10 mon.a@0(leader).pg v739 check_down_pgs last_osdmap_epoch 553
2016-04-15 14:32:09.239984 7fd35b482700 10 mon.a@0(leader).pg v739  marking pg 0.129 stale (acting_primary 4)
(now acting_primary is 4, but shouldn't be)

either the osd corrupted it's stats.acting_primary value (it's only set by init and start_peering_interval, which didn't happen between the reports.. maybe some race? memory corruption?), or the mon did something similarly stupid. :/


Related issues

Copied to Ceph - Backport #15728: jewel: osd: acting_primary not updated on split Resolved
Copied to Ceph - Backport #15729: infernalis: osd: acting_primary not updated on split Rejected
Copied to Ceph - Backport #15730: hammer: osd: acting_primary not updated on split Resolved

History

#1 Updated by Sage Weil over 3 years ago

maybe /a/teuthology-2016-04-24_22:00:02-rados-jewel-distro-basic-smithi/147520

#2 Updated by Samuel Just over 3 years ago

sjust@teuthology:/a/samuelj-2016-04-28_23:23:57-rados-wip-sam-testing-distro-basic-smithi/155488/remote also possibly

#3 Updated by Sage Weil over 3 years ago

  • Status changed from Need More Info to Pending Backport
  • Backport set to jewel,infernalis,hammer

#4 Updated by Sage Weil over 3 years ago

  • Subject changed from osd: info.stat.acting_primary corrupted? to osd: acting_primary not updated on split

#5 Updated by Nathan Cutler over 3 years ago

  • Copied to Backport #15728: jewel: osd: acting_primary not updated on split added

#6 Updated by Nathan Cutler over 3 years ago

  • Copied to Backport #15729: infernalis: osd: acting_primary not updated on split added

#7 Updated by Nathan Cutler over 3 years ago

  • Copied to Backport #15730: hammer: osd: acting_primary not updated on split added

#8 Updated by Loic Dachary about 3 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF