Project

General

Profile

Bug #24373

osd: eternal stuck PG in 'unfound_recovery'

Added by Kouya Shimura 10 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Backfill/Recovery
Target version:
-
Start date:
06/01/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:

Description

A PG might be eternally stuck in 'unfound_recovery' after some OSDs are marked down.

For example, the following steps reproduce this.

1) Create EC 2+1 pool. Assume a PG has [1,0,2] up/acting set.
2) Execute "ceph osd out osd.0 osd.2". Now the PG has [1,3,5] up/acting set.
3) Put some objects to the PG.
4) Execute "ceph osd in osd.0 osd.2". It starts recovering to [1,0,2].
5) Execute "ceph osd down osd.3 osd.5". These downs are momentary. osd.3
and osd.5 boot instantly.
It leads the PG to transit 'unfound_recovery' and stay on forever
despite all OSDs are up.

This bad situation is resolved by means of marking down an OSD in acting set.

6) Execute "ceph osd down osd.0", then unfound objects are resolved
and the PG restarts recovering.

Upon my investigation, if downed OSD is not a member of current up/acting set,
a PG might stay 'ReplicaActive' and discard peering requests from the primary.
Thus the primary OSD can't exit from unfound state.
PGs of downed OSD should transit to 'Reset' state and start peering.

ceph-osd.3.log.gz (48.8 KB) Kouya Shimura, 06/06/2018 02:17 AM


Related issues

Copied to RADOS - Backport #24500: mimic: osd: eternal stuck PG in 'unfound_recovery' Resolved
Copied to RADOS - Backport #24501: luminous: osd: eternal stuck PG in 'unfound_recovery' Resolved

History

#2 Updated by Mykola Golub 10 months ago

  • Status changed from New to Need Review

#3 Updated by Kouya Shimura 10 months ago

Attached full log (download ceph-osd.3.log.gz).

Points are:

### pg1.0s1 is in 'Started/ReplicaActive'state and osd.3 is not a member of acting set [1,0,2]
2018-06-06 09:46:15.740 7ff794727700 10 osd.3 pg_epoch: 33 pg[1.0s1... [1,0,2]] state<Started/ReplicaActive>: Activate Finished
...
### Executed "ceph osd down osd.3 osd.5" at 09:46:17
...
2018-06-06 09:46:20.532 7ff79df3a700  1 osd.3 35 state: booting -> active
...
2018-06-06 09:46:20.532 7ff794727700 10 osd.3 pg_epoch: 35 pg[1.0s1...] state<Started>: Started advmap
### After this, no message "should_restart_peering, transitioning to Reset" (the PG stays in ReplicaActive)
...

#4 Updated by Sage Weil 10 months ago

Okay, I see the problem. Two fixes: first, reset every pg on down->up (simpler approach), but the bigger issue is that MQuery processing between ReplicaActive and Stray states is different.. that's why the unfound query is ignored.

#6 Updated by Mykola Golub 10 months ago

  • Backport set to mimic,luminous

#7 Updated by Kefu Chai 10 months ago

  • Status changed from Need Review to Pending Backport

#8 Updated by Nathan Cutler 9 months ago

  • Copied to Backport #24500: mimic: osd: eternal stuck PG in 'unfound_recovery' added

#9 Updated by Nathan Cutler 9 months ago

  • Copied to Backport #24501: luminous: osd: eternal stuck PG in 'unfound_recovery' added

#10 Updated by Nathan Cutler 6 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF