Project

General

Profile

Actions

Bug #55668

open

osd/ec: after one-by-one adding a new osd to the ceph cluster, pg stuck recovery_unfound

Added by jianwei zhang almost 2 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

problem:

ec(4+2) The original 3 nodes, open ec folding, the ratio is 2, expand 2 nodes to the cluster
ec(2+1) The original 3 nodes, open ec folding, the ratio is 2, expand 2 nodes to the cluster
all osds status of ceph cluster are exists&up&in
but some PGs have been stuck in recovery_unfound state and cannot be recovered

root cause:

The pg acting set has missing objects that needs to recover.
But the source osd of the missing objects is in the backfill_targets(up) set,
Since the source osd in the backfill_targets(up) set needs to restart backfill, set the pg info.last_backfill to MIN,
Since the pg info.last_backfill of source osd is MIN, when MissingLoc::add_source_info, filter the source osd that has missing object,
These missing objects cannot recover and the pg status changes to recovery_unfound.

fix description:

pgid : pg.id
objname : ceph pg list_unfound
example:
ceph pg 3.20 list_unfound
{
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
        {
            "oid": {
                "oid": "202000000000095.00000005",
                "key": "",
                "snapid": -2,
                "hash": 1807777312,
                "max": 0,
                "pool": 3,
                "namespace": "" 
            },
            "need": "303'1222",
            "have": "0'0",
            "flags": "none",
            "clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1",
            "locations": [
                "1(5)",
                "46(1)" 
            ]
        }
    ],
    "more": false
}

# ceph pg 3.20 find_unfound_object 202000000000095.00000005
pg has 3:047e03d6:::202000000000095.00000005:head object unfound  missing 303'1222 flags = none clean_offsets: [], clean_omap: 0, new_object: 1 old_location 1(5),46(1) new_location 1(5),32(2),34(0),42(4),46(1) and find all might have unfound source

Actions

Also available in: Atom PDF