Bug #44286: Cache tiering shows unfound objects after OSD reboots - RADOS - Ceph

Actions

Copy link

Bug #44286

open

Cache tiering shows unfound objects after OSD reboots

Added by Paul Emmerich about 4 years ago. Updated over 2 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We've got a cluster with a 3/2 size/min_size replicated cache pool in front of an erasure coded pool used for RBD.

Restarting OSDs sometimes results in unfound objects, example:

2/543658058 objects unfound (0.000%)
pg 19.12 has 1 unfound objects
pg 19.2d has 1 unfound objects

Possible data damage: 2 pgs recovery_unfound
pg 19.12 is active+recovery_unfound+undersized+degraded+remapped, acting [299,310], 1 unfound
pg 19.2d is active+recovery_unfound+undersized+degraded+remapped, acting [290,309], 1 unfound

# ceph pg 19.12 list_unfound
{
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
        {
            "oid": {
                "oid": "hit_set_19.12_archive_2020-02-25 13:43:50.256316Z_2020-02-25 13:43:50.325825Z",
                "key": "",
                "snapid": -2,
                "hash": 18,
                "max": 0,
                "pool": 19,
                "namespace": ".ceph-internal" 
            },
            "need": "3312398'55868341",
            "have": "0'0",
            "flags": "none",
            "locations": []
        }
    ],
    "more": false
}

Both PGs affected here share an OSD (the one that's offline).
The cache tiering agent is busy flushing with around 300-500 MB/s while this happens.

The unfound objects stay unfound even after all OSDs are back online. The affected PG never goes below 2 online OSDs.
Restarting the OSDs does not change the state, so it's not an instance of https://tracker.ceph.com/issues/37439

Ceph version 14.2.6 (restarting to upgrade to 14.2.7). Also seen on 14.2.4 a few months ago.

Attached is a pg query on a PG in that state (from an earlier instance of this issue, also 14.2.6)

Files

pg-query.json (20.1 KB) pg-query.json

Paul Emmerich, 02/25/2020 02:22 PM

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #44286

Cache tiering shows unfound objects after OSD reboots

Updated by Preben Berg about 4 years ago

Updated by Paul Emmerich almost 4 years ago

Updated by Jan-Philipp Litza about 3 years ago

Updated by Pawel Stefanski over 2 years ago

Updated by Jan-Philipp Litza over 2 years ago

Updated by marek czardybon over 2 years ago