Project

General

Profile

Bug #44286

Cache tiering shows unfound objects after OSD reboots

Added by Paul Emmerich almost 4 years ago. Updated about 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We've got a cluster with a 3/2 size/min_size replicated cache pool in front of an erasure coded pool used for RBD.

Restarting OSDs sometimes results in unfound objects, example:

2/543658058 objects unfound (0.000%)
pg 19.12 has 1 unfound objects
pg 19.2d has 1 unfound objects

Possible data damage: 2 pgs recovery_unfound
pg 19.12 is active+recovery_unfound+undersized+degraded+remapped, acting [299,310], 1 unfound
pg 19.2d is active+recovery_unfound+undersized+degraded+remapped, acting [290,309], 1 unfound

# ceph pg 19.12 list_unfound
{
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
        {
            "oid": {
                "oid": "hit_set_19.12_archive_2020-02-25 13:43:50.256316Z_2020-02-25 13:43:50.325825Z",
                "key": "",
                "snapid": -2,
                "hash": 18,
                "max": 0,
                "pool": 19,
                "namespace": ".ceph-internal" 
            },
            "need": "3312398'55868341",
            "have": "0'0",
            "flags": "none",
            "locations": []
        }
    ],
    "more": false
}

Both PGs affected here share an OSD (the one that's offline).
The cache tiering agent is busy flushing with around 300-500 MB/s while this happens.

The unfound objects stay unfound even after all OSDs are back online. The affected PG never goes below 2 online OSDs.
Restarting the OSDs does not change the state, so it's not an instance of https://tracker.ceph.com/issues/37439

Ceph version 14.2.6 (restarting to upgrade to 14.2.7). Also seen on 14.2.4 a few months ago.

Attached is a pg query on a PG in that state (from an earlier instance of this issue, also 14.2.6)

pg-query.json (20.1 KB) Paul Emmerich, 02/25/2020 02:22 PM

History

#1 Updated by Preben Berg almost 4 years ago

Issue still present on 14.2.8.

#2 Updated by Paul Emmerich almost 4 years ago

this occasionally comes up on the mailing list as well. it's not reproducible on my test setup, though :(

#3 Updated by Jan-Philipp Litza almost 3 years ago

We even hit that bug twice today by rebooting two of our cache servers.

What's interesting is that only hit_set objects ever went missing. What's even more peculiar is the timestamps in their object IDs are from the downtime of the host, but they are only reported unfound after the host rejoined the cluster.

So either the objects were never created in the first place (but Ceph somehow assumes that they must exist), or they are created on another host but then somehow get lost during recovery. But since the cache pool has a size of 2, the latter seems highly implausible.

BTW, this happened on version 14.2.16, and after understanding the situation we simply marked the objects lost without any apparent adverse consequences.

#4 Updated by Pawel Stefanski over 2 years ago

Jan-Philipp Litza wrote:

We even hit that bug twice today by rebooting two of our cache servers.

What's interesting is that only hit_set objects ever went missing. What's even more peculiar is the timestamps in their object IDs are from the downtime of the host, but they are only reported unfound after the host rejoined the cluster.

So either the objects were never created in the first place (but Ceph somehow assumes that they must exist), or they are created on another host but then somehow get lost during recovery. But since the cache pool has a size of 2, the latter seems highly implausible.

BTW, this happened on version 14.2.16, and after understanding the situation we simply marked the objects lost without any apparent adverse consequences.

I can confirm at 14.2.22 it still occurs.

#5 Updated by Jan-Philipp Litza about 2 years ago

Update: Also happens with 16.2.5 :-(

#6 Updated by marek czardybon about 2 years ago

the problem still exists on 15.2.15.
I've also got replicated size 3 min_size 2.
the problem occurs only when one OSD is restarted.

Also available in: Atom PDF