Bug #44286
openCache tiering shows unfound objects after OSD reboots
0%
Description
We've got a cluster with a 3/2 size/min_size replicated cache pool in front of an erasure coded pool used for RBD.
Restarting OSDs sometimes results in unfound objects, example:
2/543658058 objects unfound (0.000%) pg 19.12 has 1 unfound objects pg 19.2d has 1 unfound objects Possible data damage: 2 pgs recovery_unfound pg 19.12 is active+recovery_unfound+undersized+degraded+remapped, acting [299,310], 1 unfound pg 19.2d is active+recovery_unfound+undersized+degraded+remapped, acting [290,309], 1 unfound # ceph pg 19.12 list_unfound { "num_missing": 1, "num_unfound": 1, "objects": [ { "oid": { "oid": "hit_set_19.12_archive_2020-02-25 13:43:50.256316Z_2020-02-25 13:43:50.325825Z", "key": "", "snapid": -2, "hash": 18, "max": 0, "pool": 19, "namespace": ".ceph-internal" }, "need": "3312398'55868341", "have": "0'0", "flags": "none", "locations": [] } ], "more": false }
Both PGs affected here share an OSD (the one that's offline).
The cache tiering agent is busy flushing with around 300-500 MB/s while this happens.
The unfound objects stay unfound even after all OSDs are back online. The affected PG never goes below 2 online OSDs.
Restarting the OSDs does not change the state, so it's not an instance of https://tracker.ceph.com/issues/37439
Ceph version 14.2.6 (restarting to upgrade to 14.2.7). Also seen on 14.2.4 a few months ago.
Attached is a pg query on a PG in that state (from an earlier instance of this issue, also 14.2.6)
Files