Bug #56452: Ceph objects unfound - Ceph - Ceph

Actions

Copy link

Bug #56452

open

Ceph objects unfound

Added by Martin Culcea almost 2 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

common

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

ceph-deploy

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hello,

After a host reboot the cluster could not find an object. The cluster was in stable state with all osd active+clean, no OSD was out, no other OSD was restarted during host reboot. It was 3 weeks ago, we hoped that the cluster will find the object eventually, but it did not.
Cluster version: ceph version 16.2.9, ceph-deploy cluster, pool size 2.

Attached are ceph.log, osds logs, pg query and other logs

Cluster status:
cluster:
id: 2517da9e-af62-405e-b71f-1f2e145822f7
health: HEALTH_ERR
client is using insecure global_id reclaim
mons are allowing insecure global_id reclaim
1/606943089 objects unfound (0.000%)
Possible data damage: 1 pg recovery_unfound
Degraded data redundancy: 7252/1219946300 objects degraded (0.001%), 1 pg degraded, 1 pg undersized
1 pgs not deep-scrubbed in time
1 pgs not scrubbed in time

data:
volumes: 1/1 healthy
pools: 12 pools, 6560 pgs
objects: 606.94M objects, 85 TiB
usage: 169 TiB used, 268 TiB / 438 TiB avail
pgs: 7252/1219946300 objects degraded (0.001%)
7250/1219946300 objects misplaced (0.001%)
1/606943089 objects unfound (0.000%)
6554 active+clean
4 active+clean+scrubbing+deep
1 active+recovery_unfound+undersized+degraded+remapped
1 active+clean+scrubbing

io:
client: 1.2 GiB/s rd, 1.4 GiB/s wr, 40.87k op/s rd, 72.80k op/s wr

progress:
Global Recovery Event (2w)
[===========================.] (remaining: 4m)

Ceph health detail

HEALTH_ERR clients are using insecure global_id reclaim; mons are allowing insecure global_id reclaim; 1/606997573 objects unfound (0.000%); Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 7294/1220048932 objects degraded (0.001%), 1 pg degraded, 1 pg undersized; 1 pgs not deep-scrubbed in time; 1 pgs not scrubbed in time
...
[WRN] OBJECT_UNFOUND: 1/606997573 objects unfound (0.000%)
pg 16.1e has 1 unfound objects
[ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound
pg 16.1e is active+recovery_unfound+undersized+degraded+remapped, acting [131], 1 unfound
[WRN] PG_DEGRADED: Degraded data redundancy: 7294/1220048932 objects degraded (0.001%), 1 pg degraded, 1 pg undersized
pg 16.1e is stuck undersized for 3h, current state active+recovery_unfound+undersized+degraded+remapped, last acting [131]
[WRN] PG_NOT_DEEP_SCRUBBED: 1 pgs not deep-scrubbed in time
pg 16.1e not deep-scrubbed since 2022-06-03T01:20:13.786232+0300
[WRN] PG_NOT_SCRUBBED: 1 pgs not scrubbed in time
pg 16.1e not scrubbed since 2022-06-09T03:27:36.771392+0300

The PG is acting only on osd.131, even we move the PG to other OSD:
ceph pg map 16.1e
osdmap e723093 pg 16.1e (16.1e) -> up [41,141] acting [131]

On ceph osd dump the pg is mapped as a pg_temp:

ceph osd dump | grep -w 16.1e
pg_temp 16.1e [131]

What we did:
- restarted all osd and hosts involved
- force a deep-scrub on PG (the pg cannot be scrubed anymore)
- If we stop osd.131 the PG becomes inactive and down (like it is the only osd containing the objects): Reduced data availability: 1 pg inactive, 1 pg down
- If we take out the osd.131, the pg is not moving to the new osd, it remains the only object on osd.131
- ceph force recovery
- ceph force repeer
- ceph pg repair 16.1e
- Used ceph-objectstore-tool to search for the unfound object (rbd_data.ad5ab66b8b4567.0000000000011055) on all osd's involved, the object is present only on osd.41 and osd.131 even the PG is mapped to other OSD's.
- ceph-objectstore-tool ~~op fix-lost~~
ceph pg remap: we tryed to remap the pg to others OSD's (ceph osd pg-upmap-items 16.1e 131 141) but the PG does not move to new OSD's, remain on osd.41 and osd.131 (ceph pg map 16.1e: osdmap e723093 pg 16.1e (16.1e) -> up [41,141] acting [131])

Why is this happening ?
How can we help the cluster to find the lost object?
Can we remove pg_temp 16.1e [131] from upmap (ceph osd dump) ?

Thank you,
Martin Culcea

Files

Download all files

ceph_osd_log.txt (9.32 KB) ceph_osd_log.txt		Martin Culcea, 07/04/2022 07:45 AM
ceph_pg_query_log.txt (39 KB) ceph_pg_query_log.txt		Martin Culcea, 07/04/2022 07:45 AM
ceph_log.txt (14.2 KB) ceph_log.txt		Martin Culcea, 07/04/2022 07:45 AM

No data to display

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #56452

Ceph objects unfound