Project

General

Profile

Bug #53924

Updated by Vikhyat Umrao 11 months ago

<pre>
# ceph -s
cluster:
id: 433323be-7878-11ec-b17f-000af7995756
health: HEALTH_ERR
Reduced data availability: 1 pg inactive
Possible data damage: 1 pg recovery_unfound
Degraded data redundancy: 8886/10297821 objects degraded (0.086%), 1 pg degraded, 1 pg undersized

services:
mon: 5 daemons, quorum f28-h28-000-r630.rdu2.scalelab.redhat.com,f28-h29-000-r630,f28-h30-000-r630,f22-h21-000-6048r,f22-h25-000-6048r (age 4h)
mgr: f28-h28-000-r630.rdu2.scalelab.redhat.com.vqxcfs(active, since 4h), standbys: f28-h29-000-r630.gxhqto
osd: 192 osds: 192 up (since 4h), 192 in (since 4h); 1 remapped pgs
rgw: 8 daemons active (8 hosts, 1 zones)

data:
pools: 7 pools, 931 pgs
objects: 1.72M objects, 6.2 TiB
usage: 12 TiB used, 343 TiB / 355 TiB avail
pgs: 0.107% pgs not active
8886/10297821 objects degraded (0.086%)
930 active+clean
1 recovery_unfound+undersized+degraded+remapped+peered

progress:
Global Recovery Event (2h)
[===========================.] (remaining: 11s)

</pre>

- Health detail

<pre>
# ceph health detail
HEALTH_ERR Reduced data availability: 1 pg inactive; Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 8886/10310745 objects degraded (0.086%), 1 pg degraded, 1 pg undersized
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
pg 13.2eb is stuck inactive for 3h, current state recovery_unfound+undersized+degraded+remapped+peered, last acting [33,103,NONE,123,66,NONE]
[ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound
pg 13.2eb is recovery_unfound+undersized+degraded+remapped+peered, acting [33,103,NONE,123,66,NONE]
[WRN] PG_DEGRADED: Degraded data redundancy: 8886/10310745 objects degraded (0.086%), 1 pg degraded, 1 pg undersized
pg 13.2eb is stuck undersized for 3h, current state recovery_unfound+undersized+degraded+remapped+peered, last acting [33,103,NONE,123,66,NONE]

</pre>

<pre>
# ceph version
ceph version 17.0.0-10229-g7e035110 (7e035110784fba02ba81944e444be9a36932c6a3) quincy (dev)

</pre>

- No OSD flapped and this PG went to this recovery_unfound state looks like maybe while autoscaler was changing the PG count?

<pre>

2022-01-18T16:54:01.511939+0000 mgr.f28-h28-000-r630.rdu2.scalelab.redhat.com.vqxcfs (mgr.14222) 1808 : cluster [DBG] pgmap v2900: 1762 pgs: 1 activating, 2 peering, 1 clean+premerge+peered, 1758 active+clean; 176 GiB data, 6.0 TiB used, 349 TiB / 355 TiB avail; 327 KiB/s rd, 3.0 GiB/s wr, 1.91k op/s; 186/288795 objects degraded (0.064%); 31/48198 objects unfound (0.064%)

2022-01-18T17:38:30.310339+0000 mgr.f28-h28-000-r630.rdu2.scalelab.redhat.com.vqxcfs (mgr.14222) 3155 : cluster [DBG] pgmap v6499: 963 pgs: 1 recovery_unfound+undersized+degraded+remapped+peered, 10 recovering+undersized+remapped+peered, 8 recovering+undersized+peered, 944 active+clean; 5.6 TiB data, 11 TiB used, 344 TiB / 355 TiB avail; 60 KiB/s rd, 8.6 GiB/s wr, 3.46k op/s; 8886/9263667 objects degraded (0.096%); 159823/9263667 objects misplaced (1.725%); 376 MiB/s, 103 objects/s recovering

</pre>

Back