Project

General

Profile

Actions

Bug #53924

open

EC PG stuckrecovery_unfound+undersized+degraded+remapped+peered

Added by Vikhyat Umrao over 2 years ago. Updated about 2 years ago.

Status:
Need More Info
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

# ceph -s
  cluster:
    id:     433323be-7878-11ec-b17f-000af7995756
    health: HEALTH_ERR
            Reduced data availability: 1 pg inactive
            Possible data damage: 1 pg recovery_unfound
            Degraded data redundancy: 8886/10297821 objects degraded (0.086%), 1 pg degraded, 1 pg undersized

  services:
    mon: 5 daemons, quorum f28-h28-000-r630.rdu2.scalelab.redhat.com,f28-h29-000-r630,f28-h30-000-r630,f22-h21-000-6048r,f22-h25-000-6048r (age 4h)
    mgr: f28-h28-000-r630.rdu2.scalelab.redhat.com.vqxcfs(active, since 4h), standbys: f28-h29-000-r630.gxhqto
    osd: 192 osds: 192 up (since 4h), 192 in (since 4h); 1 remapped pgs
    rgw: 8 daemons active (8 hosts, 1 zones)

  data:
    pools:   7 pools, 931 pgs
    objects: 1.72M objects, 6.2 TiB
    usage:   12 TiB used, 343 TiB / 355 TiB avail
    pgs:     0.107% pgs not active
             8886/10297821 objects degraded (0.086%)
             930 active+clean
             1   recovery_unfound+undersized+degraded+remapped+peered

  progress:
    Global Recovery Event (2h)
      [===========================.] (remaining: 11s)

- Health detail

# ceph health detail
HEALTH_ERR Reduced data availability: 1 pg inactive; Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 8886/10310745 objects degraded (0.086%), 1 pg degraded, 1 pg undersized
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
    pg 13.2eb is stuck inactive for 3h, current state recovery_unfound+undersized+degraded+remapped+peered, last acting [33,103,NONE,123,66,NONE]
[ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound
    pg 13.2eb is recovery_unfound+undersized+degraded+remapped+peered, acting [33,103,NONE,123,66,NONE]
[WRN] PG_DEGRADED: Degraded data redundancy: 8886/10310745 objects degraded (0.086%), 1 pg degraded, 1 pg undersized
    pg 13.2eb is stuck undersized for 3h, current state recovery_unfound+undersized+degraded+remapped+peered, last acting [33,103,NONE,123,66,NONE]

# ceph version
ceph version 17.0.0-10229-g7e035110 (7e035110784fba02ba81944e444be9a36932c6a3) quincy (dev)

- No OSD flapped and this PG went to this recovery_unfound state looks like maybe while autoscaler was changing the PG count?


2022-01-18T16:54:01.511939+0000 mgr.f28-h28-000-r630.rdu2.scalelab.redhat.com.vqxcfs (mgr.14222) 1808 : cluster [DBG] pgmap v2900: 1762 pgs: 1 activating, 2 peering, 1 clean+premerge+peered, 1758 active+clean; 176 GiB data, 6.0 TiB used, 349 TiB / 355 TiB avail; 327 KiB/s rd, 3.0 GiB/s wr, 1.91k op/s; 186/288795 objects degraded (0.064%); 31/48198 objects unfound (0.064%)

2022-01-18T17:38:30.310339+0000 mgr.f28-h28-000-r630.rdu2.scalelab.redhat.com.vqxcfs (mgr.14222) 3155 : cluster [DBG] pgmap v6499: 963 pgs: 1 recovery_unfound+undersized+degraded+remapped+peered, 10 recovering+undersized+remapped+peered, 8 recovering+undersized+peered, 944 active+clean; 5.6 TiB data, 11 TiB used, 344 TiB / 355 TiB avail; 60 KiB/s rd, 8.6 GiB/s wr, 3.46k op/s; 8886/9263667 objects degraded (0.096%); 159823/9263667 objects misplaced (1.725%); 376 MiB/s, 103 objects/s recovering


Files

13.2eb.query.txt (68.1 KB) 13.2eb.query.txt Vikhyat Umrao, 01/18/2022 09:14 PM
ceph-osd.33.unfound.log (227 KB) ceph-osd.33.unfound.log Vikhyat Umrao, 01/18/2022 09:20 PM
7.dc4.query.txt (68.5 KB) 7.dc4.query.txt Vikhyat Umrao, 02/07/2022 08:27 PM
7.dc4.osds.logs.txt (108 KB) 7.dc4.osds.logs.txt Vikhyat Umrao, 02/07/2022 11:33 PM
1711.png (311 KB) 1711.png jianwei zhang, 03/09/2022 05:42 AM
1715.png (311 KB) 1715.png jianwei zhang, 03/09/2022 05:42 AM
Actions

Also available in: Atom PDF