Bug #58460: LRC cluster: cluster is in HEALTH_ERR - Infrastructure - Ceph

Actions

Copy link

Bug #58460

open

LRC cluster: cluster is in HEALTH_ERR

Added by Prashant D over 1 year ago. Updated over 1 year ago.

Status:

New

Priority:

Normal

Assignee:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

- cluster status :

 cluster:
    id:     28f7427e-5558-4ffd-ae1a-51ec3042759a
    health: HEALTH_ERR
            full ratio(s) out of order
            Low space hindering backfill (add storage if this doesn't resolve itself): 21 pgs backfill_toofull
            Degraded data redundancy: 137452/498931805 objects degraded (0.028%), 21 pgs degraded, 21 pgs undersized

  services:
    mon: 5 daemons, quorum reesi003,reesi002,reesi001,ivan02,ivan01 (age 9d)
    mgr: reesi006.erytot(active, since 7d), standbys: reesi005.xxyjcw, reesi004.tplfrt
    mds: 4/4 daemons up, 5 standby, 1 hot standby
    osd: 166 osds: 166 up (since 41h), 165 in (since 15h); 67 remapped pgs
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    volumes: 4/4 healthy
    pools:   24 pools, 2965 pgs
    objects: 104.01M objects, 118 TiB
    usage:   207 TiB used, 850 TiB / 1.0 PiB avail
    pgs:     137452/498931805 objects degraded (0.028%)
             2768577/498931805 objects misplaced (0.555%)
             2898 active+clean
             46   active+remapped+backfilling
             21   active+undersized+degraded+remapped+backfill_toofull

  io:
    client:   4.6 KiB/s rd, 84 B/s wr, 5 op/s rd, 0 op/s wr
    recovery: 0 B/s, 4 objects/s

  progress:
    Global Recovery Event (22h)
      [===========================.] (remaining: 31m)

- ceph health detail

HEALTH_ERR full ratio(s) out of order; Low space hindering backfill (add storage if this doesn't resolve itself): 21 pgs backfill_toofull; Degraded data redundancy: 137452/498931805 objects degraded (0.028%), 21 pgs degraded, 21 pgs undersized
[ERR] OSD_OUT_OF_ORDER_FULL: full ratio(s) out of order
    osd_failsafe_full_ratio (0.97) < full_ratio (0.99), increased
[WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this doesn't resolve itself): 21 pgs backfill_toofull
    pg 124.4 is active+undersized+degraded+remapped+backfill_toofull, acting [41,52]
    pg 124.9 is active+undersized+degraded+remapped+backfill_toofull, acting [41,52]
...

Actions

Copy link

Updated by Prashant D over 1 year ago

Description updated (diff)

Actions

Copy link

Updated by Prashant D over 1 year ago

osd.137 util was over 97% :

$ cat ceph_osd_df_tree.2023-01-04_10-20-28 | awk -F' ' '$17>50'
ID   CLASS  WEIGHT      REWEIGHT  SIZE     RAW USE   DATA      OMAP     META     AVAIL     %USE   VAR   PGS  STATUS  TYPE NAME        
137    hdd     3.66899   1.00000  3.6 TiB   3.6 TiB   3.6 TiB    7 KiB  6.6 GiB    75 GiB  97.98  5.00  151      up          osd.137

Actions

Copy link

Updated by Prashant D over 1 year ago

Adam and I discussed this issue over g-chat last week. The LRC cluster is now in the healthy state.

Documenting steps followed to get LRC cluster to healthy state :

- Re-set ratios to default values
ceph osd set-backfillfull-ratio 0.9
ceph osd set-full-ratio 0.95
ceph osd set-nearfull-ratio 0.85

- Reweight osd.137 to offload PGs to other OSDs
ceph osd reweight-by-utilization 102 0.05 1 --no-increasing

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Infrastructure

Custom queries

Bug #58460

LRC cluster: cluster is in HEALTH_ERR

Updated by Prashant D over 1 year ago

Updated by Prashant D over 1 year ago

Updated by Prashant D over 1 year ago