Bug #56995: PGs go inactive after failed OSD comes up and is marked as in - Ceph - Ceph

Actions

Copy link

Bug #56995

open

PGs go inactive after failed OSD comes up and is marked as in

Added by Frank Schilder over 1 year ago. Updated over 1 year ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I observe a problem with peering after an OSD goes down and comes back up again. A varying number of PGs end up inactive.

Expected behaviour is that all PGs become active.

This is observed on an octopus test cluster [ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)] with ceph fs as the only application. Here a full session showing the problem and how to reproduce it:

# On a ceph fs client run a benchmark that produces high IOPs load. Then:

[root@tceph-01 ~]# ceph osd set noup
noup is set
[root@tceph-01 ~]# ceph osd down 6
marked down osd.6. 
[root@tceph-01 ~]# ceph osd out 6
marked out osd.6. 
[root@tceph-01 ~]# ceph osd unset noup
noup is unset

At this point in time the OSD comes up again and all PGs become active.

[root@tceph-01 ~]# ceph osd in 6
marked in osd.6.

At this point things go wrong. PGs peer, but a random number of PGs remains stuck inactive (random=repeating this procedure leaves a different number of PGs inactive). A "ceph pg re-peer" does not help. Forcing recovery at least schedules the inactive PGs faster for recovery, but "slop OPS" problems remain. Here the last part of the session:

[root@tceph-01 ~]# ceph status
  cluster:
    id:     bf1f51f5-b381-4cf7-b3db-88d044c1960c
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            Degraded data redundancy: 63459/4563189 objects degraded (1.391%), 138 pgs degraded

  services:
    mon: 3 daemons, quorum tceph-01,tceph-03,tceph-02 (age 5d)
    mgr: tceph-01(active, since 6d), standbys: tceph-02, tceph-03
    mds: fs:1 {0=tceph-03=up:active} 2 up:standby
    osd: 9 osds: 9 up (since 65s), 9 in (since 43s); 16 remapped pgs

  data:
    pools:   4 pools, 321 pgs
    objects: 1.07M objects, 49 GiB
    usage:   232 GiB used, 2.2 TiB / 2.4 TiB avail
    pgs:     4.050% pgs not active
             63459/4563189 objects degraded (1.391%)
             10/4563189 objects misplaced (0.000%)
             183 active+clean
             118 active+recovery_wait+degraded
             13  recovery_wait+undersized+degraded+remapped+peered
             4   active+recovering+degraded
             2   active+recovery_wait+undersized+degraded+remapped
             1   active+undersized+degraded+remapped+backfill_wait

  io:
    client:   0 B/s rd, 19 MiB/s wr, 11 op/s rd, 619 op/s wr
    recovery: 2.0 MiB/s, 529 keys/s, 65 objects/s

[root@tceph-01 ~]# ceph health detail             
HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 13 pgs inactive; Degraded data redundancy: 62454/4594533 objects degraded (1.359%), 130 pgs degraded, 15 pgs undersized
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
    mds.tceph-03(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 65 secs
[WRN] PG_AVAILABILITY: Reduced data availability: 13 pgs inactive
    pg 4.1 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,1,5,0,2147483647,2]
    pg 4.2 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,5,2147483647,0,3,2]
    pg 4.14 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,4,2147483647,2,5,1]
    pg 4.1c is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,2147483647,4,2,3,0]
    pg 4.33 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,4,2147483647,1,2,0]
    pg 4.3e is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [6,2147483647,2147483647,1,0,3]
    pg 4.4c is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [6,2147483647,1,2147483647,2,3]
    pg 4.4d is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,2147483647,5,1,3,0]
    pg 4.53 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [6,4,2147483647,2147483647,1,5]
    pg 4.54 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,0,3,4,2147483647,5]
    pg 4.66 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,2,5,0,4,2147483647]
    pg 4.68 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,2147483647,5,0,4,3]
    pg 4.70 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,0,2147483647,2,4,3]
[WRN] PG_DEGRADED: Degraded data redundancy: 62454/4594533 objects degraded (1.359%), 130 pgs degraded, 15 pgs undersized
    pg 3.50 is active+recovery_wait+degraded, acting [6,5,8]
    pg 3.51 is active+recovery_wait+degraded, acting [6,4,1]
    pg 3.53 is active+recovery_wait+degraded, acting [5,6,4]
    pg 3.55 is active+recovery_wait+degraded, acting [8,1,6]
    pg 3.59 is active+recovery_wait+degraded, acting [5,6,8]
    pg 3.5b is active+recovery_wait+degraded, acting [7,6,8]
    pg 3.5e is active+recovery_wait+degraded, acting [5,6,4]
    pg 3.5f is active+recovery_wait+degraded, acting [2,6,1]
    pg 3.64 is active+recovery_wait+degraded, acting [4,6,1]
    pg 3.66 is active+recovery_wait+degraded, acting [5,6,8]
    pg 3.68 is active+recovery_wait+degraded, acting [4,7,6]
    pg 3.6b is active+recovery_wait+degraded, acting [7,8,6]
    pg 3.70 is active+recovery_wait+degraded, acting [5,4,6]
    pg 3.71 is active+recovery_wait+degraded, acting [5,4,6]
    pg 3.72 is active+recovery_wait+degraded, acting [8,7,6]
    pg 3.7c is active+recovery_wait+degraded, acting [6,5,2]
    pg 3.7e is active+recovery_wait+degraded, acting [6,2,7]
    pg 4.51 is active+recovery_wait+degraded, acting [7,6,1,5,4,2]
    pg 4.53 is stuck undersized for 64s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [6,4,2147483647,2147483647,1,5]
    pg 4.54 is stuck undersized for 64s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,0,3,4,2147483647,5]
    pg 4.55 is active+recovery_wait+degraded, acting [3,8,5,4,7,6]
    pg 4.56 is active+recovery_wait+degraded, acting [4,8,1,0,6,3]
    pg 4.57 is active+recovery_wait+degraded, acting [3,7,6,5,8,0]
    pg 4.59 is active+recovery_wait+degraded, acting [4,6,2,1,3,7]
    pg 4.5a is active+recovery_wait+degraded, acting [5,7,4,3,0,6]
    pg 4.5b is active+recovery_wait+degraded, acting [3,7,6,5,1,2]
    pg 4.5c is active+recovery_wait+degraded, acting [0,5,4,7,6,1]
    pg 4.5d is active+recovery_wait+degraded, acting [1,8,5,6,2,4]
    pg 4.5f is active+recovery_wait+degraded, acting [2,6,1,7,5,3]
    pg 4.63 is active+recovery_wait+degraded, acting [7,5,6,0,1,2]
    pg 4.66 is stuck undersized for 64s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,2,5,0,4,2147483647]
    pg 4.67 is active+recovery_wait+degraded, acting [8,0,7,6,3,4]
    pg 4.68 is stuck undersized for 64s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,2147483647,5,0,4,3]
    pg 4.69 is active+recovery_wait+degraded, acting [3,1,6,4,8,2]
    pg 4.6a is active+recovery_wait+degraded, acting [4,0,2,1,3,6]
    pg 4.6b is active+recovery_wait+degraded, acting [2,4,7,6,0,3]
    pg 4.6c is active+recovery_wait+degraded, acting [0,1,5,6,4,3]
    pg 4.6d is active+recovery_wait+degraded, acting [0,1,6,8,2,4]
    pg 4.6f is active+recovery_wait+degraded, acting [2,6,1,7,8,0]
    pg 4.70 is stuck undersized for 64s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,0,2147483647,2,4,3]
    pg 4.71 is active+recovery_wait+degraded, acting [5,6,2,8,4,7]
    pg 4.72 is active+recovery_wait+degraded, acting [5,0,1,6,2,7]
    pg 4.74 is active+recovery_wait+degraded, acting [4,2,6,1,8,7]
    pg 4.76 is active+recovery_wait+degraded, acting [0,4,6,7,1,8]
    pg 4.77 is active+recovery_wait+degraded, acting [3,5,1,4,7,6]
    pg 4.78 is active+recovery_wait+degraded, acting [7,1,3,4,6,0]
    pg 4.7a is active+recovery_wait+degraded, acting [0,1,6,8,2,4]
    pg 4.7c is active+recovery_wait+degraded, acting [4,8,1,7,3,6]
    pg 4.7d is active+recovery_wait+degraded, acting [5,4,1,0,6,8]
    pg 4.7e is active+recovery_wait+degraded, acting [1,8,3,6,7,2]
    pg 4.7f is active+recovery_wait+degraded, acting [7,0,6,3,2,4]

[root@tceph-01 ~]# ceph pg force-recovery 4.1 4.2 4.14 4.1c 4.33
instructing pg(s) [4.1s1] on osd.1 to force-recovery; instructing pg(s) [4.14s1,4.1cs2,4.33s1] on osd.4 to force-recovery;
  instructing pg(s) [4.2s1] on osd.5 to force-recovery; 

[root@tceph-01 ~]# ceph pg force-recovery 4.3e 4.4c 4.4d 4.53 4.54 4.66 4.68 4.70
instructing pg(s) [4.54s1,4.70s1] on osd.0 to force-recovery; instructing pg(s) [4.66s1] on osd.2 to force-recovery;
  instructing pg(s) [4.4ds2,4.68s2] on osd.5 to force-recovery; instructing pg(s) [4.3es0,4.4cs0,4.53s0] on osd.6 to force-recovery; 

[root@tceph-01 ~]# ceph status
  cluster:
    id:     bf1f51f5-b381-4cf7-b3db-88d044c1960c
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            1 MDSs behind on trimming
            Reduced data availability: 11 pgs inactive
            Degraded data redundancy: 58914/4613532 objects degraded (1.277%), 123 pgs degraded, 12 pgs undersized

  services:
    mon: 3 daemons, quorum tceph-01,tceph-03,tceph-02 (age 6d)
    mgr: tceph-01(active, since 6d), standbys: tceph-02, tceph-03
    mds: fs:1 {0=tceph-03=up:active} 2 up:standby
    osd: 9 osds: 9 up (since 2m), 9 in (since 2m); 11 remapped pgs

  data:
    pools:   4 pools, 321 pgs
    objects: 1.08M objects, 50 GiB
    usage:   235 GiB used, 2.2 TiB / 2.4 TiB avail
    pgs:     3.427% pgs not active
             58914/4613532 objects degraded (1.277%)
             197 active+clean
             112 active+recovery_wait+degraded
             6   recovery_wait+forced_recovery+undersized+degraded+remapped+peered
             4   recovering+forced_recovery+undersized+degraded+remapped+peered
             1   active+undersized+degraded+remapped+backfill_wait
             1   undersized+remapped+peered

  io:
    client:   1.2 KiB/s rd, 7.7 MiB/s wr, 66 op/s rd, 1.26k op/s wr
    recovery: 17 MiB/s, 46 objects/s

After all inactive PGs are recovered, operation returns to normal and the "slow OPS" warnings disappear. What is striking is that the inactive PGs show 2 OSDs as missing instead of just 1 (and therefore don't accept writes). It is also worrying that these PGs with higher degradation are not scheduled with high priority for recovery right away and I have to do that by hand.

Actions

Copy link

Updated by Frank Schilder over 1 year ago

Some additional info: test cluster with 3 nodes with 3 OSDs, 1 MON and 1 MGR each. Network 1G public and 1G replication. Plenty RAM and CPU.

[root@rit-tceph bench]# ceph osd crush tree
ID  CLASS  WEIGHT   TYPE NAME        
-1         2.44707  root default     
-7         0.81569      host tceph-01
 0    hdd  0.27190          osd.0    
 3    hdd  0.27190          osd.3    
 6    hdd  0.27190          osd.6    
-3         0.81569      host tceph-02
 2    hdd  0.27190          osd.2    
 4    hdd  0.27190          osd.4    
 8    hdd  0.27190          osd.8    
-5         0.81569      host tceph-03
 1    hdd  0.27190          osd.1    
 5    hdd  0.27190          osd.5    
 7    hdd  0.27190          osd.7

[root@rit-tceph bench]# ceph osd pool ls detail
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode off
  last_change 555 flags hashpspool,nodelete stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 2 'fs-meta1' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode off
  last_change 592 lfor 0/458/589 flags hashpspool,nodelete max_bytes 53687091200 stripe_width 0 expected_num_objects 3
  pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 3 'fs-meta2' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off
  last_change 613 lfor 0/553/610 flags hashpspool,nodelete max_bytes 10737418240 stripe_width 0 expected_num_objects 3
  application cephfs
pool 4 'fs-data' erasure profile ec-4-2 size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off
  last_change 619 lfor 0/586/616 flags hashpspool,ec_overwrites,nodelete max_bytes 2199023255552 stripe_width 16384 fast_read 1
  compression_mode aggressive application cephfs

[root@rit-tceph bench]# ceph osd crush rule dump fs-data
{
    "rule_id": 1,
    "rule_name": "fs-data",
    "ruleset": 1,
    "type": 3,
    "min_size": 3,
    "max_size": 6,
    "steps": [
        {
            "op": "set_chooseleaf_tries",
            "num": 5
        },
        {
            "op": "set_choose_tries",
            "num": 100
        },
        {
            "op": "take",
            "item": -1,
            "item_name": "default" 
        },
        {
            "op": "choose_indep",
            "num": 0,
            "type": "osd" 
        },
        {
            "op": "emit" 
        }
    ]
}

Actions

Copy link