Project

General

Profile

Actions

Bug #56995

open

PGs go inactive after failed OSD comes up and is marked as in

Added by Frank Schilder almost 2 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I observe a problem with peering after an OSD goes down and comes back up again. A varying number of PGs end up inactive.

Expected behaviour is that all PGs become active.

This is observed on an octopus test cluster [ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)] with ceph fs as the only application. Here a full session showing the problem and how to reproduce it:

# On a ceph fs client run a benchmark that produces high IOPs load. Then:

[root@tceph-01 ~]# ceph osd set noup
noup is set
[root@tceph-01 ~]# ceph osd down 6
marked down osd.6. 
[root@tceph-01 ~]# ceph osd out 6
marked out osd.6. 
[root@tceph-01 ~]# ceph osd unset noup
noup is unset

At this point in time the OSD comes up again and all PGs become active.

[root@tceph-01 ~]# ceph osd in 6
marked in osd.6. 

At this point things go wrong. PGs peer, but a random number of PGs remains stuck inactive (random=repeating this procedure leaves a different number of PGs inactive). A "ceph pg re-peer" does not help. Forcing recovery at least schedules the inactive PGs faster for recovery, but "slop OPS" problems remain. Here the last part of the session:

[root@tceph-01 ~]# ceph status
  cluster:
    id:     bf1f51f5-b381-4cf7-b3db-88d044c1960c
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            Degraded data redundancy: 63459/4563189 objects degraded (1.391%), 138 pgs degraded

  services:
    mon: 3 daemons, quorum tceph-01,tceph-03,tceph-02 (age 5d)
    mgr: tceph-01(active, since 6d), standbys: tceph-02, tceph-03
    mds: fs:1 {0=tceph-03=up:active} 2 up:standby
    osd: 9 osds: 9 up (since 65s), 9 in (since 43s); 16 remapped pgs

  data:
    pools:   4 pools, 321 pgs
    objects: 1.07M objects, 49 GiB
    usage:   232 GiB used, 2.2 TiB / 2.4 TiB avail
    pgs:     4.050% pgs not active
             63459/4563189 objects degraded (1.391%)
             10/4563189 objects misplaced (0.000%)
             183 active+clean
             118 active+recovery_wait+degraded
             13  recovery_wait+undersized+degraded+remapped+peered
             4   active+recovering+degraded
             2   active+recovery_wait+undersized+degraded+remapped
             1   active+undersized+degraded+remapped+backfill_wait

  io:
    client:   0 B/s rd, 19 MiB/s wr, 11 op/s rd, 619 op/s wr
    recovery: 2.0 MiB/s, 529 keys/s, 65 objects/s

[root@tceph-01 ~]# ceph health detail             
HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 13 pgs inactive; Degraded data redundancy: 62454/4594533 objects degraded (1.359%), 130 pgs degraded, 15 pgs undersized
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
    mds.tceph-03(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 65 secs
[WRN] PG_AVAILABILITY: Reduced data availability: 13 pgs inactive
    pg 4.1 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,1,5,0,2147483647,2]
    pg 4.2 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,5,2147483647,0,3,2]
    pg 4.14 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,4,2147483647,2,5,1]
    pg 4.1c is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,2147483647,4,2,3,0]
    pg 4.33 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,4,2147483647,1,2,0]
    pg 4.3e is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [6,2147483647,2147483647,1,0,3]
    pg 4.4c is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [6,2147483647,1,2147483647,2,3]
    pg 4.4d is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,2147483647,5,1,3,0]
    pg 4.53 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [6,4,2147483647,2147483647,1,5]
    pg 4.54 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,0,3,4,2147483647,5]
    pg 4.66 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,2,5,0,4,2147483647]
    pg 4.68 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,2147483647,5,0,4,3]
    pg 4.70 is stuck inactive for 67s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,0,2147483647,2,4,3]
[WRN] PG_DEGRADED: Degraded data redundancy: 62454/4594533 objects degraded (1.359%), 130 pgs degraded, 15 pgs undersized
    pg 3.50 is active+recovery_wait+degraded, acting [6,5,8]
    pg 3.51 is active+recovery_wait+degraded, acting [6,4,1]
    pg 3.53 is active+recovery_wait+degraded, acting [5,6,4]
    pg 3.55 is active+recovery_wait+degraded, acting [8,1,6]
    pg 3.59 is active+recovery_wait+degraded, acting [5,6,8]
    pg 3.5b is active+recovery_wait+degraded, acting [7,6,8]
    pg 3.5e is active+recovery_wait+degraded, acting [5,6,4]
    pg 3.5f is active+recovery_wait+degraded, acting [2,6,1]
    pg 3.64 is active+recovery_wait+degraded, acting [4,6,1]
    pg 3.66 is active+recovery_wait+degraded, acting [5,6,8]
    pg 3.68 is active+recovery_wait+degraded, acting [4,7,6]
    pg 3.6b is active+recovery_wait+degraded, acting [7,8,6]
    pg 3.70 is active+recovery_wait+degraded, acting [5,4,6]
    pg 3.71 is active+recovery_wait+degraded, acting [5,4,6]
    pg 3.72 is active+recovery_wait+degraded, acting [8,7,6]
    pg 3.7c is active+recovery_wait+degraded, acting [6,5,2]
    pg 3.7e is active+recovery_wait+degraded, acting [6,2,7]
    pg 4.51 is active+recovery_wait+degraded, acting [7,6,1,5,4,2]
    pg 4.53 is stuck undersized for 64s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [6,4,2147483647,2147483647,1,5]
    pg 4.54 is stuck undersized for 64s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,0,3,4,2147483647,5]
    pg 4.55 is active+recovery_wait+degraded, acting [3,8,5,4,7,6]
    pg 4.56 is active+recovery_wait+degraded, acting [4,8,1,0,6,3]
    pg 4.57 is active+recovery_wait+degraded, acting [3,7,6,5,8,0]
    pg 4.59 is active+recovery_wait+degraded, acting [4,6,2,1,3,7]
    pg 4.5a is active+recovery_wait+degraded, acting [5,7,4,3,0,6]
    pg 4.5b is active+recovery_wait+degraded, acting [3,7,6,5,1,2]
    pg 4.5c is active+recovery_wait+degraded, acting [0,5,4,7,6,1]
    pg 4.5d is active+recovery_wait+degraded, acting [1,8,5,6,2,4]
    pg 4.5f is active+recovery_wait+degraded, acting [2,6,1,7,5,3]
    pg 4.63 is active+recovery_wait+degraded, acting [7,5,6,0,1,2]
    pg 4.66 is stuck undersized for 64s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,2,5,0,4,2147483647]
    pg 4.67 is active+recovery_wait+degraded, acting [8,0,7,6,3,4]
    pg 4.68 is stuck undersized for 64s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,2147483647,5,0,4,3]
    pg 4.69 is active+recovery_wait+degraded, acting [3,1,6,4,8,2]
    pg 4.6a is active+recovery_wait+degraded, acting [4,0,2,1,3,6]
    pg 4.6b is active+recovery_wait+degraded, acting [2,4,7,6,0,3]
    pg 4.6c is active+recovery_wait+degraded, acting [0,1,5,6,4,3]
    pg 4.6d is active+recovery_wait+degraded, acting [0,1,6,8,2,4]
    pg 4.6f is active+recovery_wait+degraded, acting [2,6,1,7,8,0]
    pg 4.70 is stuck undersized for 64s, current state recovery_wait+undersized+degraded+remapped+peered, last acting [2147483647,0,2147483647,2,4,3]
    pg 4.71 is active+recovery_wait+degraded, acting [5,6,2,8,4,7]
    pg 4.72 is active+recovery_wait+degraded, acting [5,0,1,6,2,7]
    pg 4.74 is active+recovery_wait+degraded, acting [4,2,6,1,8,7]
    pg 4.76 is active+recovery_wait+degraded, acting [0,4,6,7,1,8]
    pg 4.77 is active+recovery_wait+degraded, acting [3,5,1,4,7,6]
    pg 4.78 is active+recovery_wait+degraded, acting [7,1,3,4,6,0]
    pg 4.7a is active+recovery_wait+degraded, acting [0,1,6,8,2,4]
    pg 4.7c is active+recovery_wait+degraded, acting [4,8,1,7,3,6]
    pg 4.7d is active+recovery_wait+degraded, acting [5,4,1,0,6,8]
    pg 4.7e is active+recovery_wait+degraded, acting [1,8,3,6,7,2]
    pg 4.7f is active+recovery_wait+degraded, acting [7,0,6,3,2,4]

[root@tceph-01 ~]# ceph pg force-recovery 4.1 4.2 4.14 4.1c 4.33
instructing pg(s) [4.1s1] on osd.1 to force-recovery; instructing pg(s) [4.14s1,4.1cs2,4.33s1] on osd.4 to force-recovery;
  instructing pg(s) [4.2s1] on osd.5 to force-recovery; 

[root@tceph-01 ~]# ceph pg force-recovery 4.3e 4.4c 4.4d 4.53 4.54 4.66 4.68 4.70
instructing pg(s) [4.54s1,4.70s1] on osd.0 to force-recovery; instructing pg(s) [4.66s1] on osd.2 to force-recovery;
  instructing pg(s) [4.4ds2,4.68s2] on osd.5 to force-recovery; instructing pg(s) [4.3es0,4.4cs0,4.53s0] on osd.6 to force-recovery; 

[root@tceph-01 ~]# ceph status
  cluster:
    id:     bf1f51f5-b381-4cf7-b3db-88d044c1960c
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            1 MDSs behind on trimming
            Reduced data availability: 11 pgs inactive
            Degraded data redundancy: 58914/4613532 objects degraded (1.277%), 123 pgs degraded, 12 pgs undersized

  services:
    mon: 3 daemons, quorum tceph-01,tceph-03,tceph-02 (age 6d)
    mgr: tceph-01(active, since 6d), standbys: tceph-02, tceph-03
    mds: fs:1 {0=tceph-03=up:active} 2 up:standby
    osd: 9 osds: 9 up (since 2m), 9 in (since 2m); 11 remapped pgs

  data:
    pools:   4 pools, 321 pgs
    objects: 1.08M objects, 50 GiB
    usage:   235 GiB used, 2.2 TiB / 2.4 TiB avail
    pgs:     3.427% pgs not active
             58914/4613532 objects degraded (1.277%)
             197 active+clean
             112 active+recovery_wait+degraded
             6   recovery_wait+forced_recovery+undersized+degraded+remapped+peered
             4   recovering+forced_recovery+undersized+degraded+remapped+peered
             1   active+undersized+degraded+remapped+backfill_wait
             1   undersized+remapped+peered

  io:
    client:   1.2 KiB/s rd, 7.7 MiB/s wr, 66 op/s rd, 1.26k op/s wr
    recovery: 17 MiB/s, 46 objects/s

After all inactive PGs are recovered, operation returns to normal and the "slow OPS" warnings disappear. What is striking is that the inactive PGs show 2 OSDs as missing instead of just 1 (and therefore don't accept writes). It is also worrying that these PGs with higher degradation are not scheduled with high priority for recovery right away and I have to do that by hand.

Actions

Also available in: Atom PDF