Support #64378: Slow / Single backfilling on Reef (18.2.1-pve2) - Ceph - Ceph

Actions

#1

Updated by Pivert Dubuisson 3 months ago

The problem seems to be that even with commands like
ceph tell 'osd.*' injectargs --osd-max-backfills=3 --osd-recovery-max-active=9

The osd_max_backfills remains = 1...

root@pve1:~# ceph-conf --show-config | egrep "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills" 
osd_max_backfills = 1
osd_recovery_max_active = 0
osd_recovery_max_active_hdd = 3
osd_recovery_max_active_ssd = 10
osd_recovery_op_priority = 3

Actions

Copy link

#2

Updated by Pivert Dubuisson 3 months ago

Replying to myself:

It seems the injectargs --osd-max-backfills do not work on Reef.

The workaround I used:

Set the `osd_max_backfills = 4` in the global section of the ceph.conf
Restart each OSD

Is there a better solution ?

Actions

Copy link

#3

Updated by Pivert Dubuisson 3 months ago

With osd_max_backfills = 9, numbers are very different:

root@pve1:~# ceph status
  cluster:
    id:     e7628d51-32b5-4f5c-8eec-1cafb41ead74
    health: HEALTH_WARN
            Degraded data redundancy: 3893818/37590920 objects degraded (10.358%), 31 pgs degraded, 31 pgs undersized
            100 pgs not deep-scrubbed in time
            77 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum pve3,pve2,pve1 (age 17h)
    mgr: pve1(active, since 20h), standbys: pve3, pve2
    mds: 1/1 daemons up, 2 standby
    osd: 5 osds: 5 up (since 82s), 3 in (since 4h); 55 remapped pgs

  data:
    volumes: 1/1 healthy
    pools:   11 pools, 179 pgs
    objects: 12.56M objects, 1.2 TiB
    usage:   3.3 TiB used, 4.0 TiB / 7.3 TiB avail
    pgs:     3893818/37590920 objects degraded (10.358%)
             6376952/37590920 objects misplaced (16.964%)
             124 active+clean
             22  active+undersized+degraded+remapped+backfill_wait
             19  active+remapped+backfill_wait
             9   active+undersized+degraded+remapped+backfilling
             4   active+remapped+backfilling
             1   active+clean+remapped

  io:
    client:   12 KiB/s rd, 3.6 MiB/s wr, 3 op/s rd, 401 op/s wr
    recovery: 279 MiB/s, 175 objects/s

Actions

Copy link

#4

Updated by Niklas Hambuechen about 1 month ago

I observe the same problem on 18.2.1:

    pgs:     136112881/1202666364 objects misplaced (11.318%)
             240 active+clean
             75  active+remapped+backfill_wait
             24  active+clean+scrubbing+deep
             21  active+clean+scrubbing
             1   active+remapped+backfilling

Only 1 PG backfilling and 75 backfill_wait, no matter what configs I set.

And `ceph config set osd osd_max_backfills` does not work:

# ceph config set osd osd_max_backfills 6 && ceph config get osd osd_max_backfills 
1

Actions

Copy link

#5

Updated by Niklas Hambuechen about 1 month ago

Aha, there's a new feature in Ceph that auto-resets these values:

https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/#recovery-backfill-options

The following recovery and backfill related Ceph options are overridden to mClock defaults:

    osd_max_backfills
    osd_recovery_max_active
    osd_recovery_max_active_hdd
    osd_recovery_max_active_ssd

@Pivert Dubuisson Dubuisson: Check if the

ceph config set osd osd_mclock_override_recovery_settings true

mention in the linked https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/#steps-to-modify-mclock-max-backfills-recovery-limits helps you.

It does work for me:

# ceph config set osd osd_mclock_override_recovery_settings true
# ceph config set osd osd_max_backfills 6 && ceph config get osd osd_max_backfills
6

This speeds up my recovery drastically for one cluster. But not for the cluster mentioned in my post above, which still has

             1   active+remapped+backfilling

That problematic cluster is the one mentioned in https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/YOBS2MZS6D5CYABLZD4GMLXV6Z6RIN4W/ where I'm trying to move inode 130M backtrace information RADOS objects from replicated-HDD to replicated-SSD.

Project

General

Profile

Ceph

Custom queries

Support #64378

Slow / Single backfilling on Reef (18.2.1-pve2)

Updated by Pivert Dubuisson 3 months ago

Updated by Pivert Dubuisson 3 months ago

Updated by Pivert Dubuisson 3 months ago

Updated by Niklas Hambuechen about 1 month ago

Updated by Niklas Hambuechen about 1 month ago