Project

General

Profile

Bug #56530

Quincy: High CPU and slow progress during backfill

Added by Chris Palmer 7 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
RADOS backport_processed
Backport:
quincy
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm seeing a similar problem on a small cluster just upgraded from Pacific 16.2.9 to Quincy 17.2.1 (non-cephadm). The cluster was only very lightly loaded during and after the upgrade. The OSDs affected are all bluestore, HDD sharing NVMe DB/WAL, and all created on Pacific.

The upgrade was uneventful except that after the OSDs on the first node were restarted, some rebalancing started. I don't know why, but I set norebalance for the remainder of the upgrade.

When the entire upgrade was complete I unset norebalance. A number of things then became apparent:

  • Startup appeared normal, with health returning to OK in the expected
    time. Apart from below, things seem normal.
  • A few OSDs are consuming around 95% CPU, almost all of which is User
    mode. System, iowait and interrupt are all minimal.
  • Rebalancing is occurring, but only at about 1 object/second.
  • Disk IO rates are minimal.
  • We originally had osd_compact_on_start set to true, but then
    disabled it, and restarted the CPU-bound OSDs. They all restarted
    ok, and then resumed their high CPU load. Rebooting an OSD node
    didn't change anything either.
  • There are currently 3 PGs that are remapping. The
    Acting Primary OSDs of each of those PGs (5, 7 & 8) are the ones consuming the CPU.
  • Setting nobackfill causes the CPU consumption to drop to normal idle
    levels; unsetting it returns to the high levels.
  • We have not changed any other settings during the upgrade. All of
    the OSDs are using mclock high_client_ops profile.
  • The backfills did complete some hours later, with the backfill rate dropping to only one object every couple of seconds.

On Pacific, the rebalancing was often up to 100 objects per second with fairly minimal CPU load.

ceph pg dump pgs_brief|grep remap
8.38     active+remapped+backfilling     [7,0,4]           7 [7,0,3]               7
8.5f     active+remapped+backfilling     [8,3,2]           8 [8,3,0]               8
8.63     active+remapped+backfilling     [0,6,5]           0 [5,6,2]               5

ceph -s
  cluster:
    id:     9208361c-5b68-41ed-8155-cc246a3fe538
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum pve1,pve2,pve3 (age 46m)
    mgr: pve1(active, since 2h), standbys: pve2, pve3
    mds: 1/1 daemons up, 2 standby
    osd: 18 osds: 18 up (since 45m), 18 in (since 2w); 3 remapped pgs
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   14 pools, 617 pgs
    objects: 2.49M objects, 5.9 TiB
    usage:   18 TiB used, 25 TiB / 43 TiB avail
    pgs:     16826/7484589 objects misplaced (0.225%)
             614 active+clean
             3   active+remapped+backfilling

  io:
    client:   341 B/s rd, 256 KiB/s wr, 0 op/s rd, 20 op/s wr
    recovery: 4.8 MiB/s, 1 objects/s

I've also now replicated the problem on a larger 17.2.1 test cluster containing only minimal data. Starting from a completely clean state, merely applying a slight reweight to an OSD to trigger remap/backfill was sufficient. All participating OSDs had high CPU briefly, but then the Acting Primary ones continued at 90-105% for the duration of the backfill, which only progressed at around 0.25-1 objects per second.

This CPU load and backfill rate is not sustainable, and I have put the upgrade of our main production cluster on hold.


Related issues

Copied to RADOS - Backport #57052: quincy: Quincy: High CPU and slow progress during backfill Resolved

History

#1 Updated by Sridhar Seshasayee 7 months ago

While we look into this, I have a couple of questions:

1. Did the recovery rate stay at 1 object/sec throughout? In our tests we have seen that the rate is higher during the starting phase of recovery and eventually tapers off due to throttling by mclock.

2. Can you try speeding up the recovery by changing to "high_recovery_ops" profile on all the OSDs to see if it improves things (both CPU load and recovery rate)?

3. On the OSDs that showed high CPU usage, can you run the following command and revert back? This just dumps the mclock settings on the OSDs.

    sudo ceph daemon osd.N config show | grep osd_mclock

#2 Updated by Chris Palmer 7 months ago

Thanks for looking at this. Answers to your questions:

1. Backfill started at around 4-5 objects per second, and tailed off to about one object every few seconds.

2. Forgot to say, but I did try changing to high_recovery_ops and it didn't make any difference to either CPU or rate.

3. mclock settings for the 3 OSDs are below.

ceph daemon osd.5 config show | grep osd_mclock
    "osd_mclock_cost_per_byte_usec": "0.000000",
    "osd_mclock_cost_per_byte_usec_hdd": "5.200000",
    "osd_mclock_cost_per_byte_usec_ssd": "0.011000",
    "osd_mclock_cost_per_io_usec": "0.000000",
    "osd_mclock_cost_per_io_usec_hdd": "25000.000000",
    "osd_mclock_cost_per_io_usec_ssd": "50.000000",
    "osd_mclock_force_run_benchmark_on_init": "false",
    "osd_mclock_max_capacity_iops_hdd": "217.167905",
    "osd_mclock_max_capacity_iops_ssd": "21500.000000",
    "osd_mclock_profile": "high_client_ops",
    "osd_mclock_scheduler_anticipation_timeout": "0.000000",
    "osd_mclock_scheduler_background_best_effort_lim": "999999",
    "osd_mclock_scheduler_background_best_effort_res": "11",
    "osd_mclock_scheduler_background_best_effort_wgt": "2",
    "osd_mclock_scheduler_background_recovery_lim": "43",
    "osd_mclock_scheduler_background_recovery_res": "11",
    "osd_mclock_scheduler_background_recovery_wgt": "1",
    "osd_mclock_scheduler_client_lim": "999999",
    "osd_mclock_scheduler_client_res": "22",
    "osd_mclock_scheduler_client_wgt": "2",
    "osd_mclock_skip_benchmark": "false",

ceph daemon osd.7 config show | grep osd_mclock
    "osd_mclock_cost_per_byte_usec": "0.000000",
    "osd_mclock_cost_per_byte_usec_hdd": "5.200000",
    "osd_mclock_cost_per_byte_usec_ssd": "0.011000",
    "osd_mclock_cost_per_io_usec": "0.000000",
    "osd_mclock_cost_per_io_usec_hdd": "25000.000000",
    "osd_mclock_cost_per_io_usec_ssd": "50.000000",
    "osd_mclock_force_run_benchmark_on_init": "false",
    "osd_mclock_max_capacity_iops_hdd": "275.572166",
    "osd_mclock_max_capacity_iops_ssd": "21500.000000",
    "osd_mclock_profile": "high_client_ops",
    "osd_mclock_scheduler_anticipation_timeout": "0.000000",
    "osd_mclock_scheduler_background_best_effort_lim": "999999",
    "osd_mclock_scheduler_background_best_effort_res": "14",
    "osd_mclock_scheduler_background_best_effort_wgt": "2",
    "osd_mclock_scheduler_background_recovery_lim": "55",
    "osd_mclock_scheduler_background_recovery_res": "14",
    "osd_mclock_scheduler_background_recovery_wgt": "1",
    "osd_mclock_scheduler_client_lim": "999999",
    "osd_mclock_scheduler_client_res": "28",
    "osd_mclock_scheduler_client_wgt": "2",
    "osd_mclock_skip_benchmark": "false",

ceph daemon osd.8 config show | grep osd_mclock
    "osd_mclock_cost_per_byte_usec": "0.000000",
    "osd_mclock_cost_per_byte_usec_hdd": "5.200000",
    "osd_mclock_cost_per_byte_usec_ssd": "0.011000",
    "osd_mclock_cost_per_io_usec": "0.000000",
    "osd_mclock_cost_per_io_usec_hdd": "25000.000000",
    "osd_mclock_cost_per_io_usec_ssd": "50.000000",
    "osd_mclock_force_run_benchmark_on_init": "false",
    "osd_mclock_max_capacity_iops_hdd": "277.604269",
    "osd_mclock_max_capacity_iops_ssd": "21500.000000",
    "osd_mclock_profile": "high_client_ops",
    "osd_mclock_scheduler_anticipation_timeout": "0.000000",
    "osd_mclock_scheduler_background_best_effort_lim": "999999",
    "osd_mclock_scheduler_background_best_effort_res": "14",
    "osd_mclock_scheduler_background_best_effort_wgt": "2",
    "osd_mclock_scheduler_background_recovery_lim": "56",
    "osd_mclock_scheduler_background_recovery_res": "14",
    "osd_mclock_scheduler_background_recovery_wgt": "1",
    "osd_mclock_scheduler_client_lim": "999999",
    "osd_mclock_scheduler_client_res": "28",
    "osd_mclock_scheduler_client_wgt": "2",
    "osd_mclock_skip_benchmark": "false",

#3 Updated by Neha Ojha 7 months ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)

#4 Updated by Neha Ojha 7 months ago

  • Assignee set to Sridhar Seshasayee
  • Priority changed from Normal to High

#6 Updated by Sridhar Seshasayee 7 months ago

  • Pull request ID set to 47216

The issue is addressed currently in Ceph's main branch. Please see the linked PR. This will be back-ported to Quincy once it goes through the required testing process.

#7 Updated by Sridhar Seshasayee 7 months ago

  • Status changed from New to Fix Under Review
  • Backport set to quincy

#8 Updated by Sridhar Seshasayee 6 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Component(RADOS) OSD added

#9 Updated by Backport Bot 6 months ago

  • Copied to Backport #57052: quincy: Quincy: High CPU and slow progress during backfill added

#10 Updated by Backport Bot 6 months ago

  • Tags changed from RADOS to RADOS backport_processed

#11 Updated by Neha Ojha 6 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF