Bug #56530
Quincy: High CPU and slow progress during backfill
0%
Description
I'm seeing a similar problem on a small cluster just upgraded from Pacific 16.2.9 to Quincy 17.2.1 (non-cephadm). The cluster was only very lightly loaded during and after the upgrade. The OSDs affected are all bluestore, HDD sharing NVMe DB/WAL, and all created on Pacific.
The upgrade was uneventful except that after the OSDs on the first node were restarted, some rebalancing started. I don't know why, but I set norebalance for the remainder of the upgrade.
When the entire upgrade was complete I unset norebalance. A number of things then became apparent:
- Startup appeared normal, with health returning to OK in the expected
time. Apart from below, things seem normal. - A few OSDs are consuming around 95% CPU, almost all of which is User
mode. System, iowait and interrupt are all minimal. - Rebalancing is occurring, but only at about 1 object/second.
- Disk IO rates are minimal.
- We originally had osd_compact_on_start set to true, but then
disabled it, and restarted the CPU-bound OSDs. They all restarted
ok, and then resumed their high CPU load. Rebooting an OSD node
didn't change anything either. - There are currently 3 PGs that are remapping. The
Acting Primary OSDs of each of those PGs (5, 7 & 8) are the ones consuming the CPU. - Setting nobackfill causes the CPU consumption to drop to normal idle
levels; unsetting it returns to the high levels. - We have not changed any other settings during the upgrade. All of
the OSDs are using mclock high_client_ops profile. - The backfills did complete some hours later, with the backfill rate dropping to only one object every couple of seconds.
On Pacific, the rebalancing was often up to 100 objects per second with fairly minimal CPU load.
ceph pg dump pgs_brief|grep remap 8.38 active+remapped+backfilling [7,0,4] 7 [7,0,3] 7 8.5f active+remapped+backfilling [8,3,2] 8 [8,3,0] 8 8.63 active+remapped+backfilling [0,6,5] 0 [5,6,2] 5
ceph -s cluster: id: 9208361c-5b68-41ed-8155-cc246a3fe538 health: HEALTH_OK services: mon: 3 daemons, quorum pve1,pve2,pve3 (age 46m) mgr: pve1(active, since 2h), standbys: pve2, pve3 mds: 1/1 daemons up, 2 standby osd: 18 osds: 18 up (since 45m), 18 in (since 2w); 3 remapped pgs rgw: 3 daemons active (3 hosts, 1 zones) data: volumes: 1/1 healthy pools: 14 pools, 617 pgs objects: 2.49M objects, 5.9 TiB usage: 18 TiB used, 25 TiB / 43 TiB avail pgs: 16826/7484589 objects misplaced (0.225%) 614 active+clean 3 active+remapped+backfilling io: client: 341 B/s rd, 256 KiB/s wr, 0 op/s rd, 20 op/s wr recovery: 4.8 MiB/s, 1 objects/s
I've also now replicated the problem on a larger 17.2.1 test cluster containing only minimal data. Starting from a completely clean state, merely applying a slight reweight to an OSD to trigger remap/backfill was sufficient. All participating OSDs had high CPU briefly, but then the Acting Primary ones continued at 90-105% for the duration of the backfill, which only progressed at around 0.25-1 objects per second.
This CPU load and backfill rate is not sustainable, and I have put the upgrade of our main production cluster on hold.
Related issues
History
#1 Updated by Sridhar Seshasayee 7 months ago
While we look into this, I have a couple of questions:
1. Did the recovery rate stay at 1 object/sec throughout? In our tests we have seen that the rate is higher during the starting phase of recovery and eventually tapers off due to throttling by mclock.
2. Can you try speeding up the recovery by changing to "high_recovery_ops" profile on all the OSDs to see if it improves things (both CPU load and recovery rate)?
3. On the OSDs that showed high CPU usage, can you run the following command and revert back? This just dumps the mclock settings on the OSDs.
sudo ceph daemon osd.N config show | grep osd_mclock
#2 Updated by Chris Palmer 7 months ago
Thanks for looking at this. Answers to your questions:
1. Backfill started at around 4-5 objects per second, and tailed off to about one object every few seconds.
2. Forgot to say, but I did try changing to high_recovery_ops and it didn't make any difference to either CPU or rate.
3. mclock settings for the 3 OSDs are below.
ceph daemon osd.5 config show | grep osd_mclock "osd_mclock_cost_per_byte_usec": "0.000000", "osd_mclock_cost_per_byte_usec_hdd": "5.200000", "osd_mclock_cost_per_byte_usec_ssd": "0.011000", "osd_mclock_cost_per_io_usec": "0.000000", "osd_mclock_cost_per_io_usec_hdd": "25000.000000", "osd_mclock_cost_per_io_usec_ssd": "50.000000", "osd_mclock_force_run_benchmark_on_init": "false", "osd_mclock_max_capacity_iops_hdd": "217.167905", "osd_mclock_max_capacity_iops_ssd": "21500.000000", "osd_mclock_profile": "high_client_ops", "osd_mclock_scheduler_anticipation_timeout": "0.000000", "osd_mclock_scheduler_background_best_effort_lim": "999999", "osd_mclock_scheduler_background_best_effort_res": "11", "osd_mclock_scheduler_background_best_effort_wgt": "2", "osd_mclock_scheduler_background_recovery_lim": "43", "osd_mclock_scheduler_background_recovery_res": "11", "osd_mclock_scheduler_background_recovery_wgt": "1", "osd_mclock_scheduler_client_lim": "999999", "osd_mclock_scheduler_client_res": "22", "osd_mclock_scheduler_client_wgt": "2", "osd_mclock_skip_benchmark": "false",
ceph daemon osd.7 config show | grep osd_mclock "osd_mclock_cost_per_byte_usec": "0.000000", "osd_mclock_cost_per_byte_usec_hdd": "5.200000", "osd_mclock_cost_per_byte_usec_ssd": "0.011000", "osd_mclock_cost_per_io_usec": "0.000000", "osd_mclock_cost_per_io_usec_hdd": "25000.000000", "osd_mclock_cost_per_io_usec_ssd": "50.000000", "osd_mclock_force_run_benchmark_on_init": "false", "osd_mclock_max_capacity_iops_hdd": "275.572166", "osd_mclock_max_capacity_iops_ssd": "21500.000000", "osd_mclock_profile": "high_client_ops", "osd_mclock_scheduler_anticipation_timeout": "0.000000", "osd_mclock_scheduler_background_best_effort_lim": "999999", "osd_mclock_scheduler_background_best_effort_res": "14", "osd_mclock_scheduler_background_best_effort_wgt": "2", "osd_mclock_scheduler_background_recovery_lim": "55", "osd_mclock_scheduler_background_recovery_res": "14", "osd_mclock_scheduler_background_recovery_wgt": "1", "osd_mclock_scheduler_client_lim": "999999", "osd_mclock_scheduler_client_res": "28", "osd_mclock_scheduler_client_wgt": "2", "osd_mclock_skip_benchmark": "false",
ceph daemon osd.8 config show | grep osd_mclock "osd_mclock_cost_per_byte_usec": "0.000000", "osd_mclock_cost_per_byte_usec_hdd": "5.200000", "osd_mclock_cost_per_byte_usec_ssd": "0.011000", "osd_mclock_cost_per_io_usec": "0.000000", "osd_mclock_cost_per_io_usec_hdd": "25000.000000", "osd_mclock_cost_per_io_usec_ssd": "50.000000", "osd_mclock_force_run_benchmark_on_init": "false", "osd_mclock_max_capacity_iops_hdd": "277.604269", "osd_mclock_max_capacity_iops_ssd": "21500.000000", "osd_mclock_profile": "high_client_ops", "osd_mclock_scheduler_anticipation_timeout": "0.000000", "osd_mclock_scheduler_background_best_effort_lim": "999999", "osd_mclock_scheduler_background_best_effort_res": "14", "osd_mclock_scheduler_background_best_effort_wgt": "2", "osd_mclock_scheduler_background_recovery_lim": "56", "osd_mclock_scheduler_background_recovery_res": "14", "osd_mclock_scheduler_background_recovery_wgt": "1", "osd_mclock_scheduler_client_lim": "999999", "osd_mclock_scheduler_client_res": "28", "osd_mclock_scheduler_client_wgt": "2", "osd_mclock_skip_benchmark": "false",
#5 Updated by Vikhyat Umrao 7 months ago
#6 Updated by Sridhar Seshasayee 7 months ago
- Pull request ID set to 47216
The issue is addressed currently in Ceph's main branch. Please see the linked PR. This will be back-ported to Quincy once it goes through the required testing process.
#7 Updated by Sridhar Seshasayee 7 months ago
- Status changed from New to Fix Under Review
- Backport set to quincy
#8 Updated by Sridhar Seshasayee 6 months ago
- Status changed from Fix Under Review to Pending Backport
- Component(RADOS) OSD added
#9 Updated by Backport Bot 6 months ago
- Copied to Backport #57052: quincy: Quincy: High CPU and slow progress during backfill added
#10 Updated by Backport Bot 6 months ago
- Tags changed from RADOS to RADOS backport_processed