Project

General

Profile

Bug #41183

pg autoscale on EC pools

Added by imirc tw 8 months ago. Updated about 1 month ago.

Status:
New
Priority:
High
Assignee:
Category:
EC Pools
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature:

Description

The pg_autoscaler plugin wants to seriously increase num_pg on my EC pool from 8192 to 65536, but it seems it doesn't account for the extra shards created on other OSDS which leads to much more PG-per-OSD as configured at 100.

For example, I've got 624 osds with 8192 pg_num on the EC 6+2 pool. My calculation is that this results in (6+2)*8192 / 624 = ~105 PG-per-OSD (which is also observed).
Following the pg_autoscaler suggestion this would lead to 65536 PG's and ~840 PG-per-osd, which seems crazy.

Can anyone explain what I am missing here or is this a autoscaler miscalculation? The logic it uses is the size 1.33 (33% overhead) which it uses to calculate an expected 1.33 * 65536 / 624 = ~139 PG-per-OSD.

History

#1 Updated by Neha Ojha 8 months ago

  • Assignee set to Sage Weil
  • Priority changed from Normal to High

#2 Updated by Brian Koebbe about 1 month ago

Seem to have the same issue here.

158 OSDs with 1 main pool, an EC 5+2 pool with a 2048 pg_num, but the autoscaler wants to use 8192.

(5+2)*2048/158 = ~90
(5+2)*8192/158 = ~362

POOL                   SIZE TARGET SIZE          RATE RAW CAPACITY  RATIO TARGET RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE
name.rgw.meta           365                        3.0        1724T 0.0000               1.0      8            warn      
cephfs.stuff.meta     333.2M                       3.0        1724T 0.0000               1.0      4            warn      
name.rgw.control          0                        3.0        1724T 0.0000               1.0      8            warn      
.rgw.root              3765                        3.0        1724T 0.0000               1.0      8            warn      
rbd-data              859.7T             1.39999997616        1724T 0.6980               1.0   2048       8192 off       
cephfs.stuff.data     46025M                       3.0        1724T 0.0001               1.0      4            warn      
rbd                    3584k                       3.0        1724T 0.0000               1.0     64          4 off       
name.rgw.buckets.data     0                       1.25        1724T 0.0000               1.0      8            warn      
name.rgw.log              0                        3.0        1724T 0.0000               1.0      8            warn      

#3 Updated by Brian Koebbe about 1 month ago

Looks like a fix is going in: https://github.com/ceph/ceph/pull/33170

Also available in: Atom PDF