Project

General

Profile

Bug #57937

pg autoscaler of rgw pools doesn't work after creating otp pool

Added by Satoru Takeuchi 3 months ago. Updated about 1 month ago.

Status:
Rejected
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
container
Backport:
quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

It's about the following my post to ceph-users ML.

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/Q7A3TM6Z3XMRJPRBSHWGGACR653ICWXT/

The pg autoscale settings of pools about RGW were disabled in my clusters.
This problem was caused by overlapping roots. I created two device classes, "hdd" and "ssd".
I use the former for data and the other for bucket index.

After running `radosgw-admin mfa list --uid=rgw-admin-ops-user`, pg autoscaler of the rgw-related pools
was disabled due to overlapping roots. It's because `radosgw-admin mfa list` created a pool suffixed
by ".rgw.otp" and the root of its rule is not one of the roots of {hdd,ssd} device classes but the default root.

The possible fix is to create "*.rgw.otp" pool with the roots corresponding to device classes.

In addition, I'm glad if there is a workaround of my situation, e.g. changing the root of the crush rule
of "*.rgw.otp" pool. Since I just issued `radosgw-admin mfa list` to know the output and I'm not planning to use mfa,
I'm OK to delete "*.rgw.otp" pool if necessary.

Here is the detail (mostly the same as the above-mentioned post):

  • software versions
    • ceph: 16.2.10
    • rook: 1.9.6

The result of `ceph osd pool ls`.

$ kubectl -n ceph-poc exec deploy/rook-ceph-tools -- ceph osd pool ls
ceph-poc-block-pool
ceph-poc-object-store-ssd-index.rgw.control
ceph-poc-object-store-ssd-index.rgw.meta
ceph-poc-object-store-ssd-index.rgw.log
ceph-poc-object-store-ssd-index.rgw.buckets.index
ceph-poc-object-store-ssd-index.rgw.buckets.non-ec
.rgw.root
ceph-poc-object-store-ssd-index.rgw.buckets.data
ceph-poc-object-store-hdd-index.rgw.control
ceph-poc-object-store-hdd-index.rgw.meta
ceph-poc-object-store-hdd-index.rgw.log
ceph-poc-object-store-hdd-index.rgw.buckets.index
ceph-poc-object-store-hdd-index.rgw.buckets.non-ec
ceph-poc-object-store-hdd-index.rgw.buckets.data
device_health_metrics
ceph-poc-object-store-ssd-index.rgw.otp

Some pools are missing in the result of `ceph osd pool autoscale-status`.

$ kubectl -n ceph-poc exec deploy/rook-ceph-tools -- ceph osd pool
autoscale-status
POOL                                                  SIZE  TARGET
SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS
PG_NUM  NEW PG_NUM  AUTOSCALE  BULK
ceph-poc-object-store-ssd-index.rgw.control             0
   3.0         6144G  0.0000                                  1.0
 8              on         False
ceph-poc-object-store-ssd-index.rgw.meta             3910
   3.0         6144G  0.0000                                  1.0
 8              on         False
ceph-poc-object-store-ssd-index.rgw.log             29328M
   3.0         6144G  0.0140                                  1.0
 8              on         False
ceph-poc-object-store-ssd-index.rgw.buckets.index    4042
   3.0         6144G  0.0000                                  1.0
128           8  off        False
ceph-poc-object-store-ssd-index.rgw.buckets.non-ec      0
   3.0         6144G  0.0000                                  1.0
 8              on         False
.rgw.root                                            9592
   3.0         6144G  0.0000                                  1.0
 8              on         False
device_health_metrics                                8890k
   3.0         6144G  0.0000                                  1.0
32              on         Fals

CRUSH rules are as follows.

$ kubectl -n ceph-poc exec deploy/rook-ceph-tools -- ceph osd crush
tree --show-shadow
ID    CLASS  WEIGHT     TYPE NAME
  -3    ssd    6.00000  root default~ssd
  -9    ssd    2.00000      zone rack0~ssd
  -8    ssd    1.00000          host 10-69-0-10~ssd
  14    ssd    1.00000              osd.14
 -51    ssd          0          host 10-69-0-22~ssd
...
...
  -2    hdd  781.32037  root default~hdd
  -7    hdd  130.99301      zone rack0~hdd
  -6    hdd          0          host 10-69-0-10~hdd
 -50    hdd   14.55478          host 10-69-0-22~hdd
   8    hdd    7.27739              osd.8
...
  -1         787.32037  root default
  -5         132.99301      zone rack0
  -4           1.00000          host 10-69-0-10
  14    ssd    1.00000              osd.14
 -49          14.55478          host 10-69-0-22
   8    hdd    7.27739              osd.8
...

The rule of `...rgw.otp` pools is "replicated_rule".

$ kubectl -n ceph-poc exec deploy/rook-ceph-tools -- ceph osd pool get
ceph-poc-object-store-ssd-index.rgw.otp crush_rule
crush_rule: replicated_rule

The root of this rule uses "default".

$ kubectl -n ceph-poc exec deploy/rook-ceph-tools -- ceph osd crush
rule dump replicated_rule
{
    "rule_id": 0,
    "rule_name": "replicated_rule",
    "ruleset": 0,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
        {
            "op": "take",
            "item": -1,
            "item_name": "default" 
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host" 
        },
        {
            "op": "emit" 
        }
    ]
}

mgr daemon claims the overlapping root. Here is the log of mgr.

...
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.720+0000
7fe64d5d5700  0 [progress INFO root] Processing OSDMap change
175926..175926
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.445+0000
7fe64de96700  0 [pg_autoscaler WARNING root] pool 17 contains an
overlapping root -1... skipping scaling
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.444+0000
7fe64de96700  0 [pg_autoscaler INFO root] Pool 'device_health_metrics'
root_id -3 using 4.139842076256173e-06 of space, bias 1.0, pg target
0.000816928836381218 quantized to 32 (current 32)
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.444+0000
7fe64de96700  0 [pg_autoscaler INFO root] effective_target_ratio 0.0
0.0 0 6597069766656
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.442+0000
7fe64de96700  0 [pg_autoscaler WARNING root] pool 15 contains an
overlapping root -2... skipping scaling
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.441+0000
7fe64de96700  0 [pg_autoscaler WARNING root] pool 14 contains an
overlapping root -2... skipping scaling
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.440+0000
7fe64de96700  0 [pg_autoscaler WARNING root] pool 13 contains an
overlapping root -2... skipping scaling
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.438+0000
7fe64de96700  0 [pg_autoscaler WARNING root] pool 12 contains an
overlapping root -2... skipping scaling
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.437+0000
7fe64de96700  0 [pg_autoscaler WARNING root] pool 11 contains an
overlapping root -2... skipping scaling
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.436+0000
7fe64de96700  0 [pg_autoscaler WARNING root] pool 10 contains an
overlapping root -2... skipping scaling
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.434+0000
7fe64de96700  0 [pg_autoscaler WARNING root] pool 9 contains an
overlapping root -2... skipping scaling
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.433+0000
7fe64de96700  0 [pg_autoscaler INFO root] Pool '.rgw.root' root_id -3
using 4.361936589702964e-09 of space, bias 1.0, pg target
8.607554870347182e-07 quantized to 8 (current 8)
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.433+0000
7fe64de96700  0 [pg_autoscaler INFO root] effective_target_ratio 0.0
0.0 0 6597069766656
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.432+0000
7fe64de96700  0 [pg_autoscaler INFO root] Pool
'ceph-poc-object-store-ssd-index.rgw.buckets.non-ec' root_id -3 using
0.0 of space, bias 1.0, pg target 0.0 quantized to 8 (current 8)
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.432+0000
7fe64de96700  0 [pg_autoscaler INFO root] effective_target_ratio 0.0
0.0 0 6597069766656
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.430+0000
7fe64de96700  0 [pg_autoscaler INFO root] Pool
'ceph-poc-object-store-ssd-index.rgw.buckets.index' root_id -3 using
1.838088792283088e-09 of space, bias 1.0, pg target
3.627161883438627e-07 quantized to 8 (current 128)
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.430+0000
7fe64de96700  0 [pg_autoscaler INFO root] effective_target_ratio 0.0
0.0 0 6597069766656
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.429+0000
7fe64de96700  0 [pg_autoscaler INFO root] Pool
'ceph-poc-object-store-ssd-index.rgw.log' root_id -3 using
0.013985070909257047 of space, bias 1.0, pg target 2.7970141818514094
quantized to 8 (current 8)
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.429+0000
7fe64de96700  0 [pg_autoscaler INFO root] effective_target_ratio 0.0
0.0 0 6597069766656
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.427+0000
7fe64de96700  0 [pg_autoscaler INFO root] Pool
'ceph-poc-object-store-ssd-index.rgw.meta' root_id -3 using
1.7780621419660747e-09 of space, bias 1.0, pg target
3.5561242839321494e-07 quantized to 8 (current 8)
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.427+0000
7fe64de96700  0 [pg_autoscaler INFO root] effective_target_ratio 0.0
0.0 0 6597069766656
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.426+0000
7fe64de96700  0 [pg_autoscaler INFO root] Pool
'ceph-poc-object-store-ssd-index.rgw.control' root_id -3 using 0.0 of
space, bias 1.0, pg target 0.0 quantized to 8 (current 8)
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.426+0000
7fe64de96700  0 [pg_autoscaler INFO root] effective_target_ratio 0.0
0.0 0 6597069766656
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.425+0000
7fe64de96700  0 [pg_autoscaler WARNING root] pool 2 contains an
overlapping root -2... skipping scaling
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.421+0000
7fe64de96700  0 [pg_autoscaler ERROR root] pool 17 has overlapping
roots: {-2, -1}
2022-10-24T10:42:31+09:00 debug 2022-10-24T01:42:31.356+0000
7fe64de96700  0 [pg_autoscaler INFO root] _maybe_adjust
...

History

#1 Updated by Casey Bodley 3 months ago

  • Project changed from rgw to RADOS

#2 Updated by Kamoltat (Junior) Sirivadhna 3 months ago

  • Assignee set to Kamoltat (Junior) Sirivadhna

#3 Updated by Satoru Takeuchi 3 months ago

Is there any updates? Please let me know if I can do something.

#4 Updated by Satoru Takeuchi about 2 months ago

This problem was fixed in Rook v1.10.2. I updated my Rook/Ceph cluster to v1.10.5 and confirmed that this problem disappeared.
Please close this ticket.

#5 Updated by Radoslaw Zarzynski about 1 month ago

  • Status changed from New to Rejected

Not a Ceph issue per the last comment.

Also available in: Atom PDF