Project

General

Profile

Bug #56650

ceph df reports invalid MAX AVAIL value for stretch mode crush rule

Added by Prashant D over 1 year ago. Updated 7 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

If we define crush rule for stretch mode cluster with multiple take then MAX AVAIL for pools associated with crush rule will report available size equals to available space from single datacenter.

Consider if we define crush rule stretch_rule as per https://docs.ceph.com/en/latest/rados/operations/stretch-mode/ documentation

rule stretch_rule {
        id 1
        type replicated
        step take DC1
        step chooseleaf firstn 2 type host
        step emit
        step take DC2
        step chooseleaf firstn 2 type host
        step emit
}

and another crush rule stretch_replicated_rule with similar placement strategy :

rule stretch_replicated_rule {
        id 2
        type replicated
        step take default
        step choose firstn 0 type datacenter
        step chooseleaf firstn 2 type host
        step emit
}

then "MAX AVAIL" for pools from stretch_rule show incorrect value whereas pools from stretch_replicated_rule shows correct value.

The way crush rule stretch_rule is defined, PGMap::get_rule_avail is considering only one datacenter's available size rather than total avail size from both datacenters.

More details :

$ ceph osd crush rule ls
replicated_rule
stretch_rule
stretch_replicated_rule

$ ceph osd crush rule dump stretch_rule
{
    "rule_id": 1,
    "rule_name": "stretch_rule",
    "type": 1,
    "steps": [
        {
            "op": "take",
            "item": -5,
            "item_name": "DC1" 
        },
        {
            "op": "chooseleaf_firstn",
            "num": 2,
            "type": "host" 
        },
        {
            "op": "emit" 
        },
        {
            "op": "take",
            "item": -6,
            "item_name": "DC2" 
        },
        {
            "op": "chooseleaf_firstn",
            "num": 2,
            "type": "host" 
        },
        {
            "op": "emit" 
        }
    ]
}

$ ceph osd crush rule dump stretch_replicated_rule
{
    "rule_id": 2,
    "rule_name": "stretch_replicated_rule",
    "type": 1,
    "steps": [
        {
            "op": "take",
            "item": -1,
            "item_name": "default" 
        },
        {
            "op": "choose_firstn",
            "num": 0,
            "type": "datacenter" 
        },
        {
            "op": "chooseleaf_firstn",
            "num": 2,
            "type": "host" 
        },
        {
            "op": "emit" 
        }
    ]
}

$ ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 19 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'cephfs.a.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 88 lfor 0/0/62 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 3 'cephfs.a.data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 64 lfor 0/0/62 flags hashpspool stripe_width 0 application cephfs
pool 4 'rbdpool' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 126 flags hashpspool stripe_width 0
pool 5 'rbdtest' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 139 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 6 'stretched_rbdpool' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 130 flags hashpspool stripe_width 0
pool 7 'stretched_rbdtest' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 143 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 8 'stretched_replicated_rbdpool' replicated size 4 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 134 flags hashpspool stripe_width 0
pool 9 'stretched_replicated_rbdtest' replicated size 4 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 147 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd

$ ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    1.2 TiB  960 GiB  252 GiB   252 GiB      20.81
TOTAL  1.2 TiB  960 GiB  252 GiB   252 GiB      20.81

--- POOLS ---
POOL                          ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                           1    1  1.5 MiB        2  4.5 MiB      0    289 GiB
cephfs.a.meta                  2   16  2.3 KiB       22   96 KiB      0    289 GiB
cephfs.a.data                  3   32      0 B        0      0 B      0    289 GiB
rbdpool                        4   32      0 B        0      0 B      0    216 GiB
rbdtest                        5   32   20 GiB    5.14k   80 GiB   8.46    216 GiB
stretched_rbdpool              6   32      0 B        0      0 B      0    108 GiB
stretched_rbdtest              7   32   20 GiB    5.14k   80 GiB  15.60    108 GiB                  
stretched_replicated_rbdpool   8   32      0 B        0      0 B      0    216 GiB
stretched_replicated_rbdtest   9   32   20 GiB    5.14k   80 GiB   8.46    216 GiB

Refer : https://bugzilla.redhat.com/show_bug.cgi?id=2100920

History

#1 Updated by Prashant D over 1 year ago

  • Description updated (diff)

#2 Updated by Prashant D over 1 year ago

  • Pull request ID set to 47189

#3 Updated by Prashant D over 1 year ago

  • Status changed from New to Fix Under Review

#4 Updated by Prashant D over 1 year ago

Before applying PR#47189, MAX AVAIL for stretch_rule pools is incorrect :

$ ceph osd crush rule ls
replicated_rule
stretch_rule
stretch_replicated_rule

$ ceph osd crush rule dump stretch_rule
{
    "rule_id": 1,
    "rule_name": "stretch_rule",
    "type": 1,
    "steps": [
        {
            "op": "take",
            "item": -5,
            "item_name": "DC1" 
        },
        {
            "op": "chooseleaf_firstn",
            "num": 2,
            "type": "host" 
        },
        {
            "op": "emit" 
        },
        {
            "op": "take",
            "item": -6,
            "item_name": "DC2" 
        },
        {
            "op": "chooseleaf_firstn",
            "num": 2,
            "type": "host" 
        },
        {
            "op": "emit" 
        }
    ]
}

$ ceph osd crush rule dump stretch_replicated_rule
{
    "rule_id": 2,
    "rule_name": "stretch_replicated_rule",
    "type": 1,
    "steps": [
        {
            "op": "take",
            "item": -1,
            "item_name": "default" 
        },
        {
            "op": "choose_firstn",
            "num": 0,
            "type": "datacenter" 
        },
        {
            "op": "chooseleaf_firstn",
            "num": 2,
            "type": "host" 
        },
        {
            "op": "emit" 
        }
    ]
}

$ ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 19 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'cephfs.a.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 88 lfor 0/0/62 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 3 'cephfs.a.data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 64 lfor 0/0/62 flags hashpspool stripe_width 0 application cephfs
pool 4 'rbdpool' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 126 flags hashpspool stripe_width 0
pool 5 'rbdtest' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 139 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 6 'stretched_rbdpool' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 130 flags hashpspool stripe_width 0
pool 7 'stretched_rbdtest' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 143 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 8 'stretched_replicated_rbdpool' replicated size 4 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 134 flags hashpspool stripe_width 0
pool 9 'stretched_replicated_rbdtest' replicated size 4 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 147 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd

$ ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    1.2 TiB  960 GiB  252 GiB   252 GiB      20.81
TOTAL  1.2 TiB  960 GiB  252 GiB   252 GiB      20.81

--- POOLS ---
POOL                          ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                           1    1  1.5 MiB        2  4.5 MiB      0    289 GiB
cephfs.a.meta                  2   16  2.3 KiB       22   96 KiB      0    289 GiB
cephfs.a.data                  3   32      0 B        0      0 B      0    289 GiB
rbdpool                        4   32      0 B        0      0 B      0    216 GiB
rbdtest                        5   32   20 GiB    5.14k   80 GiB   8.46    216 GiB
stretched_rbdpool              6   32      0 B        0      0 B      0    108 GiB
stretched_rbdtest              7   32   20 GiB    5.14k   80 GiB  15.60    108 GiB                  
stretched_replicated_rbdpool   8   32      0 B        0      0 B      0    216 GiB
stretched_replicated_rbdtest   9   32   20 GiB    5.14k   80 GiB   8.46    216 GiB

After applying PR#47189, MAX AVAIL is correctly shown for stretch_rule pools :

$ ceph -s
  cluster:
    id:     3d796a5c-2ad9-4c37-b671-45446254a11b
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum a (age 19m)
    mgr: x(active, since 19m)
    mds: 1/1 daemons up
    osd: 12 osds: 12 up (since 18m), 12 in (since 18m)

  data:
    volumes: 1/1 healthy
    pools:   9 pools, 241 pgs
    objects: 15.43k objects, 60 GiB
    usage:   252 GiB used, 960 GiB / 1.2 TiB avail
    pgs:     241 active+clean

$ ceph osd crush rule ls
replicated_rule
stretch_rule
stretch_replicated_rule

$ ceph osd crush rule dump stretch_rule
{
    "rule_id": 1,
    "rule_name": "stretch_rule",
    "type": 1,
    "steps": [
        {
            "op": "take",
            "item": -5,
            "item_name": "DC1" 
        },
        {
            "op": "chooseleaf_firstn",
            "num": 2,
            "type": "host" 
        },
        {
            "op": "emit" 
        },
        {
            "op": "take",
            "item": -6,
            "item_name": "DC2" 
        },
        {
            "op": "chooseleaf_firstn",
            "num": 2,
            "type": "host" 
        },
        {
            "op": "emit" 
        }
    ]
}

$ ceph osd crush rule dump stretch_replicated_rule
{
    "rule_id": 2,
    "rule_name": "stretch_replicated_rule",
    "type": 1,
    "steps": [
        {
            "op": "take",
            "item": -1,
            "item_name": "default" 
        },
        {
            "op": "choose_firstn",
            "num": 0,
            "type": "datacenter" 
        },
        {
            "op": "chooseleaf_firstn",
            "num": 2,
            "type": "host" 
        },
        {
            "op": "emit" 
        }
    ]
}

$ ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 19 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'cephfs.a.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 90 lfor 0/0/62 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 3 'cephfs.a.data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 64 lfor 0/0/62 flags hashpspool stripe_width 0 application cephfs
pool 4 'rbdpool' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 126 flags hashpspool stripe_width 0
pool 5 'rbdtest' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 139 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 6 'stretched_rbdpool' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 130 flags hashpspool stripe_width 0
pool 7 'stretched_rbdtest' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 143 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 8 'stretched_replicated_rbdpool' replicated size 4 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 134 flags hashpspool stripe_width 0
pool 9 'stretched_replicated_rbdtest' replicated size 4 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 147 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd

$ ceph osd df tree
ID   CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP  META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME         
 -1         1.18301         -  1.2 TiB  252 GiB  240 GiB   0 B  1.8 GiB  960 GiB  20.81  1.00    -          root default      
 -5         0.59151         -  606 GiB  127 GiB  121 GiB   0 B  928 MiB  479 GiB  20.90  1.00    -              datacenter DC1
 -9         0.19716         -  202 GiB   50 GiB   48 GiB   0 B  440 MiB  152 GiB  24.64  1.18    -                  host osd-0
  0    hdd  0.09859   1.00000  101 GiB   22 GiB   21 GiB   0 B  214 MiB   79 GiB  21.73  1.04   71      up              osd.0 
  3    hdd  0.09859   1.00000  101 GiB   28 GiB   27 GiB   0 B  226 MiB   73 GiB  27.54  1.32   99      up              osd.3 
-10         0.19716         -  202 GiB   36 GiB   34 GiB   0 B  191 MiB  166 GiB  18.05  0.87    -                  host osd-1
  1    hdd  0.09859   1.00000  101 GiB   18 GiB   17 GiB   0 B   95 MiB   83 GiB  17.79  0.86   72      up              osd.1 
  2    hdd  0.09859   1.00000  101 GiB   18 GiB   17 GiB   0 B   96 MiB   83 GiB  18.31  0.88   66      up              osd.2 
-11         0.19716         -  202 GiB   40 GiB   38 GiB   0 B  297 MiB  162 GiB  20.00  0.96    -                  host osd-2
  7    hdd  0.09859   1.00000  101 GiB   24 GiB   23 GiB   0 B  160 MiB   77 GiB  24.16  1.16   80      up              osd.7 
 11    hdd  0.09859   1.00000  101 GiB   16 GiB   15 GiB   0 B  137 MiB   85 GiB  15.85  0.76   70      up              osd.11
 -6         0.59151         -  606 GiB  126 GiB  120 GiB   0 B  923 MiB  480 GiB  20.71  1.00    -              datacenter DC2
-12         0.19716         -  202 GiB   46 GiB   44 GiB   0 B  368 MiB  156 GiB  22.64  1.09    -                  host osd-3
  4    hdd  0.09859   1.00000  101 GiB   19 GiB   18 GiB   0 B  199 MiB   82 GiB  18.58  0.89   69      up              osd.4 
  8    hdd  0.09859   1.00000  101 GiB   27 GiB   26 GiB   0 B  169 MiB   74 GiB  26.69  1.28   89      up              osd.8 
-13         0.19716         -  202 GiB   40 GiB   38 GiB   0 B  258 MiB  162 GiB  19.98  0.96    -                  host osd-4
  5    hdd  0.09859   1.00000  101 GiB   20 GiB   19 GiB   0 B  104 MiB   81 GiB  19.70  0.95   74      up              osd.5 
  9    hdd  0.09859   1.00000  101 GiB   20 GiB   19 GiB   0 B  154 MiB   81 GiB  20.25  0.97   78      up              osd.9 
-14         0.19716         -  202 GiB   39 GiB   37 GiB   0 B  297 MiB  163 GiB  19.52  0.94    -                  host osd-5
  6    hdd  0.09859   1.00000  101 GiB   16 GiB   15 GiB   0 B   82 MiB   85 GiB  15.48  0.74   66      up              osd.6 
 10    hdd  0.09859   1.00000  101 GiB   24 GiB   23 GiB   0 B  215 MiB   77 GiB  23.57  1.13   81      up              osd.10
                        TOTAL  1.2 TiB  252 GiB  240 GiB   0 B  1.8 GiB  960 GiB  20.81                                       
MIN/MAX VAR: 0.74/1.32  STDDEV: 3.81

$ ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    1.2 TiB  960 GiB  252 GiB   252 GiB      20.81
TOTAL  1.2 TiB  960 GiB  252 GiB   252 GiB      20.81

--- POOLS ---
POOL                          ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                           1    1  1.5 MiB        2  4.5 MiB      0    289 GiB
cephfs.a.meta                  2   16  2.3 KiB       22   96 KiB      0    289 GiB
cephfs.a.data                  3   32      0 B        0      0 B      0    289 GiB
rbdpool                        4   32      0 B        0      0 B      0    217 GiB
rbdtest                        5   32   20 GiB    5.14k   80 GiB   8.46    217 GiB
stretched_rbdpool              6   32      0 B        0      0 B      0    217 GiB
stretched_rbdtest              7   32   20 GiB    5.14k   80 GiB   8.46    217 GiB
stretched_replicated_rbdpool   8   32      0 B        0      0 B      0    217 GiB
stretched_replicated_rbdtest   9   32   20 GiB    5.13k   80 GiB   8.46    217 GiB

#5 Updated by Radoslaw Zarzynski over 1 year ago

  • Backport set to pacific,quincy

#6 Updated by Radoslaw Zarzynski 8 months ago

A workaround from Prashant:

The workaround for this issue is instead of defining below stretch_rule for the stretch cluster 
rule stretch_rule {
        id 1
        type replicated
        step take DC1
        step chooseleaf firstn 2 type host
        step emit
        step take DC2
        step chooseleaf firstn 2 type host
        step emit
}

define it as :

rule stretch_rule {
        id 2
        type replicated
        step take default
        step choose firstn 0 type datacenter
        step chooseleaf firstn 2 type host
        step emit
}

#7 Updated by Sake Paulusma 7 months ago

Radoslaw Zarzynski wrote:

A workaround from Prashant:

[...]

Thank you for the workaround! After fixing the issue, will this workaround be the default to define the CRUSH rule for a stretched cluster or should we revert back to the original rule in the documentation?

#8 Updated by Radoslaw Zarzynski 7 months ago

The fix is undergoing extended testing.

Also available in: Atom PDF