Project

General

Profile

Bug #52969

use "ceph df" command found pool max avail increase when there are degraded objects in it

Added by minghang zhao over 2 years ago. Updated over 1 year ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

down former:
--- POOLS ---
POOL ID STORED OBJECTS USED %USED MAX AVAIL
device_health_metrics 1 0 B 5 0 B 0 138 GiB
tfs 2 79 MiB 91 236 MiB 0.17 46 GiB
data 3 0 B 0 0 B 0 138 GiB
[root@lnhost116 ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-17 0.14639 root tfs
-16 0.14639 rack tfs-rack_0
-22 0.04880 host tfs-lnhost116
4 ssd 0.04880 osd.4 up 1.00000 1.00000
-25 0.04880 host tfs-lnhost117
2 ssd 0.04880 osd.2 up 1.00000 1.00000
-28 0.04880 host tfs-lnhost118
0 ssd 0.04880 osd.0 up 1.00000 1.00000

down after:
--- POOLS ---
POOL ID STORED OBJECTS USED %USED MAX AVAIL
device_health_metrics 1 0 B 5 0 B 0 207 GiB
tfs 2 79 MiB 91 158 MiB 0.11 70 GiB
data 3 0 B 0 0 B 0 138 GiB
[root@lnhost116 ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-17 0.14639 root tfs
-16 0.14639 rack tfs-rack_0
-22 0.04880 host tfs-lnhost116
4 ssd 0.04880 osd.4 up 1.00000 1.00000
-25 0.04880 host tfs-lnhost117
2 ssd 0.04880 osd.2 down 1.00000 1.00000
-28 0.04880 host tfs-lnhost118
0 ssd 0.04880 osd.0 up 1.00000 1.00000

The TFS storage pool is a triple copy, and hosts the fault domain rule. I took an OSD from the TFS storage pool down, but its Max Avail increased, which was illogical.

test.txt View (17.2 KB) jianwei zhang, 06/01/2022 10:47 AM

History

#1 Updated by minghang zhao over 2 years ago

My solution is to add a function del_down_out_osd() to PGMap::get_rule_avail() to calculate the avail value of the storage pool. The function is used to delete osd nodes in the down/out state of the resource group corresponding to the pool.

int64_t PGMap::get_rule_avail(const OSDMap& osdmap, int ruleno) const {
map<int,float> wm;
int r = osdmap.crush->get_rule_weight_osd_map(ruleno, &wm);
if (r < 0) {
return r;
}
if (wm.empty()) {
return 0;
}

del_down_out_osd(osdmap, wm);
float fratio = osdmap.get_full_ratio();
int64_t min = -1;
for (auto p = wm.begin(); p != wm.end(); ++p) {
....

void PGMap::del_down_out_osd(const OSDMap &osdmap, map<int,float> &wm) const {
float weight = 0.0;
int osd_cnt = 0;
bool del_flag = false;
for (auto p = wm.begin(); p != wm.end(); ) {
osd_cnt = wm.size();
auto osd_info = osd_stat.find(p->first);
if (osd_info != osd_stat.end()) {
if (osd_info->second.statfs.total != 0 && p->second != 0 && (!osdmap.is_up(p->first) || osdmap.is_out(p->first))) {
dout(5) << " p->first is: " << p->first << " is continue"<< dendl;
if (osd_cnt > 1) {
weight = 1.0 / (osd_cnt - 1);
} else {
weight = 0;
}
dout(5) << "erase p->first is: " << p->first << dendl;
wm.erase(p++);
del_flag = true;
continue;
} else {
++p;
}
} else {
++p;
}
}

for (auto p = wm.begin(); p != wm.end() && del_flag; ++p)
{
p->second = weight;
dout(10) << "p->first is: " << p->first << "del new weight is: " << weight << dendl;
}
}

#2 Updated by Neha Ojha over 2 years ago

  • Project changed from Ceph to RADOS

#3 Updated by Neha Ojha over 2 years ago

minghang zhao wrote:

My solution is to add a function del_down_out_osd() to PGMap::get_rule_avail() to calculate the avail value of the storage pool. The function is used to delete osd nodes in the down/out state of the resource group corresponding to the pool.

int64_t PGMap::get_rule_avail(const OSDMap& osdmap, int ruleno) const {
map<int,float> wm;
int r = osdmap.crush->get_rule_weight_osd_map(ruleno, &wm);
if (r < 0) {
return r;
}
if (wm.empty()) {
return 0;
}

del_down_out_osd(osdmap, wm);

float fratio = osdmap.get_full_ratio();

int64_t min = -1;
for (auto p = wm.begin(); p != wm.end(); ++p) {
....

void PGMap::del_down_out_osd(const OSDMap &osdmap, map<int,float> &wm) const {
float weight = 0.0;
int osd_cnt = 0;
bool del_flag = false;
for (auto p = wm.begin(); p != wm.end(); ) {
osd_cnt = wm.size();
auto osd_info = osd_stat.find(p->first);
if (osd_info != osd_stat.end()) {
if (osd_info->second.statfs.total != 0 && p->second != 0 && (!osdmap.is_up(p->first) || osdmap.is_out(p->first))) {
dout(5) << " p->first is: " << p->first << " is continue"<< dendl;
if (osd_cnt > 1) {
weight = 1.0 / (osd_cnt - 1);
} else {
weight = 0;
}
dout(5) << "erase p->first is: " << p->first << dendl;
wm.erase(p++);
del_flag = true;
continue;
} else {
++p;
}
} else {
++p;
}
}

for (auto p = wm.begin(); p != wm.end() && del_flag; ++p) {
p->second = weight;
dout(10) << "p->first is: " << p->first << "del new weight is: " << weight << dendl;
}
}

Would you like to propose a PR?

#4 Updated by jianwei zhang over 1 year ago

ceph v15.2.13

I found same problem

1. ceph cluster initial state

# rbd create test/rbd --size 1G

Write 1G of data sequentially
# rbd bench -p test --image rbd --io-size 1M --io-threads 1 --io-total 1G --io-pattern seq --io-type write

[root@ln-ceph-rpm build]# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME             STATUS  REWEIGHT  PRI-AFF
 -3               0  root myroot                                    
 -1         0.78870  root default                                   
-11         0.19717      host ln-ceph-rpm                           
  6    hdd  0.09859          osd.6             up   1.00000  1.00000
  7    hdd  0.09859          osd.7             up   1.00000  1.00000
 -5         0.19717      host node1                                 
  0    hdd  0.09859          osd.0             up   1.00000  1.00000
  1    hdd  0.09859          osd.1             up   1.00000  1.00000
 -6         0.19717      host node2                                 
  2    hdd  0.09859          osd.2             up   1.00000  1.00000
  3    hdd  0.09859          osd.3             up   1.00000  1.00000
 -9         0.19717      host node3                                 
  4    hdd  0.09859          osd.4             up   1.00000  1.00000
  5    hdd  0.09859          osd.5             up   1.00000  1.00000

[root@ln-ceph-rpm build]# ceph -s
  cluster:
    id:     afec64f8-d9ee-4262-9410-fcf907807e2c
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 14m)
    mgr: x(active, since 70m)
    ioa: 0 daemonsno daemons active
    osd: 8 osds: 8 up (since 74s), 8 in (since 74s)

  data:
    pools:   2 pools, 33 pgs
    objects: 261 objects, 1.0 GiB
    usage:   20 GiB used, 788 GiB / 808 GiB avail
    pgs:     33 active+clean

[root@ln-ceph-rpm build]# ceph osd pool ls detail
pool 1 'device_health_metrics' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 706 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 2 'test' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 708 lfor 0/0/39 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd

[root@ln-ceph-rpm build]# ceph df detail
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED    RAW USED  %RAW USED
hdd    808 GiB  788 GiB  12 GiB    20 GiB       2.44
TOTAL  808 GiB  788 GiB  12 GiB    20 GiB       2.44

--- POOLS ---
POOL                   ID  PGS  STORED   (DATA)   (OMAP)  OBJECTS  USED     (DATA)   (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
device_health_metrics   1    1      0 B      0 B     0 B        0      0 B      0 B     0 B      0    260 GiB  N/A            N/A              0         0 B          0 B
test                    2   32  1.0 GiB  1.0 GiB     0 B      261  3.0 GiB  3.0 GiB     0 B   0.38    260 GiB  N/A            N/A            261         0 B          0 B

/// we can see that 
STORED = 1.0G
(DATA) = 1.0G
MAX AVAIL = 260G

2. kill 9 osd.0.pid - OSD.0 DOWN


[root@ln-ceph-rpm build]# ceph osd df tree
ID   CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META      AVAIL    %USE  VAR   PGS  STATUS  TYPE NAME           
 -3               0         -      0 B      0 B      0 B      0 B       0 B      0 B     0     0    -          root myroot         
 -1         0.78870         -  808 GiB   20 GiB  3.7 GiB  6.8 MiB   8.0 GiB  788 GiB  2.44  1.00    -          root default        
-11         0.19717         -  202 GiB  4.9 GiB  894 MiB  1.6 MiB   2.0 GiB  197 GiB  2.41  0.99    -              host ln-ceph-rpm
  6    hdd  0.09859   1.00000  101 GiB  2.5 GiB  531 MiB  1.0 MiB  1023 MiB   98 GiB  2.49  1.02   14      up          osd.6       
  7    hdd  0.09859   1.00000  101 GiB  2.4 GiB  363 MiB  596 KiB  1023 MiB   99 GiB  2.33  0.96    9      up          osd.7       
 -5         0.19717         -  202 GiB  4.9 GiB  909 MiB  1.6 MiB   2.0 GiB  197 GiB  2.42  0.99    -              host node1      
  0    hdd  0.09859   1.00000  101 GiB  2.4 GiB  370 MiB  732 KiB  1023 MiB   99 GiB  2.34  0.96    0    down          osd.0       
  1    hdd  0.09859   1.00000  101 GiB  2.5 GiB  539 MiB  921 KiB  1023 MiB   98 GiB  2.50  1.03   16      up          osd.1       
 -6         0.19717         -  202 GiB  4.9 GiB  953 MiB  1.8 MiB   2.0 GiB  197 GiB  2.44  1.00    -              host node2      
  2    hdd  0.09859   1.00000  101 GiB  2.4 GiB  443 MiB  971 KiB  1023 MiB   99 GiB  2.41  0.99   12      up          osd.2       
  3    hdd  0.09859   1.00000  101 GiB  2.5 GiB  511 MiB  903 KiB  1023 MiB   99 GiB  2.47  1.01   14      up          osd.3       
 -9         0.19717         -  202 GiB  5.0 GiB  1.0 GiB  1.7 MiB   2.0 GiB  197 GiB  2.48  1.02    -              host node3      
  4    hdd  0.09859   1.00000  101 GiB  2.6 GiB  567 MiB  868 KiB  1023 MiB   98 GiB  2.53  1.04   13      up          osd.4       
  5    hdd  0.09859   1.00000  101 GiB  2.5 GiB  475 MiB  862 KiB  1023 MiB   99 GiB  2.44  1.00   12      up          osd.5       
                        TOTAL  808 GiB   20 GiB  3.7 GiB  6.8 MiB   8.0 GiB  788 GiB  2.44                                         
MIN/MAX VAR: 0.96/1.04  STDDEV: 0.07

[root@ln-ceph-rpm build]# ceph -s
  cluster:
    id:     afec64f8-d9ee-4262-9410-fcf907807e2c
    health: HEALTH_WARN
            nobackfill flag(s) set
            1 osds down
            Degraded data redundancy: 72/783 objects degraded (9.195%), 8 pgs degraded

  services:
    mon: 3 daemons, quorum a,b,c (age 20m)
    mgr: x(active, since 77m)
    ioa: 0 daemonsno daemons active
    osd: 8 osds: 7 up (since 35s), 8 in (since 7m)
         flags nobackfill

  data:
    pools:   2 pools, 33 pgs
    objects: 261 objects, 1.0 GiB
    usage:   20 GiB used, 788 GiB / 808 GiB avail
    pgs:     72/783 objects degraded (9.195%)
             24 active+clean
             8  active+undersized+degraded
             1  active+undersized

[root@ln-ceph-rpm build]# ceph df detail
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED    RAW USED  %RAW USED
hdd    808 GiB  788 GiB  12 GiB    20 GiB       2.44
TOTAL  808 GiB  788 GiB  12 GiB    20 GiB       2.44

--- POOLS ---
POOL                   ID  PGS  STORED   (DATA)   (OMAP)  OBJECTS  USED     (DATA)   (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
device_health_metrics   1    1      0 B      0 B     0 B        0      0 B      0 B     0 B      0    260 GiB  N/A            N/A              0         0 B          0 B
test                    2   32  1.1 GiB  1.1 GiB     0 B      261  3.0 GiB  3.0 GiB     0 B   0.38    286 GiB  N/A            N/A            261         0 B          0 B

/// we can see that 
STORED = 1.1G     ///increase 1.0G --> 1.1G
(DATA) = 1.1G     ///increase 1.0G --> 1.1G
MAX AVAIL = 286G  ///increase 260G --> 286G

3. kill 9 osd.0.pid - OSD.0 OUT

[root@ln-ceph-rpm build]# ceph osd df tree
ID   CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META      AVAIL    %USE  VAR   PGS  STATUS  TYPE NAME           
 -3               0         -      0 B      0 B      0 B      0 B       0 B      0 B     0     0    -          root myroot         
 -1         0.78870         -  707 GiB   17 GiB  3.4 GiB  6.0 MiB   7.0 GiB  690 GiB  2.46  1.00    -          root default        
-11         0.19717         -  202 GiB  4.9 GiB  927 MiB  1.6 MiB   2.0 GiB  197 GiB  2.43  0.99    -              host ln-ceph-rpm
  6    hdd  0.09859   1.00000  101 GiB  2.5 GiB  532 MiB  1.0 MiB  1023 MiB   98 GiB  2.49  1.01   14      up          osd.6       
  7    hdd  0.09859   1.00000  101 GiB  2.4 GiB  395 MiB  596 KiB  1023 MiB   99 GiB  2.36  0.96   11      up          osd.7       
 -5         0.19717         -  101 GiB  2.5 GiB  539 MiB  921 KiB  1023 MiB   98 GiB  2.50  1.02    -              host node1      
  0    hdd  0.09859         0      0 B      0 B      0 B      0 B       0 B      0 B     0     0    0    down          osd.0       
  1    hdd  0.09859   1.00000  101 GiB  2.5 GiB  539 MiB  921 KiB  1023 MiB   98 GiB  2.50  1.02   16      up          osd.1       
 -6         0.19717         -  202 GiB  4.9 GiB  955 MiB  1.8 MiB   2.0 GiB  197 GiB  2.44  0.99    -              host node2      
  2    hdd  0.09859   1.00000  101 GiB  2.4 GiB  443 MiB  971 KiB  1023 MiB   99 GiB  2.41  0.98   12      up          osd.2       
  3    hdd  0.09859   1.00000  101 GiB  2.5 GiB  511 MiB  903 KiB  1023 MiB   99 GiB  2.47  1.01   14      up          osd.3       
 -9         0.19717         -  202 GiB  5.0 GiB  1.0 GiB  1.7 MiB   2.0 GiB  197 GiB  2.48  1.01    -              host node3      
  4    hdd  0.09859   1.00000  101 GiB  2.6 GiB  567 MiB  868 KiB  1023 MiB   98 GiB  2.53  1.03   13      up          osd.4       
  5    hdd  0.09859   1.00000  101 GiB  2.5 GiB  475 MiB  862 KiB  1023 MiB   99 GiB  2.44  0.99   12      up          osd.5       
                        TOTAL  707 GiB   17 GiB  3.4 GiB  6.0 MiB   7.0 GiB  690 GiB  2.46                                         
MIN/MAX VAR: 0.96/1.03  STDDEV: 0.05

[root@ln-ceph-rpm build]# ceph -s
  cluster:
    id:     afec64f8-d9ee-4262-9410-fcf907807e2c
    health: HEALTH_WARN
            nobackfill flag(s) set
            Degraded data redundancy: 64/783 objects degraded (8.174%), 7 pgs degraded

  services:
    mon: 3 daemons, quorum a,b,c (age 27m)
    mgr: x(active, since 83m)
    ioa: 0 daemonsno daemons active
    osd: 8 osds: 7 up (since 6m), 7 in (since 67s); 7 remapped pgs
         flags nobackfill

  data:
    pools:   2 pools, 33 pgs
    objects: 261 objects, 1.0 GiB
    usage:   17 GiB used, 690 GiB / 707 GiB avail
    pgs:     64/783 objects degraded (8.174%)
             26 active+clean
             4  active+undersized+degraded+remapped+backfill_wait
             3  active+undersized+degraded+remapped+backfilling

[root@ln-ceph-rpm build]# ceph df detail
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED    RAW USED  %RAW USED
hdd    707 GiB  690 GiB  10 GiB    17 GiB       2.46
TOTAL  707 GiB  690 GiB  10 GiB    17 GiB       2.46

--- POOLS ---
POOL                   ID  PGS  STORED   (DATA)   (OMAP)  OBJECTS  USED     (DATA)   (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
device_health_metrics   1    1      0 B      0 B     0 B        0      0 B      0 B     0 B      0    260 GiB  N/A            N/A              0         0 B          0 B
test                    2   32  1.1 GiB  1.1 GiB     0 B      261  3.0 GiB  3.0 GiB     0 B   0.39    283 GiB  N/A            N/A            261         0 B          0 B

/// we can see that 
STORED = 1.1G     ///increase 1.0G --> 1.1G
(DATA) = 1.1G     ///increase 1.0G --> 1.1G
MAX AVAIL = 283G  ///increase 260G --> 283G

4. kill 9 osd.0.pid - OSD.0 OUT unset nobackfill --> recovery HEALTH_OK


[root@ln-ceph-rpm build]# ceph osd df tree
ID   CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA      OMAP     META      AVAIL    %USE  VAR   PGS  STATUS  TYPE NAME           
 -3               0         -      0 B      0 B       0 B      0 B       0 B      0 B     0     0    -          root myroot         
 -1         0.78870         -  707 GiB   18 GiB   3.6 GiB  6.0 MiB   7.0 GiB  689 GiB  2.49  1.00    -          root default        
-11         0.19717         -  202 GiB  5.0 GiB  1012 MiB  1.6 MiB   2.0 GiB  197 GiB  2.47  0.99    -              host ln-ceph-rpm
  6    hdd  0.09859   1.00000  101 GiB  2.6 GiB   616 MiB  1.0 MiB  1023 MiB   98 GiB  2.58  1.03   16      up          osd.6       
  7    hdd  0.09859   1.00000  101 GiB  2.4 GiB   396 MiB  596 KiB  1023 MiB   99 GiB  2.36  0.95   11      up          osd.7       
 -5         0.19717         -  101 GiB  2.6 GiB   588 MiB  921 KiB  1023 MiB   98 GiB  2.55  1.02    -              host node1      
  0    hdd  0.09859         0      0 B      0 B       0 B      0 B       0 B      0 B     0     0    0    down          osd.0       
  1    hdd  0.09859   1.00000  101 GiB  2.6 GiB   588 MiB  921 KiB  1023 MiB   98 GiB  2.55  1.02   18      up          osd.1       
 -6         0.19717         -  202 GiB  5.0 GiB   1.0 GiB  1.8 MiB   2.0 GiB  197 GiB  2.48  0.99    -              host node2      
  2    hdd  0.09859   1.00000  101 GiB  2.5 GiB   516 MiB  971 KiB  1023 MiB   98 GiB  2.48  0.99   14      up          osd.2       
  3    hdd  0.09859   1.00000  101 GiB  2.5 GiB   512 MiB  903 KiB  1023 MiB   99 GiB  2.48  0.99   14      up          osd.3       
 -9         0.19717         -  202 GiB  5.1 GiB   1.1 GiB  1.7 MiB   2.0 GiB  197 GiB  2.51  1.01    -              host node3      
  4    hdd  0.09859   1.00000  101 GiB  2.6 GiB   568 MiB  868 KiB  1023 MiB   98 GiB  2.53  1.01   13      up          osd.4       
  5    hdd  0.09859   1.00000  101 GiB  2.5 GiB   520 MiB  862 KiB  1023 MiB   98 GiB  2.48  1.00   13      up          osd.5       
                        TOTAL  707 GiB   18 GiB   3.6 GiB  6.0 MiB   7.0 GiB  689 GiB  2.49                                         
MIN/MAX VAR: 0.95/1.03  STDDEV: 0.06

[root@ln-ceph-rpm build]# ceph -s
  cluster:
    id:     afec64f8-d9ee-4262-9410-fcf907807e2c
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 30m)
    mgr: x(active, since 87m)
    ioa: 0 daemonsno daemons active
    osd: 8 osds: 7 up (since 10m), 7 in (since 4m)

  data:
    pools:   2 pools, 33 pgs
    objects: 261 objects, 1.0 GiB
    usage:   18 GiB used, 689 GiB / 707 GiB avail
    pgs:     33 active+clean

[root@ln-ceph-rpm build]# ceph df detail
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED    RAW USED  %RAW USED
hdd    707 GiB  689 GiB  11 GiB    18 GiB       2.49
TOTAL  707 GiB  689 GiB  11 GiB    18 GiB       2.49

--- POOLS ---
POOL                   ID  PGS  STORED   (DATA)   (OMAP)  OBJECTS  USED     (DATA)   (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
device_health_metrics   1    1      0 B      0 B     0 B        0      0 B      0 B     0 B      0    260 GiB  N/A            N/A              0         0 B          0 B
test                    2   32  1.1 GiB  1.1 GiB     0 B      261  3.3 GiB  3.3 GiB     0 B   0.42    260 GiB  N/A            N/A            261         0 B          0 B

/// we can see that 
STORED = 1.1G     ///increase 1.0G --> 1.1G
(DATA) = 1.1G     ///increase 1.0G --> 1.1G
MAX AVAIL = 260G  ///increase 260G --> 260G

#5 Updated by jianwei zhang over 1 year ago

Problem1 step1 vs step2:
1. ceph cluster initial state
STORED = 1.0G
(DATA) = 1.0G
MAX AVAIL = 260G
2. kill 9 osd.0.pid - OSD.0 DOWN
STORED = 1.1G ///increase 1.0G --> 1.1G
(DATA) = 1.1G ///increase 1.0G --> 1.1G
MAX AVAIL = 286G ///increase 260G --> 286G

Problem step1 vs step3:
1. ceph cluster initial state
STORED = 1.0G
(DATA) = 1.0G
MAX AVAIL = 260G
3. kill 9 osd.0.pid - OSD.0 OUT
STORED = 1.1G ///increase 1.0G --> 1.1G
(DATA) = 1.1G ///increase 1.0G --> 1.1G
MAX AVAIL = 283G ///increase 260G --> 283G

First, we wrote 1G of rbd to the test storage pool of 3 copies
But ceph df detail shows that STORED/(DATA)/MAX AVAIL are all increased
I don't think this is correct.
The downgrade should be the same as HEALTH_OK

Problem2 step1 vs step4:
1. ceph cluster initial state
STORED = 1.0G
(DATA) = 1.0G
MAX AVAIL = 260G
4. kill 9 osd.0.pid - OSD.0 OUT unset nobackfill --> recovery HEALTH_OK
STORED = 1.1G ///increase 1.0G --> 1.1G
(DATA) = 1.1G ///increase 1.0G --> 1.1G
MAX AVAIL = 260G ///increase 260G --> 260G

ceph df detail shows that STORED/(DATA) are all increased
I don't think this is correct, because we only have 1G data
The downgrade should be the same as HEALTH_OK

MAX AVAIL = 260G is not correct, because osd.0 already out, I don't think we should count osd.0 in

#6 Updated by jianwei zhang over 1 year ago

针对MAX AVAIL字段,我认为应该将down or out osd.0去除掉

int64_t PGMap::get_rule_avail(const OSDMap &osdmap, int ruleno) const {
    map<int, float> wm;
    int r = osdmap.crush->get_rule_weight_osd_map(ruleno, &wm);
                                      $295 = std::map with 8 elements = {
                                             [0] = 0.125,  ///down or out osd.0, still in the cluster
                                             [1] = 0.125,
                                             [2] = 0.125,
                                             [3] = 0.125,
                                             [4] = 0.125,
                                             [5] = 0.125,
                                             [6] = 0.125,
                                             [7] = 0.125
                                       }
    if (r < 0) {
        return r;
    }
    if (wm.empty()) {
        return 0;
    }

    float fratio = osdmap.get_full_ratio();

    int64_t min = -1;
    for (auto p = wm.begin(); p != wm.end(); ++p) {
        auto osd_info = osd_stat.find(p->first);
        if (osd_info != osd_stat.end()) {
            if (osd_info->second.statfs.total == 0 || p->second == 0) {
                // osd must be out, hence its stats have been zeroed
                // (unless we somehow managed to have a disk with size 0...)
                //
                // (p->second == 0), if osd weight is 0, no need to
                // calculate proj below.
                continue;
            }
            double unusable = (double)osd_info->second.statfs.kb() * (1.0 - fratio);
            double avail = std::max(0.0, (double)osd_info->second.statfs.kb_avail() - unusable);
            avail *= 1024.0;
            int64_t proj = (int64_t)(avail / (double)p->second);  ///will cause a deviation in this calculation
            if (min < 0 || proj < min) {
                min = proj;
            }
        } else {
            if (osdmap.is_up(p->first)) {
                // This is a level 4 rather than an error, because we might have
                // only just started, and not received the first stats message yet.
                dout(4) << "OSD " << p->first << " is up, but has no stats" << dendl;
            }
        }
    }
    return min;
}

#7 Updated by jianwei zhang over 1 year ago

for ceph df detail commands

I don't think raw_used_rate should be adjusted:

    if (sum.num_object_copies > 0) {
        raw_used_rate *= (float)(sum.num_object_copies - sum.num_objects_degraded) / sum.num_object_copies;
    }
void PGMapDigest::dump_object_stat_sum(TextTable &tbl, ceph::Formatter *f, const pool_stat_t &pool_stat, uint64_t avail,
                                               float raw_used_rate, bool verbose, bool per_pool, bool per_pool_omap, const pg_pool_t *pool) 
{
    const object_stat_sum_t &sum = pool_stat.stats.sum;
    const store_statfs_t statfs = pool_stat.store_stats;

    if (sum.num_object_copies > 0) {
        raw_used_rate *= (float)(sum.num_object_copies - sum.num_objects_degraded) / sum.num_object_copies;
    }

    uint64_t used_data_bytes = pool_stat.get_allocated_data_bytes(per_pool);
    uint64_t used_omap_bytes = pool_stat.get_allocated_omap_bytes(per_pool_omap);
    uint64_t used_bytes = used_data_bytes + used_omap_bytes;

    float used = 0.0;
    // note avail passed in is raw_avail, calc raw_used here.
    if (avail) {
        used = used_bytes;
        used /= used + avail;
    } else if (used_bytes) {
        used = 1.0;
    }
    auto avail_res = raw_used_rate ? avail / raw_used_rate : 0;
    // an approximation for actually stored user data
    auto stored_data_normalized = pool_stat.get_user_data_bytes(raw_used_rate, per_pool);
    auto stored_omap_normalized = pool_stat.get_user_omap_bytes(raw_used_rate, per_pool_omap);
    auto stored_normalized = stored_data_normalized + stored_omap_normalized;
    // same, amplied by replication or EC
    auto stored_raw = stored_normalized * raw_used_rate;
    if (f) {
        f->dump_int("stored", stored_normalized);
        if (verbose) {
            f->dump_int("stored_data", stored_data_normalized);
            f->dump_int("stored_omap", stored_omap_normalized);
        }
        f->dump_int("objects", sum.num_objects);
        f->dump_int("kb_used", shift_round_up(used_bytes, 10));
        f->dump_int("bytes_used", used_bytes);
        if (verbose) {
            f->dump_int("data_bytes_used", used_data_bytes);
            f->dump_int("omap_bytes_used", used_omap_bytes);
        }
        f->dump_float("percent_used", used);
        f->dump_unsigned("max_avail", avail_res);
        if (verbose) {
            f->dump_int("quota_objects", pool->quota_max_objects);
            f->dump_int("quota_bytes", pool->quota_max_bytes);
            f->dump_int("dirty", sum.num_objects_dirty);
            f->dump_int("rd", sum.num_rd);
            f->dump_int("rd_bytes", sum.num_rd_kb * 1024ull);
            f->dump_int("wr", sum.num_wr);
            f->dump_int("wr_bytes", sum.num_wr_kb * 1024ull);
            f->dump_int("compress_bytes_used", statfs.data_compressed_allocated);
            f->dump_int("compress_under_bytes", statfs.data_compressed_original);
            // Stored by user amplified by replication
            f->dump_int("stored_raw", stored_raw);
            f->dump_unsigned("avail_raw", avail);
        }
    } else {
        tbl << stringify(byte_u_t(stored_normalized));
        if (verbose) {
            tbl << stringify(byte_u_t(stored_data_normalized));
            tbl << stringify(byte_u_t(stored_omap_normalized));
        }
        tbl << stringify(si_u_t(sum.num_objects));
        tbl << stringify(byte_u_t(used_bytes));
        if (verbose) {
            tbl << stringify(byte_u_t(used_data_bytes));
            tbl << stringify(byte_u_t(used_omap_bytes));
        }
        tbl << percentify(used * 100);
        tbl << stringify(byte_u_t(avail_res));
        if (verbose) {
            if (pool->quota_max_objects == 0)
                tbl << "N/A";
            else
                tbl << stringify(si_u_t(pool->quota_max_objects));

            if (pool->quota_max_bytes == 0)
                tbl << "N/A";
            else
                tbl << stringify(byte_u_t(pool->quota_max_bytes));

            tbl << stringify(si_u_t(sum.num_objects_dirty)) << stringify(byte_u_t(statfs.data_compressed_allocated))
                << stringify(byte_u_t(statfs.data_compressed_original));
        }
    }
}

#8 Updated by jianwei zhang over 1 year ago

for Problem2 step1 vs step4:

osd.0 already out and recovery complete HEALTH_OK, but STORED/(DATA) 1.0G increase to 1.1G

user data only 1G, I think this is also problem

Problem2 step1 vs step4:
1. ceph cluster initial state
STORED = 1.0G
(DATA) = 1.0G
4. kill 9 osd.0.pid - OSD.0 OUT unset nobackfill --> recovery HEALTH_OK
STORED = 1.1G ///increase 1.0G --> 1.1G
(DATA) = 1.1G ///increase 1.0G --> 1.1G

#9 Updated by jianwei zhang over 1 year ago

5. remove out osd.0

[root@ln-ceph-rpm build]# ceph osd rm osd.0
removed osd.0

[root@ln-ceph-rpm build]# ceph osd crush remove osd.0
removed item id 0 name 'osd.0' from crush map

[root@ln-ceph-rpm build]# ceph osd df tree
ID   CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META      AVAIL    %USE  VAR   PGS  STATUS  TYPE NAME           
 -3               0         -      0 B      0 B      0 B      0 B       0 B      0 B     0     0    -          root myroot         
 -1         0.69011         -  707 GiB   18 GiB  3.6 GiB  6.0 MiB   7.0 GiB  689 GiB  2.49  1.00    -          root default        
-11         0.19717         -  202 GiB  5.1 GiB  1.1 GiB  1.6 MiB   2.0 GiB  197 GiB  2.51  1.01    -              host ln-ceph-rpm
  6    hdd  0.09859   1.00000  101 GiB  2.6 GiB  621 MiB  1.0 MiB  1023 MiB   98 GiB  2.58  1.03   17      up          osd.6       
  7    hdd  0.09859   1.00000  101 GiB  2.5 GiB  481 MiB  596 KiB  1023 MiB   99 GiB  2.45  0.98   12      up          osd.7       
 -5         0.09859         -  101 GiB  2.4 GiB  421 MiB  921 KiB  1023 MiB   99 GiB  2.39  0.96    -              host node1      
  1    hdd  0.09859   1.00000  101 GiB  2.4 GiB  421 MiB  921 KiB  1023 MiB   99 GiB  2.39  0.96   12      up          osd.1       
 -6         0.19717         -  202 GiB  5.0 GiB  1.0 GiB  1.8 MiB   2.0 GiB  197 GiB  2.50  1.00    -              host node2      
  2    hdd  0.09859   1.00000  101 GiB  2.5 GiB  517 MiB  971 KiB  1023 MiB   98 GiB  2.48  0.99   15      up          osd.2       
  3    hdd  0.09859   1.00000  101 GiB  2.5 GiB  549 MiB  903 KiB  1023 MiB   98 GiB  2.51  1.01   15      up          osd.3       
 -9         0.19717         -  202 GiB  5.1 GiB  1.1 GiB  1.7 MiB   2.0 GiB  197 GiB  2.53  1.01    -              host node3      
  4    hdd  0.09859   1.00000  101 GiB  2.6 GiB  613 MiB  868 KiB  1023 MiB   98 GiB  2.57  1.03   15      up          osd.4       
  5    hdd  0.09859   1.00000  101 GiB  2.5 GiB  521 MiB  862 KiB  1023 MiB   98 GiB  2.48  1.00   13      up          osd.5       
                        TOTAL  707 GiB   18 GiB  3.6 GiB  6.0 MiB   7.0 GiB  689 GiB  2.49                                         
MIN/MAX VAR: 0.96/1.03  STDDEV: 0.06

[root@ln-ceph-rpm build]# ceph -s
  cluster:
    id:     afec64f8-d9ee-4262-9410-fcf907807e2c
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 18m)
    mgr: x(active, since 2h)
    ioa: 0 daemonsno daemons active
    osd: 7 osds: 7 up (since 55m), 7 in (since 49m)

  data:
    pools:   2 pools, 33 pgs
    objects: 261 objects, 1.0 GiB
    usage:   18 GiB used, 689 GiB / 707 GiB avail
    pgs:     33 active+clean

[root@ln-ceph-rpm build]# ceph df detail
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED    RAW USED  %RAW USED
hdd    707 GiB  689 GiB  11 GiB    18 GiB       2.49
TOTAL  707 GiB  689 GiB  11 GiB    18 GiB       2.49

--- POOLS ---
POOL                   ID  PGS  STORED   (DATA)   (OMAP)  OBJECTS  USED     (DATA)   (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
device_health_metrics   1    1      0 B      0 B     0 B        0      0 B      0 B     0 B      0    227 GiB  N/A            N/A              0         0 B          0 B
test                    2   32  1.0 GiB  1.0 GiB     0 B      261  3.0 GiB  3.0 GiB     0 B   0.44    227 GiB  N/A            N/A            261         0 B          0 B

/// we can see that 
STORED = 1.0G     ///decrease 1.1G --> 1.0G
(DATA) = 1.0G     ///decrease 1.1G --> 1.0G
MAX AVAIL = 227G  ///decrease 260G --> 227G

#10 Updated by jianwei zhang over 1 year ago

step4 vs step5:
4. kill 9 osd.0.pid - OSD.0 OUT unset nobackfill --> recovery HEALTH_OK
STORED = 1.1G ///increase 1.0G --> 1.1G
(DATA) = 1.1G ///increase 1.0G --> 1.1G
MAX AVAIL = 260G ///increase 260G --> 260G
5. remove out osd.0
STORED = 1.0G ///decrease 1.1G --> 1.0G
(DATA) = 1.0G ///decrease 1.1G --> 1.0G
MAX AVAIL = 227G ///decrease 260G --> 227G

become correct

#11 Updated by jianwei zhang over 1 year ago

The original intention of raising this question is that testers (users) are confused as to why MAX_AVAIL does not decrease but increases after osd down(out)?

#12 Updated by jianwei zhang over 1 year ago

4. kill 9 osd.0.pid - OSD.0 OUT unset nobackfill --> recovery HEALTH_OK
     STORED = 1.1G     ///increase 1.0G --> 1.1G
     (DATA) = 1.1G     ///increase 1.0G --> 1.1G
     MAX AVAIL = 260G  ///equal 260G --> 260G
     USED = 3.3G       ///increase 3.0G --> 3.3 G

The cluster has returned to HEALTH_OK

Why STORED/(DATA)=1.1G USED=3.3G?

This is due to out osd.0, 
in mempool::pgmap::unordered_map<int32_t, pool_stat_t> pg_pool_sum does not subtract used space on osd.0 { <pool, osd> : store_statfs_t }

Only after ceph osd rm osd.0 && ceph osd crush rm osd.0 is executed, and osd.0 is deleted from osdmap/crushmap, the space used by the pool on osd.0 can be subtracted from the pool

#14 Updated by jianwei zhang over 1 year ago

jianwei zhang wrote:

https://github.com/ceph/ceph/pull/46478

test result

#15 Updated by Radoslaw Zarzynski over 1 year ago

  • Pull request ID set to 46478

#16 Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from New to Fix Under Review

#17 Updated by Radoslaw Zarzynski over 1 year ago

  • Backport set to octopus,quincy

Also available in: Atom PDF