Bug #52969
openuse "ceph df" command found pool max avail increase when there are degraded objects in it
0%
Description
down former:
--- POOLS ---
POOL ID STORED OBJECTS USED %USED MAX AVAIL
device_health_metrics 1 0 B 5 0 B 0 138 GiB
tfs 2 79 MiB 91 236 MiB 0.17 46 GiB
data 3 0 B 0 0 B 0 138 GiB
[root@lnhost116 ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-17 0.14639 root tfs
-16 0.14639 rack tfs-rack_0
-22 0.04880 host tfs-lnhost116
4 ssd 0.04880 osd.4 up 1.00000 1.00000
-25 0.04880 host tfs-lnhost117
2 ssd 0.04880 osd.2 up 1.00000 1.00000
-28 0.04880 host tfs-lnhost118
0 ssd 0.04880 osd.0 up 1.00000 1.00000
down after:
--- POOLS ---
POOL ID STORED OBJECTS USED %USED MAX AVAIL
device_health_metrics 1 0 B 5 0 B 0 207 GiB
tfs 2 79 MiB 91 158 MiB 0.11 70 GiB
data 3 0 B 0 0 B 0 138 GiB
[root@lnhost116 ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-17 0.14639 root tfs
-16 0.14639 rack tfs-rack_0
-22 0.04880 host tfs-lnhost116
4 ssd 0.04880 osd.4 up 1.00000 1.00000
-25 0.04880 host tfs-lnhost117
2 ssd 0.04880 osd.2 down 1.00000 1.00000
-28 0.04880 host tfs-lnhost118
0 ssd 0.04880 osd.0 up 1.00000 1.00000
The TFS storage pool is a triple copy, and hosts the fault domain rule. I took an OSD from the TFS storage pool down, but its Max Avail increased, which was illogical.
Files
Updated by minghang zhao over 2 years ago
My solution is to add a function del_down_out_osd() to PGMap::get_rule_avail() to calculate the avail value of the storage pool. The function is used to delete osd nodes in the down/out state of the resource group corresponding to the pool.
int64_t PGMap::get_rule_avail(const OSDMap& osdmap, int ruleno) const
{
map<int,float> wm;
int r = osdmap.crush->get_rule_weight_osd_map(ruleno, &wm);
if (r < 0) {
return r;
}
if (wm.empty()) {
return 0;
}
del_down_out_osd(osdmap, wm);
float fratio = osdmap.get_full_ratio();
int64_t min = -1;
for (auto p = wm.begin(); p != wm.end(); ++p) {
....
void PGMap::del_down_out_osd(const OSDMap &osdmap, map<int,float> &wm) const
{
float weight = 0.0;
int osd_cnt = 0;
bool del_flag = false;
for (auto p = wm.begin(); p != wm.end(); )
{
osd_cnt = wm.size();
auto osd_info = osd_stat.find(p->first);
if (osd_info != osd_stat.end()) {
if (osd_info->second.statfs.total != 0 && p->second != 0 && (!osdmap.is_up(p->first) || osdmap.is_out(p->first)))
{
dout(5) << " p->first is: " << p->first << " is continue"<< dendl;
if (osd_cnt > 1)
{
weight = 1.0 / (osd_cnt - 1);
} else {
weight = 0;
}
dout(5) << "erase p->first is: " << p->first << dendl;
wm.erase(p++);
del_flag = true;
continue;
} else {
++p;
}
} else {
++p;
}
}
for (auto p = wm.begin(); p != wm.end() && del_flag; ++p)
{
p->second = weight;
dout(10) << "p->first is: " << p->first << "del new weight is: " << weight << dendl;
}
}
Updated by Neha Ojha over 2 years ago
minghang zhao wrote:
My solution is to add a function del_down_out_osd() to PGMap::get_rule_avail() to calculate the avail value of the storage pool. The function is used to delete osd nodes in the down/out state of the resource group corresponding to the pool.
int64_t PGMap::get_rule_avail(const OSDMap& osdmap, int ruleno) const {
map<int,float> wm;
int r = osdmap.crush->get_rule_weight_osd_map(ruleno, &wm);
if (r < 0) {
return r;
}
if (wm.empty()) {
return 0;
}del_down_out_osd(osdmap, wm);
float fratio = osdmap.get_full_ratio();
int64_t min = -1;
for (auto p = wm.begin(); p != wm.end(); ++p) {
....void PGMap::del_down_out_osd(const OSDMap &osdmap, map<int,float> &wm) const {
float weight = 0.0;
int osd_cnt = 0;
bool del_flag = false;
for (auto p = wm.begin(); p != wm.end(); ) {
osd_cnt = wm.size();
auto osd_info = osd_stat.find(p->first);
if (osd_info != osd_stat.end()) {
if (osd_info->second.statfs.total != 0 && p->second != 0 && (!osdmap.is_up(p->first) || osdmap.is_out(p->first))) {
dout(5) << " p->first is: " << p->first << " is continue"<< dendl;
if (osd_cnt > 1) {
weight = 1.0 / (osd_cnt - 1);
} else {
weight = 0;
}
dout(5) << "erase p->first is: " << p->first << dendl;
wm.erase(p++);
del_flag = true;
continue;
} else {
+p;
}
} else {
+p;
}
}for (auto p = wm.begin(); p != wm.end() && del_flag; ++p) {
p->second = weight;
dout(10) << "p->first is: " << p->first << "del new weight is: " << weight << dendl;
}
}
Would you like to propose a PR?
Updated by jianwei zhang almost 2 years ago
ceph v15.2.13
I found same problem
1. ceph cluster initial state
# rbd create test/rbd --size 1G
Write 1G of data sequentially
# rbd bench -p test --image rbd --io-size 1M --io-threads 1 --io-total 1G --io-pattern seq --io-type write
[root@ln-ceph-rpm build]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-3 0 root myroot
-1 0.78870 root default
-11 0.19717 host ln-ceph-rpm
6 hdd 0.09859 osd.6 up 1.00000 1.00000
7 hdd 0.09859 osd.7 up 1.00000 1.00000
-5 0.19717 host node1
0 hdd 0.09859 osd.0 up 1.00000 1.00000
1 hdd 0.09859 osd.1 up 1.00000 1.00000
-6 0.19717 host node2
2 hdd 0.09859 osd.2 up 1.00000 1.00000
3 hdd 0.09859 osd.3 up 1.00000 1.00000
-9 0.19717 host node3
4 hdd 0.09859 osd.4 up 1.00000 1.00000
5 hdd 0.09859 osd.5 up 1.00000 1.00000
[root@ln-ceph-rpm build]# ceph -s
cluster:
id: afec64f8-d9ee-4262-9410-fcf907807e2c
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 14m)
mgr: x(active, since 70m)
ioa: 0 daemonsno daemons active
osd: 8 osds: 8 up (since 74s), 8 in (since 74s)
data:
pools: 2 pools, 33 pgs
objects: 261 objects, 1.0 GiB
usage: 20 GiB used, 788 GiB / 808 GiB avail
pgs: 33 active+clean
[root@ln-ceph-rpm build]# ceph osd pool ls detail
pool 1 'device_health_metrics' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 706 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 2 'test' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 708 lfor 0/0/39 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
[root@ln-ceph-rpm build]# ceph df detail
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 808 GiB 788 GiB 12 GiB 20 GiB 2.44
TOTAL 808 GiB 788 GiB 12 GiB 20 GiB 2.44
--- POOLS ---
POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
device_health_metrics 1 1 0 B 0 B 0 B 0 0 B 0 B 0 B 0 260 GiB N/A N/A 0 0 B 0 B
test 2 32 1.0 GiB 1.0 GiB 0 B 261 3.0 GiB 3.0 GiB 0 B 0.38 260 GiB N/A N/A 261 0 B 0 B
/// we can see that
STORED = 1.0G
(DATA) = 1.0G
MAX AVAIL = 260G
2. kill 9 osd.0.pid - OSD.0 DOWN
[root@ln-ceph-rpm build]# ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-3 0 - 0 B 0 B 0 B 0 B 0 B 0 B 0 0 - root myroot
-1 0.78870 - 808 GiB 20 GiB 3.7 GiB 6.8 MiB 8.0 GiB 788 GiB 2.44 1.00 - root default
-11 0.19717 - 202 GiB 4.9 GiB 894 MiB 1.6 MiB 2.0 GiB 197 GiB 2.41 0.99 - host ln-ceph-rpm
6 hdd 0.09859 1.00000 101 GiB 2.5 GiB 531 MiB 1.0 MiB 1023 MiB 98 GiB 2.49 1.02 14 up osd.6
7 hdd 0.09859 1.00000 101 GiB 2.4 GiB 363 MiB 596 KiB 1023 MiB 99 GiB 2.33 0.96 9 up osd.7
-5 0.19717 - 202 GiB 4.9 GiB 909 MiB 1.6 MiB 2.0 GiB 197 GiB 2.42 0.99 - host node1
0 hdd 0.09859 1.00000 101 GiB 2.4 GiB 370 MiB 732 KiB 1023 MiB 99 GiB 2.34 0.96 0 down osd.0
1 hdd 0.09859 1.00000 101 GiB 2.5 GiB 539 MiB 921 KiB 1023 MiB 98 GiB 2.50 1.03 16 up osd.1
-6 0.19717 - 202 GiB 4.9 GiB 953 MiB 1.8 MiB 2.0 GiB 197 GiB 2.44 1.00 - host node2
2 hdd 0.09859 1.00000 101 GiB 2.4 GiB 443 MiB 971 KiB 1023 MiB 99 GiB 2.41 0.99 12 up osd.2
3 hdd 0.09859 1.00000 101 GiB 2.5 GiB 511 MiB 903 KiB 1023 MiB 99 GiB 2.47 1.01 14 up osd.3
-9 0.19717 - 202 GiB 5.0 GiB 1.0 GiB 1.7 MiB 2.0 GiB 197 GiB 2.48 1.02 - host node3
4 hdd 0.09859 1.00000 101 GiB 2.6 GiB 567 MiB 868 KiB 1023 MiB 98 GiB 2.53 1.04 13 up osd.4
5 hdd 0.09859 1.00000 101 GiB 2.5 GiB 475 MiB 862 KiB 1023 MiB 99 GiB 2.44 1.00 12 up osd.5
TOTAL 808 GiB 20 GiB 3.7 GiB 6.8 MiB 8.0 GiB 788 GiB 2.44
MIN/MAX VAR: 0.96/1.04 STDDEV: 0.07
[root@ln-ceph-rpm build]# ceph -s
cluster:
id: afec64f8-d9ee-4262-9410-fcf907807e2c
health: HEALTH_WARN
nobackfill flag(s) set
1 osds down
Degraded data redundancy: 72/783 objects degraded (9.195%), 8 pgs degraded
services:
mon: 3 daemons, quorum a,b,c (age 20m)
mgr: x(active, since 77m)
ioa: 0 daemonsno daemons active
osd: 8 osds: 7 up (since 35s), 8 in (since 7m)
flags nobackfill
data:
pools: 2 pools, 33 pgs
objects: 261 objects, 1.0 GiB
usage: 20 GiB used, 788 GiB / 808 GiB avail
pgs: 72/783 objects degraded (9.195%)
24 active+clean
8 active+undersized+degraded
1 active+undersized
[root@ln-ceph-rpm build]# ceph df detail
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 808 GiB 788 GiB 12 GiB 20 GiB 2.44
TOTAL 808 GiB 788 GiB 12 GiB 20 GiB 2.44
--- POOLS ---
POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
device_health_metrics 1 1 0 B 0 B 0 B 0 0 B 0 B 0 B 0 260 GiB N/A N/A 0 0 B 0 B
test 2 32 1.1 GiB 1.1 GiB 0 B 261 3.0 GiB 3.0 GiB 0 B 0.38 286 GiB N/A N/A 261 0 B 0 B
/// we can see that
STORED = 1.1G ///increase 1.0G --> 1.1G
(DATA) = 1.1G ///increase 1.0G --> 1.1G
MAX AVAIL = 286G ///increase 260G --> 286G
3. kill 9 osd.0.pid - OSD.0 OUT
[root@ln-ceph-rpm build]# ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-3 0 - 0 B 0 B 0 B 0 B 0 B 0 B 0 0 - root myroot
-1 0.78870 - 707 GiB 17 GiB 3.4 GiB 6.0 MiB 7.0 GiB 690 GiB 2.46 1.00 - root default
-11 0.19717 - 202 GiB 4.9 GiB 927 MiB 1.6 MiB 2.0 GiB 197 GiB 2.43 0.99 - host ln-ceph-rpm
6 hdd 0.09859 1.00000 101 GiB 2.5 GiB 532 MiB 1.0 MiB 1023 MiB 98 GiB 2.49 1.01 14 up osd.6
7 hdd 0.09859 1.00000 101 GiB 2.4 GiB 395 MiB 596 KiB 1023 MiB 99 GiB 2.36 0.96 11 up osd.7
-5 0.19717 - 101 GiB 2.5 GiB 539 MiB 921 KiB 1023 MiB 98 GiB 2.50 1.02 - host node1
0 hdd 0.09859 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down osd.0
1 hdd 0.09859 1.00000 101 GiB 2.5 GiB 539 MiB 921 KiB 1023 MiB 98 GiB 2.50 1.02 16 up osd.1
-6 0.19717 - 202 GiB 4.9 GiB 955 MiB 1.8 MiB 2.0 GiB 197 GiB 2.44 0.99 - host node2
2 hdd 0.09859 1.00000 101 GiB 2.4 GiB 443 MiB 971 KiB 1023 MiB 99 GiB 2.41 0.98 12 up osd.2
3 hdd 0.09859 1.00000 101 GiB 2.5 GiB 511 MiB 903 KiB 1023 MiB 99 GiB 2.47 1.01 14 up osd.3
-9 0.19717 - 202 GiB 5.0 GiB 1.0 GiB 1.7 MiB 2.0 GiB 197 GiB 2.48 1.01 - host node3
4 hdd 0.09859 1.00000 101 GiB 2.6 GiB 567 MiB 868 KiB 1023 MiB 98 GiB 2.53 1.03 13 up osd.4
5 hdd 0.09859 1.00000 101 GiB 2.5 GiB 475 MiB 862 KiB 1023 MiB 99 GiB 2.44 0.99 12 up osd.5
TOTAL 707 GiB 17 GiB 3.4 GiB 6.0 MiB 7.0 GiB 690 GiB 2.46
MIN/MAX VAR: 0.96/1.03 STDDEV: 0.05
[root@ln-ceph-rpm build]# ceph -s
cluster:
id: afec64f8-d9ee-4262-9410-fcf907807e2c
health: HEALTH_WARN
nobackfill flag(s) set
Degraded data redundancy: 64/783 objects degraded (8.174%), 7 pgs degraded
services:
mon: 3 daemons, quorum a,b,c (age 27m)
mgr: x(active, since 83m)
ioa: 0 daemonsno daemons active
osd: 8 osds: 7 up (since 6m), 7 in (since 67s); 7 remapped pgs
flags nobackfill
data:
pools: 2 pools, 33 pgs
objects: 261 objects, 1.0 GiB
usage: 17 GiB used, 690 GiB / 707 GiB avail
pgs: 64/783 objects degraded (8.174%)
26 active+clean
4 active+undersized+degraded+remapped+backfill_wait
3 active+undersized+degraded+remapped+backfilling
[root@ln-ceph-rpm build]# ceph df detail
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 707 GiB 690 GiB 10 GiB 17 GiB 2.46
TOTAL 707 GiB 690 GiB 10 GiB 17 GiB 2.46
--- POOLS ---
POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
device_health_metrics 1 1 0 B 0 B 0 B 0 0 B 0 B 0 B 0 260 GiB N/A N/A 0 0 B 0 B
test 2 32 1.1 GiB 1.1 GiB 0 B 261 3.0 GiB 3.0 GiB 0 B 0.39 283 GiB N/A N/A 261 0 B 0 B
/// we can see that
STORED = 1.1G ///increase 1.0G --> 1.1G
(DATA) = 1.1G ///increase 1.0G --> 1.1G
MAX AVAIL = 283G ///increase 260G --> 283G
4. kill 9 osd.0.pid - OSD.0 OUT unset nobackfill --> recovery HEALTH_OK
[root@ln-ceph-rpm build]# ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-3 0 - 0 B 0 B 0 B 0 B 0 B 0 B 0 0 - root myroot
-1 0.78870 - 707 GiB 18 GiB 3.6 GiB 6.0 MiB 7.0 GiB 689 GiB 2.49 1.00 - root default
-11 0.19717 - 202 GiB 5.0 GiB 1012 MiB 1.6 MiB 2.0 GiB 197 GiB 2.47 0.99 - host ln-ceph-rpm
6 hdd 0.09859 1.00000 101 GiB 2.6 GiB 616 MiB 1.0 MiB 1023 MiB 98 GiB 2.58 1.03 16 up osd.6
7 hdd 0.09859 1.00000 101 GiB 2.4 GiB 396 MiB 596 KiB 1023 MiB 99 GiB 2.36 0.95 11 up osd.7
-5 0.19717 - 101 GiB 2.6 GiB 588 MiB 921 KiB 1023 MiB 98 GiB 2.55 1.02 - host node1
0 hdd 0.09859 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down osd.0
1 hdd 0.09859 1.00000 101 GiB 2.6 GiB 588 MiB 921 KiB 1023 MiB 98 GiB 2.55 1.02 18 up osd.1
-6 0.19717 - 202 GiB 5.0 GiB 1.0 GiB 1.8 MiB 2.0 GiB 197 GiB 2.48 0.99 - host node2
2 hdd 0.09859 1.00000 101 GiB 2.5 GiB 516 MiB 971 KiB 1023 MiB 98 GiB 2.48 0.99 14 up osd.2
3 hdd 0.09859 1.00000 101 GiB 2.5 GiB 512 MiB 903 KiB 1023 MiB 99 GiB 2.48 0.99 14 up osd.3
-9 0.19717 - 202 GiB 5.1 GiB 1.1 GiB 1.7 MiB 2.0 GiB 197 GiB 2.51 1.01 - host node3
4 hdd 0.09859 1.00000 101 GiB 2.6 GiB 568 MiB 868 KiB 1023 MiB 98 GiB 2.53 1.01 13 up osd.4
5 hdd 0.09859 1.00000 101 GiB 2.5 GiB 520 MiB 862 KiB 1023 MiB 98 GiB 2.48 1.00 13 up osd.5
TOTAL 707 GiB 18 GiB 3.6 GiB 6.0 MiB 7.0 GiB 689 GiB 2.49
MIN/MAX VAR: 0.95/1.03 STDDEV: 0.06
[root@ln-ceph-rpm build]# ceph -s
cluster:
id: afec64f8-d9ee-4262-9410-fcf907807e2c
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 30m)
mgr: x(active, since 87m)
ioa: 0 daemonsno daemons active
osd: 8 osds: 7 up (since 10m), 7 in (since 4m)
data:
pools: 2 pools, 33 pgs
objects: 261 objects, 1.0 GiB
usage: 18 GiB used, 689 GiB / 707 GiB avail
pgs: 33 active+clean
[root@ln-ceph-rpm build]# ceph df detail
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 707 GiB 689 GiB 11 GiB 18 GiB 2.49
TOTAL 707 GiB 689 GiB 11 GiB 18 GiB 2.49
--- POOLS ---
POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
device_health_metrics 1 1 0 B 0 B 0 B 0 0 B 0 B 0 B 0 260 GiB N/A N/A 0 0 B 0 B
test 2 32 1.1 GiB 1.1 GiB 0 B 261 3.3 GiB 3.3 GiB 0 B 0.42 260 GiB N/A N/A 261 0 B 0 B
/// we can see that
STORED = 1.1G ///increase 1.0G --> 1.1G
(DATA) = 1.1G ///increase 1.0G --> 1.1G
MAX AVAIL = 260G ///increase 260G --> 260G
Updated by jianwei zhang almost 2 years ago
Problem1 step1 vs step2:
1. ceph cluster initial state
STORED = 1.0G
(DATA) = 1.0G
MAX AVAIL = 260G
2. kill 9 osd.0.pid - OSD.0 DOWN
STORED = 1.1G ///increase 1.0G --> 1.1G
(DATA) = 1.1G ///increase 1.0G --> 1.1G
MAX AVAIL = 286G ///increase 260G --> 286G
Problem step1 vs step3:
1. ceph cluster initial state
STORED = 1.0G
(DATA) = 1.0G
MAX AVAIL = 260G
3. kill 9 osd.0.pid - OSD.0 OUT
STORED = 1.1G ///increase 1.0G --> 1.1G
(DATA) = 1.1G ///increase 1.0G --> 1.1G
MAX AVAIL = 283G ///increase 260G --> 283G
First, we wrote 1G of rbd to the test storage pool of 3 copies
But ceph df detail shows that STORED/(DATA)/MAX AVAIL are all increased
I don't think this is correct.
The downgrade should be the same as HEALTH_OK
Problem2 step1 vs step4:
1. ceph cluster initial state
STORED = 1.0G
(DATA) = 1.0G
MAX AVAIL = 260G
4. kill 9 osd.0.pid - OSD.0 OUT unset nobackfill --> recovery HEALTH_OK
STORED = 1.1G ///increase 1.0G --> 1.1G
(DATA) = 1.1G ///increase 1.0G --> 1.1G
MAX AVAIL = 260G ///increase 260G --> 260G
ceph df detail shows that STORED/(DATA) are all increased
I don't think this is correct, because we only have 1G data
The downgrade should be the same as HEALTH_OK
MAX AVAIL = 260G is not correct, because osd.0 already out, I don't think we should count osd.0 in
Updated by jianwei zhang almost 2 years ago
针对MAX AVAIL字段,我认为应该将down or out osd.0去除掉
int64_t PGMap::get_rule_avail(const OSDMap &osdmap, int ruleno) const {
map<int, float> wm;
int r = osdmap.crush->get_rule_weight_osd_map(ruleno, &wm);
$295 = std::map with 8 elements = {
[0] = 0.125, ///down or out osd.0, still in the cluster
[1] = 0.125,
[2] = 0.125,
[3] = 0.125,
[4] = 0.125,
[5] = 0.125,
[6] = 0.125,
[7] = 0.125
}
if (r < 0) {
return r;
}
if (wm.empty()) {
return 0;
}
float fratio = osdmap.get_full_ratio();
int64_t min = -1;
for (auto p = wm.begin(); p != wm.end(); ++p) {
auto osd_info = osd_stat.find(p->first);
if (osd_info != osd_stat.end()) {
if (osd_info->second.statfs.total == 0 || p->second == 0) {
// osd must be out, hence its stats have been zeroed
// (unless we somehow managed to have a disk with size 0...)
//
// (p->second == 0), if osd weight is 0, no need to
// calculate proj below.
continue;
}
double unusable = (double)osd_info->second.statfs.kb() * (1.0 - fratio);
double avail = std::max(0.0, (double)osd_info->second.statfs.kb_avail() - unusable);
avail *= 1024.0;
int64_t proj = (int64_t)(avail / (double)p->second); ///will cause a deviation in this calculation
if (min < 0 || proj < min) {
min = proj;
}
} else {
if (osdmap.is_up(p->first)) {
// This is a level 4 rather than an error, because we might have
// only just started, and not received the first stats message yet.
dout(4) << "OSD " << p->first << " is up, but has no stats" << dendl;
}
}
}
return min;
}
Updated by jianwei zhang almost 2 years ago
for ceph df detail commands
I don't think raw_used_rate should be adjusted:
if (sum.num_object_copies > 0) {
raw_used_rate *= (float)(sum.num_object_copies - sum.num_objects_degraded) / sum.num_object_copies;
}
void PGMapDigest::dump_object_stat_sum(TextTable &tbl, ceph::Formatter *f, const pool_stat_t &pool_stat, uint64_t avail,
float raw_used_rate, bool verbose, bool per_pool, bool per_pool_omap, const pg_pool_t *pool)
{
const object_stat_sum_t &sum = pool_stat.stats.sum;
const store_statfs_t statfs = pool_stat.store_stats;
if (sum.num_object_copies > 0) {
raw_used_rate *= (float)(sum.num_object_copies - sum.num_objects_degraded) / sum.num_object_copies;
}
uint64_t used_data_bytes = pool_stat.get_allocated_data_bytes(per_pool);
uint64_t used_omap_bytes = pool_stat.get_allocated_omap_bytes(per_pool_omap);
uint64_t used_bytes = used_data_bytes + used_omap_bytes;
float used = 0.0;
// note avail passed in is raw_avail, calc raw_used here.
if (avail) {
used = used_bytes;
used /= used + avail;
} else if (used_bytes) {
used = 1.0;
}
auto avail_res = raw_used_rate ? avail / raw_used_rate : 0;
// an approximation for actually stored user data
auto stored_data_normalized = pool_stat.get_user_data_bytes(raw_used_rate, per_pool);
auto stored_omap_normalized = pool_stat.get_user_omap_bytes(raw_used_rate, per_pool_omap);
auto stored_normalized = stored_data_normalized + stored_omap_normalized;
// same, amplied by replication or EC
auto stored_raw = stored_normalized * raw_used_rate;
if (f) {
f->dump_int("stored", stored_normalized);
if (verbose) {
f->dump_int("stored_data", stored_data_normalized);
f->dump_int("stored_omap", stored_omap_normalized);
}
f->dump_int("objects", sum.num_objects);
f->dump_int("kb_used", shift_round_up(used_bytes, 10));
f->dump_int("bytes_used", used_bytes);
if (verbose) {
f->dump_int("data_bytes_used", used_data_bytes);
f->dump_int("omap_bytes_used", used_omap_bytes);
}
f->dump_float("percent_used", used);
f->dump_unsigned("max_avail", avail_res);
if (verbose) {
f->dump_int("quota_objects", pool->quota_max_objects);
f->dump_int("quota_bytes", pool->quota_max_bytes);
f->dump_int("dirty", sum.num_objects_dirty);
f->dump_int("rd", sum.num_rd);
f->dump_int("rd_bytes", sum.num_rd_kb * 1024ull);
f->dump_int("wr", sum.num_wr);
f->dump_int("wr_bytes", sum.num_wr_kb * 1024ull);
f->dump_int("compress_bytes_used", statfs.data_compressed_allocated);
f->dump_int("compress_under_bytes", statfs.data_compressed_original);
// Stored by user amplified by replication
f->dump_int("stored_raw", stored_raw);
f->dump_unsigned("avail_raw", avail);
}
} else {
tbl << stringify(byte_u_t(stored_normalized));
if (verbose) {
tbl << stringify(byte_u_t(stored_data_normalized));
tbl << stringify(byte_u_t(stored_omap_normalized));
}
tbl << stringify(si_u_t(sum.num_objects));
tbl << stringify(byte_u_t(used_bytes));
if (verbose) {
tbl << stringify(byte_u_t(used_data_bytes));
tbl << stringify(byte_u_t(used_omap_bytes));
}
tbl << percentify(used * 100);
tbl << stringify(byte_u_t(avail_res));
if (verbose) {
if (pool->quota_max_objects == 0)
tbl << "N/A";
else
tbl << stringify(si_u_t(pool->quota_max_objects));
if (pool->quota_max_bytes == 0)
tbl << "N/A";
else
tbl << stringify(byte_u_t(pool->quota_max_bytes));
tbl << stringify(si_u_t(sum.num_objects_dirty)) << stringify(byte_u_t(statfs.data_compressed_allocated))
<< stringify(byte_u_t(statfs.data_compressed_original));
}
}
}
Updated by jianwei zhang almost 2 years ago
for Problem2 step1 vs step4:
osd.0 already out and recovery complete HEALTH_OK, but STORED/(DATA) 1.0G increase to 1.1G
user data only 1G, I think this is also problem
Problem2 step1 vs step4:
1. ceph cluster initial state
STORED = 1.0G
(DATA) = 1.0G
4. kill 9 osd.0.pid - OSD.0 OUT unset nobackfill --> recovery HEALTH_OK
STORED = 1.1G ///increase 1.0G --> 1.1G
(DATA) = 1.1G ///increase 1.0G --> 1.1G
Updated by jianwei zhang almost 2 years ago
5. remove out osd.0
[root@ln-ceph-rpm build]# ceph osd rm osd.0
removed osd.0
[root@ln-ceph-rpm build]# ceph osd crush remove osd.0
removed item id 0 name 'osd.0' from crush map
[root@ln-ceph-rpm build]# ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-3 0 - 0 B 0 B 0 B 0 B 0 B 0 B 0 0 - root myroot
-1 0.69011 - 707 GiB 18 GiB 3.6 GiB 6.0 MiB 7.0 GiB 689 GiB 2.49 1.00 - root default
-11 0.19717 - 202 GiB 5.1 GiB 1.1 GiB 1.6 MiB 2.0 GiB 197 GiB 2.51 1.01 - host ln-ceph-rpm
6 hdd 0.09859 1.00000 101 GiB 2.6 GiB 621 MiB 1.0 MiB 1023 MiB 98 GiB 2.58 1.03 17 up osd.6
7 hdd 0.09859 1.00000 101 GiB 2.5 GiB 481 MiB 596 KiB 1023 MiB 99 GiB 2.45 0.98 12 up osd.7
-5 0.09859 - 101 GiB 2.4 GiB 421 MiB 921 KiB 1023 MiB 99 GiB 2.39 0.96 - host node1
1 hdd 0.09859 1.00000 101 GiB 2.4 GiB 421 MiB 921 KiB 1023 MiB 99 GiB 2.39 0.96 12 up osd.1
-6 0.19717 - 202 GiB 5.0 GiB 1.0 GiB 1.8 MiB 2.0 GiB 197 GiB 2.50 1.00 - host node2
2 hdd 0.09859 1.00000 101 GiB 2.5 GiB 517 MiB 971 KiB 1023 MiB 98 GiB 2.48 0.99 15 up osd.2
3 hdd 0.09859 1.00000 101 GiB 2.5 GiB 549 MiB 903 KiB 1023 MiB 98 GiB 2.51 1.01 15 up osd.3
-9 0.19717 - 202 GiB 5.1 GiB 1.1 GiB 1.7 MiB 2.0 GiB 197 GiB 2.53 1.01 - host node3
4 hdd 0.09859 1.00000 101 GiB 2.6 GiB 613 MiB 868 KiB 1023 MiB 98 GiB 2.57 1.03 15 up osd.4
5 hdd 0.09859 1.00000 101 GiB 2.5 GiB 521 MiB 862 KiB 1023 MiB 98 GiB 2.48 1.00 13 up osd.5
TOTAL 707 GiB 18 GiB 3.6 GiB 6.0 MiB 7.0 GiB 689 GiB 2.49
MIN/MAX VAR: 0.96/1.03 STDDEV: 0.06
[root@ln-ceph-rpm build]# ceph -s
cluster:
id: afec64f8-d9ee-4262-9410-fcf907807e2c
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 18m)
mgr: x(active, since 2h)
ioa: 0 daemonsno daemons active
osd: 7 osds: 7 up (since 55m), 7 in (since 49m)
data:
pools: 2 pools, 33 pgs
objects: 261 objects, 1.0 GiB
usage: 18 GiB used, 689 GiB / 707 GiB avail
pgs: 33 active+clean
[root@ln-ceph-rpm build]# ceph df detail
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 707 GiB 689 GiB 11 GiB 18 GiB 2.49
TOTAL 707 GiB 689 GiB 11 GiB 18 GiB 2.49
--- POOLS ---
POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
device_health_metrics 1 1 0 B 0 B 0 B 0 0 B 0 B 0 B 0 227 GiB N/A N/A 0 0 B 0 B
test 2 32 1.0 GiB 1.0 GiB 0 B 261 3.0 GiB 3.0 GiB 0 B 0.44 227 GiB N/A N/A 261 0 B 0 B
/// we can see that
STORED = 1.0G ///decrease 1.1G --> 1.0G
(DATA) = 1.0G ///decrease 1.1G --> 1.0G
MAX AVAIL = 227G ///decrease 260G --> 227G
Updated by jianwei zhang almost 2 years ago
step4 vs step5:
4. kill 9 osd.0.pid - OSD.0 OUT unset nobackfill --> recovery HEALTH_OK
STORED = 1.1G ///increase 1.0G --> 1.1G
(DATA) = 1.1G ///increase 1.0G --> 1.1G
MAX AVAIL = 260G ///increase 260G --> 260G
5. remove out osd.0
STORED = 1.0G ///decrease 1.1G --> 1.0G
(DATA) = 1.0G ///decrease 1.1G --> 1.0G
MAX AVAIL = 227G ///decrease 260G --> 227G
become correct
Updated by jianwei zhang almost 2 years ago
The original intention of raising this question is that testers (users) are confused as to why MAX_AVAIL does not decrease but increases after osd down(out)?
Updated by jianwei zhang almost 2 years ago
4. kill 9 osd.0.pid - OSD.0 OUT unset nobackfill --> recovery HEALTH_OK
STORED = 1.1G ///increase 1.0G --> 1.1G
(DATA) = 1.1G ///increase 1.0G --> 1.1G
MAX AVAIL = 260G ///equal 260G --> 260G
USED = 3.3G ///increase 3.0G --> 3.3 G
The cluster has returned to HEALTH_OK
Why STORED/(DATA)=1.1G USED=3.3G?
This is due to out osd.0,
in mempool::pgmap::unordered_map<int32_t, pool_stat_t> pg_pool_sum does not subtract used space on osd.0 { <pool, osd> : store_statfs_t }
Only after ceph osd rm osd.0 && ceph osd crush rm osd.0 is executed, and osd.0 is deleted from osdmap/crushmap, the space used by the pool on osd.0 can be subtracted from the pool
Updated by jianwei zhang almost 2 years ago
Updated by jianwei zhang almost 2 years ago
Updated by Radoslaw Zarzynski almost 2 years ago
- Status changed from New to Fix Under Review
Updated by Radoslaw Zarzynski almost 2 years ago
- Backport set to octopus,quincy