Bug #58664
The per pool 'STORED' metric of ceph df is wrong when some osd are down.
0%
Description
Hi.
I've seen this issue with ceph 16.2.11, shipped from Debian repositories.
I built a small test cluster composed of 3 (virtual) machines with 2 osds each. I assembled a cephfs and wrote 1 MiB of data on a replicated pool:
nominal behaviour¶
ceph df output:
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 36 GiB 36 GiB 61 MiB 61 MiB 0.17
TOTAL 36 GiB 36 GiB 61 MiB 61 MiB 0.17
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
device_health_metrics 1 1 0 B 0 0 B 0 11 GiB
cephfs_meta 11 32 17 KiB 23 132 KiB 0 17 GiB
cephfs_rep_0 12 32 1 MiB 1 3 MiB 0 17 GiB
ceph osd tree output (for information):
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.03534 root default
-3 0.01178 host deb11-ceph-1
0 hdd 0.00589 osd.0 up 1.00000 1.00000
4 hdd 0.00589 osd.4 up 1.00000 1.00000
-7 0.01178 host deb11-ceph-2
1 hdd 0.00589 osd.1 up 1.00000 1.00000
3 hdd 0.00589 osd.3 up 1.00000 1.00000
-5 0.01178 host deb11-ceph-3
2 hdd 0.00589 osd.2 up 1.00000 1.00000
5 hdd 0.00589 osd.5 up 1.00000 1.00000
ceph -s output (for information):
cluster:
id: 5edbfb48-7070-4d0a-a240-e000364453c2
health: HEALTH_OK
services:
mon: 3 daemons, quorum sto-ceph-1,sto-ceph-2,sto-ceph-3 (age 43s)
mgr: deb11-ceph-3(active, since 64m), standbys: deb11-ceph-1, deb11-ceph-2
mds: 1/1 daemons up, 2 standby
osd: 6 osds: 6 up (since 40s), 6 in (since 64m)
data:
volumes: 1/1 healthy
pools: 3 pools, 65 pgs
objects: 24 objects, 1.0 MiB
usage: 62 MiB used, 36 GiB / 36 GiB avail
pgs: 65 active+clean
Faulty behaviour¶
Then, if I shutdown one serveur, the STORED value increase.
ceph df output:
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 36 GiB 36 GiB 61 MiB 61 MiB 0.17
TOTAL 36 GiB 36 GiB 61 MiB 61 MiB 0.17
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
device_health_metrics 1 1 0 B 0 0 B 0 11 GiB
cephfs_meta 11 32 17 KiB 23 132 KiB 0 17 GiB
cephfs_rep_0 12 32 1.5 MiB 1 3 MiB 0 17 GiB
STORED value for pool cephfs_rep_0 is wrong here.
ceph osd tree output (for information):
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.03534 root default
-3 0.01178 host deb11-ceph-1
0 hdd 0.00589 osd.0 up 1.00000 1.00000
4 hdd 0.00589 osd.4 up 1.00000 1.00000
-7 0.01178 host deb11-ceph-2
1 hdd 0.00589 osd.1 down 1.00000 1.00000
3 hdd 0.00589 osd.3 down 1.00000 1.00000
-5 0.01178 host deb11-ceph-3
2 hdd 0.00589 osd.2 up 1.00000 1.00000
5 hdd 0.00589 osd.5 up 1.00000 1.00000
ceph -s output (for information):
cluster:
id: 5edbfb48-7070-4d0a-a240-e000364453c2
health: HEALTH_WARN
1/3 mons down, quorum sto-ceph-1,sto-ceph-3
2 osds down
1 host (2 osds) down
Degraded data redundancy: 24/72 objects degraded (33.333%), 15 pgs degraded, 65 pgs undersized
services:
mon: 3 daemons, quorum sto-ceph-1,sto-ceph-3 (age 2m), out of quorum: sto-ceph-2
mgr: deb11-ceph-3(active, since 68m), standbys: deb11-ceph-1
mds: 1/1 daemons up, 1 standby
osd: 6 osds: 4 up (since 2m), 6 in (since 68m)
data:
volumes: 1/1 healthy
pools: 3 pools, 65 pgs
objects: 24 objects, 1.0 MiB
usage: 63 MiB used, 36 GiB / 36 GiB avail
pgs: 24/72 objects degraded (33.333%)
50 active+undersized
Analysis¶
As I understand it, the STORED value is the actual amount of data stored by the user in the pool, regardless of replication, erasure coding, or osd failures. It should not be adjusted if an osd fails.
I guess the issue stems from src/mon/PGMap.cc file, method PGMapDigest::dump_object_stat_sum.
Basically, raw_used_rate (used for STORED calculation) is adjusted with this code fragment:
if (sum.num_object_copies > 0) {
raw_used_rate *= (float)(sum.num_object_copies - sum.num_objects_degraded) / sum.num_object_copies;
}
Maybe it shouldn't?
HOW TO REPRODUCE:¶
- install a 3 nodes cluster with two osds.
- create a simple cephfs (default is size=3 for pools)
ceph osd pool create cephfs_meta 32 replicated
ceph osd pool create cephfs_rep_0 32 replicated
ceph fs new cephfs cephfs_meta cephfs_rep_0
mount -t ceph :/ /mnt/ceph -o name=admin
cd /mnt/ceph && dd if=/dev/zero of=file-1m count=1024 bs=1024
Then:
- look at ceph df output
- shutdown a server
- wait for df stats to be updated (may take a few minutes)
- look at ceph df output again
History
#1 Updated by Ilya Dryomov about 1 year ago
- Target version deleted (
v16.2.11)