Bug #24512
openRaw used space leak
0%
Description
Hello
I'm testing an setup of cephfs over a EC pool with 21 data + 3 coding chunks ([EC_]stripe_unit of 16k).
All OSD are bluestore on 1 HDD along with WAL&DB.
I'm experiencing unexpected usage of raw space (more than 200% instead of theoretical 24/21=114%).
I've run a lot of copy tests trying identifying the issue.
Firstly, the initial situation (see file "global_stats.txt"): I have 15T in the global, but two pools with only 7.8T and 1G respectively.
From what I dig in doc + mailing list, the space can be lost with:
- WAL+DB: expecting max 1.5GB/osd on 80 osds: 120GB (observed after deletion of all cephfs data: 91GB)
- unfilled stripes of min. "[pool_]obj_stripe", max loss possible: 240k files * 1344kB = 315GB
=> we can expect up to 435GB more, but 7207GB are observed !
What I'm expecting from EC:
- unfilled EC stripes of min. 21*[EC_]stripe_unit = 21*16kB = 336kB
=> no loss possible if [pool_]obj_stripe of 1344kB = 4*336kB , as each full pool_stripe is stored on 4 full EC_stripes
=> rem: with the base 4MB, we should have 4096รท336=12.19 => 13/12.19*24/21 = 121.9% (instead of 114.3%, or increase of ~7%)
What I'm expecting from bluestore:
- some additional DB stuff (keys, index, checksums)... But mostly negligible if large data ?
- some block alignement optims causing fragmentation ?
Secondly, I run bunch of tests with cp, rsync and dd (see copy_tests.txt).
It can be seen that when copying with packets of 1M (what cp is doing), the usage is more than 200% of the original files. That decreases down to 130% when the packet is equal to the object size and increases up to 450% when it is small (128k).
I dont know what is the origin (fragmentation?), but EC loose a large part of its purpose here: gaining space.
1) is it the expected behavior ? (appart the fact that you dont propose such design for EC) What did I miss ?
2) if it is due to bluestore fragmentation, is it possible to design a "defragmenter" in the future to get back the unused space ?
Thanks
Files