Bug #24512
openRaw used space leak
0%
Description
Hello
I'm testing an setup of cephfs over a EC pool with 21 data + 3 coding chunks ([EC_]stripe_unit of 16k).
All OSD are bluestore on 1 HDD along with WAL&DB.
I'm experiencing unexpected usage of raw space (more than 200% instead of theoretical 24/21=114%).
I've run a lot of copy tests trying identifying the issue.
Firstly, the initial situation (see file "global_stats.txt"): I have 15T in the global, but two pools with only 7.8T and 1G respectively.
From what I dig in doc + mailing list, the space can be lost with:
- WAL+DB: expecting max 1.5GB/osd on 80 osds: 120GB (observed after deletion of all cephfs data: 91GB)
- unfilled stripes of min. "[pool_]obj_stripe", max loss possible: 240k files * 1344kB = 315GB
=> we can expect up to 435GB more, but 7207GB are observed !
What I'm expecting from EC:
- unfilled EC stripes of min. 21*[EC_]stripe_unit = 21*16kB = 336kB
=> no loss possible if [pool_]obj_stripe of 1344kB = 4*336kB , as each full pool_stripe is stored on 4 full EC_stripes
=> rem: with the base 4MB, we should have 4096รท336=12.19 => 13/12.19*24/21 = 121.9% (instead of 114.3%, or increase of ~7%)
What I'm expecting from bluestore:
- some additional DB stuff (keys, index, checksums)... But mostly negligible if large data ?
- some block alignement optims causing fragmentation ?
Secondly, I run bunch of tests with cp, rsync and dd (see copy_tests.txt).
It can be seen that when copying with packets of 1M (what cp is doing), the usage is more than 200% of the original files. That decreases down to 130% when the packet is equal to the object size and increases up to 450% when it is small (128k).
I dont know what is the origin (fragmentation?), but EC loose a large part of its purpose here: gaining space.
1) is it the expected behavior ? (appart the fact that you dont propose such design for EC) What did I miss ?
2) if it is due to bluestore fragmentation, is it possible to design a "defragmenter" in the future to get back the unused space ?
Thanks
Files
Updated by Thomas De Maet almost 6 years ago
sorry, wrong ceph version: 12.2.5-407 (luminous stable)
I'm still very interested by any answer. If I try filestore can I gain space at the cost of some perfs ?
Thanks
Updated by Igor Fedotov almost 6 years ago
Would you share performance counters dump for several (3-5) OSDs, preferably from different nodes? And corresponding 'ceph osd df' output...
Updated by Thomas De Maet almost 6 years ago
- File osd_0_asok osd_0_asok added
- File osd_30_asok osd_30_asok added
- File osd_53_asok osd_53_asok added
- File osd_77_asok osd_77_asok added
- File osd_df osd_df added
Here they are from 3 hosts (the link with hosts is in the df tree) !
Thanks!
Updated by Igor Fedotov almost 6 years ago
I checked 'stored' vs. 'allocated' counters under bluestore section. 'stored' is the actual amount written to bluestore from upper level while 'allocated' includes the above plus the overhead caused by allocation granularity. For all logs allocation overhead is approx. 40% which means that stored objects are highly fragmented and data in fragments isn't aligned with bluestore allocation unit (64K for HDD by default).
Unfortunately I'm not aware of the scheme cephfs utilizes to save its content but it looks like the issue is in that scheme or its tuning. IMO that's neither bluestore related issue nor stats miscalculations we observed before.
Updated by Igor Fedotov almost 6 years ago
- Project changed from bluestore to CephFS
Updated by Thomas De Maet almost 6 years ago
- File fs_pool_osd.txt fs_pool_osd.txt added
some additional info:
- mounted with 'mount t ceph' default config but:
--- 2 mds servers active
--- ram per OSD decreased to 512MB (half the default)
- the fs was emptied then filled once again with rsync
- additional infos about fs/EC/params in linked file
The uncommon thing is the large EC pool of 24 chunks. The parameter jerasure-per-chunk-alignment is false in the EC profile. Reading #8475 this parameter seems to belong to other EC method than reed_sol_van, but possibly related ?
Thanks
Updated by Thomas De Maet almost 6 years ago
Unfortunately, I have now to use the disks for production...
Here follow the last tests with smaller pools, which are much better:
13+3, defaults: theory 1.231x, real 1.307x
16+3, defaults: theory 1.188x, real 1.220x
16+3, 64M objs, 1M stripe: real 1.215x (on 1GbE network, input flow increase by ~60Mb/s)
I finally end-up with the last option for my no-critical data.
I believe it should be a good thing for Ceph to see where the original issue came from. When fixed, I suspect that the used space can decrease for smaller pools as well.
Thanks
Updated by Patrick Donnelly about 5 years ago
- Target version deleted (
v12.2.5) - Start date deleted (
06/13/2018) - Affected Versions deleted (
v10.2.5) - ceph-qa-suite deleted (
ceph-ansible)