Bug #24512: Raw used space leak - CephFS - Ceph

Actions

Copy link

Bug #24512

open

Raw used space leak

Added by Thomas De Maet almost 6 years ago. Updated about 5 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hello

I'm testing an setup of cephfs over a EC pool with 21 data + 3 coding chunks ([EC_]stripe_unit of 16k).
All OSD are bluestore on 1 HDD along with WAL&DB.

I'm experiencing unexpected usage of raw space (more than 200% instead of theoretical 24/21=114%).

I've run a lot of copy tests trying identifying the issue.

Firstly, the initial situation (see file "global_stats.txt"): I have 15T in the global, but two pools with only 7.8T and 1G respectively.

From what I dig in doc + mailing list, the space can be lost with:
- WAL+DB: expecting max 1.5GB/osd on 80 osds: 120GB (observed after deletion of all cephfs data: 91GB)
- unfilled stripes of min. "[pool_]obj_stripe", max loss possible: 240k files * 1344kB = 315GB

=> we can expect up to 435GB more, but 7207GB are observed !

What I'm expecting from EC:
- unfilled EC stripes of min. 21*[EC_]stripe_unit = 21*16kB = 336kB
=> no loss possible if [pool_]obj_stripe of 1344kB = 4*336kB , as each full pool_stripe is stored on 4 full EC_stripes
=> rem: with the base 4MB, we should have 4096÷336=12.19 => 13/12.19*24/21 = 121.9% (instead of 114.3%, or increase of ~7%)
What I'm expecting from bluestore:
- some additional DB stuff (keys, index, checksums)... But mostly negligible if large data ?
- some block alignement optims causing fragmentation ?

Secondly, I run bunch of tests with cp, rsync and dd (see copy_tests.txt).

It can be seen that when copying with packets of 1M (what cp is doing), the usage is more than 200% of the original files. That decreases down to 130% when the packet is equal to the object size and increases up to 450% when it is small (128k).

I dont know what is the origin (fragmentation?), but EC loose a large part of its purpose here: gaining space.

1) is it the expected behavior ? (appart the fact that you dont propose such design for EC) What did I miss ?

2) if it is due to bluestore fragmentation, is it possible to design a "defragmenter" in the future to get back the unused space ?

Thanks

Files

Download all files

global_stats.txt (1015 Bytes) global_stats.txt		Thomas De Maet, 06/13/2018 01:39 PM
copy_tests.txt (2.13 KB) copy_tests.txt		Thomas De Maet, 06/13/2018 01:39 PM
osd_0_asok (26.7 KB) osd_0_asok		Thomas De Maet, 06/19/2018 12:25 PM
osd_30_asok (26.7 KB) osd_30_asok		Thomas De Maet, 06/19/2018 12:25 PM
osd_53_asok (26.4 KB) osd_53_asok		Thomas De Maet, 06/19/2018 12:25 PM
osd_77_asok (26.6 KB) osd_77_asok		Thomas De Maet, 06/19/2018 12:25 PM
osd_df (8.28 KB) osd_df		Thomas De Maet, 06/19/2018 12:25 PM
fs_pool_osd.txt (46.3 KB) fs_pool_osd.txt		Thomas De Maet, 06/20/2018 07:51 AM

Actions

Copy link

Updated by Thomas De Maet almost 6 years ago

sorry, wrong ceph version: 12.2.5-407 (luminous stable)

I'm still very interested by any answer. If I try filestore can I gain space at the cost of some perfs ?

Thanks

Actions

Copy link

Updated by Igor Fedotov almost 6 years ago

Would you share performance counters dump for several (3-5) OSDs, preferably from different nodes? And corresponding 'ceph osd df' output...

Actions

Copy link Download all files

Updated by Thomas De Maet almost 6 years ago

File osd_0_asok osd_0_asok added
File osd_30_asok osd_30_asok added
File osd_53_asok osd_53_asok added
File osd_77_asok osd_77_asok added
File osd_df osd_df added

Here they are from 3 hosts (the link with hosts is in the df tree) !

Thanks!

Actions

Copy link

Updated by Igor Fedotov almost 6 years ago

I checked 'stored' vs. 'allocated' counters under bluestore section. 'stored' is the actual amount written to bluestore from upper level while 'allocated' includes the above plus the overhead caused by allocation granularity. For all logs allocation overhead is approx. 40% which means that stored objects are highly fragmented and data in fragments isn't aligned with bluestore allocation unit (64K for HDD by default).
Unfortunately I'm not aware of the scheme cephfs utilizes to save its content but it looks like the issue is in that scheme or its tuning. IMO that's neither bluestore related issue nor stats miscalculations we observed before.

Actions

Copy link

Updated by Igor Fedotov almost 6 years ago

Project changed from bluestore to CephFS

Actions

Copy link

Updated by Thomas De Maet almost 6 years ago

File fs_pool_osd.txt fs_pool_osd.txt added

some additional info:
- mounted with 'mount ~~t ceph'~~
default config but:
--- 2 mds servers active
--- ram per OSD decreased to 512MB (half the default)
- the fs was emptied then filled once again with rsync
- additional infos about fs/EC/params in linked file

The uncommon thing is the large EC pool of 24 chunks. The parameter jerasure-per-chunk-alignment is false in the EC profile. Reading #8475 this parameter seems to belong to other EC method than reed_sol_van, but possibly related ?

Thanks

Actions

Copy link

Updated by Thomas De Maet almost 6 years ago

Unfortunately, I have now to use the disks for production...

Here follow the last tests with smaller pools, which are much better:

13+3, defaults: theory 1.231x, real 1.307x
16+3, defaults: theory 1.188x, real 1.220x
16+3, 64M objs, 1M stripe: real 1.215x (on 1GbE network, input flow increase by ~60Mb/s)

I finally end-up with the last option for my no-critical data.

I believe it should be a good thing for Ceph to see where the original issue came from. When fixed, I suspect that the used space can decrease for smaller pools as well.

Thanks

Actions

Copy link

Updated by Patrick Donnelly about 5 years ago

Target version deleted (~~v12.2.5~~)
Start date deleted (~~06/13/2018~~)
Affected Versions deleted (~~v10.2.5~~)
ceph-qa-suite deleted (~~ceph-ansible~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #24512

Raw used space leak

Updated by Thomas De Maet almost 6 years ago

Updated by Igor Fedotov almost 6 years ago

Updated by Thomas De Maet almost 6 years ago

Updated by Igor Fedotov almost 6 years ago

Updated by Igor Fedotov almost 6 years ago

Updated by Thomas De Maet almost 6 years ago

Updated by Thomas De Maet almost 6 years ago

Updated by Patrick Donnelly about 5 years ago