Project

General

Profile

Actions

Bug #24512

open

Raw used space leak

Added by Thomas De Maet almost 6 years ago. Updated about 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello

I'm testing an setup of cephfs over a EC pool with 21 data + 3 coding chunks ([EC_]stripe_unit of 16k).
All OSD are bluestore on 1 HDD along with WAL&DB.

I'm experiencing unexpected usage of raw space (more than 200% instead of theoretical 24/21=114%).

I've run a lot of copy tests trying identifying the issue.

Firstly, the initial situation (see file "global_stats.txt"): I have 15T in the global, but two pools with only 7.8T and 1G respectively.

From what I dig in doc + mailing list, the space can be lost with:
- WAL+DB: expecting max 1.5GB/osd on 80 osds: 120GB (observed after deletion of all cephfs data: 91GB)
- unfilled stripes of min. "[pool_]obj_stripe", max loss possible: 240k files * 1344kB = 315GB

=> we can expect up to 435GB more, but 7207GB are observed !

What I'm expecting from EC:
- unfilled EC stripes of min. 21*[EC_]stripe_unit = 21*16kB = 336kB
=> no loss possible if [pool_]obj_stripe of 1344kB = 4*336kB , as each full pool_stripe is stored on 4 full EC_stripes
=> rem: with the base 4MB, we should have 4096รท336=12.19 => 13/12.19*24/21 = 121.9% (instead of 114.3%, or increase of ~7%)
What I'm expecting from bluestore:
- some additional DB stuff (keys, index, checksums)... But mostly negligible if large data ?
- some block alignement optims causing fragmentation ?

Secondly, I run bunch of tests with cp, rsync and dd (see copy_tests.txt).

It can be seen that when copying with packets of 1M (what cp is doing), the usage is more than 200% of the original files. That decreases down to 130% when the packet is equal to the object size and increases up to 450% when it is small (128k).

I dont know what is the origin (fragmentation?), but EC loose a large part of its purpose here: gaining space.

1) is it the expected behavior ? (appart the fact that you dont propose such design for EC) What did I miss ?

2) if it is due to bluestore fragmentation, is it possible to design a "defragmenter" in the future to get back the unused space ?

Thanks


Files

global_stats.txt (1015 Bytes) global_stats.txt Thomas De Maet, 06/13/2018 01:39 PM
copy_tests.txt (2.13 KB) copy_tests.txt Thomas De Maet, 06/13/2018 01:39 PM
osd_0_asok (26.7 KB) osd_0_asok Thomas De Maet, 06/19/2018 12:25 PM
osd_30_asok (26.7 KB) osd_30_asok Thomas De Maet, 06/19/2018 12:25 PM
osd_53_asok (26.4 KB) osd_53_asok Thomas De Maet, 06/19/2018 12:25 PM
osd_77_asok (26.6 KB) osd_77_asok Thomas De Maet, 06/19/2018 12:25 PM
osd_df (8.28 KB) osd_df Thomas De Maet, 06/19/2018 12:25 PM
fs_pool_osd.txt (46.3 KB) fs_pool_osd.txt Thomas De Maet, 06/20/2018 07:51 AM
Actions #1

Updated by Thomas De Maet almost 6 years ago

sorry, wrong ceph version: 12.2.5-407 (luminous stable)

I'm still very interested by any answer. If I try filestore can I gain space at the cost of some perfs ?

Thanks

Actions #2

Updated by Igor Fedotov almost 6 years ago

Would you share performance counters dump for several (3-5) OSDs, preferably from different nodes? And corresponding 'ceph osd df' output...

Updated by Thomas De Maet almost 6 years ago

Here they are from 3 hosts (the link with hosts is in the df tree) !

Thanks!

Actions #4

Updated by Igor Fedotov almost 6 years ago

I checked 'stored' vs. 'allocated' counters under bluestore section. 'stored' is the actual amount written to bluestore from upper level while 'allocated' includes the above plus the overhead caused by allocation granularity. For all logs allocation overhead is approx. 40% which means that stored objects are highly fragmented and data in fragments isn't aligned with bluestore allocation unit (64K for HDD by default).
Unfortunately I'm not aware of the scheme cephfs utilizes to save its content but it looks like the issue is in that scheme or its tuning. IMO that's neither bluestore related issue nor stats miscalculations we observed before.

Actions #5

Updated by Igor Fedotov almost 6 years ago

  • Project changed from bluestore to CephFS
Actions #6

Updated by Thomas De Maet almost 6 years ago

some additional info:
- mounted with 'mount t ceph'
default config but:
--- 2 mds servers active
--- ram per OSD decreased to 512MB (half the default)
- the fs was emptied then filled once again with rsync
- additional infos about fs/EC/params in linked file

The uncommon thing is the large EC pool of 24 chunks. The parameter jerasure-per-chunk-alignment is false in the EC profile. Reading #8475 this parameter seems to belong to other EC method than reed_sol_van, but possibly related ?

Thanks

Actions #7

Updated by Thomas De Maet almost 6 years ago

Unfortunately, I have now to use the disks for production...

Here follow the last tests with smaller pools, which are much better:

13+3, defaults: theory 1.231x, real 1.307x
16+3, defaults: theory 1.188x, real 1.220x
16+3, 64M objs, 1M stripe: real 1.215x (on 1GbE network, input flow increase by ~60Mb/s)

I finally end-up with the last option for my no-critical data.

I believe it should be a good thing for Ceph to see where the original issue came from. When fixed, I suspect that the used space can decrease for smaller pools as well.

Thanks

Actions #8

Updated by Patrick Donnelly about 5 years ago

  • Target version deleted (v12.2.5)
  • Start date deleted (06/13/2018)
  • Affected Versions deleted (v10.2.5)
  • ceph-qa-suite deleted (ceph-ansible)
Actions

Also available in: Atom PDF