Bug #7593: Disk saturation during PG folder splitting - Ceph - Ceph

Actions

Copy link

Bug #7593

closed

Disk saturation during PG folder splitting

Added by Guang Yang about 10 years ago. Updated almost 10 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

This bug is a follow-up for issue 7207 (http://tracker.ceph.com/issues/7207), with the lock contention issue fixed, we are still seeing non-trivial latency increasing when PG starts folder splitting. It is observed the disk utilization is 90%+ during the time.

In order to improve the situation / mitigate the impact over folder splitting, there are a couple of options which worth exploring:
1) Throttle the folders splitting concurrency per OSD. This can be achieved by providing a configurable #max concurrent splittings#, when the folder needs to be splitted, it checks the concurrency and if it is less than max allowed, it goes ahead doing the splitting, or else it waits for the next round.

2) Pre-create all folders (up to 6 / 7 level depending on the estimation of objects to put within the system) during pool / PG creation time. With use of this way, we can completely avoid runtime folder splitting overhead, however, as Greg commented, people might be bad estimating the number of objects to upload (it is actually object size). Another potential improvement on top of this approach, is that when we have the #fixed# level, we can avoid the HashIndex::lookup effort by ::stat each path component (a couple of positive and one negative), though I don't have any stats yet how much benefit we can expect from that.

Please help to review the problem as well as the proposal (or some other better approaches I missed). Thanks.

Actions

Copy link

Updated by Sage Weil about 10 years ago

Status changed from New to 4

It is difficult to delay splitting without breaking the semantics. The simplest way to throttle right now is to increase pg_num slowly so that there aren't multiple PGs per OSD splitting at once... would that work for your case?

Actions

Copy link

Updated by Guang Yang about 10 years ago

Sage Weil wrote:

It is difficult to delay splitting without breaking the semantics.

Can you elaborate a little bit on this one?

The simplest way to throttle right now is to increase pg_num slowly so that there aren't multiple PGs per OSD splitting at once... would that work for your case?

Do you mean increasing pg_num at runtime or setting a larger pg_num at the beginning when creating the pool? As pg_num / pgp_num change will trigger data migration which in turn bring latency increasing as well.

Sage,
How do you think of the second approach as pre-creating all folders so that:
1) never split / merge folders
2) Bypass HashIndex::_lookup to locate the file by calculate the path to the file, I would expect this could bring performance improvement for file lookup, especially for those with tight RAM / disk ratio.

We would make this configurable so that it could / should just be used for a special use case like us:
1) GET latency is important
2) Stable latency is important

How do you think?

Actions

Copy link

Updated by Guang Yang about 10 years ago

Hi Sage,
If we would like to make the following changes: # Bring in a new configuration flag which can be used to disable folder splitting / merging completely. # If the above flag is on, hack the file lookup to locate the file in the very first layer instead of walking through the path (by ::stat).

Is the above change acceptable (or do you prefer only change the item 1)?

Thanks,
Guang

Actions

Copy link

Updated by Sage Weil about 10 years ago

At a high level, sure, if you know ahead of time how many objects per PG you expect you can pre-hash the PG directories. The problem is that, in general, people don't know how many objects their pool will hold. You would need to know both

number of osds, and thus number of pgs for the pool
number of objects in the pool (total capacity / avg object size?)

to get to the number of objects per PG and not ever split PGs.

If you did know all that, it seems like you could put hints in the pool metadata that the OSD could pass to the FileStore to pre-hash things...

Actions

Copy link

Updated by Guang Yang about 10 years ago

Thanks Sage very much for the comments.

To begin with, I propose a change here - https://github.com/ceph/ceph/pull/1444

With this change, we can create a standalone tool to make the folders when provision the cluster for our needs (it may be not general enough like considering PG split, etc), and we will consider a more elegant fix.

Please help to review the patch.

Actions

Copy link

Updated by Sage Weil about 10 years ago

Status changed from 4 to Resolved

Actions

Copy link

Updated by Guang Yang almost 10 years ago

Sage Weil wrote:

At a high level, sure, if you know ahead of time how many objects per PG you expect you can pre-hash the PG directories. The problem is that, in general, people don't know how many objects their pool will hold. You would need to know both

number of osds, and thus number of pgs for the pool

number of objects in the pool (total capacity / avg object size?)

to get to the number of objects per PG and not ever split PGs.

If you did know all that, it seems like you could put hints in the pool metadata that the OSD could pass to the FileStore to pre-hash things...

Hi Sage,
I have a simple standalone tool developed to pre-hash PG folders right after creating the pool, I am thinking if it is good to merge it back to Ceph, the interface, as you mentioned, can be:
1. When creating the pool, provide a new configuration named 'pre_hash_pg_sub_folders' to let user specify how many folders to pre-create (an alternative is to calculate on user's behalf with the information like 'number of objects to split' 'total number of pgs', 'target objects on that pool', but I think it is good to delegate the calculation to user if he/she know the underlying impact of this), as each split will result in 16X folders, it is not a hard math for most of the times (for user to roughly estimate how many files he/she would put into the pool), it does not hard anyway.
2. We force the option to only take effect when user create pool, he/she should not be able to change it on the fly.
3. If user does not choose a 2 powered PG number, the first time splitting may result in different amount of sub-dirs, so that we may split into different levels.

Does that sound like a reasonable change? If yes, I will prepare a pull request.

Thanks!
Guang

Actions

Copy link

Updated by Sage Weil almost 10 years ago

Guang Yang wrote:

Sage Weil wrote:

At a high level, sure, if you know ahead of time how many objects per PG you expect you can pre-hash the PG directories. The problem is that, in general, people don't know how many objects their pool will hold. You would need to know both

number of osds, and thus number of pgs for the pool

number of objects in the pool (total capacity / avg object size?)

to get to the number of objects per PG and not ever split PGs.

If you did know all that, it seems like you could put hints in the pool metadata that the OSD could pass to the FileStore to pre-hash things...

Hi Sage,
I have a simple standalone tool developed to pre-hash PG folders right after creating the pool, I am thinking if it is good to merge it back to Ceph, the interface, as you mentioned, can be:
1. When creating the pool, provide a new configuration named 'pre_hash_pg_sub_folders' to let user specify how many folders to pre-create (an alternative is to calculate on user's behalf with the information like 'number of objects to split' 'total number of pgs', 'target objects on that pool', but I think it is good to delegate the calculation to user if he/she know the underlying impact of this), as each split will result in 16X folders, it is not a hard math for most of the times (for user to roughly estimate how many files he/she would put into the pool), it does not hard anyway.

If this is a filestore_* type config option, I think something easily can be merged in, but I'm worried that it will be difficult to use in most environments since there are often different pools with different expected data sets.

I wonder if a more general way to accomplish this is to have a pool property that is something like "expected num objects." The OSD can then infer how many objects are expected per pg and pass that down through the ObjectStore interface to FileStore, which can respond by pre-hashing the necessary amount.

2. We force the option to only take effect when user create pool, he/she should not be able to change it on the fly.
3. If user does not choose a 2 powered PG number, the first time splitting may result in different amount of sub-dirs, so that we may split into different levels.

A pool property would smooth other this a bit because the double-size PGs would get a value twice as large. IIRC there is a helper method in osd_types.h somewhere that will do a divide_by_pg_num by operation that compensates for the pg_num values that are not a power of 2.

Would this accomplish what you want?

Actions

Copy link