Project

General

Profile

Actions

Bug #7207

closed

Lock contention at filestore I/O (FileStore::lfn_open) during filestore folder splitting bring the cluster's performance degradation

Added by Guang Yang over 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

There is a lock contention happening at filestore I/O (FileStore::lfn_open, https://github.com/guangyy/ceph/blob/master/src/os/FileStore.cc#L198), with the lock holding, the thread (either op thread
or filestore thread) will check if it is the time to do folder splitting (max 320 files within a folder as default configured, https://github.com/guangyy/ceph/blob/master/src/os/FileStore.cc#L230), and if that is true, the thread will do the folder splitting which can take up to several seconds, during this time period, all threads trying to do filestore I/O (especially the op thread which is in critical path) would hang waiting for the lock.

So that after running for some time, there are millions of folders needing to be splitted (from level 5 to level 6) and each one could hand OSD op for several seconds (due to the lock contention), that would make the performance degradation for the entire cluster as we observed from 'dump_historic_ops' including: 1) all op threads (doing filestore I/O) hang waiting for the lock. 2) op waiting long time within op working queue as there is no op thread taking it (because of 1). 3) waiting for sub-op as sub-op might be experiencing either 1 or 2.

We deem that as a lock contention issues as one sub folder splitting (there are tens of thousands of such folders managed by an OSD) could block the entire OSD I/O.

A typical event flow for such hang op looks like below (note the latency comes from reached_pg to started): {
"rmw_flags": 4,
"received_at": "2014-01-22 03:43:33.775424",
"age": "158.326674",
"duration": "4.003504",
"flag_point": "waiting for sub ops",
"client_info": { "client": "client.94353",
"tid": 30248979},
"events": [ { "time": "2014-01-22 03:43:33.778359",
"event": "waiting_for_osdmap"}, { "time": "2014-01-22 03:43:33.778640",
"event": "reached_pg"}, { "time": "2014-01-22 03:43:36.302657",
"event": "started"}, { "time": "2014-01-22 03:43:36.303010",
"event": "started"}, { "time": "2014-01-22 03:43:36.309889",
"event": "waiting for subops from [112,200]"}, { "time": "2014-01-22 03:43:36.311360",
"event": "commit_queued_for_journal_write"}, { "time": "2014-01-22 03:43:36.311680",
"event": "write_thread_in_journal_buffer"}, { "time": "2014-01-22 03:43:36.312144",
"event": "journaled_completion_queued"}, { "time": "2014-01-22 03:43:36.312200",
"event": "op_commit"}, { "time": "2014-01-22 03:43:36.337536",
"event": "sub_op_commit_rec"}, { "time": "2014-01-22 03:43:36.482597",
"event": "op_applied"}, { "time": "2014-01-22 03:43:37.778550",
"event": "sub_op_commit_rec"}, { "time": "2014-01-22 03:43:37.778928",
"event": "done"}]},
}

Actions #1

Updated by Sage Weil over 10 years ago

  • Priority changed from Normal to Urgent
Actions #2

Updated by Ian Colle over 10 years ago

  • Assignee set to Samuel Just
Actions #3

Updated by Samuel Just over 10 years ago

  • Estimated time set to 16:00 h
Actions #4

Updated by Samuel Just over 10 years ago

3849ce2fa428dcfab57d6a7fdde1870f1448a33f I think

Actions #5

Updated by Greg Farnum over 10 years ago

  • Status changed from New to 7
  • Assignee changed from Samuel Just to Greg Farnum

As he noted, there's already a patch that gets rid of this lock over the affected area (although it also needs the immediately-following bug fix patch). I've pushed it to wip-7207 so we can run it through the rados suite; I'm not expecting any problems.

Actions #6

Updated by Guang Yang over 10 years ago

Thanks for looking at this!

In terms of folder splitting, I am wondering if people are interested in an alternative solution, that is to pre-create all sub-folders (e.g. 5 level, 6 level) when provisioning a new cluster after estimating the number of filers that would be put into a pool, With use of this approach, we can avoid the (unplanned) folder splitting which could brings long tail latency (even after we fixed the lock contention issue as we still need to protect the operation atomicity).

Actions #7

Updated by Greg Farnum about 10 years ago

Hmm, the problem with that is:
1) people are really bad at estimating how much stuff they're going to put into a storage system, and if they're wrong it'll be expensive.
2) You can remove the cost of the splits from runtime, but it's a tradeoff because you're adding a bunch of extra metadata the OSDs need to cache/lookup for the entire lifetime of the cluster whenever they do an IO.

It might be worth exploring, but I don't know that's worth it until we've got more data on when we should actually be doing the splits to begin with.

Actions #8

Updated by Greg Farnum about 10 years ago

  • Status changed from 7 to Resolved

This is in master in 55ab35ba9c1b3a1c4fc0abe5539c6bc320409371. It passed a full rados suite run.

Apparently I accidentally put it in yesterday...whoops. :(

Actions #9

Updated by Guang Yang about 10 years ago

Greg Farnum wrote:

Hmm, the problem with that is:
1) people are really bad at estimating how much stuff they're going to put into a storage system, and if they're wrong it'll be expensive.
2) You can remove the cost of the splits from runtime, but it's a tradeoff because you're adding a bunch of extra metadata the OSDs need to cache/lookup for the entire lifetime of the cluster whenever they do an IO.

It might be worth exploring, but I don't know that's worth it until we've got more data on when we should actually be doing the splits to begin with.

After checking the fix, I think there is another potential improvement over the splitting, that is to only trigger the splitting from within a background thread (e.g. filestore op thread) but not from within the op thread which is in the critical path, as if the two op threads accidentally trigger splitting at the same time (which is likely to happen as objects are evenly distributed across different folders), the entire OSD would block a couple of seconds.

Actions #10

Updated by kaifeng yao about 10 years ago

I see Guang's concern is that the all threads within a thread pool might be working on splitting, as most of the folders at the same level are all close to split at the same time. Perhaps we can stagger the folder splits by its PG name/replica id/folder name.

Another concern is that multiple threads might be splitting a same folder and may cause reentry problem. But I am not sure whether it is a real problem.

Actions #11

Updated by Greg Farnum about 10 years ago

Guang Yang wrote:

After checking the fix, I think there is another potential improvement over the splitting, that is to only trigger the splitting from within a background thread (e.g. filestore op thread) but not from within the op thread which is in the critical path, as if the two op threads accidentally trigger splitting at the same time (which is likely to happen as objects are evenly distributed across different folders), the entire OSD would block a couple of seconds.

That I find to be a pretty unlikely problem — random balances aren't going to result in that happening very often. Maybe if we observe it happening in practice we can investigate solutions (a background thread won't work without being very complicated, we need strong consistency about where things are located). Somehow staggering it would work better (and we already serialize based on the PG, so we don't need to worry about colliding splits if we do it in-line).

Actions #12

Updated by Guang Yang about 10 years ago

> That I find to be a pretty unlikely problem — random balances aren't going to result in that happening very often. Maybe if we observe it happening in practice we can investigate solutions (a background thread won't work without being very complicated, we need strong consistency about where things are located). Somehow staggering it would work better (and we already serialize based on the PG, so we don't need to worry about colliding splits if we do it in-line).

Thanks Greg! We can port the fix and test against it on the cluster which is trying to do splitting (from level 5 to level 6).

By triggering it from within a background thread (filestore thread), I mean FileStore::lfn_open could be called either from an op thread (e.g. from read op) or from a filestore thread. And if it is called from a op thread which happens to trigger the splitting, that op thread would work on the splitting stuff for a couple of seconds (6 seconds per our testing with 320 files to split), the throughput of the OSD drops a half during the time period, even worse if two op threads hang the same way, entire OSD could block for a couple of seconds.

An alternative is to only trigger (and do) the splitting in the background filestore thread which does not block op anyway (is there any other concern?).

After checking the code, one potential small change to achieve this is, check the #create# parameter within FileStore::lfn_open and only trigger folder splitting when #create# is true.

Actions #13

Updated by Greg Farnum about 10 years ago

Well, we already only do folder split/merge when creating or removing objects, which isn't going to happen on a read. ;) I hadn't considered that doing the split would block other ops on different PGs from progressing; I don't remember how the internal interfaces are constructed for the work queues but it's possible we could stick anything doing a split onto a callback and release the op thread to work on other operations.

Actions #14

Updated by Guang Yang about 10 years ago

Greg Farnum wrote:

Well, we already only do folder split/merge when creating or removing objects,

Oh right, it checks #create# parameter within the function (sorry missing it) before checking if it is good time to do splitting. So that I assume the splitting will never happen in an op thread (am I right?) but only filestore thread?

Actions #15

Updated by Guang Yang about 10 years ago

Another change I would suggest is to add some logs before and after folder splitting so that it is easier to check when it happens and how long does it take. Does that make sense?

Actions #16

Updated by Greg Farnum about 10 years ago

Which op threads and filestore threads are you talking about? The splitting happens whenever the filestore actually does the create that triggers it.

Actions #17

Updated by Guang Yang about 10 years ago

Greg Farnum wrote:

Which op threads and filestore threads are you talking about? The splitting happens whenever the filestore actually does the create that triggers it.

By #op thread#, I mean the thread handling messages (opWQ), by filestore thread, I mean the filestore op thread (filestoreWQ).

That is correct, I was wrong in terms of that such path (folder splitting) could be triggered and done within a op thread which potential blocks other message processing. As long as it is only happen from within a background filestore thread, we can go ahead to test the fix.

Thanks Greg!

Actions #18

Updated by Guang Yang about 10 years ago

Hi Greg,
I got a chance to test against the fix in our cluster (the leaf folder has around 1000 files), there are two observations:
1) Disk saturation when folder started splitting, the util is 100 and it could last minutes.
2) Start seeing client message throttle.
3) Latency increasing (up to tens of seconds).

Things could be worse if we increase the max objects per folder config.

Actions #19

Updated by Guang Yang about 10 years ago

Hello Greg,
Can we backport this fix to dumpling?

Actions #20

Updated by Greg Farnum about 10 years ago

As a performance patch that only impacts splitting, I don't think this is really appropriate for a backport.

Actions

Also available in: Atom PDF