Project

General

Profile

Actions

Bug #9202

closed

Performance degradation during recovering and backfilling

Added by Zhi Zhang over 9 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

From recent test and analysis, we find slow requests mainly happen at 2 patterns during recovering and backfilling.

1. Long time waiting in the op_wq and no op thread to process them. This will be reported as 'no flag points reached'.

2. Op thread has got an op to process, but is stuck by pg lock. This can be seen from dump_historic_ops, see event "waiting_for_osdmap" and "reached_pg".

{ "description": "osd_op(client.48687.0:41007649
default.45832.251_709e27f683b5a2160634c45ec79bf260
[] 22.b4445a81 ondisk+write e15711)",
"received_at": "2014-08-20 03:05:20.324103",
"age": "229.608474",
"duration": "141.505679",
"type_data": [
"started", { "client": "client.48687",
"tid": 41007649},
[ { "time": "2014-08-20 03:05:20.370175",
"event": "waiting_for_osdmap"}, { "time": "2014-08-20 03:07:41.829093",
"event": "reached_pg"}, { "time": "2014-08-20 03:07:41.829486",
"event": "started"}, { "time": "2014-08-20 03:07:41.829536",
"event": "started"}, { "time": "2014-08-20 03:07:41.829782",
"event": "done"}]]},

Above 2 patterns are relevant to some extent. When doing backfilling, there will be some pg scan ops. When op thread starts to process pg scan op, it will hold this pg lock until it finishes. However, it may take longer time to process pg scan op.

See below logs, it takes 1+ min to finish this op.
--------------------------------------------------------------------------
2014-08-20 05:40:10.614246 7f552081a700 10 osd.148 15817 dequeue_op 0x30a634a0 prio 127 cost 0 latency 7.181219 pg_scan(get_digest 22.79cs0
d79c//head//22-0//0//22 e 15817/15817) v2 pg pg[22.79cs0( v 15817'23973(12435'20642,15817'23973] lb 858ec79c/default.45834.877_c218526299dbf02a43375aafca843965/head//22 local-les=0 n=19260 ec=11132 les/c 14898/14893 14896/14896/14264)
[148,76,62,137,152,19,38,74,129,54,71]/[37,76,62,137,152,19,38,74,129,54,71]
r=-1 lpr=14896 pi=11132-14895/25 luod=0'0 crt=15815'23972 active+remapped]
...
2014-08-20 05:41:23.071930 7f552081a700 10 osd.148 15817 dequeue_op 0x30a634a0 finish
--------------------------------------------------------------------------

During this time, if any op thread tries to process any op belonging to this pg, it will first try to get this pg lock, then this op thread will be stuck because ceph uses mutex::lock here. This is pattern 2.

So when this happens, not only current op can't be processed, but also this op thread can't process other ops in the op_wq. Hence the available op threads become less and lots of ops in the op_wq become slow requests. This is pattern 1.

To mitigate the performance degradation issue, we could tune some backfill/recovery parameters and increase op thread number temporarily, but the
recovery time will be multiple times longer than before.

For further solution, we could investigate from following ways to see if they will be helpful.
1. use mutex::try_lock instead of mutex::lock?
2. reduce pg scan op's processing time by reducing some scope or dividing into many smaller ops?

Actions

Also available in: Atom PDF