Bug #9523: Both op threads and dispatcher threads could be stuck at acquiring the budget of filestore ops during peering stage - RADOS - Ceph

Actions

Copy link

Bug #9523

closed

Both op threads and dispatcher threads could be stuck at acquiring the budget of filestore ops during peering stage

Added by Zhi Zhang over 9 years ago. Updated almost 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

FileStore

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

When OSD is rejoining and peering, we still see some slow requests and performance downgradation in about 5 to 10 mins.

Most of the slow requests reported on other OSDs are due to wait for ECSubWriteReply OP from rejoined OSD.

The rejoined OSD takes long time to process ECSubWrite OP mainly due to 1) less available op threads; 2) stuck at acquiring the budget of filestore ops.

[1] During OSD rejoining and peering stage, there are some typical OPs, like PG_REMOVE, PG_NOTIFY, PG_QUERY, PG_LOG, etc. At least these OPs will be pre-processed by dispatcher threads before enqueuing to corresponding working queue. When dispatcher threads are pre-processing these OPs, dispatcher threads have some filestore transactions so that it needs to acquire the budget of filestore ops.

[2] After pre-processing, PG_REMOVE OP will be enqueued to remove_wq which will be processed by disk thread pool (1 thread), and PG_NOTIFY, PG_QUERY & PG_LOG OPs will be enqueued to peering_wq which will be processed by op thread pool (8 threads). They all have filestore transactions so that they also need to acquire the budget of filestore ops.

[3] When PG is active, it could serve client OP and sub OP, which needs op threads to process and also needs to acquire the budget of filestore ops for filestore transactions.

So from [2], during peering stage, there are lots of PG_NOTIFY, PG_QUERY & PG_LOG OPs, which will occupy most of op threads, so less available op threads to process ECSubWrite OP. From the OSD log, there is only 1 op thread to process ECSubWrite OP in about 30 sec.

And from [1] to [3], there are lots of filestore transactions need to acquire the budget of filestore ops, so it is easy to hit the throttle of filestore max ops (Here is 200. Default value is 50).

From the OSD log, dispatcher thread is stuck for 6 sec. When this happens, no new OPs can be dispatched and processed.
--------------------------------------------
2014-09-17 07:30:41.200056 7f116bc98700 20 osd.166 2637 _dispatch 0x4ec43600 pg_log(4.3fcs10 epoch 2635 log log((1438'14004,2631'17004], crt=2630'17002) query_epoch 2635) v4
2014-09-17 07:30:41.200083 7f116bc98700 7 osd.166 2637 handle_pg_log pg_log(4.3fcs10 epoch 2635 log log((1438'14004,2631'17004], crt=2630'17002) query_epoch 2635) v4 from osd.3
...
2014-09-17 07:30:41.200771 7f116bc98700 5 osd.166 pg_epoch: 2631 pg[4.3fcs10( empty lb 0//0//-1 local-les=0 n=0 ec=179 les/c 2607/2576 2621/2621/179) [3,157,65,58,104,38,89,149,162,17,60] r=-1 lpr=2631 pi=179-2620/15 crt=0'0 inactive] exit Start 0.000027 0 0.000000
2014-09-17 07:30:41.200786 7f116bc98700 5 osd.166 pg_epoch: 2631 pg[4.3fcs10( empty lb 0//0//-1 local-les=0 n=0 ec=179 les/c 2607/2576 2621/2621/179) [3,157,65,58,104,38,89,149,162,17,60] r=-1 lpr=2631 pi=179-2620/15 crt=0'0 inactive] enter Started/Stray
2014-09-17 07:30:41.200802 7f116bc98700 10 log is not dirty
2014-09-17 07:30:41.200810 7f116bc98700 5 filestore(/home/y/var/lib/ceph/osd/ceph-166) queue_transactions existing osr(4.3fcs10 0x33ecfe0)/0x33ecfe0
2014-09-17 07:30:41.200814 7f116bc98700 2 filestore(/home/y/var/lib/ceph/osd/ceph-166) waiting 201 > 200 ops || 80612996 > 104857600
...
2014-09-17 07:30:47.216151 7f116bc98700 5 filestore(/home/y/var/lib/ceph/osd/ceph-166) queue_transactions (writeahead) 5638911 0x3d10fc0
2014-09-17 07:30:47.216260 7f116bc98700 10 osd.166 2637 pg[4.3fcs10( empty lb 0//0//-1 local-les=0 n=0 ec=179 les/c 2607/2576 2621/2621/179) [3,157,65,58,104,38,89,149,162,17,60] r=-1 lpr=2631 pi=179-2620/15 crt=0'0 inactive] is new (resurrected)
--------------------------------------------

From the OSD log, op thread to process peering event from peering_wq is stuck for many times, so it slows down due to this throttle. It takes 25 sec to finish.
--------------------------------------------
2014-09-17 07:30:24.342099 7f1163c88700 10 osd.166 pg_epoch: 2633 pg[8.34( empty lb 0//0//-1 local-les=0 n=0 ec=273 les/c 2623/2623 2632/2633/2633) [166,113,63]/[15,113,63] r=-1 lpr=2633 pi=273-2632/18 crt=0'0 remapped NOTIFY] handle_peering_event: epoch_sent: 2633 epoch_requested: 2633 MQuery from 15 query_epoch 2633 query: query(info 0'0)
2014-09-17 07:30:24.342129 7f1163c88700 10 osd.166 pg_epoch: 2633 pg[8.34( empty lb 0//0//-1 local-les=0 n=0 ec=273 les/c 2623/2623 2632/2633/2633) [166,113,63]/[15,113,63] r=-1 lpr=2633 pi=273-2632/18 crt=0'0 remapped NOTIFY] sending info
...
2014-09-17 07:30:30.467957 7f1163c88700 10 osd.166 pg_epoch: 2637 pg[11.74( empty lb 0//0//-1 local-les=0 n=0 ec=279 les/c 2596/2577 2632/2634/279) [147,166,142]/[147,142,158] r=-1 lpr=2634 pi=279-2633/17 crt=0'0 remapped NOTIFY] handle_peering_event: epoch_sent: 2635 epoch_requested: 2635 MLogRec from 147
...
2014-09-17 07:30:30.468080 7f1163c88700 10 osd.166 pg_epoch: 2637 pg[11.74( empty lb 0//0//-1 local-les=0 n=0 ec=279 les/c 2596/2577 2632/2634/279) [147,166,142]/[147,142,158] r=-1 lpr=2634 pi=279-2633/17 crt=0'0 remapped] state<Started/ReplicaActive>: Activate Finished
2014-09-17 07:30:30.468144 7f1163c88700 10 log is not dirty
2014-09-17 07:30:30.468150 7f1163c88700 10 -- 10.193.207.112:6851/78206 get_connection osd.142 10.193.207.110:6841/19776 existing 0x3c1cf400
2014-09-17 07:30:30.468162 7f1163c88700 10 -- 10.193.207.112:6851/78206 get_connection osd.147 10.193.207.111:6846/98479 existing 0x3c90aa00
2014-09-17 07:30:30.468170 7f1163c88700 10 -- 10.193.207.112:6851/78206 get_connection osd.158 10.193.207.110:6846/21283 existing 0x3c90bb80
2014-09-17 07:30:30.468179 7f1163c88700 5 filestore(/home/y/var/lib/ceph/osd/ceph-166) queue_transactions existing osr(11.74 0x3c049170)/0x3c049170
2014-09-17 07:30:30.468184 7f1163c88700 2 filestore(/home/y/var/lib/ceph/osd/ceph-166) waiting 201 > 200 ops || 104313537 > 104857600
...
2014-09-17 07:30:51.892579 7f1163c88700 5 filestore(/home/y/var/lib/ceph/osd/ceph-166) queue_transactions existing osr(4.363s6 0x33ecd80)/0x33ecd80
2014-09-17 07:30:51.892585 7f1163c88700 2 filestore(/home/y/var/lib/ceph/osd/ceph-166) waiting 201 > 200 ops || 37065678 > 104857600
...
2014-09-17 07:30:51.936238 7f1163c88700 5 filestore(/home/y/var/lib/ceph/osd/ceph-166) queue_transactions (writeahead) 5639113 0x39af080
--------------------------------------------

By increasing following parameters could mitigate this issue. But due to different work load and pg number per OSD, this might still be a problem for few minutes and following parameters need to tune accordingly.

filestore queue max ops = 500
filestore queue max bytes = 204857600
filestore_queue_committing_max_bytes = 204857600

Actions

Copy link

Updated by Zhi Zhang over 9 years ago

Subject changed from Short-time peformance downgradation after OSD is rejoining and peering to Both op threads and dispatcher threads could be stuck at acquiring the budget of filestore ops during peering stage

Actions

Copy link

Updated by Guang Yang over 9 years ago

There seems two problems here:

Dispatcher thread hang due to filestore throttling
Op thread hang due to filestore throttling

While for the filestore throttling in general, there was a mail thread discussing this - http://www.spinics.net/lists/ceph-devel/msg20984.html, in terms of if we should increase those throttling bytes/ops.

As for the dispatcher thread, when handling the peering event, it might need to create the pg (by the dispatcher thread) which need to queue the filestore transaction.

handle_pg_peering_evt(...)
   - dispatch_context(...)
       - store->queue_transaction()

I am wondering if it makes sense to add a new parameter named should_take_filestore_budget to dispatch_context and queue_transaction, and for caller running from within dispatcher thread, we pass false.

Thoughts?

Actions

Copy link

Updated by Guang Yang over 9 years ago

I am wondering if it makes sense to add a new parameter named should_take_filestore_budget to dispatch_context and queue_transaction, and for caller running from within dispatcher thread, we pass false.

Created a pull request - https://github.com/ceph/ceph/pull/3042

Actions

Copy link

Updated by Sage Weil over 9 years ago

Assignee set to Guang Yang

Actions

Copy link

Updated by Loïc Dachary about 9 years ago

Did you manage to reproduce the problem ?

Actions

Copy link

Updated by Guang Yang about 9 years ago

Loic Dachary wrote:

Did you manage to reproduce the problem ?

We increased the throttle to overcome this problem, but I will take a look to investigate if the fix makes sense or not. Thanks!

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

Project changed from Ceph to RADOS
Status changed from New to Closed
Assignee deleted (~~Guang Yang~~)
Component(RADOS) FileStore added

Based on the PR discussion it seems the diagnosed issue wasn't the cause of the slowness. Closing since it hasn't (knowingly) come up again and stuff has changed since then.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #9523

Both op threads and dispatcher threads could be stuck at acquiring the budget of filestore ops during peering stage

Updated by Zhi Zhang over 9 years ago

Updated by Guang Yang over 9 years ago

Updated by Guang Yang over 9 years ago

Updated by Sage Weil over 9 years ago

Updated by Loïc Dachary about 9 years ago

Updated by Guang Yang about 9 years ago

Updated by Greg Farnum almost 7 years ago