Bug #58529

osd: very slow recovery due to delayed push reply messages

Added by Samuel Just 2 months ago. Updated about 2 months ago.

Fix Under Review
Target version:
% Done:


3 - minor
Affected Versions:
Pull request ID:
Crash signature (v1):
Crash signature (v2):


I took a look at the logs for pg114.d6 attached to this tracker. The cost for the push replies is calculated to over
80_M! I think this is calculated by compute_cost(). The base cost is 8_M. This indicates multiple replies being sent
as part of MOSDPGPushReply. The corresponding qos_cost is determined to be 436 as shown below for a PGRecoveryMsg.
This is quite high, and in general higher the cost, the longer an item stays in the mclock queue. Each item stays in
the mclock queue for a few seconds (4-6 secs) based on the osd capacity and allocations set for recovery ops.
Shown below is one such log for a push reply item:

2023-01-19T20:59:49.909+0000 7f2f370cb700 20 osd.128 op_wq(4) _process OpSchedulerItem(114.d6 PGRecoveryMsg(op=MOSDPGPushReply(114.d6 10682931/10682929 [PushReplyOp(114:6b6e3b2f:::1001e1b90fb.00000000:head),PushReplyOp(114:6b6e3bff:::10011ed22d6.00000000:head),PushReplyOp(114:6b6e3d00:::10028d57803.00000000:head),PushReplyOp(114:6b6e3d4c:::10028375b85.00000000:head),PushReplyOp(114:6b6e3e50:::10028a7159b.00000000:head)]) v3) class_id 0 qos_cost 436 cost 83891080 e10682931) queued

The push reply object maintains a vector of PushReplyOp (replies). This vector is used to compute the overall cost which
adds up to over 80_M.

void compute_cost(CephContext *cct) {
  cost = 0;
  for (auto i = replies.begin(); i != replies.end(); ++i) {
    cost += i->cost(cct);

Looking further, I also see:

 2023-01-19T21:00:38.358+0000 7f2f370cb700 10 osd.128 10682935 _maybe_queue_recovery starting 5, recovery_ops_reserved 0 -> 5

The above indicates `osd_recovery_max_single_start` is set to 5 (default: 1) which is probably resulting in the increased
cost for push reply ops as there will be as many PushReplyOp cost aggregated for the MOSDPGPushReply item.

As an immediate solution, `osd_recovery_max_single_start` can be set back to 1 and this should help the push reply ops
to be scheduled faster.

Longer term we have a few alternatives:

  1. Prevent modification of `osd_recovery_max_single_start` similar to the way we prevent modification of the various sleep options with mclock enabled.
  2. Audit the costs for the various background ops and modify them to work well with mclock. This is of course ensuring backward compatibility with the 'wpq' scheduler.
  3. Consider reply ops with higher priority and put them into a higher priority queue (currently being implemented).The higher priority queue is not managed by mclock and therefore, these ops can be completed faster.

The above can be discussed and the best approach adopted.

Related issues

Related to Infrastructure - Bug #58498: ceph: pgs stuck backfilling New


#1 Updated by Samuel Just 2 months ago

  • Related to Bug #58498: ceph: pgs stuck backfilling added

#2 Updated by Samuel Just 2 months ago

I've opened this bug to track the slow backfill behavior from, which appears to be unrelated to the original hung backfill behavior.

osd_recovery_max_single_start = 5 is a reasonable value. The mclock/cost machinery should handle it correctly.

The specific message I am seeing be delayed is a push reply. It doesn't actually make sense to throttle replies at all. The IO work has already been done, delaying the reply simply increases the amount of time the object being recovered is unavailable to client IO.

#3 Updated by Neha Ojha 2 months ago

  • Project changed from Ceph to RADOS

#4 Updated by Sridhar Seshasayee about 2 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 49975

Also available in: Atom PDF