Project

General

Profile

Actions

Bug #58529

closed

osd: very slow recovery due to delayed push reply messages

Added by Samuel Just over 1 year ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
quincy, reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I took a look at the logs for pg114.d6 attached to this tracker. The cost for the push replies is calculated to over
80_M! I think this is calculated by compute_cost(). The base cost is 8_M. This indicates multiple replies being sent
as part of MOSDPGPushReply. The corresponding qos_cost is determined to be 436 as shown below for a PGRecoveryMsg.
This is quite high, and in general higher the cost, the longer an item stays in the mclock queue. Each item stays in
the mclock queue for a few seconds (4-6 secs) based on the osd capacity and allocations set for recovery ops.
Shown below is one such log for a push reply item:

2023-01-19T20:59:49.909+0000 7f2f370cb700 20 osd.128 op_wq(4) _process OpSchedulerItem(114.d6 PGRecoveryMsg(op=MOSDPGPushReply(114.d6 10682931/10682929 [PushReplyOp(114:6b6e3b2f:::1001e1b90fb.00000000:head),PushReplyOp(114:6b6e3bff:::10011ed22d6.00000000:head),PushReplyOp(114:6b6e3d00:::10028d57803.00000000:head),PushReplyOp(114:6b6e3d4c:::10028375b85.00000000:head),PushReplyOp(114:6b6e3e50:::10028a7159b.00000000:head)]) v3) class_id 0 qos_cost 436 cost 83891080 e10682931) queued

The push reply object maintains a vector of PushReplyOp (replies). This vector is used to compute the overall cost which
adds up to over 80_M.

void compute_cost(CephContext *cct) {
  cost = 0;
  for (auto i = replies.begin(); i != replies.end(); ++i) {
    cost += i->cost(cct);
  }
}

Looking further, I also see:

 2023-01-19T21:00:38.358+0000 7f2f370cb700 10 osd.128 10682935 _maybe_queue_recovery starting 5, recovery_ops_reserved 0 -> 5

The above indicates `osd_recovery_max_single_start` is set to 5 (default: 1) which is probably resulting in the increased
cost for push reply ops as there will be as many PushReplyOp cost aggregated for the MOSDPGPushReply item.

As an immediate solution, `osd_recovery_max_single_start` can be set back to 1 and this should help the push reply ops
to be scheduled faster.

Longer term we have a few alternatives:

  1. Prevent modification of `osd_recovery_max_single_start` similar to the way we prevent modification of the various sleep options with mclock enabled.
  2. Audit the costs for the various background ops and modify them to work well with mclock. This is of course ensuring backward compatibility with the 'wpq' scheduler.
  3. Consider reply ops with higher priority and put them into a higher priority queue (currently being implemented).The higher priority queue is not managed by mclock and therefore, these ops can be completed faster.

The above can be discussed and the best approach adopted.


Related issues 4 (2 open2 closed)

Related to Infrastructure - Bug #58498: ceph: pgs stuck backfillingNew

Actions
Has duplicate RADOS - Bug #62872: ceph osd_max_backfills default value is 1000New

Actions
Copied to RADOS - Backport #62131: quincy: osd: very slow recovery due to delayed push reply messagesResolvedSridhar SeshasayeeActions
Copied to RADOS - Backport #62132: reef: osd: very slow recovery due to delayed push reply messagesResolvedSridhar SeshasayeeActions
Actions #1

Updated by Samuel Just over 1 year ago

  • Related to Bug #58498: ceph: pgs stuck backfilling added
Actions #2

Updated by Samuel Just over 1 year ago

I've opened this bug to track the slow backfill behavior from https://tracker.ceph.com/issues/58498, which appears to be unrelated to the original hung backfill behavior.

osd_recovery_max_single_start = 5 is a reasonable value. The mclock/cost machinery should handle it correctly.

The specific message I am seeing be delayed is a push reply. It doesn't actually make sense to throttle replies at all. The IO work has already been done, delaying the reply simply increases the amount of time the object being recovered is unavailable to client IO.

Actions #3

Updated by Neha Ojha over 1 year ago

  • Project changed from Ceph to RADOS
Actions #4

Updated by Sridhar Seshasayee over 1 year ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 49975
Actions #5

Updated by Sridhar Seshasayee 10 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #6

Updated by Sridhar Seshasayee 10 months ago

  • Backport set to quincy, reef
Actions #7

Updated by Backport Bot 10 months ago

  • Copied to Backport #62131: quincy: osd: very slow recovery due to delayed push reply messages added
Actions #8

Updated by Backport Bot 10 months ago

  • Copied to Backport #62132: reef: osd: very slow recovery due to delayed push reply messages added
Actions #9

Updated by Backport Bot 10 months ago

  • Tags set to backport_processed
Actions #10

Updated by Sridhar Seshasayee 10 months ago

  • Status changed from Pending Backport to Resolved
Actions #11

Updated by Radoslaw Zarzynski 8 months ago

  • Has duplicate Bug #62872: ceph osd_max_backfills default value is 1000 added
Actions

Also available in: Atom PDF