Project

General

Profile

Actions

Bug #62529

open

PrimaryLogPG::log_op_stats uses `now` time vs op end time when calculating op latency

Added by Michael Kidd 8 months ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
low-hanging-fruit
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

  • With `debug_osd` 15 or higher, a log line called `log_op_stats` is output, which ends with a reported latency value `lat`.
  • This line can be output several 10's of milliseconds after the op actually completes.
  • However, the `PrimaryLogPG::log_op_stats` uses `ceph_clock_now` as the time for calculating the op duration (lat) instead of the time when the op completed.
  • This can result in an arbitrarily higher latency value being reported.

https://github.com/ceph/ceph/blob/main/src/osd/PrimaryLogPG.cc#L4384-L4387

Actions #1

Updated by Radoslaw Zarzynski 8 months ago

IIUC the linked method is being called asynchronously, after the op got completed.
However, the debug there claims the op has the latency calculated in the following way:

  const utime_t now = ceph_clock_now();

  const utime_t latency = now - m->get_recv_stamp();

now should be actually the op's completion time.

Actions #2

Updated by Radoslaw Zarzynski 8 months ago

  • Tags set to low-hanging-fruit
Actions #3

Updated by Laura Flores 8 months ago

  • Translation missing: en.field_tag_list set to low-hanging-fruit
Actions #4

Updated by Laura Flores 8 months ago

I'll bring this to Grace Hopper Open Source Day!

Actions

Also available in: Atom PDF