Project

General

Profile

Actions

Bug #61338

closed

rgw: qactive perf counter may leak on errors

Added by Matt Benjamin 11 months ago. Updated 3 months ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

One of my customer ran some bench tests on their ceph cluster. As part of their tests they've tried to make a lot of requests using the next s3bench: https://github.com/shpaz/s3bench/blob/master/s3bench.py

they used a simple automation that destroys all running processes (using `kill` command) that runs this script. After destroying all the processes they've noticed that the metric provided by the mgr `ceph_rgw_qactive` stays high as it was, although all operations to the radosgw were stopped.

ASK:

they would like to get an explanation regarding this metric and the cause of it to stay high. The Grafana dashboard that uses this metric as `rgw connections` graph.


Related issues 1 (1 open0 closed)

Is duplicate of rgw - Bug #48358: rgw: qlen and qactive perf counters leakNewMark Kogan

Actions
Actions #1

Updated by Casey Bodley 11 months ago

perf counters don't really have user-facing documentation outside of the description that the admin socket 'perf schema' command provides. the descriptions for qlen and qactive come from https://github.com/ceph/ceph/blob/2727096/src/rgw/rgw_perf_counters.cc#L28-L29

they're both described in terms of a "request queue", which i assume comes from the thread-pool/work-queue model of the old fcgi frontend. the beast and civetweb frontends started supporting these counters in https://github.com/ceph/ceph/pull/20842. for both frontends, those counters are decremented on ClientIO::complete_request(). if there are error paths that don't call that method, then the counters would leak

Actions #2

Updated by Casey Bodley 11 months ago

  • Subject changed from rgw: explanation of the ceph_rgw_qactive parameter to rgw: qactive perf counter may leak on errors
Actions #3

Updated by Milind Verma 11 months ago

The cluster which is having those issues have a lot of errors in the complete_request() function with the description "bad file descriptor" and "connection closed by peer". In a while the cluster has huge latency and the number of those errors increase, could the leak happen because the problem having in the cluster?

Actions #4

Updated by Wout van Heeswijk 4 months ago

We have found a cluster that looks like it has this problem.

What can I provide here to facilitate the debugging?

Actions #5

Updated by Casey Bodley 3 months ago

  • Is duplicate of Bug #48358: rgw: qlen and qactive perf counters leak added
Actions #6

Updated by Casey Bodley 3 months ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF