Bug #61338: rgw: qactive perf counter may leak on errors - rgw - Ceph

Actions

Copy link

Bug #61338

closed

rgw: qactive perf counter may leak on errors

Added by Matt Benjamin 11 months ago. Updated 3 months ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

One of my customer ran some bench tests on their ceph cluster. As part of their tests they've tried to make a lot of requests using the next s3bench: https://github.com/shpaz/s3bench/blob/master/s3bench.py

they used a simple automation that destroys all running processes (using `kill` command) that runs this script. After destroying all the processes they've noticed that the metric provided by the mgr `ceph_rgw_qactive` stays high as it was, although all operations to the radosgw were stopped.

ASK:

they would like to get an explanation regarding this metric and the cause of it to stay high. The Grafana dashboard that uses this metric as `rgw connections` graph.

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Casey Bodley 11 months ago

perf counters don't really have user-facing documentation outside of the description that the admin socket 'perf schema' command provides. the descriptions for qlen and qactive come from https://github.com/ceph/ceph/blob/2727096/src/rgw/rgw_perf_counters.cc#L28-L29

they're both described in terms of a "request queue", which i assume comes from the thread-pool/work-queue model of the old fcgi frontend. the beast and civetweb frontends started supporting these counters in https://github.com/ceph/ceph/pull/20842. for both frontends, those counters are decremented on ClientIO::complete_request(). if there are error paths that don't call that method, then the counters would leak

Actions

Copy link

Updated by Casey Bodley 11 months ago

Subject changed from rgw: explanation of the ceph_rgw_qactive parameter to rgw: qactive perf counter may leak on errors

Actions

Copy link

Updated by Milind Verma 11 months ago

The cluster which is having those issues have a lot of errors in the complete_request() function with the description "bad file descriptor" and "connection closed by peer". In a while the cluster has huge latency and the number of those errors increase, could the leak happen because the problem having in the cluster?

Actions

Copy link