Bug #61338
closedrgw: qactive perf counter may leak on errors
0%
Description
One of my customer ran some bench tests on their ceph cluster. As part of their tests they've tried to make a lot of requests using the next s3bench: https://github.com/shpaz/s3bench/blob/master/s3bench.py
they used a simple automation that destroys all running processes (using `kill` command) that runs this script. After destroying all the processes they've noticed that the metric provided by the mgr `ceph_rgw_qactive` stays high as it was, although all operations to the radosgw were stopped.
ASK:
they would like to get an explanation regarding this metric and the cause of it to stay high. The Grafana dashboard that uses this metric as `rgw connections` graph.
Updated by Casey Bodley 11 months ago
perf counters don't really have user-facing documentation outside of the description that the admin socket 'perf schema' command provides. the descriptions for qlen
and qactive
come from https://github.com/ceph/ceph/blob/2727096/src/rgw/rgw_perf_counters.cc#L28-L29
they're both described in terms of a "request queue", which i assume comes from the thread-pool/work-queue model of the old fcgi frontend. the beast and civetweb frontends started supporting these counters in https://github.com/ceph/ceph/pull/20842. for both frontends, those counters are decremented on ClientIO::complete_request()
. if there are error paths that don't call that method, then the counters would leak
Updated by Casey Bodley 11 months ago
- Subject changed from rgw: explanation of the ceph_rgw_qactive parameter to rgw: qactive perf counter may leak on errors
Updated by Milind Verma 11 months ago
The cluster which is having those issues have a lot of errors in the complete_request() function with the description "bad file descriptor" and "connection closed by peer". In a while the cluster has huge latency and the number of those errors increase, could the leak happen because the problem having in the cluster?
Updated by Wout van Heeswijk 4 months ago
We have found a cluster that looks like it has this problem.
What can I provide here to facilitate the debugging?
Updated by Casey Bodley 3 months ago
- Is duplicate of Bug #48358: rgw: qlen and qactive perf counters leak added