Bug #61338
closed
rgw: qactive perf counter may leak on errors
Added by Matt Benjamin 12 months ago.
Updated 3 months ago.
Description
One of my customer ran some bench tests on their ceph cluster. As part of their tests they've tried to make a lot of requests using the next s3bench: https://github.com/shpaz/s3bench/blob/master/s3bench.py
they used a simple automation that destroys all running processes (using `kill` command) that runs this script. After destroying all the processes they've noticed that the metric provided by the mgr `ceph_rgw_qactive` stays high as it was, although all operations to the radosgw were stopped.
ASK:
they would like to get an explanation regarding this metric and the cause of it to stay high. The Grafana dashboard that uses this metric as `rgw connections` graph.
Related issues
1 (1 open — 0 closed)
perf counters don't really have user-facing documentation outside of the description that the admin socket 'perf schema' command provides. the descriptions for qlen
and qactive
come from https://github.com/ceph/ceph/blob/2727096/src/rgw/rgw_perf_counters.cc#L28-L29
they're both described in terms of a "request queue", which i assume comes from the thread-pool/work-queue model of the old fcgi frontend. the beast and civetweb frontends started supporting these counters in https://github.com/ceph/ceph/pull/20842. for both frontends, those counters are decremented on ClientIO::complete_request()
. if there are error paths that don't call that method, then the counters would leak
- Subject changed from rgw: explanation of the ceph_rgw_qactive parameter to rgw: qactive perf counter may leak on errors
The cluster which is having those issues have a lot of errors in the complete_request() function with the description "bad file descriptor" and "connection closed by peer". In a while the cluster has huge latency and the number of those errors increase, could the leak happen because the problem having in the cluster?
We have found a cluster that looks like it has this problem.
What can I provide here to facilitate the debugging?
- Is duplicate of Bug #48358: rgw: qlen and qactive perf counters leak added
- Status changed from New to Duplicate
Also available in: Atom
PDF