Bug #48358: rgw: qlen and qactive perf counters leak - rgw - Ceph

Actions

Copy link

Bug #48358

open

rgw: qlen and qactive perf counters leak

Added by Dan van der Ster over 3 years ago. Updated 7 days ago.

Status:

New

Priority:

High

Assignee:

Mark Kogan

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.11

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

In our environment the rgw qlen and qactive perf counters seem to trend slowly upwards. See the plot attached.
I suspect there is a case where the client IO is completed without the qlen/qactive counters getting decremented.

For context, we are trying to see if rgw_max_concurrent_requests can be tuned down to limit the peak rgw memory usage. So we want to monitor how many existing concurrent IOs we have in prod, but clearly this qlen counter isn't reliable for that. We'll send a separate PR to expose the throttle `outstanding_requests` values in a new perf counter to solve this separately, but maybe the qlen leak is obvious to someone?

Files

Download all files

Screenshot-20201125134722-782x407.png (56.4 KB) Screenshot-20201125134722-782x407.png		Dan van der Ster, 11/25/2020 12:47 PM
Ceph - RGW metrics - Grafana 2021-05-11 11-27-50.png (226 KB) Ceph - RGW metrics - Grafana 2021-05-11 11-27-50.png		Aleksandr Rudenko, 05/11/2021 08:43 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Dan van der Ster over 3 years ago

We'll send a separate PR to expose the throttle `outstanding_requests` values in a new perf counter

https://github.com/ceph/ceph/pull/38283

Actions

Copy link

Updated by Mark Kogan over 3 years ago

Assignee set to Mark Kogan

Actions

Copy link

Updated by Aleksandr Rudenko almost 3 years ago

File Ceph - RGW metrics - Grafana 2021-05-11 11-27-50.png Ceph - RGW metrics - Grafana 2021-05-11 11-27-50.png added

We can see same behavior on 14.2.15.

Actions

Copy link

Updated by Casey Bodley 3 months ago

Has duplicate Bug #61338: rgw: qactive perf counter may leak on errors added

Actions

Copy link

Updated by Casey Bodley 3 months ago

Priority changed from Normal to High

i'm hearing reports that when these counters leak, rgw performance also degrades significantly until the process restarts. this is probably due to leaks of the counter associated with rgw_max_concurrent_requests

to decrement the perf counters, we rely on a call to ClientIO::complete_request(): https://github.com/ceph/ceph/blob/f4758e5/src/rgw/rgw_asio_client.cc#L97-L100

for rgw_max_concurrent_requests, we rely on a similar hook in SimpleThrottler::request_complete(): https://github.com/ceph/ceph/blob/f4758e5/src/rgw/rgw_dmclock_async_scheduler.h#L188-L193

certain types of errors fail to call either function

raising priority since this effects more than just the output of metrics

Actions

Copy link

Updated by Casey Bodley 3 months ago

from Andrea Bolzonella:

After my analysis, I observed that whenever an error is raised in the rgw_rest.cc (line 630 in 18.2.1), the connection is closed, but the qlen is not decremented.

  try {
    RESTFUL_IO(s)->complete_header();
  } catch (rgw::io::Exception& e) {
    ldpp_dout(s, 0) << "ERROR: RESTFUL_IO(s)->complete_header() returned err=" 
             << e.what() << dendl;
  }

Actions

Copy link

Updated by Andrea Bolzonella 7 days ago

Is there any progress on this ticket?
We still have a performance issue when active connections get high.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #48358

rgw: qlen and qactive perf counters leak

Updated by Dan van der Ster over 3 years ago

Updated by Mark Kogan over 3 years ago

Updated by Aleksandr Rudenko almost 3 years ago

Updated by Casey Bodley 3 months ago

Updated by Casey Bodley 3 months ago

Updated by Casey Bodley 3 months ago

Updated by Andrea Bolzonella 7 days ago