Project

General

Profile

Actions

Bug #41511

closed

civetweb threads using 100% of CPU

Added by Vladimir Brik over 4 years ago. Updated over 4 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, radosgw process on those machines starts consuming 100% of multiple CPU cores (I've seen between 1 and 5) seemingly indefinitely, even though the affected machine(s) are not being used for data transfers (nothing in radosgw logs, couple of KB/s of network).

This situation can affect any of our rados gateways, but I haven't seen more than two affected concurrently (probably because radosgw processes were being restarted frequently recently).

I don't see anything obvious in the logs. Perf top is saying that CPU is consumed by radosgw shared object in symbol get_obj_data::flush, which, if I interpret things correctly, is called from a symbol with a long name that contains the substring "boost9intrusive9list_impl"

This is our configuration:
rgw_frontends = civetweb num_threads=5000 port=443s ssl_certificate=/etc/ceph/rgw.crt error_log_file=/var/log/ceph/civetweb.error.log

(error log file doesn't exist)

Here are the fruits of my attempts to capture the call graph using perf and gdbpmp:
https://icecube.wisc.edu/~vbrik/perf.data
https://icecube.wisc.edu/~vbrik/gdbpmp.data


Related issues 1 (0 open1 closed)

Is duplicate of rgw - Backport #39660: nautilus: rgw: Segfault during request processingResolvedActions
Actions

Also available in: Atom PDF