Bug #15915: rgw command is consuming all the cpu time - rgw - Ceph

Actions

Copy link

Bug #15915

closed

rgw command is consuming all the cpu time

Added by Russell Islam almost 8 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Casey Bodley

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v10.2.1

ceph-qa-suite:

rgw

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
949 ceph 20 0 2429384 45380 11292 S 99.9 4.5 55:58.86 radosgw
1 root 20 0 41368 3860 2352 S 0.0 0.4 0:00.58 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:01.48 ksoftirqd/0

Files

Download all files

out.png (149 KB) out.png		Jiaying Ren, 05/24/2016 09:21 AM
rgw.log.log (520 KB) rgw.log.log	RGW logs (--debug-rgw=20)	Benoit Petit, 06/13/2016 10:06 AM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Russell Islam almost 8 years ago

Above output is from top command.

Actions

Copy link

Updated by Russell Islam almost 8 years ago

More info:
After configuring multi site object gateway, radosgw is taking almost 100% cpu usage while syncing is going on.

Actions

Copy link

Updated by Yehuda Sadeh almost 8 years ago

what version are you using?

Actions

Copy link

Updated by Russell Islam almost 8 years ago

Latest version: Jewel 10.2.1

Actions

Copy link

Updated by Russell Islam almost 8 years ago

More info: It also takes long time to stop the service.

systemctl stop ceph-radosgw@

Actions

Copy link

Updated by Nathan Cutler almost 8 years ago

In another ticket ( #15907 ) there is a situation where the old sysvinit script is getting run - I think because the user did systemctl start ceph (which has the unintended effect of running /etc/init.d/ceph via systemd-sysvinit. Maybe something similar is happening in this situation.

You could check ps aux | grep ceph for lines like the one described in http://tracker.ceph.com/issues/15907#note-2

Is the behavior different if you use systemctl start ceph-radosgw.target to start RGW?

Actions

Copy link

Updated by Nathan Cutler almost 8 years ago

And use systemctl stop ceph-radosgw.target to stop, of course.

Actions

Copy link

Updated by Russell Islam almost 8 years ago

I started the daemon with systemctl start ceph-radosgw.target. Still almost 100% of the cpu is occupied by radosgw.

ps aux | grep ceph

root 4950 0.0 0.7 158912 7336 ? Ss May18 0:54 python /usr/sbin/ceph-create-keys --cluster ceph --id ceph-us-west-1
ceph 5265 0.0 4.0 356888 41184 ? Ssl May18 0:06 /usr/bin/ceph-mon -f --cluster ceph --id ceph-us-west --setuser ceph --setgroup ceph
ceph 6390 0.0 7.3 893100 74616 ? Ssl May18 0:30 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
ceph 11114 95.4 3.4 2437208 34872 ? Ssl 09:59 1:25 /usr/bin/radosgw -f --cluster ceph --name client.rgw.ceph-us-west --setuser ceph --setgroup ceph
root 11849 0.0 0.0 112632 948 pts/0 R+ 10:01 0:00 grep --color=auto ceph

Question: What is the difference between "systemctl start ceph-radosgw.target" and "systemctl start ceph-radosgw@rgw."?
Do we need both of them?

Actions

Copy link

Updated by Russell Islam almost 8 years ago

Could anyone confirm if this is normal behavior?

Actions

Copy link

#10

Updated by Jiaying Ren almost 8 years ago

File out.png out.png added

Hi~ Yehuda:

I've seem the same issue: my env:

[mikulely@localhost src]$ uname -a
Linux localhost.localdomain 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[mikulely@localhost src]$ ceph -v
ceph version 10.0.0-7743-g10f9a1d (10f9a1d1b38b8aeac029cb7332ee67fc8e80eb6e)

My setup steps:

[mikulely@localhost src]$ pwd
/home/mikulely/ceph/src
[mikulely@localhost src]$ python test/rgw/test_multi.py --num-zones 2

After this setup,the ceph-radsogw is over 160% by htop output.

After encouter this,I've enable oprofile option and re-compile,the profile result is attached. Anything I can do to help future investigate?

Actions

Copy link

#11

Updated by Russell Islam almost 8 years ago

If this is not a bug, better close it.

Actions

Copy link

#12

Updated by Casey Bodley almost 8 years ago

Jiaying Ren wrote:

After encouter this,I've enable oprofile option and re-compile,the profile result is attached. Anything I can do to help future investigate?

Thanks for the profile data. If you're still able to reproduce this, could you turn on --debug-rgw=20 and see what shows up in the radosgw logs? If we're spinning somewhere, it will probably be spamming the logs with repeated output. That output should help us narrow down the cause.

Actions

Copy link

#13

Updated by Benoit Petit almost 8 years ago

Hi,

I face exactly the same problem. I have two rgw in multisite with the following characteristics:

CentOS Linux release 7.1.1503 (Core)
Linux cephrgw-lab-01-ber 3.10.0-229.el7.x86_64
Running radosgw with the following command: /usr/bin/radosgw -d --cluster ceph --debug_ms 5 --name client.rgw.cephrgw-lab-01-ber --setuser ceph --setgroup ceph --debug-rgw=20 > rgw.log 2>&1
ceph version: ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)

I've attached the logs (--debug-rgw=20).

Please tell me if I have to open another ticket. (And sorry if I had to)

Thanks for your time.

Actions

Copy link

#14

Updated by Benoit Petit almost 8 years ago

File rgw.log.log rgw.log.log added

It's better with the log file

Actions

Copy link

#15

Updated by Orit Wasserman almost 8 years ago

Assignee set to Casey Bodley

Actions

Copy link

#16

Updated by Benoit Petit almost 8 years ago

Just in case it could help, I've attached a performance record captured with perf record (perf version 3.10.0-327.18.2.el7.x86_64.debug on centos 7) on the radosgw pid. It can be read with perf report -i perf.data.

Thanks,

Actions

Copy link

#17

Updated by Benoit Petit almost 8 years ago

Hm, nope. Can't upload it as it is hard to get a record smaller than 2Mo and I get a request entity too large as sonn as my attachment exceeds 1Mo.

Here it is: [[https://framadrop.org/r/xdOZIgBRxA#0CvhDDDOw1nFXjc6lRw89jf5A099pPpNItFkGIg/JdE=]]

Thanks

Actions

Copy link

#18

Updated by Russell Islam almost 8 years ago

This is still in version 10.2.2. Can we get some update on this?

Actions

Copy link

#19

Updated by Casey Bodley almost 8 years ago

Russell Islam wrote:

This is still in version 10.2.2. Can we get some update on this?

We've still been unable to reproduce this in testing, though we have seen issues with older versions of libcurl; can you provide the version you're running? (curl --version)

Actions

Copy link

#20

Updated by Russell Islam almost 8 years ago

[root@ceph-client7 ceph-config]# curl --version
curl 7.29.0 (x86_64-redhat-linux-gnu) libcurl/7.29.0 NSS/3.19.1 Basic ECC zlib/1.2.7 libidn/1.28 libssh2/1.4.3
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp scp sftp smtp smtps telnet tftp
Features: AsynchDNS GSS-Negotiate IDN IPv6 Largefile NTLM NTLM_WB SSL libz

Actions

Copy link

#21

Updated by Casey Bodley almost 8 years ago

Russell Islam wrote:

[root@ceph-client7 ceph-config]# curl --version
curl 7.29.0 (x86_64-redhat-linux-gnu) libcurl/7.29.0 NSS/3.19.1 Basic ECC zlib/1.2.7 libidn/1.28 libssh2/1.4.3
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp scp sftp smtp smtps telnet tftp
Features: AsynchDNS GSS-Negotiate IDN IPv6 Largefile NTLM NTLM_WB SSL libz

Thank you. 7.29 is the version we had some downstream issues with in RHEL. We make heavy use of curl_multi_wait(), and 7.29 is missing some fixes that were leading to deadlocks in our case. Would you be willing to test with a more recent version of curl? If not, I can set up a centos vm and give it a try.

Actions

Copy link

#22