Project

General

Profile

Bug #15915

rgw command is consuming all the cpu time

Added by Russell Islam over 1 year ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
Start date:
05/17/2016
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rgw
Release:
jewel
Needs Doc:
No

Description

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
949 ceph 20 0 2429384 45380 11292 S 99.9 4.5 55:58.86 radosgw
1 root 20 0 41368 3860 2352 S 0.0 0.4 0:00.58 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:01.48 ksoftirqd/0

out.png View (149 KB) Jiaying Ren, 05/24/2016 09:21 AM

rgw.log.log View - RGW logs (--debug-rgw=20) (520 KB) Benoit Petit, 06/13/2016 10:06 AM


Related issues

Related to rgw - Bug #16695: radosgw Consumes too much CPU time to synchronize metadata or data between multisite Resolved 07/15/2016
Related to rgw - Bug #17052: unittest_http_manager times out Resolved 08/17/2016

History

#1 Updated by Russell Islam over 1 year ago

Above output is from top command.

#2 Updated by Russell Islam over 1 year ago

More info:
After configuring multi site object gateway, radosgw is taking almost 100% cpu usage while syncing is going on.

#3 Updated by Yehuda Sadeh over 1 year ago

what version are you using?

#4 Updated by Russell Islam over 1 year ago

Latest version: Jewel 10.2.1

#5 Updated by Russell Islam over 1 year ago

More info: It also takes long time to stop the service.

systemctl stop ceph-radosgw@

#6 Updated by Nathan Cutler over 1 year ago

In another ticket ( #15907 ) there is a situation where the old sysvinit script is getting run - I think because the user did systemctl start ceph (which has the unintended effect of running /etc/init.d/ceph via systemd-sysvinit. Maybe something similar is happening in this situation.

You could check ps aux | grep ceph for lines like the one described in http://tracker.ceph.com/issues/15907#note-2

Is the behavior different if you use systemctl start ceph-radosgw.target to start RGW?

#7 Updated by Nathan Cutler over 1 year ago

And use systemctl stop ceph-radosgw.target to stop, of course.

#8 Updated by Russell Islam over 1 year ago

I started the daemon with systemctl start ceph-radosgw.target. Still almost 100% of the cpu is occupied by radosgw.

ps aux | grep ceph

root 4950 0.0 0.7 158912 7336 ? Ss May18 0:54 python /usr/sbin/ceph-create-keys --cluster ceph --id ceph-us-west-1
ceph 5265 0.0 4.0 356888 41184 ? Ssl May18 0:06 /usr/bin/ceph-mon -f --cluster ceph --id ceph-us-west --setuser ceph --setgroup ceph
ceph 6390 0.0 7.3 893100 74616 ? Ssl May18 0:30 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
ceph 11114 95.4 3.4 2437208 34872 ? Ssl 09:59 1:25 /usr/bin/radosgw -f --cluster ceph --name client.rgw.ceph-us-west --setuser ceph --setgroup ceph
root 11849 0.0 0.0 112632 948 pts/0 R+ 10:01 0:00 grep --color=auto ceph

Question: What is the difference between "systemctl start ceph-radosgw.target" and "systemctl start ceph-radosgw@rgw."?
Do we need both of them?

#9 Updated by Russell Islam over 1 year ago

Could anyone confirm if this is normal behavior?

#10 Updated by Jiaying Ren over 1 year ago

Hi~ Yehuda:

I've seem the same issue: my env:

[mikulely@localhost src]$ uname -a
Linux localhost.localdomain 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[mikulely@localhost src]$ ceph -v
ceph version 10.0.0-7743-g10f9a1d (10f9a1d1b38b8aeac029cb7332ee67fc8e80eb6e)

My setup steps:

[mikulely@localhost src]$ pwd
/home/mikulely/ceph/src
[mikulely@localhost src]$ python test/rgw/test_multi.py --num-zones 2

After this setup,the ceph-radsogw is over 160% by htop output.

After encouter this,I've enable oprofile option and re-compile,the profile result is attached. Anything I can do to help future investigate?

#11 Updated by Russell Islam over 1 year ago

If this is not a bug, better close it.

#12 Updated by Casey Bodley over 1 year ago

Jiaying Ren wrote:

After encouter this,I've enable oprofile option and re-compile,the profile result is attached. Anything I can do to help future investigate?

Thanks for the profile data. If you're still able to reproduce this, could you turn on --debug-rgw=20 and see what shows up in the radosgw logs? If we're spinning somewhere, it will probably be spamming the logs with repeated output. That output should help us narrow down the cause.

#13 Updated by Benoit Petit over 1 year ago

Hi,

I face exactly the same problem. I have two rgw in multisite with the following characteristics:

CentOS Linux release 7.1.1503 (Core)
Linux cephrgw-lab-01-ber 3.10.0-229.el7.x86_64
Running radosgw with the following command: /usr/bin/radosgw -d --cluster ceph --debug_ms 5 --name client.rgw.cephrgw-lab-01-ber --setuser ceph --setgroup ceph --debug-rgw=20 > rgw.log 2>&1
ceph version: ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)

I've attached the logs (--debug-rgw=20).

Please tell me if I have to open another ticket. (And sorry if I had to)

Thanks for your time.

#14 Updated by Benoit Petit over 1 year ago

It's better with the log file

#15 Updated by Orit Wasserman over 1 year ago

  • Assignee set to Casey Bodley

#16 Updated by Benoit Petit over 1 year ago

Just in case it could help, I've attached a performance record captured with perf record (perf version 3.10.0-327.18.2.el7.x86_64.debug on centos 7) on the radosgw pid. It can be read with perf report -i perf.data.

Thanks,

#17 Updated by Benoit Petit over 1 year ago

Hm, nope. Can't upload it as it is hard to get a record smaller than 2Mo and I get a request entity too large as sonn as my attachment exceeds 1Mo.

Here it is: [[https://framadrop.org/r/xdOZIgBRxA#0CvhDDDOw1nFXjc6lRw89jf5A099pPpNItFkGIg/JdE=]]

Thanks

#18 Updated by Russell Islam over 1 year ago

This is still in version 10.2.2. Can we get some update on this?

#19 Updated by Casey Bodley over 1 year ago

Russell Islam wrote:

This is still in version 10.2.2. Can we get some update on this?

We've still been unable to reproduce this in testing, though we have seen issues with older versions of libcurl; can you provide the version you're running? (curl --version)

#20 Updated by Russell Islam over 1 year ago

[root@ceph-client7 ceph-config]# curl --version
curl 7.29.0 (x86_64-redhat-linux-gnu) libcurl/7.29.0 NSS/3.19.1 Basic ECC zlib/1.2.7 libidn/1.28 libssh2/1.4.3
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp scp sftp smtp smtps telnet tftp
Features: AsynchDNS GSS-Negotiate IDN IPv6 Largefile NTLM NTLM_WB SSL libz

#21 Updated by Casey Bodley over 1 year ago

Russell Islam wrote:

[root@ceph-client7 ceph-config]# curl --version
curl 7.29.0 (x86_64-redhat-linux-gnu) libcurl/7.29.0 NSS/3.19.1 Basic ECC zlib/1.2.7 libidn/1.28 libssh2/1.4.3
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp scp sftp smtp smtps telnet tftp
Features: AsynchDNS GSS-Negotiate IDN IPv6 Largefile NTLM NTLM_WB SSL libz

Thank you. 7.29 is the version we had some downstream issues with in RHEL. We make heavy use of curl_multi_wait(), and 7.29 is missing some fixes that were leading to deadlocks in our case. Would you be willing to test with a more recent version of curl? If not, I can set up a centos vm and give it a try.

#22 Updated by Russell Islam over 1 year ago

Thanks for the update. I will test this issue with later version of curl and keep you posted here.

#23 Updated by Russell Islam over 1 year ago

Tested with later version of curl. In my case 7.43. Got rid of this issue.

#24 Updated by Casey Bodley over 1 year ago

Russell Islam wrote:

Tested with later version of curl. In my case 7.43. Got rid of this issue.

Good to know, thank you very much for testing. That means we can reproduce by running against older versions of libcurl to get to the bottom of this.

#25 Updated by Russell Islam over 1 year ago

Good to know, thank you very much for testing. That means we can reproduce by running against older versions of libcurl to get to the bottom of this.

Yes. You are right.

#26 Updated by Casey Bodley about 1 year ago

  • Related to Bug #16695: radosgw Consumes too much CPU time to synchronize metadata or data between multisite added

#27 Updated by Casey Bodley about 1 year ago

  • Related to Bug #17052: unittest_http_manager times out added

#28 Updated by Casey Bodley 7 months ago

  • Status changed from New to Resolved

Also available in: Atom PDF