Project

General

Profile

Bug #12405

filestore: syncfs causes high cpu load due to kernel implementation in high-memory boxes

Added by Vimal A.R over 8 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
FileStore
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Some of the OSD nodes on Firefly (0.80.9-2-g62645d3) are generating high load, around 40-80.

Increasing the debug level on the OSD shows a high number of _share_map_outgoing messages in the log, all on the same epoch. These are on Firefly 0.80.9-2-g62645d3.

From the output of 'sysdig':

~~~
  1. sysdig -c bottlenecks

112572161) 0.000000000 ceph-osd (15110) > procinfo cpu_usr=1 cpu_sys=0
112980628) 507.220968526 ceph-osd (15110) < futex res=0
121321233) 0.000000000 ceph-osd (16786) > procinfo cpu_usr=0 cpu_sys=2
121694746) 420.638580446 ceph-osd (16786) < futex res=0
120804281) 0.000000000 ceph-osd (9568) > procinfo cpu_usr=10 cpu_sys=2
121055942) 415.173129018 ceph-osd (9568) < futex res=0
119754338) 0.000000000 ceph-osd (8237) > procinfo cpu_usr=0 cpu_sys=2
119763836) 411.159274871 ceph-osd (8237) < futex res=0
39391582) 0.000000000 ceph-osd (15114) > futex addr=31A3371C op=128(FUTEX_PRIVATE_FLAG) val=2101
112980759) 391.079698263 ceph-osd (15114) < futex res=0
119241796) 0.000000000 ceph-osd (15817) > procinfo cpu_usr=0 cpu_sys=2
119750947) 374.850314713 ceph-osd (15817) < futex res=0
119241799) 0.000000000 ceph-osd (15820) > procinfo cpu_usr=0 cpu_sys=1
119750940) 374.850282574 ceph-osd (15820) < futex res=0
121322769) 0.000000000 ceph-osd (12037) > procinfo cpu_usr=0 cpu_sys=2
121697776) 364.788367503 ceph-osd (12037) < futex res=0
119757808) 0.000000000 ceph-osd (15776) > procinfo cpu_usr=1 cpu_sys=2
120026942) 360.564754591 ceph-osd (15776) < futex res=0
232158) 0.000000000 ceph-osd (7214) > futex addr=3370D84 op=393(FUTEX_CLOCK_REALTIME|FUTEX_PRIVATE_FLAG|FUTEX_WAIT_BITSET) val=823
70538638) 356.035740671 ceph-osd (7214) < futex res=-110(ETIMEDOUT)
~~~

The load goes down to 0 when the OSD processes are stopped. A large number (few K per poll) of ETIMEDOUT errors followed by EAGAIN, can be seen while trying 'sysdig'.

One of the OSD logs shows the below:

~~
2015-07-08 19:40:01.257508 7f098eb5d700 20 osd.904 768054 scrub_should_schedule loadavg 117.89 >= max 0.5 = no, load too high
2015-07-08 19:40:01.257557 7f098eb5d700 20 osd.904 768054 sched_scrub load_is_low=0
2015-07-08 19:40:01.257561 7f098eb5d700 20 osd.904 768054 sched_scrub 5.e1ad high load at 2015-06-24 20:17:54.827549: 1.20733e+06 < max (1.2096e+06 seconds)
2015-07-08 19:40:01.257578 7f098eb5d700 20 osd.904 768054 sched_scrub done
~
~

Attaching the logs from the nodes.

20-top.out (129 KB) Vimal A.R, 07/20/2015 07:26 AM

History

#1 Updated by Vimal A.R over 8 years ago

#2 Updated by Vimal A.R over 8 years ago

I'm not able to attach the logs here. The webpage is timing out with "413 Request Entity Too Large", even for files as small as 1.3MB. Do we have any other option to upload the data?

Vimal

#3 Updated by huang jun over 8 years ago

you can upload to the website like https://pastebin.mozilla.org/, the copy the url to here.
we also encounter this problem, at first we think it related to memory, we have 24 osds on one host and only 48GB memory, but you have 248GB and also get this problem.

#4 Updated by Vimal A.R over 8 years ago

Hello,

I've uploaded the logs to the ceph.com dropbox, inside the folder 01458699. I'll get more details on this.

Vimal

#5 Updated by Nathan Cutler over 8 years ago

  • Tracker changed from Tasks to Bug
  • Project changed from Stable releases to Ceph
  • Regression set to No

#6 Updated by Vimal A.R over 8 years ago

An update to this issue:

a) This is for the Congress (DreamObjects) cluster.

b) There weren't any changes in terms of tunables.

c) No recovery process is going at the time of the problem.

d) The high load is consistent.

e) This is experienced mostly on the new ceph stores (14 * 36 = 504 OSDs, out of 1322)

f) There is no backfilling happening here. These are six month old cephstores, and backfilling/recovery has already been completed quite some time back.

#7 Updated by Sage Weil over 8 years ago

'perf record ag -F 99 - sleep 10' sez

# Overhead        Command         Shared Object                                 
# ........  .............  ....................  ...............................
#
    77.57%       ceph-osd  [kernel.kallsyms]     [k] __ticket_spin_lock
                 |
                 --- __ticket_spin_lock
                    |          
                    |--99.79%-- _raw_spin_lock
                    |          |          
                    |          |--89.00%-- wait_sb_inodes
                    |          |          sync_inodes_sb
                    |          |          __sync_filesystem
                    |          |          sync_filesystem
                    |          |          sys_syncfs
                    |          |          system_call_fastpath
                    |          |          syncfs
                    |          |          
                    |          |--5.79%-- xfs_inode_ag_walk_grab
                    |          |          xfs_inode_ag_walk.isra.8
                    |          |          xfs_inode_ag_iterator
                    |          |          xfs_quiesce_data
                    |          |          xfs_fs_sync_fs
                    |          |          __sync_filesystem
                    |          |          sync_filesystem
                    |          |          sys_syncfs
                    |          |          system_call_fastpath
                    |          |          syncfs
                    |          |          
                    |          |--3.77%-- igrab
                    |          |          xfs_inode_ag_walk_grab
                    |          |          xfs_inode_ag_walk.isra.8
                    |          |          xfs_inode_ag_iterator
                    |          |          xfs_quiesce_data
                    |          |          xfs_fs_sync_fs
                    |          |          __sync_filesystem
                    |          |          sync_filesystem
                    |          |          sys_syncfs
                    |          |          system_call_fastpath
                    |          |          syncfs
                    |          |          

can you try a newer kernel? my guess is the kernel inode cache is huge and in 3.2 (sooo old!!!) syncfs iterated over that.

#8 Updated by Sage Weil over 8 years ago

nevermind, new kernels do the same. i think the reason why it is problematic on these nodes is because they have gobs and gobs of ram and a huge number of inodes in cache.

#9 Updated by Sage Weil over 8 years ago

Try this: reboot the nodes, and

echo 10000 > /proc/sys/vm/vfs_cache_pressure

That should favor data (page cache) over inodes.

#10 Updated by Sage Weil over 8 years ago

  • Subject changed from OSDs on Firefly generating high CPU load to filestore: syncfs causes high cpu load

#11 Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Subject changed from filestore: syncfs causes high cpu load to filestore: syncfs causes high cpu load due to kernel implementation in high-memory boxes
  • Category set to Performance/Resource Usage
  • Component(RADOS) FileStore added

#12 Updated by Sage Weil over 6 years ago

  • Status changed from New to Resolved

Newer kernels fix syncfs(2) to use a dirty inode list for this. (The fix isn't in the latest el7 kernel(s) yet, though.)

Also available in: Atom PDF