Project

General

Profile

Actions

Bug #12405

closed

filestore: syncfs causes high cpu load due to kernel implementation in high-memory boxes

Added by Vimal A.R almost 9 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
FileStore
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Some of the OSD nodes on Firefly (0.80.9-2-g62645d3) are generating high load, around 40-80.

Increasing the debug level on the OSD shows a high number of _share_map_outgoing messages in the log, all on the same epoch. These are on Firefly 0.80.9-2-g62645d3.

From the output of 'sysdig':

~~~
  1. sysdig -c bottlenecks

112572161) 0.000000000 ceph-osd (15110) > procinfo cpu_usr=1 cpu_sys=0
112980628) 507.220968526 ceph-osd (15110) < futex res=0
121321233) 0.000000000 ceph-osd (16786) > procinfo cpu_usr=0 cpu_sys=2
121694746) 420.638580446 ceph-osd (16786) < futex res=0
120804281) 0.000000000 ceph-osd (9568) > procinfo cpu_usr=10 cpu_sys=2
121055942) 415.173129018 ceph-osd (9568) < futex res=0
119754338) 0.000000000 ceph-osd (8237) > procinfo cpu_usr=0 cpu_sys=2
119763836) 411.159274871 ceph-osd (8237) < futex res=0
39391582) 0.000000000 ceph-osd (15114) > futex addr=31A3371C op=128(FUTEX_PRIVATE_FLAG) val=2101
112980759) 391.079698263 ceph-osd (15114) < futex res=0
119241796) 0.000000000 ceph-osd (15817) > procinfo cpu_usr=0 cpu_sys=2
119750947) 374.850314713 ceph-osd (15817) < futex res=0
119241799) 0.000000000 ceph-osd (15820) > procinfo cpu_usr=0 cpu_sys=1
119750940) 374.850282574 ceph-osd (15820) < futex res=0
121322769) 0.000000000 ceph-osd (12037) > procinfo cpu_usr=0 cpu_sys=2
121697776) 364.788367503 ceph-osd (12037) < futex res=0
119757808) 0.000000000 ceph-osd (15776) > procinfo cpu_usr=1 cpu_sys=2
120026942) 360.564754591 ceph-osd (15776) < futex res=0
232158) 0.000000000 ceph-osd (7214) > futex addr=3370D84 op=393(FUTEX_CLOCK_REALTIME|FUTEX_PRIVATE_FLAG|FUTEX_WAIT_BITSET) val=823
70538638) 356.035740671 ceph-osd (7214) < futex res=-110(ETIMEDOUT)
~~~

The load goes down to 0 when the OSD processes are stopped. A large number (few K per poll) of ETIMEDOUT errors followed by EAGAIN, can be seen while trying 'sysdig'.

One of the OSD logs shows the below:

~~
2015-07-08 19:40:01.257508 7f098eb5d700 20 osd.904 768054 scrub_should_schedule loadavg 117.89 >= max 0.5 = no, load too high
2015-07-08 19:40:01.257557 7f098eb5d700 20 osd.904 768054 sched_scrub load_is_low=0
2015-07-08 19:40:01.257561 7f098eb5d700 20 osd.904 768054 sched_scrub 5.e1ad high load at 2015-06-24 20:17:54.827549: 1.20733e+06 < max (1.2096e+06 seconds)
2015-07-08 19:40:01.257578 7f098eb5d700 20 osd.904 768054 sched_scrub done
~~

Attaching the logs from the nodes.


Files

20-top.out (129 KB) 20-top.out Vimal A.R, 07/20/2015 07:26 AM
Actions

Also available in: Atom PDF