Project

General

Profile

Actions

Bug #19299

closed

Jewel -> Kraken: OSD boot takes 1+ hours, unusually high CPU

Added by Ben Meekhof about 7 years ago. Updated over 6 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Since upgrading to Kraken we've had severe problems with OSD startup. Though this ticket mentions bootup specifically the load conditions described can be triggered on all cluster nodes merely by having a sufficiently high number of OSD change state from in to out - about 60 out of 600 is enough to destabilize the cluster due to CPU load on all nodes. We had no issues like this under Jewel in the same configuration.

- OSD starting up do not get marked up/in for 1+ hours
- very high cpu usage, if many OSD on system then system is overwhelmed
- other OSD nodes see very high CPU usage, corresponding higher with number of OSD being started until they are saturated
- 'perf top' shows kernel spending 50% of it's time in ' _raw_spin_lock_irqsave'
- strace -f -T -c shows ~80% of time spend in futex syscall, 12% of time in restart_syscall

We've tried reducing various tuning parameters to 1 with no effect: ms_async_op_threads, ms_async_max_op_threads, osd_recovery_max_active, osd_op_threads. When osd_op_threads is reduced to zero then the CPU saturation goes away but the OSD never boots (given 12 hours).

I have attached the output of strace when this is ongoing, and the output of an OSD log with all debug params turned up to 999. It's not the full output since the start of the boot but represents what is ongoing while we wait for the OSD to boot. I tried to keep these to reasonable length but if more is needed please let me know.


Files

ceph-osd.log.txt (20.7 KB) ceph-osd.log.txt Ben Meekhof, 03/17/2017 07:45 PM
strace.txt (23.6 KB) strace.txt Ben Meekhof, 03/17/2017 07:45 PM
strace-time.txt (1.08 KB) strace-time.txt Ben Meekhof, 03/17/2017 07:45 PM
perf-top.txt (1.79 KB) perf-top.txt Ben Meekhof, 03/17/2017 07:45 PM
strace-2.txt (69.4 KB) strace-2.txt Ben Meekhof, 03/17/2017 08:11 PM
Actions

Also available in: Atom PDF