Project

General

Profile

Bug #19299

Jewel -> Kraken: OSD boot takes 1+ hours, unusually high CPU

Added by Ben Meekhof about 7 years ago. Updated over 6 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Since upgrading to Kraken we've had severe problems with OSD startup. Though this ticket mentions bootup specifically the load conditions described can be triggered on all cluster nodes merely by having a sufficiently high number of OSD change state from in to out - about 60 out of 600 is enough to destabilize the cluster due to CPU load on all nodes. We had no issues like this under Jewel in the same configuration.

- OSD starting up do not get marked up/in for 1+ hours
- very high cpu usage, if many OSD on system then system is overwhelmed
- other OSD nodes see very high CPU usage, corresponding higher with number of OSD being started until they are saturated
- 'perf top' shows kernel spending 50% of it's time in ' _raw_spin_lock_irqsave'
- strace -f -T -c shows ~80% of time spend in futex syscall, 12% of time in restart_syscall

We've tried reducing various tuning parameters to 1 with no effect: ms_async_op_threads, ms_async_max_op_threads, osd_recovery_max_active, osd_op_threads. When osd_op_threads is reduced to zero then the CPU saturation goes away but the OSD never boots (given 12 hours).

I have attached the output of strace when this is ongoing, and the output of an OSD log with all debug params turned up to 999. It's not the full output since the start of the boot but represents what is ongoing while we wait for the OSD to boot. I tried to keep these to reasonable length but if more is needed please let me know.

ceph-osd.log.txt View (20.7 KB) Ben Meekhof, 03/17/2017 07:45 PM

strace.txt View (23.6 KB) Ben Meekhof, 03/17/2017 07:45 PM

strace-time.txt View (1.08 KB) Ben Meekhof, 03/17/2017 07:45 PM

perf-top.txt View (1.79 KB) Ben Meekhof, 03/17/2017 07:45 PM

strace-2.txt View (69.4 KB) Ben Meekhof, 03/17/2017 08:11 PM

History

#1 Updated by Sage Weil about 7 years ago

the strace doesn't include child processes.. can you repeat with -f passed to strace?

#2 Updated by Sage Weil about 7 years ago

  • Status changed from New to Need More Info
  • Priority changed from Normal to Urgent

#3 Updated by Ben Meekhof about 7 years ago

As far as I know that was with -f passed. Maybe is relevant that I grepped out 'madvise' calls because they occur at such a high rate. To be certain I've attached another definitely with the correct flag and with just a few lines of representative madvise calls included at the top.

#4 Updated by Ben Meekhof almost 7 years ago

It looks like the core of the problem is related to processing increases in the osd map epoch. Any change requires unusually high CPU resources from each OSD to process, and when a significant number of OSD change state the other OSD become overwhelmed with processing the map increments. It's particularly rough on our setup because of the density of OSD vs CPU but the resource usage far exceeds what we ever saw with the Jewel release so I don't think this could be normal. As long as the osdmap doesn't change then everything behaves normally. I'm not really sure what to do about the issue and have reverted to Jewel for the time being.

All of our osd map settings are at the defaults, and I don't believe those changed between the releases.

#5 Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category set to Performance/Resource Usage

#6 Updated by red ref almost 7 years ago

Same problem here (fresh 12.0.3). Got OSD's behind by > 5000 maps, it took ~8 hours to get them booted.

Looking into, I increased osd_map_cache_size to reach the number of osdmap's OSD is behind (increasing memory usage too). It solved the problem.

Digging further (from the logs), I spotted a behavior difference :
- on laggy side, each get_map is separated by an OSD "tick_without_osd_lock" and does'nt seem to be batched on the network by packs of 100. Parallel processing (working with pgs) is nevertheless done after 100 loops.
- on the other side, it is working well (no tick, batch)

Tried to understand this code part, but I miss some clues.

#7 Updated by Greg Farnum almost 7 years ago

  • Status changed from Need More Info to New
  • Priority changed from Urgent to High

Ping Sage, you got that subprocess strace data.

#8 Updated by Sage Weil over 6 years ago

  • Status changed from New to Can't reproduce

Note that this code is slated to get replaced in mimic.

#9 Updated by Ben Meekhof over 6 years ago

We did not have any similar issues updating from Jewel to Luminous. As far as I'm concerned there's no reason to keep this open.

Also available in: Atom PDF