Bug #19299
closedJewel -> Kraken: OSD boot takes 1+ hours, unusually high CPU
0%
Description
Since upgrading to Kraken we've had severe problems with OSD startup. Though this ticket mentions bootup specifically the load conditions described can be triggered on all cluster nodes merely by having a sufficiently high number of OSD change state from in to out - about 60 out of 600 is enough to destabilize the cluster due to CPU load on all nodes. We had no issues like this under Jewel in the same configuration.
- OSD starting up do not get marked up/in for 1+ hours
- very high cpu usage, if many OSD on system then system is overwhelmed
- other OSD nodes see very high CPU usage, corresponding higher with number of OSD being started until they are saturated
- 'perf top' shows kernel spending 50% of it's time in ' _raw_spin_lock_irqsave'
- strace -f -T -c shows ~80% of time spend in futex syscall, 12% of time in restart_syscall
We've tried reducing various tuning parameters to 1 with no effect: ms_async_op_threads, ms_async_max_op_threads, osd_recovery_max_active, osd_op_threads. When osd_op_threads is reduced to zero then the CPU saturation goes away but the OSD never boots (given 12 hours).
I have attached the output of strace when this is ongoing, and the output of an OSD log with all debug params turned up to 999. It's not the full output since the start of the boot but represents what is ongoing while we wait for the OSD to boot. I tried to keep these to reasonable length but if more is needed please let me know.
Files