Bug #18049
closedtimeout during ceph-disk trigger due to /var/lock/ceph-disk flock contention
0%
Description
ceph-disk@.service handles udev device events by calling "ceph-disk trigger", which in turn handles service restart for the corresponding device.
"ceph-disk trigger" invocation is performed in a mutually exclusive manner, with each call first taking an flock on /var/lock/ceph-disk. The flock behaviour was added with f0a47578c7c4521d7cf50e9419620ddb629736f5 to address http://tracker.ceph.com/issues/13160.
The 120 second timeout was later added with bed1a5cc05a9880b91fc9ac8d8a959efe3b3d512 to address http://tracker.ceph.com/issues/16580 .
On systems with many osds, "ceph-disk trigger" during startup results in a large amount of contention for the /var/lock/ceph-disk flock, and can lead to some services tripping the 120 second timeout.
Given that the intention of the flock was to restrict concurrent invocations for a single device, it should be sufficient to use the device path for the flock. This will allow "ceph-disk trigger" events for different devices to run concurrently, greatly reducing the likelihood of service timeout.
Updated by Loïc Dachary over 7 years ago
- Status changed from New to 7
- Priority changed from Normal to Urgent
- Backport set to jewel
Updated by Loïc Dachary over 7 years ago
- Status changed from 7 to Pending Backport
Updated by Loïc Dachary over 7 years ago
- Copied to Backport #18060: jewel: timeout during ceph-disk trigger due to /var/lock/ceph-disk flock contention added
Updated by Loïc Dachary about 7 years ago
Updated by Nathan Cutler about 7 years ago
- Status changed from Pending Backport to Resolved