Bug #18049: timeout during ceph-disk trigger due to /var/lock/ceph-disk flock contention - Ceph - Ceph

Actions

Copy link

Bug #18049

closed

timeout during ceph-disk trigger due to /var/lock/ceph-disk flock contention

Added by David Disseldorp over 7 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

David Disseldorp

Category:

Target version:

% Done:

Source:

Tags:

Backport:

jewel

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

ceph-disk

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ceph-disk@.service handles udev device events by calling "ceph-disk trigger", which in turn handles service restart for the corresponding device.

"ceph-disk trigger" invocation is performed in a mutually exclusive manner, with each call first taking an flock on /var/lock/ceph-disk. The flock behaviour was added with f0a47578c7c4521d7cf50e9419620ddb629736f5 to address http://tracker.ceph.com/issues/13160.

The 120 second timeout was later added with bed1a5cc05a9880b91fc9ac8d8a959efe3b3d512 to address http://tracker.ceph.com/issues/16580 .

On systems with many osds, "ceph-disk trigger" during startup results in a large amount of contention for the /var/lock/ceph-disk flock, and can lead to some services tripping the 120 second timeout.

Given that the intention of the flock was to restrict concurrent invocations for a single device, it should be sufficient to use the device path for the flock. This will allow "ceph-disk trigger" events for different devices to run concurrently, greatly reducing the likelihood of service timeout.

Related issues 1 (0 open — 1 closed)