Project

General

Profile

Bug #17689

upstart: race condition starting ceph-all jobs depending on network

Added by Billy Olsen about 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

As of bug 5248, the ceph-all upstart job has the following start definition:

start runlevel [2345]

which allows it to run even if networking services are not all the way up. This allows for a race condition in which a service (e.g. ceph-mon) attempts to binds to a network which is not yet up, causing it to fail to start the service.

Commit 7e08ed1bf154f5556b3c4e49f937c1575bf992b8 removed the directive to wait for the net-device-up where IFACE!=lo due to devices not coming up in time for Mellanox cards from ceph-all.conf. Rather than using the net-device-up event (which is fired for each device), I think the best approach would be to use the static-network-up meta event which will trigger after all the stanzas in the /etc/network/interfaces and /etc/network/interfaces.d/*.conf files are processed, allowing for the networking to be started before attempting to start the service dependent upon the network interface.

The following works in my systems:

start runlevel [2345] and static-network-up

Note: a work-around is to use the post-up directive in the appropriate network stanza after the device is brought up to restart the appropriate service(s) (e.g. ceph-mon-all, ceph-all, etc).

History

#1 Updated by Loïc Dachary about 6 years ago

  • Target version deleted (v10.2.4)

#3 Updated by Ken Dreyer over 5 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF