Bug #17689
closedupstart: race condition starting ceph-all jobs depending on network
0%
Description
As of bug 5248, the ceph-all upstart job has the following start definition:
start runlevel [2345]
which allows it to run even if networking services are not all the way up. This allows for a race condition in which a service (e.g. ceph-mon) attempts to binds to a network which is not yet up, causing it to fail to start the service.
Commit 7e08ed1bf154f5556b3c4e49f937c1575bf992b8 removed the directive to wait for the net-device-up where IFACE!=lo due to devices not coming up in time for Mellanox cards from ceph-all.conf. Rather than using the net-device-up event (which is fired for each device), I think the best approach would be to use the static-network-up meta event which will trigger after all the stanzas in the /etc/network/interfaces and /etc/network/interfaces.d/*.conf files are processed, allowing for the networking to be started before attempting to start the service dependent upon the network interface.
The following works in my systems:
start runlevel [2345] and static-network-up
Note: a work-around is to use the post-up directive in the appropriate network stanza after the device is brought up to restart the appropriate service(s) (e.g. ceph-mon-all, ceph-all, etc).