Bug #58241
closedSystemd unit file TimeoutStartSec of 120 seconds is too low
0%
Description
Currently the systemd unit file used for all cephadm-managed containers sets TimeoutStartSec=120
which is too little.
In a case of a busy host restart and when all the containers are being started at the same time the start takes for example 140 seconds (dual E5-2630 v3, 64 GB RAM, slow system disks). It should be increased radically, perhaps to 180-200 seconds.
120 seconds is not enough for a host with all services, including RGW, and 12 OSD when using slow system disks.
What's more, if a container exceeds the 120 seconds start time and systemd terminates it (for trying to be started again later) then cephadm will not detect those containers as running even if a later restart by systemd succeeded in starting the container within the 120 second time limit (separate issue will be created).
Updated by Redouane Kachach Elhichou over 1 year ago
- Assignee set to Redouane Kachach Elhichou
Updated by Redouane Kachach Elhichou over 1 year ago
- Status changed from New to In Progress
Updated by Redouane Kachach Elhichou over 1 year ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 49728
Updated by Redouane Kachach Elhichou over 1 year ago
- Related to Bug #58242: cephadm doesn't communicate with containers that failed initial start but were successfully restarted later by systemd added
Updated by Adam King about 1 year ago
- Status changed from Fix Under Review to Resolved
since this shares a PR with https://tracker.ceph.com/issues/58242, going to mark this one resolved and will track backports in https://tracker.ceph.com/issues/58242