Bug #18740: random OSDs fail to start after reboot with systemd - Ceph - Ceph

Actions

Copy link

Bug #18740

closed

random OSDs fail to start after reboot with systemd

Added by Alexey Sheplyakov about 7 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

jewel

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v10.2.4, v10.2.5, v10.2.6

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After a reboot random OSDs (2 -- 4 of 18) fail to start.
The problematic OSDs can be started manually (with ceph-disk activate-lockbox /dev/sdX3) just fine.

Environment: Ubuntu 16.04
Hardware: HP ProLiant SL4540 Gen8, 18 HDDs, 4 SSDs

Note: applying https://github.com/ceph/ceph/pull/12210/commits/0ab5b7a711ad7037ff0eb7e8281b293ddfc28a2a does NOT help.

sudo journalctl | grep sdm3

Jan 30 21:18:15 ceph-001 systemd[1]: Starting Ceph disk activation: /dev/sdm3...
Jan 30 21:18:16 ceph-001 sh[4071]: main_trigger: main_trigger: Namespace(cluster='ceph', dev='/dev/sdm3', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', func=<function main_trigger at 0x7f6b776dd668>, log_stdout=True, prepend_to_path='/usr/bin', prog='ceph-disk', setgroup=None, setuser=None, statedir='/var/lib/ceph', sync=True, sysconfdir='/etc/ceph', verbose=True)
Jan 30 21:18:16 ceph-001 sh[4071]: command_check_call: Running command: /bin/chown ceph:ceph /dev/sdm3
Jan 30 21:18:16 ceph-001 sh[4071]: command: Running command: /sbin/blkid -o udev -p /dev/sdm3
Jan 30 21:18:16 ceph-001 sh[4071]: main_trigger: trigger /dev/sdm3 parttype fb3aabf9-d25f-47cc-bf5e-721d1816496b uuid 00000000-0000-0000-0000-000000000000
Jan 30 21:18:16 ceph-001 sh[4071]: command: Running command: /usr/sbin/ceph-disk --verbose activate-lockbox /dev/sdm3
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Main process exited, code=exited, status=124/n/a
Jan 30 21:20:15 ceph-001 systemd[1]: Failed to start Ceph disk activation: /dev/sdm3.
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Unit entered failed state.
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Failed with result 'exit-code'.

Increasing the timeout in ceph-disk@.service to 900 seconds fixes the problem.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Nathan Cutler about 7 years ago

Did you apply https://github.com/ceph/ceph/pull/12147 as well?

Actions

Copy link

Updated by Alexey Sheplyakov about 7 years ago

Did you apply https://github.com/ceph/ceph/pull/12147 as well?

Yes.

Actions

Copy link

Updated by Alexey Sheplyakov about 7 years ago

https://github.com/ceph/ceph/pull/13197

Actions

Copy link

Updated by David Disseldorp about 7 years ago

I'd be in favour of just dropping the timeout alltogether (i.e. revert bed1a5cc05a9880b91fc9ac8d8a959efe3b3d512), as it can leave the OSD device in an unknown (mounted/unmounted) state when triggered.

In addition to 0ab5b7a711ad7037ff0eb7e8281b293ddfc28a2a, the ceph-disk activate_lock should also be made granular.

Actions

Copy link

Updated by Alexey Sheplyakov about 7 years ago

I think having no timeout at all is also bad: one wants to be notified that OSD took "too long" to start.
There's no universally good definition of "too long", though, so there should be a way to adjust it (and sane default).

Actions

Copy link