Project

General

Profile

Actions

Bug #18740

closed

random OSDs fail to start after reboot with systemd

Added by Alexey Sheplyakov over 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After a reboot random OSDs (2 -- 4 of 18) fail to start.
The problematic OSDs can be started manually (with ceph-disk activate-lockbox /dev/sdX3) just fine.

Environment: Ubuntu 16.04
Hardware: HP ProLiant SL4540 Gen8, 18 HDDs, 4 SSDs

Note: applying https://github.com/ceph/ceph/pull/12210/commits/0ab5b7a711ad7037ff0eb7e8281b293ddfc28a2a does NOT help.

sudo journalctl | grep sdm3

Jan 30 21:18:15 ceph-001 systemd[1]: Starting Ceph disk activation: /dev/sdm3...
Jan 30 21:18:16 ceph-001 sh[4071]: main_trigger: main_trigger: Namespace(cluster='ceph', dev='/dev/sdm3', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', func=<function main_trigger at 0x7f6b776dd668>, log_stdout=True, prepend_to_path='/usr/bin', prog='ceph-disk', setgroup=None, setuser=None, statedir='/var/lib/ceph', sync=True, sysconfdir='/etc/ceph', verbose=True)
Jan 30 21:18:16 ceph-001 sh[4071]: command_check_call: Running command: /bin/chown ceph:ceph /dev/sdm3
Jan 30 21:18:16 ceph-001 sh[4071]: command: Running command: /sbin/blkid -o udev -p /dev/sdm3
Jan 30 21:18:16 ceph-001 sh[4071]: main_trigger: trigger /dev/sdm3 parttype fb3aabf9-d25f-47cc-bf5e-721d1816496b uuid 00000000-0000-0000-0000-000000000000
Jan 30 21:18:16 ceph-001 sh[4071]: command: Running command: /usr/sbin/ceph-disk --verbose activate-lockbox /dev/sdm3
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Main process exited, code=exited, status=124/n/a
Jan 30 21:20:15 ceph-001 systemd[1]: Failed to start Ceph disk activation: /dev/sdm3.
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Unit entered failed state.
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Failed with result 'exit-code'.

Increasing the timeout in ceph-disk@.service to 900 seconds fixes the problem.


Related issues 1 (0 open1 closed)

Copied to Ceph - Backport #19910: jewel: random OSDs fail to start after reboot with systemdResolvedAlexey SheplyakovActions
Actions

Also available in: Atom PDF