Project

General

Profile

Actions

Bug #18740

closed

random OSDs fail to start after reboot with systemd

Added by Alexey Sheplyakov about 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After a reboot random OSDs (2 -- 4 of 18) fail to start.
The problematic OSDs can be started manually (with ceph-disk activate-lockbox /dev/sdX3) just fine.

Environment: Ubuntu 16.04
Hardware: HP ProLiant SL4540 Gen8, 18 HDDs, 4 SSDs

Note: applying https://github.com/ceph/ceph/pull/12210/commits/0ab5b7a711ad7037ff0eb7e8281b293ddfc28a2a does NOT help.

sudo journalctl | grep sdm3

Jan 30 21:18:15 ceph-001 systemd[1]: Starting Ceph disk activation: /dev/sdm3...
Jan 30 21:18:16 ceph-001 sh[4071]: main_trigger: main_trigger: Namespace(cluster='ceph', dev='/dev/sdm3', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', func=<function main_trigger at 0x7f6b776dd668>, log_stdout=True, prepend_to_path='/usr/bin', prog='ceph-disk', setgroup=None, setuser=None, statedir='/var/lib/ceph', sync=True, sysconfdir='/etc/ceph', verbose=True)
Jan 30 21:18:16 ceph-001 sh[4071]: command_check_call: Running command: /bin/chown ceph:ceph /dev/sdm3
Jan 30 21:18:16 ceph-001 sh[4071]: command: Running command: /sbin/blkid -o udev -p /dev/sdm3
Jan 30 21:18:16 ceph-001 sh[4071]: main_trigger: trigger /dev/sdm3 parttype fb3aabf9-d25f-47cc-bf5e-721d1816496b uuid 00000000-0000-0000-0000-000000000000
Jan 30 21:18:16 ceph-001 sh[4071]: command: Running command: /usr/sbin/ceph-disk --verbose activate-lockbox /dev/sdm3
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Main process exited, code=exited, status=124/n/a
Jan 30 21:20:15 ceph-001 systemd[1]: Failed to start Ceph disk activation: /dev/sdm3.
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Unit entered failed state.
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Failed with result 'exit-code'.

Increasing the timeout in ceph-disk@.service to 900 seconds fixes the problem.


Related issues 1 (0 open1 closed)

Copied to Ceph - Backport #19910: jewel: random OSDs fail to start after reboot with systemdResolvedAlexey SheplyakovActions
Actions #1

Updated by Nathan Cutler about 7 years ago

Actions #2

Updated by Alexey Sheplyakov about 7 years ago

Did you apply https://github.com/ceph/ceph/pull/12147 as well?

Yes.

Actions #4

Updated by David Disseldorp about 7 years ago

I'd be in favour of just dropping the timeout alltogether (i.e. revert bed1a5cc05a9880b91fc9ac8d8a959efe3b3d512), as it can leave the OSD device in an unknown (mounted/unmounted) state when triggered.

In addition to 0ab5b7a711ad7037ff0eb7e8281b293ddfc28a2a, the ceph-disk activate_lock should also be made granular.

Actions #5

Updated by Alexey Sheplyakov about 7 years ago

I think having no timeout at all is also bad: one wants to be notified that OSD took "too long" to start.
There's no universally good definition of "too long", though, so there should be a way to adjust it (and sane default).

Actions #6

Updated by Ken Dreyer about 7 years ago

  • Status changed from New to In Progress
Actions #7

Updated by Kefu Chai about 7 years ago

  • Status changed from In Progress to Resolved
Actions #8

Updated by Alexey Sheplyakov almost 7 years ago

  • Status changed from Resolved to Pending Backport
Actions #9

Updated by Alexey Sheplyakov almost 7 years ago

  • Copied to Backport #19910: jewel: random OSDs fail to start after reboot with systemd added
Actions #10

Updated by Nathan Cutler almost 7 years ago

  • Backport set to jewel
Actions #11

Updated by Nathan Cutler over 6 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF