Project

General

Profile

Bug #18740

random OSDs fail to start after reboot with systemd

Added by Alexey Sheplyakov 10 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
01/31/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

After a reboot random OSDs (2 -- 4 of 18) fail to start.
The problematic OSDs can be started manually (with ceph-disk activate-lockbox /dev/sdX3) just fine.

Environment: Ubuntu 16.04
Hardware: HP ProLiant SL4540 Gen8, 18 HDDs, 4 SSDs

Note: applying https://github.com/ceph/ceph/pull/12210/commits/0ab5b7a711ad7037ff0eb7e8281b293ddfc28a2a does NOT help.

sudo journalctl | grep sdm3

Jan 30 21:18:15 ceph-001 systemd[1]: Starting Ceph disk activation: /dev/sdm3...
Jan 30 21:18:16 ceph-001 sh[4071]: main_trigger: main_trigger: Namespace(cluster='ceph', dev='/dev/sdm3', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', func=<function main_trigger at 0x7f6b776dd668>, log_stdout=True, prepend_to_path='/usr/bin', prog='ceph-disk', setgroup=None, setuser=None, statedir='/var/lib/ceph', sync=True, sysconfdir='/etc/ceph', verbose=True)
Jan 30 21:18:16 ceph-001 sh[4071]: command_check_call: Running command: /bin/chown ceph:ceph /dev/sdm3
Jan 30 21:18:16 ceph-001 sh[4071]: command: Running command: /sbin/blkid -o udev -p /dev/sdm3
Jan 30 21:18:16 ceph-001 sh[4071]: main_trigger: trigger /dev/sdm3 parttype fb3aabf9-d25f-47cc-bf5e-721d1816496b uuid 00000000-0000-0000-0000-000000000000
Jan 30 21:18:16 ceph-001 sh[4071]: command: Running command: /usr/sbin/ceph-disk --verbose activate-lockbox /dev/sdm3
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Main process exited, code=exited, status=124/n/a
Jan 30 21:20:15 ceph-001 systemd[1]: Failed to start Ceph disk activation: /dev/sdm3.
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Unit entered failed state.
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Failed with result 'exit-code'.

Increasing the timeout in ceph-disk@.service to 900 seconds fixes the problem.


Related issues

Copied to Ceph - Backport #19910: jewel: random OSDs fail to start after reboot with systemd Resolved

History

#1 Updated by Nathan Cutler 10 months ago

#2 Updated by Alexey Sheplyakov 10 months ago

Did you apply https://github.com/ceph/ceph/pull/12147 as well?

Yes.

#4 Updated by David Disseldorp 10 months ago

I'd be in favour of just dropping the timeout alltogether (i.e. revert bed1a5cc05a9880b91fc9ac8d8a959efe3b3d512), as it can leave the OSD device in an unknown (mounted/unmounted) state when triggered.

In addition to 0ab5b7a711ad7037ff0eb7e8281b293ddfc28a2a, the ceph-disk activate_lock should also be made granular.

#5 Updated by Alexey Sheplyakov 10 months ago

I think having no timeout at all is also bad: one wants to be notified that OSD took "too long" to start.
There's no universally good definition of "too long", though, so there should be a way to adjust it (and sane default).

#6 Updated by Ken Dreyer 9 months ago

  • Status changed from New to In Progress

#7 Updated by Kefu Chai 8 months ago

  • Status changed from In Progress to Resolved

#8 Updated by Alexey Sheplyakov 6 months ago

  • Status changed from Resolved to Pending Backport

#9 Updated by Alexey Sheplyakov 6 months ago

  • Copied to Backport #19910: jewel: random OSDs fail to start after reboot with systemd added

#10 Updated by Nathan Cutler 6 months ago

  • Backport set to jewel

#11 Updated by Nathan Cutler 3 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF