Project

General

Profile

Bug #18740

random OSDs fail to start after reboot with systemd

Added by Alexey Sheplyakov 8 months ago. Updated 29 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
01/31/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

After a reboot random OSDs (2 -- 4 of 18) fail to start.
The problematic OSDs can be started manually (with ceph-disk activate-lockbox /dev/sdX3) just fine.

Environment: Ubuntu 16.04
Hardware: HP ProLiant SL4540 Gen8, 18 HDDs, 4 SSDs

Note: applying https://github.com/ceph/ceph/pull/12210/commits/0ab5b7a711ad7037ff0eb7e8281b293ddfc28a2a does NOT help.

sudo journalctl | grep sdm3

Jan 30 21:18:15 ceph-001 systemd[1]: Starting Ceph disk activation: /dev/sdm3...
Jan 30 21:18:16 ceph-001 sh[4071]: main_trigger: main_trigger: Namespace(cluster='ceph', dev='/dev/sdm3', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', func=<function main_trigger at 0x7f6b776dd668>, log_stdout=True, prepend_to_path='/usr/bin', prog='ceph-disk', setgroup=None, setuser=None, statedir='/var/lib/ceph', sync=True, sysconfdir='/etc/ceph', verbose=True)
Jan 30 21:18:16 ceph-001 sh[4071]: command_check_call: Running command: /bin/chown ceph:ceph /dev/sdm3
Jan 30 21:18:16 ceph-001 sh[4071]: command: Running command: /sbin/blkid -o udev -p /dev/sdm3
Jan 30 21:18:16 ceph-001 sh[4071]: main_trigger: trigger /dev/sdm3 parttype fb3aabf9-d25f-47cc-bf5e-721d1816496b uuid 00000000-0000-0000-0000-000000000000
Jan 30 21:18:16 ceph-001 sh[4071]: command: Running command: /usr/sbin/ceph-disk --verbose activate-lockbox /dev/sdm3
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Main process exited, code=exited, status=124/n/a
Jan 30 21:20:15 ceph-001 systemd[1]: Failed to start Ceph disk activation: /dev/sdm3.
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Unit entered failed state.
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Failed with result 'exit-code'.

Increasing the timeout in ceph-disk@.service to 900 seconds fixes the problem.


Related issues

Copied to Ceph - Backport #19910: jewel: random OSDs fail to start after reboot with systemd Resolved

History

#1 Updated by Nathan Cutler 8 months ago

#2 Updated by Alexey Sheplyakov 8 months ago

Did you apply https://github.com/ceph/ceph/pull/12147 as well?

Yes.

#4 Updated by David Disseldorp 8 months ago

I'd be in favour of just dropping the timeout alltogether (i.e. revert bed1a5cc05a9880b91fc9ac8d8a959efe3b3d512), as it can leave the OSD device in an unknown (mounted/unmounted) state when triggered.

In addition to 0ab5b7a711ad7037ff0eb7e8281b293ddfc28a2a, the ceph-disk activate_lock should also be made granular.

#5 Updated by Alexey Sheplyakov 8 months ago

I think having no timeout at all is also bad: one wants to be notified that OSD took "too long" to start.
There's no universally good definition of "too long", though, so there should be a way to adjust it (and sane default).

#6 Updated by Ken Dreyer 7 months ago

  • Status changed from New to In Progress

#7 Updated by Kefu Chai 6 months ago

  • Status changed from In Progress to Resolved

#8 Updated by Alexey Sheplyakov 4 months ago

  • Status changed from Resolved to Pending Backport

#9 Updated by Alexey Sheplyakov 4 months ago

  • Copied to Backport #19910: jewel: random OSDs fail to start after reboot with systemd added

#10 Updated by Nathan Cutler 4 months ago

  • Backport set to jewel

#11 Updated by Nathan Cutler 29 days ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF