Bug #13940: OSDs fail to start on reboot with dmcrypt/luks - Ceph - Ceph

Actions

Copy link

Bug #13940

closed

OSDs fail to start on reboot with dmcrypt/luks

Added by Aaron Bassett over 8 years ago. Updated about 8 years ago.

Status:

Won't Fix

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v0.94.3

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
Linux smr-r1-r1-head2 3.19.0-25-generic #26~14.04.1-Ubuntu SMP Fri Jul 24 21:16:20 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu 14.04.3 LTS

I have 60 osds per host. They were created with ceph-disk activate --dmcrypt. They're using LUKS, When I reboot a host ~ 10 osds come up, the rest fail to. The one's that fail are all have a /dev/mapper/temporary-cryptsetup-NNNN entry for the journal and data partitions. The symptoms all match this mailing list issue: http://www.spinics.net/lists/ceph-devel/msg25281.html. I ended up fixing it the same way, by luks closing, the temporary-devices and then decrypting them with the right names and start ceph-osd-all.

Actions

Copy link

Updated by Loïc Dachary about 8 years ago

Status changed from New to 12
Priority changed from Normal to High
Release set to hammer

Actions

Copy link

Updated by Loïc Dachary about 8 years ago

This is most probably a timeout because individual udev actions take too long and abort or fail (I don't know exactly what happens when a udev action takes long to complete). This cannot happen in infernalis but the modifications are extensive and not easy to backport. The general idea is to not do any work when ceph-disk is called from udev. Instead ceph-disk trigger is called and launches a systemd/upstart action in the background.

Actions

Copy link

Updated by Dan Mick about 8 years ago

Submitter responded in email (which bounced):

Ok thanks for the update. We are evaluating Infernalis for our next deployment, and I have a script that cleans up from the broken state on boot, so we can limp along as is if this needs to be WONTFIX.

Actions

Copy link