Bug #15559: osds do not start after boot - Ceph - Ceph

Actions

Copy link

Bug #15559

closed

osds do not start after boot

Added by Ruben Kerkhof almost 8 years ago. Updated over 7 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v10.1.1

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After reboot, my OSDS are not started.
I have one harddisk per OSD, with the journal on a partition on /dev/sda, which is an ssd.
/dev/sda also contains an LVM volume group for the root fs. OS is CentOS 7.2. Ceph version 10.1.2.

I enabled debug logging in udev, and found this (rest of the output in the attached log):

Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd⁶⁴⁵: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sda4' [812] exit with return code 1
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd⁶³³: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sda11'(err) ' load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd⁶⁵¹: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sda9'(err) 'Traceback (most recent call last):'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd⁶⁵⁴: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sdk1'(err) ' load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd⁶²⁸: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sdj1'(err) ' load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd⁶⁴⁷: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sda6'(err) ' File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4906, in main'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd⁶²⁷: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sdf1'(err) ' setup_statedir(args.statedir)'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd⁶³⁶: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sdl1'(err) ' setup_statedir(args.statedir)'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd⁶²³: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sdb1'(err) 'Traceback (most recent call last):'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd⁶²⁹: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sdh1'(err) ' os.mkdir(STATEDIR)'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd⁶²⁵: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sde1'(err) ' setup_statedir(args.statedir)'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd⁶³²: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sda10'(err) ' File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4366, in setup_statedir'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd⁶⁴⁶: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sda5'(err) 'OSError: [Errno 2] No such file or directory: '/var/lib/ceph''
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd⁶⁴¹: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sda14'(err) 'OSError: [Errno 2] No such file or directory: '/var/lib/ceph''
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd⁶³⁹: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sda13'(err) 'OSError: [Errno 2] No such file or directory: '/var/lib/ceph''

I'll attach the full debug log.

It's hard to tell from the enormous amount of output, but what I think is happening is that:
- udev runs from initramfs
- at this point lvm isn't done scanning for vgs / pvs and the root fs is not available yet
- ceph disk tries to create /var/lib/ceph on the root filesystem.

I can't trigger this on every reboot however, I guess it depends on how fast LVM assembles its devices.

As a sidenote, what makes debugging this hard is that those python tracebacks only show up with udev in debug mode.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Loïc Dachary over 7 years ago

Is duplicate of Bug #17889: ceph-disk: ceph-disk@.service races with ceph-osd@.service added

Actions

Copy link

Updated by Loïc Dachary over 7 years ago

Status changed from New to Duplicate

Actions

Copy link

Updated by Loïc Dachary over 7 years ago

@Ruben Kerkhof I'm trying to reproduce the problem but I've not been able to so far. Can you ? (side note: I'm convinced this is real but since it's a race, having a reliable way to reproduce it even if it requires 10 reboots would help immensely).

Actions

Copy link

Updated by Ruben Kerkhof over 7 years ago

Loic Dachary wrote:

@Ruben Kerkhof I'm trying to reproduce the problem but I've not been able to so far. Can you ? (side note: I'm convinced this is real but since it's a race, having a reliable way to reproduce it even if it requires 10 reboots would help immensely).

Hi Loic,

I've since removed LVM out of the equation on both my clusters and haven't seen this anymore.
One thing I do recall though is that at the same time I logged this issue we also had hardware issues, whereby our journal / boot SSD responded extremely slow. So it might be that the race is more easily triggered on slow hardware.

Another thing was that /var was on a separate lv.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #15559

osds do not start after boot

Updated by Loïc Dachary over 7 years ago

Updated by Loïc Dachary over 7 years ago

Updated by Loïc Dachary over 7 years ago

Updated by Ruben Kerkhof over 7 years ago