Project

General

Profile

Actions

Bug #15559

closed

osds do not start after boot

Added by Ruben Kerkhof almost 8 years ago. Updated over 7 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After reboot, my OSDS are not started.
I have one harddisk per OSD, with the journal on a partition on /dev/sda, which is an ssd.
/dev/sda also contains an LVM volume group for the root fs. OS is CentOS 7.2. Ceph version 10.1.2.

I enabled debug logging in udev, and found this (rest of the output in the attached log):

Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd645: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sda4' [812] exit with return code 1
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd633: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sda11'(err) ' load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd651: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sda9'(err) 'Traceback (most recent call last):'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd654: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sdk1'(err) ' load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd628: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sdj1'(err) ' load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd647: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sda6'(err) ' File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4906, in main'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd627: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sdf1'(err) ' setup_statedir(args.statedir)'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd636: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sdl1'(err) ' setup_statedir(args.statedir)'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd623: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sdb1'(err) 'Traceback (most recent call last):'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd629: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sdh1'(err) ' os.mkdir(STATEDIR)'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd625: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sde1'(err) ' setup_statedir(args.statedir)'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd632: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sda10'(err) ' File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4366, in setup_statedir'
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd646: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sda5'(err) 'OSError: [Errno 2] No such file or directory: '/var/lib/ceph''
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd641: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sda14'(err) 'OSError: [Errno 2] No such file or directory: '/var/lib/ceph''
Apr 20 20:53:06 ams1-pod11-ceph18.tilaa.nl systemd-udevd639: '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/sda13'(err) 'OSError: [Errno 2] No such file or directory: '/var/lib/ceph''

I'll attach the full debug log.

It's hard to tell from the enormous amount of output, but what I think is happening is that:
- udev runs from initramfs
- at this point lvm isn't done scanning for vgs / pvs and the root fs is not available yet
- ceph disk tries to create /var/lib/ceph on the root filesystem.

I can't trigger this on every reboot however, I guess it depends on how fast LVM assembles its devices.

As a sidenote, what makes debugging this hard is that those python tracebacks only show up with udev in debug mode.


Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Bug #17889: ceph-disk: ceph-disk@.service races with ceph-osd@.serviceResolvedLoïc Dachary11/14/2016

Actions
Actions #1

Updated by Loïc Dachary over 7 years ago

  • Is duplicate of Bug #17889: ceph-disk: ceph-disk@.service races with ceph-osd@.service added
Actions #2

Updated by Loïc Dachary over 7 years ago

  • Status changed from New to Duplicate
Actions #3

Updated by Loïc Dachary over 7 years ago

@Ruben Kerkhof I'm trying to reproduce the problem but I've not been able to so far. Can you ? (side note: I'm convinced this is real but since it's a race, having a reliable way to reproduce it even if it requires 10 reboots would help immensely).

Actions #4

Updated by Ruben Kerkhof over 7 years ago

Loic Dachary wrote:

@Ruben Kerkhof I'm trying to reproduce the problem but I've not been able to so far. Can you ? (side note: I'm convinced this is real but since it's a race, having a reliable way to reproduce it even if it requires 10 reboots would help immensely).

Hi Loic,

I've since removed LVM out of the equation on both my clusters and haven't seen this anymore.
One thing I do recall though is that at the same time I logged this issue we also had hardware issues, whereby our journal / boot SSD responded extremely slow. So it might be that the race is more easily triggered on slow hardware.

Another thing was that /var was on a separate lv.

Actions

Also available in: Atom PDF