Project

General

Profile

Actions

Bug #5194

closed

udev does not start osd after reboot on wheezy or el6 or fedora

Added by Robert Sander almost 11 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
ceph-deploy
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph-deploy creates a partition with a filesystem (XFS by default) and mounts it to /var/lib/ceph/osd/<clustername>-<id>.

This mount is not added to /etc/fstab which makes the operation not persistent across reboots.
After a reboot the init.d script does not start the OSD as there is no data in /var/lib/ceph/osd/<clustername>-<id>.

ceph-deploy should add an entry to /etc/fstab. At least the documentation on http://ceph.com/docs/master/rados/deployment/ceph-deploy-osd/ should mention it.


Files

syslog (192 KB) syslog syslog Robert Sander, 06/13/2013 08:40 AM
syslog (191 KB) syslog Robert Sander, 06/14/2013 01:09 PM
Actions #1

Updated by Ian Colle almost 11 years ago

  • Assignee set to Anonymous
  • Priority changed from Normal to Urgent
Actions #2

Updated by Robert Sander almost 11 years ago

Something like

grep osd/<clustername>-<id> /proc/mounts >> /etc/fstab

could work after the OSD filesystem has been mounted.

Actions #3

Updated by Sage Weil almost 11 years ago

udev shoudl trigger ceph-disk active after the reboot to bring the osd back up; no fstab entry should be necessary (provided GPT partitoins are being used)

Actions #4

Updated by Robert Sander almost 11 years ago

Then a component is missing on my test system (Debian 7 wheezy).

After rebooting the filesystem is not mounted when not in /etc/fstab.

There is nothing returned from "grep -r ceph /etc/udev".
There is a /lib/udev/rules.d/95-ceph-osd.rules, when I link that into /etc/udev/rules.d udev still does not recogize the filesystem.

Actions #5

Updated by Sage Weil almost 11 years ago

what happens if you do 'ceph-disk-active /dev/sdb1' (or whatever the xfs patition is)? what about 'partprobe /dev/sdb' (or whatever the disk device is)?

is it a gpt partition that ceph-deploy created, or did you partition the disk yourself? the udev stuff will only trigger basic on GPT partition labels...

Actions #6

Updated by Sage Weil almost 11 years ago

  • Subject changed from ceph-deploy osd create does not add fstab entry to udev does not start osd after reboot on wheezy
  • Status changed from New to Need More Info

can you confirm whether 'partprobe /dev/...' will start the osd?

Actions #7

Updated by Sage Weil almost 11 years ago

  • Assignee deleted (Anonymous)
Actions #8

Updated by Sage Weil almost 11 years ago

  • Priority changed from Urgent to High
Actions #9

Updated by Robert Sander almost 11 years ago

Sage Weil wrote:

what happens if you do 'ceph-disk-active /dev/sdb1' (or whatever the xfs patition is)? what about 'partprobe /dev/sdb' (or whatever the disk device is)?

root@ceph03-test:~# cat /proc/partitions
major minor #blocks name

8        0   10485760 sda
8 1 248832 sda1
8 2 1 sda2
8 5 10233856 sda5
8 16 16777216 sdb
8 17 16776175 sdb1
8 32 1048576 sdc
8 33 1047552 sdc1
8 48 16777216 sdd
8 49 16776175 sdd1
8 64 1048576 sde
8 65 1047552 sde1

"partprobe /dev/sdb" and "partprobe /dev/sdd" mounts the filesystems.

is it a gpt partition that ceph-deploy created, or did you partition the disk yourself? the udev stuff will only trigger basic on GPT partition labels...

Both have been created with ceph-deploy:

Model: VMware Virtual disk (scsi)
Disk /dev/sdb: 17.2GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
1 1049kB 17.2GB 17.2GB xfs ceph data

Actions #10

Updated by Sage Weil almost 11 years ago

  • Priority changed from High to Urgent

We need to gather some udev logs to diagnose this... can you change teh level in /etc/udev/udev.conf to 'debug', restart the udevd daemon (service udevd restart?), reproduce the problem, and then attach.. probably /var/log/syslog, or /var/log/daemon? (not sure where the udev output goes!)

Thanks!

Upgrading this to urgent since we've seen similar things on other distros as well.

Actions #11

Updated by Robert Sander almost 11 years ago

Hi,

attached is /var/log/syslog after booting the machine with udev debug level logging.

The filesystems have not been mounted automatically.

I issues "partprobe /dev/sdb" at 17:34:56 and "partprobe /dev/sdd" at 17:35:39.

Actions #12

Updated by Sage Weil almost 11 years ago

  • Assignee set to Sage Weil

I see it starting osd.5 and osd.2:

Jun 13 17:35:39 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1'(out) '=== osd.2 === '
Jun 13 17:35:40 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1'(out) 'create-or-move updated item id 2 name 'osd.2' weight 0.02 at location {host=ceph03-test,root=default} to crush map'
Jun 13 17:35:40 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1'(out) 'Starting Ceph osd.2 on ceph03-test...'
Jun 13 17:35:40 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1'(out) 'starting osd.2 at :/0 osd_data /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal'
Jun 13 17:35:40 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1' [3057] exit with return code 0

...

Jun 13 17:35:39 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1'(out) '=== osd.2 === '
Jun 13 17:35:40 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1'(out) 'create-or-move updated item id 2 name 'osd.2' weight 0.02 at location {host=ceph03-test,root=default} to crush map'
Jun 13 17:35:40 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1'(out) 'Starting Ceph osd.2 on ceph03-test...'
Jun 13 17:35:40 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1'(out) 'starting osd.2 at :/0 osd_data /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal'
Jun 13 17:35:40 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1' [3057] exit with return code 0

i take it the processes are no longer running? can you look in the /var/log/ceph/ceph-osd.[25].log logs to see what happens? (and/or attach them)

Thanks!

Actions #13

Updated by Robert Sander almost 11 years ago

Hi Sage,

this was a clean reboot of the cluster node.

As the filesystems have not been mounted automatically no OSD has been started.

They get started by udev as soon as I issue the partprobe commands.

Actions #14

Updated by Sage Weil almost 11 years ago

  • Status changed from Need More Info to 7
Actions #15

Updated by Sage Weil almost 11 years ago

  • Status changed from 7 to Need More Info

Hi Robert,

Can you grab

https://github.com/ceph/ceph/blob/master/src/ceph-disk and copy it to /usr/sbin
https://github.com/ceph/ceph/blob/master/udev/95-ceph-osd.rules for /lib/udev/rules.d

and see if the problem is resolved? I think I've squashed all the issues...

thanks!

Actions #16

Updated by Robert Sander almost 11 years ago

Sage Weil wrote:

Can you grab

https://github.com/ceph/ceph/blob/master/src/ceph-disk and copy it to /usr/sbin
https://github.com/ceph/ceph/blob/master/udev/95-ceph-osd.rules for /lib/udev/rules.d

and see if the problem is resolved?

Hi Sage,

I am sorry but I still have to run partprobe manually after a reboot.

Actions #17

Updated by Sage Weil almost 11 years ago

Can you generate and attach a udev log after the reboot? Actually, ideally,

- reboot
- note the time
- run partprobe

and send the log along so i can tell what activity resulted from partprobe (tho i guess it'll be obvious).

Thanks! (Also if you'r eon irc right now that'd be quicker to debug this.. #ceph on irc.oftc.net)

Actions #18

Updated by Robert Sander almost 11 years ago

Hi Sage,

attached is the current syslog.

I started "partprobe /dev/sdb" at Jun 14 21:57:06 and "partprobe /dev/sdd" at Jun 14 21:58:10. Before that there seems to be no udev activity except for the vmhgfs and vmsync modules.

Could it be that udev already runs in the initrd context and we do not see that output in the syslog?

Sorry for not joining IRC, but I am busy with other things.

Actions #19

Updated by Sage Weil almost 11 years ago

  • Status changed from Need More Info to In Progress

tahnks- i now see the problem (and can reproduce it here, yay!). testing a fix

Actions #20

Updated by Sage Weil almost 11 years ago

  • Subject changed from udev does not start osd after reboot on wheezy to udev does not start osd after reboot on wheezy or el6 or fedora
Actions #21

Updated by Sage Weil almost 11 years ago

rhel seems to be working, fedora18 is acting very strange.

Actions #22

Updated by Sage Weil almost 11 years ago

update:

  • wheezy is working well.
  • fedora is failing only because the mon doesn't start on boot. see #5369
  • rhel needs to be retested.
Actions #23

Updated by Sage Weil almost 11 years ago

  • Status changed from In Progress to Fix Under Review

now works on rhel, centos, wheezy, precise. f18 still has the mon start issue.

Actions #24

Updated by Sage Weil almost 11 years ago

  • Status changed from Fix Under Review to Resolved
Actions #25

Updated by René Pavlík almost 8 years ago

I want to point out on regression in this in debian jessie. As a temporary workaround I placed the corresponding lines to /etc/fstab to mount the XFS partitions at boot. After that all works and the OSD daemons are started automatically. I tried the partprobe without any success.

I can generate the logs if needed.

Thanks. Rene

Actions #26

Updated by Nathan Cutler almost 8 years ago

René, the Debian Jessie issue is known and is being addressed by http://tracker.ceph.com/issues/16351

Actions #27

Updated by René Pavlík almost 8 years ago

ok, Nathan, thanks for the link.

Actions

Also available in: Atom PDF