Bug #5194: udev does not start osd after reboot on wheezy or el6 or fedora - devops - Ceph

Actions

Copy link

Bug #5194

closed

udev does not start osd after reboot on wheezy or el6 or fedora

Added by Robert Sander almost 11 years ago. Updated almost 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Sage Weil

Category:

ceph-deploy

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ceph-deploy creates a partition with a filesystem (XFS by default) and mounts it to /var/lib/ceph/osd/<clustername>-<id>.

This mount is not added to /etc/fstab which makes the operation not persistent across reboots.
After a reboot the init.d script does not start the OSD as there is no data in /var/lib/ceph/osd/<clustername>-<id>.

ceph-deploy should add an entry to /etc/fstab. At least the documentation on http://ceph.com/docs/master/rados/deployment/ceph-deploy-osd/ should mention it.

Files

Download all files

syslog (192 KB) syslog	syslog	Robert Sander, 06/13/2013 08:40 AM
syslog (191 KB) syslog		Robert Sander, 06/14/2013 01:09 PM

Actions

Copy link

Updated by Ian Colle almost 11 years ago

Assignee set to Anonymous
Priority changed from Normal to Urgent

Actions

Copy link

Updated by Robert Sander almost 11 years ago

Something like

grep osd/<clustername>-<id> /proc/mounts >> /etc/fstab

could work after the OSD filesystem has been mounted.

Actions

Copy link

Updated by Sage Weil almost 11 years ago

udev shoudl trigger ceph-disk active after the reboot to bring the osd back up; no fstab entry should be necessary (provided GPT partitoins are being used)

Actions

Copy link

Updated by Robert Sander almost 11 years ago

Then a component is missing on my test system (Debian 7 wheezy).

After rebooting the filesystem is not mounted when not in /etc/fstab.

There is nothing returned from "grep -r ceph /etc/udev".
There is a /lib/udev/rules.d/95-ceph-osd.rules, when I link that into /etc/udev/rules.d udev still does not recogize the filesystem.

Actions

Copy link

Updated by Sage Weil almost 11 years ago

what happens if you do 'ceph-disk-active /dev/sdb1' (or whatever the xfs patition is)? what about 'partprobe /dev/sdb' (or whatever the disk device is)?

is it a gpt partition that ceph-deploy created, or did you partition the disk yourself? the udev stuff will only trigger basic on GPT partition labels...

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Subject changed from ceph-deploy osd create does not add fstab entry to udev does not start osd after reboot on wheezy
Status changed from New to Need More Info

can you confirm whether 'partprobe /dev/...' will start the osd?

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Assignee deleted (~~Anonymous~~)

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Priority changed from Urgent to High

Actions

Copy link

Updated by Robert Sander almost 11 years ago

Sage Weil wrote:

what happens if you do 'ceph-disk-active /dev/sdb1' (or whatever the xfs patition is)? what about 'partprobe /dev/sdb' (or whatever the disk device is)?

root@ceph03-test:~# cat /proc/partitions
major minor #blocks name

8        0   10485760 sda
   8        1     248832 sda1
   8        2          1 sda2
   8        5   10233856 sda5
   8       16   16777216 sdb
   8       17   16776175 sdb1
   8       32    1048576 sdc
   8       33    1047552 sdc1
   8       48   16777216 sdd
   8       49   16776175 sdd1
   8       64    1048576 sde
   8       65    1047552 sde1

"partprobe /dev/sdb" and "partprobe /dev/sdd" mounts the filesystems.

is it a gpt partition that ceph-deploy created, or did you partition the disk yourself? the udev stuff will only trigger basic on GPT partition labels...

Both have been created with ceph-deploy:

Model: VMware Virtual disk (scsi)
Disk /dev/sdb: 17.2GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
1 1049kB 17.2GB 17.2GB xfs ceph data

Actions

Copy link

#10

Updated by Sage Weil almost 11 years ago

Priority changed from High to Urgent

We need to gather some udev logs to diagnose this... can you change teh level in /etc/udev/udev.conf to 'debug', restart the udevd daemon (service udevd restart?), reproduce the problem, and then attach.. probably /var/log/syslog, or /var/log/daemon? (not sure where the udev output goes!)

Thanks!

Upgrading this to urgent since we've seen similar things on other distros as well.

Actions

Copy link

#11

Updated by Robert Sander almost 11 years ago

File syslog syslog added

Hi,

attached is /var/log/syslog after booting the machine with udev debug level logging.

The filesystems have not been mounted automatically.

I issues "partprobe /dev/sdb" at 17:34:56 and "partprobe /dev/sdd" at 17:35:39.

Actions

Copy link

#12

Updated by Sage Weil almost 11 years ago

Assignee set to Sage Weil

I see it starting osd.5 and osd.2:

Jun 13 17:35:39 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1'(out) '=== osd.2 === '
Jun 13 17:35:40 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1'(out) 'create-or-move updated item id 2 name 'osd.2' weight 0.02 at location {host=ceph03-test,root=default} to crush map'
Jun 13 17:35:40 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1'(out) 'Starting Ceph osd.2 on ceph03-test...'
Jun 13 17:35:40 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1'(out) 'starting osd.2 at :/0 osd_data /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal'
Jun 13 17:35:40 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1' [3057] exit with return code 0

...

Jun 13 17:35:39 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1'(out) '=== osd.2 === '
Jun 13 17:35:40 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1'(out) 'create-or-move updated item id 2 name 'osd.2' weight 0.02 at location {host=ceph03-test,root=default} to crush map'
Jun 13 17:35:40 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1'(out) 'Starting Ceph osd.2 on ceph03-test...'
Jun 13 17:35:40 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1'(out) 'starting osd.2 at :/0 osd_data /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal'
Jun 13 17:35:40 ceph03-test udevd[469]: '/usr/sbin/ceph-disk-activate --mount /dev/sdd1' [3057] exit with return code 0

i take it the processes are no longer running? can you look in the /var/log/ceph/ceph-osd.[25].log logs to see what happens? (and/or attach them)

Thanks!

Actions

Copy link

#13

Updated by Robert Sander almost 11 years ago

Hi Sage,

this was a clean reboot of the cluster node.

As the filesystems have not been mounted automatically no OSD has been started.

They get started by udev as soon as I issue the partprobe commands.

Actions

Copy link

#14

Updated by Sage Weil almost 11 years ago

Status changed from Need More Info to 7

Actions

Copy link

#15

Updated by Sage Weil almost 11 years ago

Status changed from 7 to Need More Info

Hi Robert,

Can you grab

https://github.com/ceph/ceph/blob/master/src/ceph-disk and copy it to /usr/sbin
https://github.com/ceph/ceph/blob/master/udev/95-ceph-osd.rules for /lib/udev/rules.d

and see if the problem is resolved? I think I've squashed all the issues...

thanks!

Actions

Copy link

#16

Updated by Robert Sander almost 11 years ago

Sage Weil wrote:

Can you grab

https://github.com/ceph/ceph/blob/master/src/ceph-disk and copy it to /usr/sbin
https://github.com/ceph/ceph/blob/master/udev/95-ceph-osd.rules for /lib/udev/rules.d

and see if the problem is resolved?

Hi Sage,

I am sorry but I still have to run partprobe manually after a reboot.

Actions

Copy link

#17

Updated by Sage Weil almost 11 years ago

Can you generate and attach a udev log after the reboot? Actually, ideally,

- reboot
- note the time
- run partprobe

and send the log along so i can tell what activity resulted from partprobe (tho i guess it'll be obvious).

Thanks! (Also if you'r eon irc right now that'd be quicker to debug this.. #ceph on irc.oftc.net)

Actions

Copy link

#18

Updated by Robert Sander almost 11 years ago

File syslog syslog added

Hi Sage,

attached is the current syslog.

I started "partprobe /dev/sdb" at Jun 14 21:57:06 and "partprobe /dev/sdd" at Jun 14 21:58:10. Before that there seems to be no udev activity except for the vmhgfs and vmsync modules.

Could it be that udev already runs in the initrd context and we do not see that output in the syslog?

Sorry for not joining IRC, but I am busy with other things.

Actions

Copy link

#19

Updated by Sage Weil almost 11 years ago

Status changed from Need More Info to In Progress

tahnks- i now see the problem (and can reproduce it here, yay!). testing a fix

Actions

Copy link

#20

Updated by Sage Weil almost 11 years ago

Subject changed from udev does not start osd after reboot on wheezy to udev does not start osd after reboot on wheezy or el6 or fedora

Actions

Copy link

#21

Updated by Sage Weil almost 11 years ago

rhel seems to be working, fedora18 is acting very strange.

Actions

Copy link

#22

Updated by Sage Weil almost 11 years ago

update:

wheezy is working well.
fedora is failing only because the mon doesn't start on boot. see #5369
rhel needs to be retested.

Actions

Copy link

#23

Updated by Sage Weil almost 11 years ago

Status changed from In Progress to Fix Under Review

now works on rhel, centos, wheezy, precise. f18 still has the mon start issue.

Actions

Copy link

#24

Updated by Sage Weil almost 11 years ago

Status changed from Fix Under Review to Resolved

Actions

Copy link

#25

Updated by René Pavlík almost 8 years ago

I want to point out on regression in this in debian jessie. As a temporary workaround I placed the corresponding lines to /etc/fstab to mount the XFS partitions at boot. After that all works and the OSD daemons are started automatically. I tried the partprobe without any success.

I can generate the logs if needed.

Thanks. Rene

Actions

Copy link

#26

Updated by Nathan Cutler almost 8 years ago

René, the Debian Jessie issue is known and is being addressed by http://tracker.ceph.com/issues/16351

Actions

Copy link

#27

Updated by René Pavlík almost 8 years ago

ok, Nathan, thanks for the link.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » devops

Custom queries

Bug #5194

udev does not start osd after reboot on wheezy or el6 or fedora

Updated by Ian Colle almost 11 years ago

Updated by Robert Sander almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Robert Sander almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Robert Sander almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Robert Sander almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Robert Sander almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Robert Sander almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Robert Sander almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by René Pavlík almost 8 years ago

Updated by Nathan Cutler almost 8 years ago

Updated by René Pavlík almost 8 years ago