Project

General

Profile

Actions

Bug #6592

closed

3.8 kernel + /dev/cciss/c0d1 + precise : fail to show in /dev/disk/by-partuuid

Added by Loïc Dachary over 10 years ago. Updated over 9 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Category:
ceph-disk
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After ceph-deploy create osd the osd fails to start because the journal can't be found

ceph-deploy osd create bm0301:/dev/cciss/c0d{1,2,3,4,5,6} bm0302:/dev/cciss/c0d{1,2,3,4,5,6} bm0303:/dev/sd{b,c,d,e,f,g}

root@bm0301:~# ls -l /var/lib/ceph/osd/ceph-0/journal
lrwxrwxrwx 1 root root 58 Oct 18 16:13 /var/lib/ceph/osd/ceph-0/journal -> /dev/disk/by-partuuid/763ec75e-ae9b-4729-84c9-2dfcb7f0697f
root@bm0301:~# ls -l /dev/disk/by-partuuid/763ec75e-ae9b-4729-84c9-2dfcb7f0697f
ls: cannot access /dev/disk/by-partuuid/763ec75e-ae9b-4729-84c9-2dfcb7f0697f: No such file or directory

It is because it's missing from /dev/disk/by-partuuid :
root@ops-bm0301:/dev/disk# ls -l by-partuuid            
total 0
lrwxrwxrwx 1 root root 18 Oct 18 16:14 0aad873e-048a-4501-b0d6-fe8619cc7a07 -> ../../cciss/c0d2p1
lrwxrwxrwx 1 root root 18 Oct 18 16:43 256c419c-f05e-46e8-a5d9-e096c656fac7 -> ../../cciss/c0d1p1
lrwxrwxrwx 1 root root 18 Oct 18 16:14 8aa4cf3b-2767-40ab-8f30-199451e5ed4e -> ../../cciss/c0d4p1
lrwxrwxrwx 1 root root 18 Oct 18 16:14 970b4ed5-78f4-4052-a008-2f2f659d9443 -> ../../cciss/c0d6p1
lrwxrwxrwx 1 root root 18 Oct 18 16:14 a9a6972c-f9ac-4d26-96ba-49c4692ce428 -> ../../cciss/c0d3p1
lrwxrwxrwx 1 root root 18 Oct 18 16:14 c164d530-2875-4a52-982a-54b8bc5ded9a -> ../../cciss/c0d5p1

although
sgdisk --print /dev/cciss/c0d1
Number  Start (sector)    End (sector)  Size       Code  Name
   1         2099200       585871930   278.4 GiB   FFFF  ceph data
   2            2048         2097152   1023.0 MiB  FFFF  ceph journal
and
sgdisk --info=2 /dev/cciss/c0d1  
Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown)
Partition unique GUID: 763EC75E-AE9B-4729-84C9-2DFCB7F0697F
First sector: 2048 (at 1024.0 KiB)
Last sector: 2097152 (at 1024.0 MiB)
Partition size: 2095105 sectors (1023.0 MiB)
Attribute flags: 0000000000000000
Partition name: 'ceph journal'

The problem is resolved after a reboot. That problem does not happen when dealing with /dev/sdX devices

Actions #1

Updated by Loïc Dachary over 10 years ago

The full chat log about research on this problem:

<loicd> http://pastealacon.com/33335 ( /dev/sdc1) looks exactly like http://pastealacon.com/33336 ( /dev/cciss/c0d1p1 ) modulo the disk / machine name
<loicd> root@ops-bm0303:/dev/disk# ls -l by-partuuid/ | grep sdc is http://paste.ubuntu.com/6257716/
<loicd> root@ops-bm0301:/dev/disk# ls -l by-partuuid is http://pastealacon.com/33337
<loicd> however
<loicd> sgdisk --print /dev/cciss/c0d1
<loicd> Number  Start (sector)    End (sector)  Size       Code  Name
<loicd>    1         2099200       585871930   278.4 GiB   FFFF  ceph data
<loicd>    2            2048         2097152   1023.0 MiB  FFFF  ceph journal
<loicd> and
<loicd> sgdisk --info=2 /dev/cciss/c0d1  
<loicd> Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown)
<loicd> Partition unique GUID: 763EC75E-AE9B-4729-84C9-2DFCB7F0697F
<loicd> First sector: 2048 (at 1024.0 KiB)
<loicd> Last sector: 2097152 (at 1024.0 MiB)
<loicd> Partition size: 2095105 sectors (1023.0 MiB)
<loicd> Attribute flags: 0000000000000000
<loicd> Partition name: 'ceph journal'
<loicd> root@ops-bm0301:/dev/disk# grep -i 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 /lib/udev/*/*  
<loicd> /lib/udev/rules.d/95-ceph-osd.rules:  ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
-*- loicd feels he is getting closer
<loicd> alfredodeza: why would this partition not show in /dev/disk/by-partuuid ? 
-*- loicd trying a partprobe
<loicd> no change
<loicd> how can a partition with what seems to be a valid Partition unique GUID: 763EC75E-AE9B-4729-84C9-2DFCB7F0697F not show in  /dev/disk/by-partuuid
<yanzheng> what does blkid say
-*- loicd check
<loicd> yanzheng: I'm not familiar with blkid, what do you suggest I try ? I tried blkid -U 256C419C-F05E-46E8-A5D9-E096C656FAC7
<loicd> blkid -U 763EC75E-AE9B-4729-84C9-2DFCB7F0697F
<loicd>  blkid -U 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D
<loicd> and none output anything 
<yanzheng> no parameter
<loicd> blkid /dev/cciss/c0d1p1
<loicd> /dev/cciss/c0d1p1: UUID="f6dcdc07-6874-41a9-a309-b92f1037cc49" TYPE="xfs" 
<loicd>  blkid /dev/cciss/c0d1p
<loicd>  blkid /dev/cciss/c0d1p2
<loicd> shows nothing at all
<loicd> at last something that does not seem right yanzheng :-)
<loicd> that would mean that ceph-deploy failed to set the uuid
<yanzheng> i guess the kernel code matches the blkid code
<yanzheng> no idea
<loicd> which is different form the GUID code which is also different from the unique GUID :-D
<loicd> yanzheng: that's a lead, thanks for the tip
-*- loicd digs some more :-)
-*- loicd uuid overflows :-D
<ccourtaut> uuidception!
<yanzheng> ceph journal is not fs, I guess the kernel driver does not recognize it
<ccourtaut> root@bm0303:~# blkid /dev/sdc1
<ccourtaut> /dev/sdc1: UUID="1f64a285-48cf-4ced-95c3-6dec5c369024" TYPE="xfs" 
<ccourtaut> root@bm0303:~# blkid /dev/sdc2
<ccourtaut> root@bm0303:~# 
<ccourtaut> loicd: ^
<loicd> yes
<loicd> so it's normal that the journal partition has no uuid found by blkid, that's not where the problem comes from
<loicd> because the /dev/sdc* shows in partuuid just fine
<ccourtaut> loicd: the real problem seems that the journal partition does not appear in /dev/...
<ccourtaut> :/
<loicd> yes, as listed above 
-*- loicd reading http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/2793
<loicd> and browsing the ceph-deploy git logs
<loicd> absolutely nothing with cciss
-*- loicd suggests a reboot...
<loicd> ccourtaut: we have two machines with the problem, we can reboot one to see if it fixes the problem in which case we will know that it's a notification problem and not a configuration problem
<loicd> and then try to figure out the notification problem on the remaining machine
<loicd> so, the osd are up after a reboot
<loicd> it definitely a notification problem
<ccourtaut> it seems to indeed
<loicd> yanzheng: how would you suggest we ask udev/the kernel to rescan/recondsider the situation of /dev/cciss/c0d1 as if it was a new disk entirely ? I know very little and my question is probably not formulated correctly but I hope you see the idea :-)
<loicd> we will keep the machine that has the notification problem untouched for now, to try to figure out the cause of the problem
<glzhao> loicd: "udevadm trigger [options]" maybe could help you, and you can also write udev rules in /etc/udev/rules.d/
<loicd> glzhao: thanks :-) do you know what option will ask something like "forget everything about this device and do it again as if boot just occured" ... sort of ;-)
<glzhao> loicd: sorry :-( , I couldn't remember it clearly

Actions #2

Updated by Loïc Dachary over 10 years ago

  • Description updated (diff)
Actions #3

Updated by Loïc Dachary over 10 years ago

  • Description updated (diff)
Actions #4

Updated by Sage Weil over 10 years ago

'blkid -o udev DEVICE' is what udev is relying on... and that is not outputting the right uuid info?

Actions #5

Updated by Ian Colle over 10 years ago

  • Category changed from ceph-deploy to ceph-disk
  • Priority changed from Normal to High
Actions #6

Updated by Loïc Dachary over 10 years ago

I will try again on thursday, when I get access to the hardware again.

Actions #7

Updated by Loïc Dachary over 10 years ago

  • Status changed from New to Need More Info
Actions #8

Updated by Loïc Dachary over 10 years ago

  • Status changed from Need More Info to In Progress

blkid -o udev /dev/cciss/c0d1p2 does not return anything. Note, however, that after a reboot the OSDs are running fine and found the journal.

root@bm0301:~# blkid -o udev /dev/cciss/c0d1p1
ID_FS_UUID=f6dcdc07-6874-41a9-a309-b92f1037cc49
ID_FS_UUID_ENC=f6dcdc07-6874-41a9-a309-b92f1037cc49
ID_FS_TYPE=xfs
root@bm0301:~# blkid -o udev /dev/cciss/c0d1p2
root@bm0301:~# 
root@bm0301:~# sgdisk --info=1 /dev/cciss/c0d1
Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown)
Partition unique GUID: 256C419C-F05E-46E8-A5D9-E096C656FAC7
First sector: 2099200 (at 1.0 GiB)
Last sector: 585871930 (at 279.4 GiB)
Partition size: 583772731 sectors (278.4 GiB)
Attribute flags: 0000000000000000
Partition name: 'ceph data'
root@bm0301:~# sgdisk --info=2 /dev/cciss/c0d1
Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown)
Partition unique GUID: 763EC75E-AE9B-4729-84C9-2DFCB7F0697F
First sector: 2048 (at 1024.0 KiB)
Last sector: 2097152 (at 1024.0 MiB)
Partition size: 2095105 sectors (1023.0 MiB)
Attribute flags: 0000000000000000
Partition name: 'ceph journal'

Actions #9

Updated by Ian Colle about 10 years ago

  • Assignee set to Loïc Dachary

Loic - is this still "In Progress"?

Actions #10

Updated by Loïc Dachary about 10 years ago

We did not get to the bottom of this and the hardware is still available. It's cold but not dead ;-)

Actions #11

Updated by Sage Weil over 9 years ago

  • Status changed from In Progress to 12
Actions #12

Updated by Loïc Dachary over 9 years ago

  • Status changed from 12 to Can't reproduce

I lost access to the hardware before being able to properly reproduce / diagnose this border case.

Actions

Also available in: Atom PDF