Actions
Bug #6592
closed3.8 kernel + /dev/cciss/c0d1 + precise : fail to show in /dev/disk/by-partuuid
% Done:
0%
Source:
Community (dev)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
After ceph-deploy create osd the osd fails to start because the journal can't be found
ceph-deploy osd create bm0301:/dev/cciss/c0d{1,2,3,4,5,6} bm0302:/dev/cciss/c0d{1,2,3,4,5,6} bm0303:/dev/sd{b,c,d,e,f,g}
root@bm0301:~# ls -l /var/lib/ceph/osd/ceph-0/journal lrwxrwxrwx 1 root root 58 Oct 18 16:13 /var/lib/ceph/osd/ceph-0/journal -> /dev/disk/by-partuuid/763ec75e-ae9b-4729-84c9-2dfcb7f0697f root@bm0301:~# ls -l /dev/disk/by-partuuid/763ec75e-ae9b-4729-84c9-2dfcb7f0697f ls: cannot access /dev/disk/by-partuuid/763ec75e-ae9b-4729-84c9-2dfcb7f0697f: No such file or directory
It is because it's missing from /dev/disk/by-partuuid :
root@ops-bm0301:/dev/disk# ls -l by-partuuid total 0 lrwxrwxrwx 1 root root 18 Oct 18 16:14 0aad873e-048a-4501-b0d6-fe8619cc7a07 -> ../../cciss/c0d2p1 lrwxrwxrwx 1 root root 18 Oct 18 16:43 256c419c-f05e-46e8-a5d9-e096c656fac7 -> ../../cciss/c0d1p1 lrwxrwxrwx 1 root root 18 Oct 18 16:14 8aa4cf3b-2767-40ab-8f30-199451e5ed4e -> ../../cciss/c0d4p1 lrwxrwxrwx 1 root root 18 Oct 18 16:14 970b4ed5-78f4-4052-a008-2f2f659d9443 -> ../../cciss/c0d6p1 lrwxrwxrwx 1 root root 18 Oct 18 16:14 a9a6972c-f9ac-4d26-96ba-49c4692ce428 -> ../../cciss/c0d3p1 lrwxrwxrwx 1 root root 18 Oct 18 16:14 c164d530-2875-4a52-982a-54b8bc5ded9a -> ../../cciss/c0d5p1
although
sgdisk --print /dev/cciss/c0d1 Number Start (sector) End (sector) Size Code Name 1 2099200 585871930 278.4 GiB FFFF ceph data 2 2048 2097152 1023.0 MiB FFFF ceph journal and sgdisk --info=2 /dev/cciss/c0d1 Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) Partition unique GUID: 763EC75E-AE9B-4729-84C9-2DFCB7F0697F First sector: 2048 (at 1024.0 KiB) Last sector: 2097152 (at 1024.0 MiB) Partition size: 2095105 sectors (1023.0 MiB) Attribute flags: 0000000000000000 Partition name: 'ceph journal'
The problem is resolved after a reboot. That problem does not happen when dealing with /dev/sdX devices
Updated by Loïc Dachary over 10 years ago
The full chat log about research on this problem:
<loicd> http://pastealacon.com/33335 ( /dev/sdc1) looks exactly like http://pastealacon.com/33336 ( /dev/cciss/c0d1p1 ) modulo the disk / machine name <loicd> root@ops-bm0303:/dev/disk# ls -l by-partuuid/ | grep sdc is http://paste.ubuntu.com/6257716/ <loicd> root@ops-bm0301:/dev/disk# ls -l by-partuuid is http://pastealacon.com/33337 <loicd> however <loicd> sgdisk --print /dev/cciss/c0d1 <loicd> Number Start (sector) End (sector) Size Code Name <loicd> 1 2099200 585871930 278.4 GiB FFFF ceph data <loicd> 2 2048 2097152 1023.0 MiB FFFF ceph journal <loicd> and <loicd> sgdisk --info=2 /dev/cciss/c0d1 <loicd> Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) <loicd> Partition unique GUID: 763EC75E-AE9B-4729-84C9-2DFCB7F0697F <loicd> First sector: 2048 (at 1024.0 KiB) <loicd> Last sector: 2097152 (at 1024.0 MiB) <loicd> Partition size: 2095105 sectors (1023.0 MiB) <loicd> Attribute flags: 0000000000000000 <loicd> Partition name: 'ceph journal' <loicd> root@ops-bm0301:/dev/disk# grep -i 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 /lib/udev/*/* <loicd> /lib/udev/rules.d/95-ceph-osd.rules: ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \ -*- loicd feels he is getting closer <loicd> alfredodeza: why would this partition not show in /dev/disk/by-partuuid ? -*- loicd trying a partprobe <loicd> no change <loicd> how can a partition with what seems to be a valid Partition unique GUID: 763EC75E-AE9B-4729-84C9-2DFCB7F0697F not show in /dev/disk/by-partuuid <yanzheng> what does blkid say -*- loicd check <loicd> yanzheng: I'm not familiar with blkid, what do you suggest I try ? I tried blkid -U 256C419C-F05E-46E8-A5D9-E096C656FAC7 <loicd> blkid -U 763EC75E-AE9B-4729-84C9-2DFCB7F0697F <loicd> blkid -U 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D <loicd> and none output anything <yanzheng> no parameter <loicd> blkid /dev/cciss/c0d1p1 <loicd> /dev/cciss/c0d1p1: UUID="f6dcdc07-6874-41a9-a309-b92f1037cc49" TYPE="xfs" <loicd> blkid /dev/cciss/c0d1p <loicd> blkid /dev/cciss/c0d1p2 <loicd> shows nothing at all <loicd> at last something that does not seem right yanzheng :-) <loicd> that would mean that ceph-deploy failed to set the uuid <yanzheng> i guess the kernel code matches the blkid code <yanzheng> no idea <loicd> which is different form the GUID code which is also different from the unique GUID :-D <loicd> yanzheng: that's a lead, thanks for the tip -*- loicd digs some more :-) -*- loicd uuid overflows :-D <ccourtaut> uuidception! <yanzheng> ceph journal is not fs, I guess the kernel driver does not recognize it <ccourtaut> root@bm0303:~# blkid /dev/sdc1 <ccourtaut> /dev/sdc1: UUID="1f64a285-48cf-4ced-95c3-6dec5c369024" TYPE="xfs" <ccourtaut> root@bm0303:~# blkid /dev/sdc2 <ccourtaut> root@bm0303:~# <ccourtaut> loicd: ^ <loicd> yes <loicd> so it's normal that the journal partition has no uuid found by blkid, that's not where the problem comes from <loicd> because the /dev/sdc* shows in partuuid just fine <ccourtaut> loicd: the real problem seems that the journal partition does not appear in /dev/... <ccourtaut> :/ <loicd> yes, as listed above -*- loicd reading http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/2793 <loicd> and browsing the ceph-deploy git logs <loicd> absolutely nothing with cciss -*- loicd suggests a reboot... <loicd> ccourtaut: we have two machines with the problem, we can reboot one to see if it fixes the problem in which case we will know that it's a notification problem and not a configuration problem <loicd> and then try to figure out the notification problem on the remaining machine <loicd> so, the osd are up after a reboot <loicd> it definitely a notification problem <ccourtaut> it seems to indeed <loicd> yanzheng: how would you suggest we ask udev/the kernel to rescan/recondsider the situation of /dev/cciss/c0d1 as if it was a new disk entirely ? I know very little and my question is probably not formulated correctly but I hope you see the idea :-) <loicd> we will keep the machine that has the notification problem untouched for now, to try to figure out the cause of the problem <glzhao> loicd: "udevadm trigger [options]" maybe could help you, and you can also write udev rules in /etc/udev/rules.d/ <loicd> glzhao: thanks :-) do you know what option will ask something like "forget everything about this device and do it again as if boot just occured" ... sort of ;-) <glzhao> loicd: sorry :-( , I couldn't remember it clearly
Updated by Sage Weil over 10 years ago
'blkid -o udev DEVICE' is what udev is relying on... and that is not outputting the right uuid info?
Updated by Ian Colle over 10 years ago
- Category changed from ceph-deploy to ceph-disk
- Priority changed from Normal to High
Updated by Loïc Dachary over 10 years ago
I will try again on thursday, when I get access to the hardware again.
Updated by Loïc Dachary over 10 years ago
- Status changed from New to Need More Info
Updated by Loïc Dachary over 10 years ago
- Status changed from Need More Info to In Progress
blkid -o udev /dev/cciss/c0d1p2 does not return anything. Note, however, that after a reboot the OSDs are running fine and found the journal.
root@bm0301:~# blkid -o udev /dev/cciss/c0d1p1 ID_FS_UUID=f6dcdc07-6874-41a9-a309-b92f1037cc49 ID_FS_UUID_ENC=f6dcdc07-6874-41a9-a309-b92f1037cc49 ID_FS_TYPE=xfs root@bm0301:~# blkid -o udev /dev/cciss/c0d1p2 root@bm0301:~# root@bm0301:~# sgdisk --info=1 /dev/cciss/c0d1 Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) Partition unique GUID: 256C419C-F05E-46E8-A5D9-E096C656FAC7 First sector: 2099200 (at 1.0 GiB) Last sector: 585871930 (at 279.4 GiB) Partition size: 583772731 sectors (278.4 GiB) Attribute flags: 0000000000000000 Partition name: 'ceph data' root@bm0301:~# sgdisk --info=2 /dev/cciss/c0d1 Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) Partition unique GUID: 763EC75E-AE9B-4729-84C9-2DFCB7F0697F First sector: 2048 (at 1024.0 KiB) Last sector: 2097152 (at 1024.0 MiB) Partition size: 2095105 sectors (1023.0 MiB) Attribute flags: 0000000000000000 Partition name: 'ceph journal'
Updated by Ian Colle about 10 years ago
- Assignee set to Loïc Dachary
Loic - is this still "In Progress"?
Updated by Loïc Dachary about 10 years ago
We did not get to the bottom of this and the hardware is still available. It's cold but not dead ;-)
Updated by Loïc Dachary over 9 years ago
- Status changed from 12 to Can't reproduce
I lost access to the hardware before being able to properly reproduce / diagnose this border case.
Actions