Bug #44682
openweird osd create failure
0%
Description
[2020-03-19 16:36:47,960][ceph_volume.main][INFO ] Running command: ceph-volume lvm prepare --bluestore --data /dev/test_cephadm/test_cephadm.0 --no-systemd ... [2020-03-19 16:36:48,208][ceph_volume.process][INFO ] Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new dc4320c1-b70d-4796-94e6-6fbb54aa542e [2020-03-19 16:36:48,542][ceph_volume.process][INFO ] stdout 0 ... [2020-03-19 16:36:50,859][ceph_volume.process][INFO ] Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid dc4320c1-b70d-4796-94e6-6fbb54aa542e --setuser ceph --setgroup ceph ... [2020-03-19 16:36:57,385][ceph_volume.main][INFO ] Running command: ceph-volume lvm list --format json ... [2020-03-19 16:36:57,386][ceph_volume.process][INFO ] Running command: /usr/sbin/lvs --noheadings --readonly --separator=";" -a -S -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size [2020-03-19 16:36:57,457][ceph_volume.process][INFO ] stdout ceph.block_device=/dev/test_cephadm/test_cephadm.0,ceph.block_uuid=0zyXvS-jZpt-ChrF-mHcy-btRw-9WHq-QBpBTY,ceph.cephx_lockbox_secret=,ceph.cluster_fsid=00000000-0000-0000-0000-0000deadbeef,ceph.cluster_name=ceph,ceph.crush_device_class=None,ceph.encrypted=0,ceph.osd_fsid=dc4320c1-b70d-4796-94e6-6fbb54aa542e,ceph.osd_id=0,ceph.type=block,ceph.vdo=0";"/dev/test_cephadm/test_cephadm.0";"test_cephadm.0";"test_cephadm";"0zyXvS-jZpt-ChrF-mHcy-btRw-9WHq-QBpBTY";"<3.00g [2020-03-19 16:36:57,457][ceph_volume.process][INFO ] stdout ";"/dev/vg_nvme/lv_1";"lv_1";"vg_nvme";"9kQn2N-GaIz-L8XX-hCt8-X9n6-8TNN-F12Tdl";"89.40g [2020-03-19 16:36:57,457][ceph_volume.process][INFO ] stdout ";"/dev/vg_nvme/lv_2";"lv_2";"vg_nvme";"cnGwHA-nB7p-vixh-k1BC-vmGv-Uljz-Dz2vUW";"89.40g [2020-03-19 16:36:57,457][ceph_volume.process][INFO ] stdout ";"/dev/vg_nvme/lv_3";"lv_3";"vg_nvme";"dxO7Te-g6Nh-7C7e-ibQy-gts5-95oF-ZlIKfC";"89.40g [2020-03-19 16:36:57,458][ceph_volume.process][INFO ] stdout ";"/dev/vg_nvme/lv_4";"lv_4";"vg_nvme";"2SYy28-MhCy-1YBW-ZBSC-OdPZ-53Fc-VIywy2";"89.40g [2020-03-19 16:36:57,458][ceph_volume.process][INFO ] stdout ";"/dev/vg_nvme/lv_5";"lv_5";"vg_nvme";"rY3w7d-P651-CW4d-nsEu-l8hG-BmCL-kXtWB7";"14.90g [2020-03-19 16:36:57,458][ceph_volume.process][INFO ] Running command: /usr/sbin/pvs --no-heading --readonly --separator=";" -S lv_uuid=0zyXvS-jZpt-ChrF-mHcy-btRw-9WHq-QBpBTY -o pv_name,pv_tags,pv_uuid,vg_name,lv_uuid [2020-03-19 16:36:57,509][ceph_volume.process][INFO ] stdout /dev/loop0";"";"gk6LDn-X3ar-pjKd-iUVD-RlJ5-X3c8-vl86Jh";"test_cephadm";"0zyXvS-jZpt-ChrF-mHcy-btRw-9WHq-QBpBTY [2020-03-19 16:36:57,510][ceph_volume.api.lvm][WARNING] device is not part of ceph: </dev/vg_nvme/lv_1> [2020-03-19 16:36:57,510][ceph_volume.api.lvm][WARNING] device is not part of ceph: </dev/vg_nvme/lv_2> [2020-03-19 16:36:57,510][ceph_volume.api.lvm][WARNING] device is not part of ceph: </dev/vg_nvme/lv_3> [2020-03-19 16:36:57,510][ceph_volume.api.lvm][WARNING] device is not part of ceph: </dev/vg_nvme/lv_4> [2020-03-19 16:36:57,511][ceph_volume.api.lvm][WARNING] device is not part of ceph: </dev/vg_nvme/lv_5> [2020-03-19 16:37:01,918][ceph_volume.main][INFO ] Running command: ceph-volume lvm activate 0 dc4320c1-b70d-4796-94e6-6fbb54aa542e --no-systemd [2020-03-19 16:37:01,919][ceph_volume.process][INFO ] Running command: /usr/sbin/lvs --noheadings --readonly --separator=";" -a -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size [2020-03-19 16:37:01,985][ceph_volume.process][INFO ] stdout ";"/dev/test_cephadm/test_cephadm.0";"test_cephadm.0";"test_cephadm";"0zyXvS-jZpt-ChrF-mHcy-btRw-9WHq-QBpBTY";"<3.00g [2020-03-19 16:37:01,986][ceph_volume.process][INFO ] stdout ";"/dev/test_cephadm/test_cephadm.1";"test_cephadm.1";"test_cephadm";"93hDpc-g10A-YBd4-RxFg-cSk3-peow-UlWBTo";"<3.00g [2020-03-19 16:37:01,986][ceph_volume.process][INFO ] stdout ";"/dev/vg_nvme/lv_1";"lv_1";"vg_nvme";"9kQn2N-GaIz-L8XX-hCt8-X9n6-8TNN-F12Tdl";"89.40g [2020-03-19 16:37:01,986][ceph_volume.process][INFO ] stdout ";"/dev/vg_nvme/lv_2";"lv_2";"vg_nvme";"cnGwHA-nB7p-vixh-k1BC-vmGv-Uljz-Dz2vUW";"89.40g [2020-03-19 16:37:01,986][ceph_volume.process][INFO ] stdout ";"/dev/vg_nvme/lv_3";"lv_3";"vg_nvme";"dxO7Te-g6Nh-7C7e-ibQy-gts5-95oF-ZlIKfC";"89.40g [2020-03-19 16:37:01,986][ceph_volume.process][INFO ] stdout ";"/dev/vg_nvme/lv_4";"lv_4";"vg_nvme";"2SYy28-MhCy-1YBW-ZBSC-OdPZ-53Fc-VIywy2";"89.40g [2020-03-19 16:37:01,986][ceph_volume.process][INFO ] stdout ";"/dev/vg_nvme/lv_5";"lv_5";"vg_nvme";"rY3w7d-P651-CW4d-nsEu-l8hG-BmCL-kXtWB7";"14.90g [2020-03-19 16:37:01,987][ceph_volume][ERROR ] exception caught by decorator Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc return f(*a, **kw) File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 150, in main terminal.dispatch(self.mapper, subcommand_args) File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch instance.main() File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py", line 42, in main terminal.dispatch(self.mapper, self.argv) File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch instance.main() File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/activate.py", line 341, in main self.activate(args) File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root return func(*a, **kw) File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/activate.py", line 251, in activate raise RuntimeError('could not find osd.%s with osd_fsid %s' % (osd_id, osd_fsid)) RuntimeError: could not find osd.0 with osd_fsid dc4320c1-b70d-4796-94e6-6fbb54aa542e
/a/sage-2020-03-19_16:01:13-rados:cephadm-wip-sage3-testing-2020-03-19-0857-distro-basic-smithi/4868079
Updated by Sage Weil about 4 years ago
- Priority changed from High to Urgent
This reproduces 100% of the time when the test_cephadm workunit runs on ubuntu:
2020-03-21T13:57:39.849 INFO:tasks.workunit.client.0.smithi039.stderr:+ nfs_rados_pool=nfs-ganesha 2020-03-21T13:57:39.849 INFO:tasks.workunit.client.0.smithi039.stderr:+ sudo /tmp/tmp.l90EJwZuJ4/cephadm --image quay.io/ceph-ci/ceph:octopus shell --fsid 00000000-0000-0000-0000-0000deadbeef --config /tmp/tmp.kFQyFfDNWX/tmp.eS9p6LtV1c --keyring /tmp/tmp.kFQyFfDNWX/tmp.qqq8KuQplb -- ceph osd pool create nfs-ganesha 64 2020-03-21T13:57:42.702 INFO:tasks.workunit.client.0.smithi039.stderr:pool 'nfs-ganesha' created 2020-03-21T13:57:43.423 INFO:tasks.workunit.client.0.smithi039.stderr:+ sudo /tmp/tmp.l90EJwZuJ4/cephadm --image quay.io/ceph-ci/ceph:octopus shell --fsid 00000000-0000-0000-0000-0000deadbeef --config /tmp/tmp.kFQyFfDNWX/tmp.eS9p6LtV1c --keyring /tmp/tmp.kFQyFfDNWX/tmp.qqq8KuQplb -- rados --pool nfs-ganesha --namespace nfs-ns create conf-nfs.a
and then hangs.
the reason is one osd did not come up, and the rados pool is not active
root@smithi039:/# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.00290 root default -3 0.00290 host smithi039 1 hdd 0.00290 osd.1 up 1.00000 1.00000 0 0 osd.0 down 0 1.00000
ceph-volume is failing part way through...
[2020-03-21 13:56:10,182][ceph_volume.main][INFO ] Running command: ceph-volume lvm activate 0 db0e6efc-9ed5-4ffb-b942-5bf6b5864286 --no-systemd [2020-03-21 13:56:10,183][ceph_volume.process][INFO ] Running command: /usr/sbin/lvs --noheadings --readonly --separator=";" -a -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size [2020-03-21 13:56:10,254][ceph_volume.process][INFO ] stdout ";"/dev/test_cephadm/test_cephadm.0";"test_cephadm.0";"test_cephadm";"Z35H8O-SgZQ-KHGH-VX7H-oivI-weww-xkVkdq";"<3.00g [2020-03-21 13:56:10,255][ceph_volume.process][INFO ] stdout ";"/dev/test_cephadm/test_cephadm.1";"test_cephadm.1";"test_cephadm";"NLCVih-yaPe-1eQQ-uTAI-z3EL-krYn-z2epGF";"<3.00g [2020-03-21 13:56:10,255][ceph_volume.process][INFO ] stdout ";"/dev/vg_nvme/lv_1";"lv_1";"vg_nvme";"vkBYob-062z-L6Mj-D91v-ZJb0-Oz4Q-10ze7b";"89.40g [2020-03-21 13:56:10,255][ceph_volume.process][INFO ] stdout ";"/dev/vg_nvme/lv_2";"lv_2";"vg_nvme";"V5Op4v-PCgX-RwXM-u1fP-hc3T-fDly-r3baVN";"89.40g [2020-03-21 13:56:10,255][ceph_volume.process][INFO ] stdout ";"/dev/vg_nvme/lv_3";"lv_3";"vg_nvme";"xe1QnH-TDSn-iiOh-Hs3J-MlXC-t3eU-q3CyxH";"89.40g [2020-03-21 13:56:10,255][ceph_volume.process][INFO ] stdout ";"/dev/vg_nvme/lv_4";"lv_4";"vg_nvme";"UeIhNY-HkXr-ufJ9-D0wG-5W90-tsGg-0iwWQd";"89.40g [2020-03-21 13:56:10,255][ceph_volume.process][INFO ] stdout ";"/dev/vg_nvme/lv_5";"lv_5";"vg_nvme";"P9wxjU-jvEv-kAlk-Qfk6-gcKL-33wS-KIRypf";"14.90g [2020-03-21 13:56:10,256][ceph_volume][ERROR ] exception caught by decorator Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc return f(*a, **kw) File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 150, in main terminal.dispatch(self.mapper, subcommand_args) File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch instance.main() File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py", line 42, in main terminal.dispatch(self.mapper, self.argv) File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch instance.main() File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/activate.py", line 341, in main self.activate(args) File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root return func(*a, **kw) File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/activate.py", line 251, in activate raise RuntimeError('could not find osd.%s with osd_fsid %s' % (osd_id, osd_fsid)) RuntimeError: could not find osd.0 with osd_fsid db0e6efc-9ed5-4ffb-b942-5bf6b5864286
Updated by Jan Fajerski about 4 years ago
It looks like lv_tags don't stick or a re removed at some point.
[2020-03-19 16:36:48,594][ceph_volume.process][INFO ] Running command: /usr/sbin/lvchange --addtag ceph.osd_fsid=dc4320c1-b70d-4796-94e6-6fbb54aa542e /dev/test_cephadm/test_cephadm.0 [2020-03-19 16:36:48,754][ceph_volume.process][INFO ] stdout Logical volume test_cephadm/test_cephadm.0 changed. [2020-03-19 16:36:48,754][ceph_volume.process][INFO ] Running command: /usr/sbin/lvchange --addtag ceph.osd_id=0 /dev/test_cephadm/test_cephadm.0 [2020-03-19 16:36:48,946][ceph_volume.process][INFO ] stdout Logical volume test_cephadm/test_cephadm.0 changed.
When lvm list queries the lvs the tags are there:
[2020-03-19 16:36:57,385][ceph_volume.main][INFO ] Running command: ceph-volume lvm list --format json [2020-03-19 16:36:57,386][ceph_volume.process][INFO ] Running command: /usr/sbin/lvs --noheadings --readonly --separator=";" -a -S -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size [2020-03-19 16:36:57,457][ceph_volume.process][INFO ] stdout ceph.block_device=/dev/test_cephadm/test_cephadm.0,ceph.block_uuid=0zyXvS-jZpt-ChrF-mHcy-btRw-9WHq-QBpBTY,ceph.cephx_lockbox_secret=,ceph.cluster_fsid=00000000-0000-0000-0000-0000deadbeef,ceph.cluster_name=ceph,ceph.crush_device_class=None,ceph.encrypted=0,ceph.osd_fsid=dc4320c1-b70d-4796-94e6-6fbb54aa542e,ceph.osd_id=0,ceph.type=block,ceph.vdo=0";"/dev/test_cephadm/test_cephadm.0";"test_cephadm.0";"test_cephadm";"0zyXvS-jZpt-ChrF-mHcy-btRw-9WHq-QBpBTY";"<3.00g
But when activate lists the lvs the tags are gone:
[2020-03-19 16:37:01,918][ceph_volume.main][INFO ] Running command: ceph-volume lvm activate 0 dc4320c1-b70d-4796-94e6-6fbb54aa542e --no-systemd [2020-03-19 16:37:01,919][ceph_volume.process][INFO ] Running command: /usr/sbin/lvs --noheadings --readonly --separator=";" -a -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size [2020-03-19 16:37:01,985][ceph_volume.process][INFO ] stdout ";"/dev/test_cephadm/test_cephadm.0";"test_cephadm.0";"test_cephadm";"0zyXvS-jZpt-ChrF-mHcy-btRw-9WHq-QBpBTY";"<3.00g [2020-03-19 16:37:01,986][ceph_volume.process][INFO ] stdout ";"/dev/test_cephadm/test_cephadm.1";"test_cephadm.1";"test_cephadm";"93hDpc-g10A-YBd4-RxFg-cSk3-peow-UlWBTo";"<3.00g
There is also an additional lvs now. Will look at the test implementation, might be something races here and erases our tags here.
Updated by Sage Weil about 4 years ago
- Status changed from New to Pending Backport
workaround https://github.com/ceph/ceph/pull/34145
Updated by Sage Weil about 4 years ago
- Status changed from Pending Backport to In Progress
Updated by Jan Fajerski about 4 years ago
After talking to LVM folks I ran some tests aiming at lvm state synchronisation and locking. What we currently do seems to be enough (map lvm's locking dir, /run/udev and /dev) to facilitate multi-actor lvm access.
I tracked this particular issue down to a version mismatch. This behaviour was only seen with a ubuntu 18.04 host. Running lvm from a ubuntu 18.04 container shows the same behaviour, whereas containers with newer lvm binaries behave correctly (say ubunutu focal).
I have not yet identified the lowest lvm version to behave well. The most annoying thing here is that we implicitly introduce version dependencies between host os and container iff we actually need lvm access from the host in real world scenarios. Iiuc we don't, but we should still document this as careless users can end up in a bad place.
Updated by Sebastian Wagner over 3 years ago
- Related to Bug #24790: Can not activate data partition on bluestore added
Updated by Jan Fajerski over 3 years ago
- Related to deleted (Bug #24790: Can not activate data partition on bluestore)