Project

General

Profile

Bug #44682

weird osd create failure

Added by Sage Weil about 4 years ago. Updated about 3 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

[2020-03-19 16:36:47,960][ceph_volume.main][INFO  ] Running command: ceph-volume  lvm prepare --bluestore --data /dev/test_cephadm/test_cephadm.0 --no-systemd
...
[2020-03-19 16:36:48,208][ceph_volume.process][INFO  ] Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new dc4320c1-b70d-4796-94e6-6fbb54aa542e
[2020-03-19 16:36:48,542][ceph_volume.process][INFO  ] stdout 0
...
[2020-03-19 16:36:50,859][ceph_volume.process][INFO  ] Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid dc4320c1-b70d-4796-94e6-6fbb54aa542e --setuser ceph --setgroup ceph
...
[2020-03-19 16:36:57,385][ceph_volume.main][INFO  ] Running command: ceph-volume  lvm list --format json
...
[2020-03-19 16:36:57,386][ceph_volume.process][INFO  ] Running command: /usr/sbin/lvs --noheadings --readonly --separator=";" -a -S  -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size
[2020-03-19 16:36:57,457][ceph_volume.process][INFO  ] stdout ceph.block_device=/dev/test_cephadm/test_cephadm.0,ceph.block_uuid=0zyXvS-jZpt-ChrF-mHcy-btRw-9WHq-QBpBTY,ceph.cephx_lockbox_secret=,ceph.cluster_fsid=00000000-0000-0000-0000-0000deadbeef,ceph.cluster_name=ceph,ceph.crush_device_class=None,ceph.encrypted=0,ceph.osd_fsid=dc4320c1-b70d-4796-94e6-6fbb54aa542e,ceph.osd_id=0,ceph.type=block,ceph.vdo=0";"/dev/test_cephadm/test_cephadm.0";"test_cephadm.0";"test_cephadm";"0zyXvS-jZpt-ChrF-mHcy-btRw-9WHq-QBpBTY";"<3.00g
[2020-03-19 16:36:57,457][ceph_volume.process][INFO  ] stdout ";"/dev/vg_nvme/lv_1";"lv_1";"vg_nvme";"9kQn2N-GaIz-L8XX-hCt8-X9n6-8TNN-F12Tdl";"89.40g
[2020-03-19 16:36:57,457][ceph_volume.process][INFO  ] stdout ";"/dev/vg_nvme/lv_2";"lv_2";"vg_nvme";"cnGwHA-nB7p-vixh-k1BC-vmGv-Uljz-Dz2vUW";"89.40g
[2020-03-19 16:36:57,457][ceph_volume.process][INFO  ] stdout ";"/dev/vg_nvme/lv_3";"lv_3";"vg_nvme";"dxO7Te-g6Nh-7C7e-ibQy-gts5-95oF-ZlIKfC";"89.40g
[2020-03-19 16:36:57,458][ceph_volume.process][INFO  ] stdout ";"/dev/vg_nvme/lv_4";"lv_4";"vg_nvme";"2SYy28-MhCy-1YBW-ZBSC-OdPZ-53Fc-VIywy2";"89.40g
[2020-03-19 16:36:57,458][ceph_volume.process][INFO  ] stdout ";"/dev/vg_nvme/lv_5";"lv_5";"vg_nvme";"rY3w7d-P651-CW4d-nsEu-l8hG-BmCL-kXtWB7";"14.90g
[2020-03-19 16:36:57,458][ceph_volume.process][INFO  ] Running command: /usr/sbin/pvs --no-heading --readonly --separator=";" -S lv_uuid=0zyXvS-jZpt-ChrF-mHcy-btRw-9WHq-QBpBTY -o pv_name,pv_tags,pv_uuid,vg_name,lv_uuid
[2020-03-19 16:36:57,509][ceph_volume.process][INFO  ] stdout /dev/loop0";"";"gk6LDn-X3ar-pjKd-iUVD-RlJ5-X3c8-vl86Jh";"test_cephadm";"0zyXvS-jZpt-ChrF-mHcy-btRw-9WHq-QBpBTY
[2020-03-19 16:36:57,510][ceph_volume.api.lvm][WARNING] device is not part of ceph: </dev/vg_nvme/lv_1>
[2020-03-19 16:36:57,510][ceph_volume.api.lvm][WARNING] device is not part of ceph: </dev/vg_nvme/lv_2>
[2020-03-19 16:36:57,510][ceph_volume.api.lvm][WARNING] device is not part of ceph: </dev/vg_nvme/lv_3>
[2020-03-19 16:36:57,510][ceph_volume.api.lvm][WARNING] device is not part of ceph: </dev/vg_nvme/lv_4>
[2020-03-19 16:36:57,511][ceph_volume.api.lvm][WARNING] device is not part of ceph: </dev/vg_nvme/lv_5>
[2020-03-19 16:37:01,918][ceph_volume.main][INFO  ] Running command: ceph-volume  lvm activate 0 dc4320c1-b70d-4796-94e6-6fbb54aa542e --no-systemd
[2020-03-19 16:37:01,919][ceph_volume.process][INFO  ] Running command: /usr/sbin/lvs --noheadings --readonly --separator=";" -a -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size
[2020-03-19 16:37:01,985][ceph_volume.process][INFO  ] stdout ";"/dev/test_cephadm/test_cephadm.0";"test_cephadm.0";"test_cephadm";"0zyXvS-jZpt-ChrF-mHcy-btRw-9WHq-QBpBTY";"<3.00g
[2020-03-19 16:37:01,986][ceph_volume.process][INFO  ] stdout ";"/dev/test_cephadm/test_cephadm.1";"test_cephadm.1";"test_cephadm";"93hDpc-g10A-YBd4-RxFg-cSk3-peow-UlWBTo";"<3.00g
[2020-03-19 16:37:01,986][ceph_volume.process][INFO  ] stdout ";"/dev/vg_nvme/lv_1";"lv_1";"vg_nvme";"9kQn2N-GaIz-L8XX-hCt8-X9n6-8TNN-F12Tdl";"89.40g
[2020-03-19 16:37:01,986][ceph_volume.process][INFO  ] stdout ";"/dev/vg_nvme/lv_2";"lv_2";"vg_nvme";"cnGwHA-nB7p-vixh-k1BC-vmGv-Uljz-Dz2vUW";"89.40g
[2020-03-19 16:37:01,986][ceph_volume.process][INFO  ] stdout ";"/dev/vg_nvme/lv_3";"lv_3";"vg_nvme";"dxO7Te-g6Nh-7C7e-ibQy-gts5-95oF-ZlIKfC";"89.40g
[2020-03-19 16:37:01,986][ceph_volume.process][INFO  ] stdout ";"/dev/vg_nvme/lv_4";"lv_4";"vg_nvme";"2SYy28-MhCy-1YBW-ZBSC-OdPZ-53Fc-VIywy2";"89.40g
[2020-03-19 16:37:01,986][ceph_volume.process][INFO  ] stdout ";"/dev/vg_nvme/lv_5";"lv_5";"vg_nvme";"rY3w7d-P651-CW4d-nsEu-l8hG-BmCL-kXtWB7";"14.90g
[2020-03-19 16:37:01,987][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 150, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py", line 42, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/activate.py", line 341, in main
    self.activate(args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/activate.py", line 251, in activate
    raise RuntimeError('could not find osd.%s with osd_fsid %s' % (osd_id, osd_fsid))
RuntimeError: could not find osd.0 with osd_fsid dc4320c1-b70d-4796-94e6-6fbb54aa542e

/a/sage-2020-03-19_16:01:13-rados:cephadm-wip-sage3-testing-2020-03-19-0857-distro-basic-smithi/4868079

History

#1 Updated by Sage Weil about 4 years ago

  • Priority changed from High to Urgent

This reproduces 100% of the time when the test_cephadm workunit runs on ubuntu:

2020-03-21T13:57:39.849 INFO:tasks.workunit.client.0.smithi039.stderr:+ nfs_rados_pool=nfs-ganesha
2020-03-21T13:57:39.849 INFO:tasks.workunit.client.0.smithi039.stderr:+ sudo /tmp/tmp.l90EJwZuJ4/cephadm --image quay.io/ceph-ci/ceph:octopus shell --fsid 00000000-0000-0000-0000-0000deadbeef --config /tmp/tmp.kFQyFfDNWX/tmp.eS9p6LtV1c --keyring /tmp/tmp.kFQyFfDNWX/tmp.qqq8KuQplb -- ceph osd pool create nfs-ganesha 64
2020-03-21T13:57:42.702 INFO:tasks.workunit.client.0.smithi039.stderr:pool 'nfs-ganesha' created
2020-03-21T13:57:43.423 INFO:tasks.workunit.client.0.smithi039.stderr:+ sudo /tmp/tmp.l90EJwZuJ4/cephadm --image quay.io/ceph-ci/ceph:octopus shell --fsid 00000000-0000-0000-0000-0000deadbeef --config /tmp/tmp.kFQyFfDNWX/tmp.eS9p6LtV1c --keyring /tmp/tmp.kFQyFfDNWX/tmp.qqq8KuQplb -- rados --pool nfs-ganesha --namespace nfs-ns create conf-nfs.a

and then hangs.

the reason is one osd did not come up, and the rados pool is not active

root@smithi039:/# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME           STATUS  REWEIGHT  PRI-AFF
-1         0.00290  root default                                 
-3         0.00290      host smithi039                           
 1    hdd  0.00290          osd.1           up   1.00000  1.00000
 0               0  osd.0                 down         0  1.00000

ceph-volume is failing part way through...
[2020-03-21 13:56:10,182][ceph_volume.main][INFO  ] Running command: ceph-volume  lvm activate 0 db0e6efc-9ed5-4ffb-b942-5bf6b5864286 --no-systemd
[2020-03-21 13:56:10,183][ceph_volume.process][INFO  ] Running command: /usr/sbin/lvs --noheadings --readonly --separator=";" -a -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size
[2020-03-21 13:56:10,254][ceph_volume.process][INFO  ] stdout ";"/dev/test_cephadm/test_cephadm.0";"test_cephadm.0";"test_cephadm";"Z35H8O-SgZQ-KHGH-VX7H-oivI-weww-xkVkdq";"<3.00g
[2020-03-21 13:56:10,255][ceph_volume.process][INFO  ] stdout ";"/dev/test_cephadm/test_cephadm.1";"test_cephadm.1";"test_cephadm";"NLCVih-yaPe-1eQQ-uTAI-z3EL-krYn-z2epGF";"<3.00g
[2020-03-21 13:56:10,255][ceph_volume.process][INFO  ] stdout ";"/dev/vg_nvme/lv_1";"lv_1";"vg_nvme";"vkBYob-062z-L6Mj-D91v-ZJb0-Oz4Q-10ze7b";"89.40g
[2020-03-21 13:56:10,255][ceph_volume.process][INFO  ] stdout ";"/dev/vg_nvme/lv_2";"lv_2";"vg_nvme";"V5Op4v-PCgX-RwXM-u1fP-hc3T-fDly-r3baVN";"89.40g
[2020-03-21 13:56:10,255][ceph_volume.process][INFO  ] stdout ";"/dev/vg_nvme/lv_3";"lv_3";"vg_nvme";"xe1QnH-TDSn-iiOh-Hs3J-MlXC-t3eU-q3CyxH";"89.40g
[2020-03-21 13:56:10,255][ceph_volume.process][INFO  ] stdout ";"/dev/vg_nvme/lv_4";"lv_4";"vg_nvme";"UeIhNY-HkXr-ufJ9-D0wG-5W90-tsGg-0iwWQd";"89.40g
[2020-03-21 13:56:10,255][ceph_volume.process][INFO  ] stdout ";"/dev/vg_nvme/lv_5";"lv_5";"vg_nvme";"P9wxjU-jvEv-kAlk-Qfk6-gcKL-33wS-KIRypf";"14.90g
[2020-03-21 13:56:10,256][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 150, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py", line 42, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/activate.py", line 341, in main
    self.activate(args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/activate.py", line 251, in activate
    raise RuntimeError('could not find osd.%s with osd_fsid %s' % (osd_id, osd_fsid))
RuntimeError: could not find osd.0 with osd_fsid db0e6efc-9ed5-4ffb-b942-5bf6b5864286

#2 Updated by Jan Fajerski about 4 years ago

It looks like lv_tags don't stick or a re removed at some point.

[2020-03-19 16:36:48,594][ceph_volume.process][INFO  ] Running command: /usr/sbin/lvchange --addtag ceph.osd_fsid=dc4320c1-b70d-4796-94e6-6fbb54aa542e /dev/test_cephadm/test_cephadm.0
[2020-03-19 16:36:48,754][ceph_volume.process][INFO  ] stdout Logical volume test_cephadm/test_cephadm.0 changed.
[2020-03-19 16:36:48,754][ceph_volume.process][INFO  ] Running command: /usr/sbin/lvchange --addtag ceph.osd_id=0 /dev/test_cephadm/test_cephadm.0
[2020-03-19 16:36:48,946][ceph_volume.process][INFO  ] stdout Logical volume test_cephadm/test_cephadm.0 changed.

When lvm list queries the lvs the tags are there:

[2020-03-19 16:36:57,385][ceph_volume.main][INFO  ] Running command: ceph-volume  lvm list --format json
[2020-03-19 16:36:57,386][ceph_volume.process][INFO  ] Running command: /usr/sbin/lvs --noheadings --readonly --separator=";" -a -S  -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size
[2020-03-19 16:36:57,457][ceph_volume.process][INFO  ] stdout ceph.block_device=/dev/test_cephadm/test_cephadm.0,ceph.block_uuid=0zyXvS-jZpt-ChrF-mHcy-btRw-9WHq-QBpBTY,ceph.cephx_lockbox_secret=,ceph.cluster_fsid=00000000-0000-0000-0000-0000deadbeef,ceph.cluster_name=ceph,ceph.crush_device_class=None,ceph.encrypted=0,ceph.osd_fsid=dc4320c1-b70d-4796-94e6-6fbb54aa542e,ceph.osd_id=0,ceph.type=block,ceph.vdo=0";"/dev/test_cephadm/test_cephadm.0";"test_cephadm.0";"test_cephadm";"0zyXvS-jZpt-ChrF-mHcy-btRw-9WHq-QBpBTY";"<3.00g

But when activate lists the lvs the tags are gone:

[2020-03-19 16:37:01,918][ceph_volume.main][INFO  ] Running command: ceph-volume  lvm activate 0 dc4320c1-b70d-4796-94e6-6fbb54aa542e --no-systemd
[2020-03-19 16:37:01,919][ceph_volume.process][INFO  ] Running command: /usr/sbin/lvs --noheadings --readonly --separator=";" -a -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size
[2020-03-19 16:37:01,985][ceph_volume.process][INFO  ] stdout ";"/dev/test_cephadm/test_cephadm.0";"test_cephadm.0";"test_cephadm";"0zyXvS-jZpt-ChrF-mHcy-btRw-9WHq-QBpBTY";"<3.00g
[2020-03-19 16:37:01,986][ceph_volume.process][INFO  ] stdout ";"/dev/test_cephadm/test_cephadm.1";"test_cephadm.1";"test_cephadm";"93hDpc-g10A-YBd4-RxFg-cSk3-peow-UlWBTo";"<3.00g

There is also an additional lvs now. Will look at the test implementation, might be something races here and erases our tags here.

#3 Updated by Sage Weil about 4 years ago

  • Status changed from New to Pending Backport

#4 Updated by Sage Weil about 4 years ago

  • Status changed from Pending Backport to In Progress

#5 Updated by Jan Fajerski almost 4 years ago

After talking to LVM folks I ran some tests aiming at lvm state synchronisation and locking. What we currently do seems to be enough (map lvm's locking dir, /run/udev and /dev) to facilitate multi-actor lvm access.

I tracked this particular issue down to a version mismatch. This behaviour was only seen with a ubuntu 18.04 host. Running lvm from a ubuntu 18.04 container shows the same behaviour, whereas containers with newer lvm binaries behave correctly (say ubunutu focal).

I have not yet identified the lowest lvm version to behave well. The most annoying thing here is that we implicitly introduce version dependencies between host os and container iff we actually need lvm access from the host in real world scenarios. Iiuc we don't, but we should still document this as careless users can end up in a bad place.

#6 Updated by Sebastian Wagner over 3 years ago

  • Related to Bug #24790: Can not activate data partition on bluestore added

#7 Updated by Jan Fajerski over 3 years ago

  • Related to deleted (Bug #24790: Can not activate data partition on bluestore)

#8 Updated by Sage Weil about 3 years ago

  • Priority changed from Urgent to Normal

Also available in: Atom PDF