Project

General

Profile

Actions

Bug #57918

closed

CEPHADM_REFRESH_FAILED: failed to probe daemons or devices

Added by Sake Paulusma over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
High
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
quincy,pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Last friday I upgrade the Ceph cluster successfully from 17.2.3 to 17.2.5 with "ceph orch upgrade start --image localcontainerregistry.local.com:5000/ceph/ceph:v17.2.5-20221017". After sometime, an hour?, I've got a health warning: CEPHADM_REFRESH_FAILED: failed to probe daemons or devices. I'm using only Cephfs on the cluster and it's still working correctly.
Checking the running services, everything is up and running; mon, osd and mds. But on the hosts running mon and mds services I get errors in the cephadm.log, see the loglines below.

It look likes cephadm tries to start a container for checking something?

Ceph is running on 11 VMs:
  • 3 mon
  • 2 mds
  • 6 osds

The mon and mds have 2 disks; a system and a "data" disk. The data disk isn't used, so parted lists /dev/sdb as "unrecognised disk label". The OSD has 3 disks; a system and two data disks. If configured a OSD service to use all available disks on OSD hosts.

On mon nodes I got the following:

2022-10-24 10:31:43,880 7f179e5bfb80 DEBUG --------------------------------------------------------------------------------
cephadm ['gather-facts']
2022-10-24 10:31:44,333 7fc2d52b6b80 DEBUG --------------------------------------------------------------------------------
cephadm ['--image', 'localcontainerregistry.local.com:5000/ceph/ceph@sha256:122436e2f1df0c803666c5591db4a9b6c9196a71b4d44c6bd5d18102509dfca0', 'ceph-volume', '--fsid', '8909ef90-22ea-11ed-8df1-0050569ee1b1', '--', 'inventory', '--format=json-pretty', '--filter-for-batch']
2022-10-24 10:31:44,663 7fc2d52b6b80 INFO Inferring config /var/lib/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1/mon.oqsoel24332/config
2022-10-24 10:31:44,663 7fc2d52b6b80 DEBUG Using specified fsid: 8909ef90-22ea-11ed-8df1-0050569ee1b1
2022-10-24 10:31:45,574 7fc2d52b6b80 INFO Non-zero exit code 1 from /bin/podman run --rm --ipc=host --stop-signal=SIGTERM --authfile=/etc/ceph/podman-auth.json --net=host --entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk --init -e CONTAINER_IMAGE=localcontainerregistry.local.com:5000/ceph/ceph@sha256:122436e2f1df0c803666c5591db4a9b6c9196a71b4d44c6bd5d18102509dfca0 -e NODE_NAME=monnode2.local.com -e CEPH_USE_RANDOM_NONCE=1 -e CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v /var/run/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1:/var/run/ceph:z -v /var/log/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1:/var/log/ceph:z -v /var/lib/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1/crash:/var/lib/ceph/crash:z -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm -v /var/lib/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1/selinux:/sys/fs/selinux:ro -v /:/rootfs -v /tmp/ceph-tmp31tx1iy2:/etc/ceph/ceph.conf:z localcontainerregistry.local.com:5000/ceph/ceph@sha256:122436e2f1df0c803666c5591db4a9b6c9196a71b4d44c6bd5d18102509dfca0 inventory --format=json-pretty --filter-for-batch
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr Traceback (most recent call last):
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr   File "/usr/sbin/ceph-volume", line 11, in <module>
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr     load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in __init__
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr     self.main(self.argv)
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr     return f(*a, **kw)
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr     terminal.dispatch(self.mapper, subcommand_args)
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr     instance.main()
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/inventory/main.py", line 53, in main
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr     with_lsm=self.args.with_lsm))
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/util/device.py", line 39, in __init__
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr     all_devices_vgs = lvm.get_all_devices_vgs()
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/api/lvm.py", line 797, in get_all_devices_vgs
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr     return [VolumeGroup(**vg) for vg in vgs]
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/api/lvm.py", line 797, in <listcomp>
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr     return [VolumeGroup(**vg) for vg in vgs]
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/api/lvm.py", line 517, in __init__
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr     raise ValueError('VolumeGroup must have a non-empty name')
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr ValueError: VolumeGroup must have a non-empty name

On mds nodes I got the following:

2022-10-24 10:25:18,506 7f613f6fdb80 DEBUG --------------------------------------------------------------------------------
cephadm ['gather-facts']
2022-10-24 10:25:19,047 7fd9b0d92b80 DEBUG --------------------------------------------------------------------------------
cephadm ['--image', 'localcontainerregistry.local.com:5000/ceph/ceph@sha256:122436e2f1df0c803666c5591db4a9b6c9196a71b4d44c6bd5d18102509dfca0', 'ceph-volume', '--fsid', '8909ef90-22ea-11ed-8df1-0050569ee1b1', '--', 'inventory', '--format=json-pretty', '--filter-for-batch']
2022-10-24 10:25:19,388 7fd9b0d92b80 DEBUG Using specified fsid: 8909ef90-22ea-11ed-8df1-0050569ee1b1
2022-10-24 10:25:20,306 7fd9b0d92b80 INFO Non-zero exit code 1 from /bin/podman run --rm --ipc=host --stop-signal=SIGTERM --authfile=/etc/ceph/podman-auth.json --net=host --entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk --init -e CONTAINER_IMAGE=localcontainerregistry.local.com:5000/ceph/ceph@sha256:122436e2f1df0c803666c5591db4a9b6c9196a71b4d44c6bd5d18102509dfca0 -e NODE_NAME=mdsnode1.local.com -e CEPH_USE_RANDOM_NONCE=1 -e CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v /var/run/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1:/var/run/ceph:z -v /var/log/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1:/var/log/ceph:z -v /var/lib/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1/crash:/var/lib/ceph/crash:z -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm -v /var/lib/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1/selinux:/sys/fs/selinux:ro -v /:/rootfs localcontainerregistry.local.com:5000/ceph/ceph@sha256:122436e2f1df0c803666c5591db4a9b6c9196a71b4d44c6bd5d18102509dfca0 inventory --format=json-pretty --filter-for-batch
2022-10-24 10:25:20,306 7fd9b0d92b80 INFO /bin/podman: stderr Traceback (most recent call last):
2022-10-24 10:25:20,306 7fd9b0d92b80 INFO /bin/podman: stderr   File "/usr/sbin/ceph-volume", line 11, in <module>
2022-10-24 10:25:20,306 7fd9b0d92b80 INFO /bin/podman: stderr     load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
2022-10-24 10:25:20,306 7fd9b0d92b80 INFO /bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in __init__
2022-10-24 10:25:20,307 7fd9b0d92b80 INFO /bin/podman: stderr     self.main(self.argv)
2022-10-24 10:25:20,307 7fd9b0d92b80 INFO /bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
2022-10-24 10:25:20,307 7fd9b0d92b80 INFO /bin/podman: stderr     return f(*a, **kw)
2022-10-24 10:25:20,307 7fd9b0d92b80 INFO /bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
2022-10-24 10:25:20,307 7fd9b0d92b80 INFO /bin/podman: stderr     terminal.dispatch(self.mapper, subcommand_args)
2022-10-24 10:25:20,307 7fd9b0d92b80 INFO /bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
2022-10-24 10:25:20,307 7fd9b0d92b80 INFO /bin/podman: stderr     instance.main()
2022-10-24 10:25:20,307 7fd9b0d92b80 INFO /bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/inventory/main.py", line 53, in main
2022-10-24 10:25:20,307 7fd9b0d92b80 INFO /bin/podman: stderr     with_lsm=self.args.with_lsm))
2022-10-24 10:25:20,307 7fd9b0d92b80 INFO /bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/util/device.py", line 39, in __init__
2022-10-24 10:25:20,307 7fd9b0d92b80 INFO /bin/podman: stderr     all_devices_vgs = lvm.get_all_devices_vgs()
2022-10-24 10:25:20,307 7fd9b0d92b80 INFO /bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/api/lvm.py", line 797, in get_all_devices_vgs
2022-10-24 10:25:20,307 7fd9b0d92b80 INFO /bin/podman: stderr     return [VolumeGroup(**vg) for vg in vgs]
2022-10-24 10:25:20,307 7fd9b0d92b80 INFO /bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/api/lvm.py", line 797, in <listcomp>
2022-10-24 10:25:20,307 7fd9b0d92b80 INFO /bin/podman: stderr     return [VolumeGroup(**vg) for vg in vgs]
2022-10-24 10:25:20,307 7fd9b0d92b80 INFO /bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/api/lvm.py", line 517, in __init__
2022-10-24 10:25:20,307 7fd9b0d92b80 INFO /bin/podman: stderr     raise ValueError('VolumeGroup must have a non-empty name')
2022-10-24 10:25:20,307 7fd9b0d92b80 INFO /bin/podman: stderr ValueError: VolumeGroup must have a non-empty name

Disk commands output

[cephadm@mdshost2 ~]$ sudo lvs -a
  LV               VG     Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv_home          vg_sys -wi-ao---- 256.00m
  lv_opt           vg_sys -wi-ao----   3.00g
  lv_root          vg_sys -wi-ao----   5.00g
  lv_swap          vg_sys -wi-ao----   7.56g
  lv_tmp           vg_sys -wi-ao----   1.00g
  lv_var           vg_sys -wi-ao----  15.00g
  lv_var_log       vg_sys -wi-ao----   5.00g
  lv_var_log_audit vg_sys -wi-ao---- 512.00m

[cephadm@mdshost2 ~]$ sudo vgs -a
  VG     #PV #LV #SN Attr   VSize   VFree
  vg_sys   1   8   0 wz--n- <49.00g 11.68g

[cephadm@mdshost2 ~]$ sudo parted --list
Model: VMware Virtual disk (scsi)
Disk /dev/sda: 53.7GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:

Number  Start   End     Size    Type     File system  Flags
 1      1049kB  1075MB  1074MB  primary  xfs          boot
 2      1075MB  53.7GB  52.6GB  primary               lvm

Error: /dev/sdb: unrecognised disk label
Model: VMware Virtual disk (scsi)
Disk /dev/sdb: 53.7GB
Sector size (logical/physical): 512B/512B
Partition Table: unknown
Disk Flags:


Related issues 2 (0 open2 closed)

Copied to ceph-volume - Backport #58278: quincy: CEPHADM_REFRESH_FAILED: failed to probe daemons or devicesResolvedGuillaume AbriouxActions
Copied to ceph-volume - Backport #58279: pacific: CEPHADM_REFRESH_FAILED: failed to probe daemons or devicesResolvedGuillaume AbriouxActions
Actions #1

Updated by Sake Paulusma over 1 year ago

Fixed the issue by removing the unused disk, but an empty disk shouldn't be a issue.

Actions #2

Updated by Guillaume Abrioux over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to Guillaume Abrioux
  • Priority changed from Normal to High
Actions #3

Updated by Guillaume Abrioux over 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Backport set to quincy,pacific
  • Pull request ID set to 48707
Actions #4

Updated by Guillaume Abrioux over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
Actions #5

Updated by Guillaume Abrioux over 1 year ago

  • Copied to Backport #58278: quincy: CEPHADM_REFRESH_FAILED: failed to probe daemons or devices added
Actions #6

Updated by Guillaume Abrioux over 1 year ago

  • Copied to Backport #58279: pacific: CEPHADM_REFRESH_FAILED: failed to probe daemons or devices added
Actions #7

Updated by Guillaume Abrioux over 1 year ago

  • Tags set to backport_processed
Actions #8

Updated by Guillaume Abrioux about 1 year ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF