Project

General

Profile

Actions

Bug #58700

open

cephadm: UPGRADE_REDEPLOY_DAEMON: unit activation timeout on upgraded osds due to change in activation method

Added by Maximilian Matzinger about 1 year ago. Updated 12 months ago.

Status:
In Progress
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Upgrading to 16.2.11 on Cephadm managed nodes with slow devices with LVM can fail to upgrade OSDs because raw activation now takes precedence and can be very slow.

The issue is caused by a change to the daemon's unit.run script to use ceph-volume activate ... instead of ceph-volume lvm activate ... on previous versions.
https://github.com/ceph/ceph/blob/v16.2.11/src/cephadm/cephadm#L3070-L3105

The command tries raw activation before LVM:
https://github.com/ceph/ceph/blob/v16.2.11/src/ceph-volume/ceph_volume/activate/main.py#L45-L56

Raw activation probes for the correct device in a O(n^2) loop over all devices found by lsblk.
https://github.com/ceph/ceph/blob/v16.2.11/src/ceph-volume/ceph_volume/devices/raw/list.py#L81-L113

On our nodes, len(devs) turns out to be 318 for 40 physical storage devices because of already mapped LVM, DB/WAL, dm-crypt and multipath devices, resulting in ~100k _get_bluestore_info() checks on slow devices.
In our case, the activate command takes 40 minutes before raw activation fails, then tries LVM and quickly succeeds.
This is too slow for the TimeoutStartSec, resulting in unit failure and a paused upgrade.

Feb 13 09:56:47 mon0 conmon[280000]: mgr.mon0.vasejl (mgr.71915487) 545835 : cephadm [ERR] cephadm exited with an error code: 1, stderr:Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-121f50c6-1
47e-11ec-8d46-3868dd37cc30-osd-0
Feb 13 09:56:47 mon0 conmon[280000]: /bin/podman: stderr Error: inspecting object: no such container ceph-121f50c6-147e-11ec-8d46-3868dd37cc30-osd-0
Feb 13 09:56:47 mon0 conmon[280000]: Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-121f50c6-147e-11ec-8d46-3868dd37cc30-osd.0
Feb 13 09:56:47 mon0 conmon[280000]: /bin/podman: stderr Error: inspecting object: no such container ceph-121f50c6-147e-11ec-8d46-3868dd37cc30-osd.0
Feb 13 09:56:47 mon0 conmon[280000]: Deploy daemon osd.0 ...
Feb 13 09:56:47 mon0 conmon[280000]: Non-zero exit code 1 from systemctl start ceph-121f50c6-147e-11ec-8d46-3868dd37cc30@osd.0
Feb 13 09:56:47 mon0 conmon[280000]: systemctl: stderr Job for ceph-121f50c6-147e-11ec-8d46-3868dd37cc30@osd.0.service failed because a timeout was exceeded.
Feb 13 09:56:47 mon0 conmon[280000]: systemctl: stderr See "systemctl status ceph-121f50c6-147e-11ec-8d46-3868dd37cc30@osd.0.service" and "journalctl -xe" for details.
Feb 13 09:56:47 mon0 conmon[280000]: Traceback (most recent call last):
Feb 13 09:56:47 mon0 conmon[280000]:   File "/var/lib/ceph/121f50c6-147e-11ec-8d46-3868dd37cc30/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 9248, in <module>
Feb 13 09:56:47 mon0 conmon[280000]:     main()
Feb 13 09:56:47 mon0 conmon[280000]:   File "/var/lib/ceph/121f50c6-147e-11ec-8d46-3868dd37cc30/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 9236, in main
Feb 13 09:56:47 mon0 conmon[280000]:     r = ctx.func(ctx)
Feb 13 09:56:47 mon0 conmon[280000]:   File "/var/lib/ceph/121f50c6-147e-11ec-8d46-3868dd37cc30/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 1990, in _default_image
Feb 13 09:56:47 mon0 conmon[280000]:     return func(ctx)
Feb 13 09:56:47 mon0 conmon[280000]:   File "/var/lib/ceph/121f50c6-147e-11ec-8d46-3868dd37cc30/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 5041, in command_deploy
Feb 13 09:56:47 mon0 conmon[280000]:     ports=daemon_ports)
Feb 13 09:56:47 mon0 conmon[280000]:   File "/var/lib/ceph/121f50c6-147e-11ec-8d46-3868dd37cc30/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 2952, in deploy_daemon
Feb 13 09:56:47 mon0 conmon[280000]:     c, osd_fsid=osd_fsid, ports=ports)
Feb 13 09:56:47 mon0 conmon[280000]:   File "/var/lib/ceph/121f50c6-147e-11ec-8d46-3868dd37cc30/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 3197, in deploy_daemon_units
Feb 13 09:56:47 mon0 conmon[280000]:     call_throws(ctx, ['systemctl', 'start', unit_name])
Feb 13 09:56:47 mon0 conmon[280000]:   File "/var/lib/ceph/121f50c6-147e-11ec-8d46-3868dd37cc30/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 1657, in call_throws
Feb 13 09:56:47 mon0 conmon[280000]:     raise RuntimeError(f'Failed command: {" ".join(command)}: {s}')
Feb 13 09:56:47 mon0 conmon[280000]: RuntimeError: Failed command: systemctl start ceph-121f50c6-147e-11ec-8d46-3868dd37cc30@osd.0: Job for ceph-121f50c6-147e-11ec-8d46-3868dd37cc30@osd.0.service failed because a timeout was exceeded.
Feb 13 09:56:47 mon0 conmon[280000]: See "systemctl status ceph-121f50c6-147e-11ec-8d46-3868dd37cc30@osd.0.service" and "journalctl -xe" for details.
Feb 13 09:56:47 mon0 conmon[280000]: Traceback (most recent call last):
Feb 13 09:56:47 mon0 conmon[280000]:   File "/usr/share/ceph/mgr/cephadm/serve.py", line 1456, in _remote_connection
Feb 13 09:56:47 mon0 conmon[280000]:     yield (conn, connr)
Feb 13 09:56:47 mon0 conmon[280000]:   File "/usr/share/ceph/mgr/cephadm/serve.py", line 1353, in _run_cephadm
Feb 13 09:56:47 mon0 conmon[280000]:     code, '\n'.join(err)))
Feb 13 09:56:47 mon0 conmon[280000]: orchestrator._interface.OrchestratorError: cephadm exited with an error code: 1, stderr:Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-121f50c6-147e-11ec-8d
46-3868dd37cc30-osd-0
Feb 13 09:56:47 mon0 conmon[280000]: /bin/podman: stderr Error: inspecting object: no such container ceph-121f50c6-147e-11ec-8d46-3868dd37cc30-osd-0
Feb 13 09:56:47 mon0 conmon[280000]: Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-121f50c6-147e-11ec-8d46-3868dd37cc30-osd.0
Feb 13 09:56:47 mon0 conmon[280000]: /bin/podman: stderr Error: inspecting object: no such container ceph-121f50c6-147e-11ec-8d46-3868dd37cc30-osd.0
Actions #1

Updated by Guillaume Abrioux about 1 year ago

  • Status changed from New to In Progress
  • Assignee set to Guillaume Abrioux
Actions #2

Updated by Maximilian Matzinger 12 months ago

fixed in 16.2.12

Actions

Also available in: Atom PDF