Bug #62958: cephadm: mgr-nfs upgrade test fails trying to redeploy first mgr daemon - Orchestrator - Ceph

Actions

Copy link

Bug #62958

open

cephadm: mgr-nfs upgrade test fails trying to redeploy first mgr daemon

Added by Adam King 8 months ago. Updated 7 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

From the cephadm logs

2023-09-08 19:03:19,673 7f017b1f1b80 DEBUG Determined image: 'quay.ceph.io/ceph-ci/ceph@sha256:29eb1b22bdc86e11facd8e3b821e546994d614ae2a0aec9d47234c7aede558d5'
2023-09-08 19:03:19,693 7f017b1f1b80 INFO Redeploy daemon mgr.smithi012.wqsagl ...
2023-09-08 19:06:22,875 7f017b1f1b80 INFO Non-zero exit code 1 from systemctl daemon-reload
2023-09-08 19:06:22,875 7f017b1f1b80 INFO systemctl: stderr Failed to reload daemon: Connection timed out
2023-09-08 19:06:23,243 7fb3be39fb80 DEBUG --------------------------------------------------------------------------------
cephadm ['--timeout', '895', 'check-host']
2023-09-08 19:06:23,287 7fb3be39fb80 INFO podman (/usr/bin/podman) version 4.6.1 is present
2023-09-08 19:06:23,287 7fb3be39fb80 INFO systemctl is present
2023-09-08 19:06:23,287 7fb3be39fb80 INFO lvcreate is present
2023-09-08 19:14:25,205 7fb3be39fb80 WARNING No time sync service is running; checked for ['chrony.service', 'chronyd.service', 'systemd-timesyncd.service', 'ntpd.service', 'ntp.service', 'ntpsec.service', 'openntpd.service']
2023-09-08 19:14:25,206 7fb3be39fb80 ERROR ERROR: No time synchronization is active

So, for whatever reason the systemctl daemon-reload is timing out.

https://pulpito.ceph.com/adking-2023-09-22_21:58:46-orch:cephadm-wip-adk-testing-2023-09-22-1218-distro-default-smithi/7401264

Actions

Copy link

Updated by Adam King 8 months ago

NOTE: did an interactive rerun of this at one point and found if you just go in and resume the upgrade (it gets paused when upgrading the first mgr fails) the upgrade then proceeds to completion successfully. So this issue seems to be temporary. However, it has reproduced consistently when running this test on main branch.

Actions

Copy link

Updated by Adam King 7 months ago

log segment seems to reproduce consistently

2023-10-02 18:19:48,632 7fec3da8ab80 DEBUG --------------------------------------------------------------------------------
cephadm ['--image', 'quay.ceph.io/ceph-ci/ceph@sha256:77acc4c088be6af3a1febbb71c133f46cf301e630beda5ebf693dedd0cbb6c24', '--timeout', '895', 'inspect-image']
2023-10-02 18:19:48,711 7fec3da8ab80 DEBUG /usr/bin/podman: stdout bea6d6056140a1237279d00840c835e057b2d146f9ce7dc51391e06705a39910,[quay.ceph.io/ceph-ci/ceph@sha256:77acc4c088be6af3a1febbb71c133f46cf301e630beda5ebf693dedd0cbb6c24]
2023-10-02 18:19:51,782 7fec3da8ab80 DEBUG ceph: stdout ceph version 18.0.0-6505-g37d71fd3 (37d71fd39f09a164b3518d8aef9e20133197e5f9) reef (dev)
2023-10-02 18:19:52,137 7ff867ac7b80 DEBUG --------------------------------------------------------------------------------
cephadm ['--image', 'quay.ceph.io/ceph-ci/ceph@sha256:77acc4c088be6af3a1febbb71c133f46cf301e630beda5ebf693dedd0cbb6c24', '--timeout', '895', '_orch', 'deploy', '--fsid', '0f4cacde-614f-11ee-8db5-212e2dc638e7']
2023-10-02 18:19:52,168 7ff867ac7b80 DEBUG Loaded deploy configuration: {'fsid': '0f4cacde-614f-11ee-8db5-212e2dc638e7', 'name': 'mgr.smithi037.otxkbg', 'image': '', 'deploy_arguments': [], 'params': {'tcp_ports': [8443, 9283, 8765], 'allow_ptrace': True}, 'meta': {'service_name': 'mgr', 'ports': [8443, 9283, 8765], 'ip': None, 'deployed_by': ['quay.ceph.io/ceph-ci/ceph@sha256:77acc4c088be6af3a1febbb71c133f46cf301e630beda5ebf693dedd0cbb6c24'], 'rank': None, 'rank_generation': None, 'extra_container_args': None, 'extra_entrypoint_args': None}, 'config_blobs': {'config': '# minimal ceph.conf for 0f4cacde-614f-11ee-8db5-212e2dc638e7\n[global]\n\tfsid = 0f4cacde-614f-11ee-8db5-212e2dc638e7\n\tmon_host = [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] [v2:172.21.15.99:3300/0,v1:172.21.15.99:6789/0]\n', 'keyring': '[mgr.smithi037.otxkbg]\nkey = AQDVBxtl0V+EBRAAL7HxayTcNPSffvcNIumVsw==\n'}}
2023-10-02 18:19:52,168 7ff867ac7b80 DEBUG Determined image: 'quay.ceph.io/ceph-ci/ceph@sha256:77acc4c088be6af3a1febbb71c133f46cf301e630beda5ebf693dedd0cbb6c24'
2023-10-02 18:19:52,189 7ff867ac7b80 INFO Redeploy daemon mgr.smithi037.otxkbg ...
2023-10-02 18:22:53,628 7ff867ac7b80 INFO Non-zero exit code 1 from systemctl daemon-reload
2023-10-02 18:22:53,628 7ff867ac7b80 INFO systemctl: stderr Failed to reload daemon: Connection timed out
2023-10-02 18:22:54,040 7ff76eb10b80 DEBUG --------------------------------------------------------------------------------
cephadm ['--timeout', '895', 'check-host']
2023-10-02 18:22:54,086 7ff76eb10b80 INFO podman (/usr/bin/podman) version 4.6.1 is present
2023-10-02 18:22:54,086 7ff76eb10b80 INFO systemctl is present
2023-10-02 18:22:54,086 7ff76eb10b80 INFO lvcreate is present
2023-10-02 18:30:22,670 7f923cd15b80 DEBUG --------------------------------------------------------------------------------
cephadm ['shell', '--', 'ceph', 'orch', 'upgrade', 'status']
2023-10-02 18:30:22,697 7f923cd15b80 DEBUG /usr/bin/podman: 4.6.1
2023-10-02 18:30:22,699 7f923cd15b80 DEBUG Using default config: /etc/ceph/ceph.conf
2023-10-02 18:30:52,751 7ff76eb10b80 WARNING No time sync service is running; checked for ['chrony.service', 'chronyd.service', 'systemd-timesyncd.service', 'ntpd.service', 'ntp.service', 'ntpsec.service', 'openntpd.service']
2023-10-02 18:30:52,752 7ff76eb10b80 ERROR ERROR: No time synchronization is active

Actions

Copy link

Updated by Adam King 7 months ago

`journalctl -f` during the time where the `systemctl daemon-reload` would time out was just printing

Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr:  getxattr returned 2 for dev=0:148 ino=1

hundreds of times per second. Additionally, even just trying `systemctl -l` seemed to hang for me during that time.

Actions

Copy link

Updated by Adam King 7 months ago

`ceph health detail` at the time

[root@smithi037 ~]# cephadm shell -- ceph health detail
Inferring fsid 0f4cacde-614f-11ee-8db5-212e2dc638e7
Using recent ceph image quay.ceph.io/ceph-ci/ceph@sha256:77acc4c088be6af3a1febbb71c133f46cf301e630beda5ebf693dedd0cbb6c24
HEALTH_WARN Upgrading daemon mgr.smithi037.otxkbg on host smithi037 failed.
[WRN] UPGRADE_REDEPLOY_DAEMON: Upgrading daemon mgr.smithi037.otxkbg on host smithi037 failed.
    Upgrade daemon: mgr.smithi037.otxkbg: cephadm exited with an error code: 1, stderr: Redeploy daemon mgr.smithi037.otxkbg ...
Non-zero exit code 1 from systemctl daemon-reload
systemctl: stderr Failed to reload daemon: Connection timed out
Traceback (most recent call last):
  File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/tmp7dyruzaw.cephadm.build/__main__.py", line 7681, in <module>
  File "/tmp/tmp7dyruzaw.cephadm.build/__main__.py", line 7669, in main
  File "/tmp/tmp7dyruzaw.cephadm.build/__main__.py", line 5092, in command_deploy_from
  File "/tmp/tmp7dyruzaw.cephadm.build/__main__.py", line 5110, in _common_deploy
  File "/tmp/tmp7dyruzaw.cephadm.build/__main__.py", line 5147, in _dispatch_deploy
  File "/tmp/tmp7dyruzaw.cephadm.build/__main__.py", line 2570, in deploy_daemon
  File "/tmp/tmp7dyruzaw.cephadm.build/__main__.py", line 2803, in deploy_daemon_units
  File "/tmp/tmp7dyruzaw.cephadm.build/cephadmlib/call_wrappers.py", line 308, in call_throws
RuntimeError: Failed command: systemctl daemon-reload: Failed to reload daemon: Connection timed out

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Orchestrator

Custom queries

Bug #62958

cephadm: mgr-nfs upgrade test fails trying to redeploy first mgr daemon

Updated by Adam King 8 months ago

Updated by Adam King 7 months ago

Updated by Adam King 7 months ago

Updated by Adam King 7 months ago