Actions
Bug #62958
opencephadm: mgr-nfs upgrade test fails trying to redeploy first mgr daemon
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
From the cephadm logs
2023-09-08 19:03:19,673 7f017b1f1b80 DEBUG Determined image: 'quay.ceph.io/ceph-ci/ceph@sha256:29eb1b22bdc86e11facd8e3b821e546994d614ae2a0aec9d47234c7aede558d5' 2023-09-08 19:03:19,693 7f017b1f1b80 INFO Redeploy daemon mgr.smithi012.wqsagl ... 2023-09-08 19:06:22,875 7f017b1f1b80 INFO Non-zero exit code 1 from systemctl daemon-reload 2023-09-08 19:06:22,875 7f017b1f1b80 INFO systemctl: stderr Failed to reload daemon: Connection timed out 2023-09-08 19:06:23,243 7fb3be39fb80 DEBUG -------------------------------------------------------------------------------- cephadm ['--timeout', '895', 'check-host'] 2023-09-08 19:06:23,287 7fb3be39fb80 INFO podman (/usr/bin/podman) version 4.6.1 is present 2023-09-08 19:06:23,287 7fb3be39fb80 INFO systemctl is present 2023-09-08 19:06:23,287 7fb3be39fb80 INFO lvcreate is present 2023-09-08 19:14:25,205 7fb3be39fb80 WARNING No time sync service is running; checked for ['chrony.service', 'chronyd.service', 'systemd-timesyncd.service', 'ntpd.service', 'ntp.service', 'ntpsec.service', 'openntpd.service'] 2023-09-08 19:14:25,206 7fb3be39fb80 ERROR ERROR: No time synchronization is active
So, for whatever reason the systemctl daemon-reload is timing out.
Updated by Adam King 8 months ago
NOTE: did an interactive rerun of this at one point and found if you just go in and resume the upgrade (it gets paused when upgrading the first mgr fails) the upgrade then proceeds to completion successfully. So this issue seems to be temporary. However, it has reproduced consistently when running this test on main branch.
Updated by Adam King 7 months ago
log segment seems to reproduce consistently
2023-10-02 18:19:48,632 7fec3da8ab80 DEBUG -------------------------------------------------------------------------------- cephadm ['--image', 'quay.ceph.io/ceph-ci/ceph@sha256:77acc4c088be6af3a1febbb71c133f46cf301e630beda5ebf693dedd0cbb6c24', '--timeout', '895', 'inspect-image'] 2023-10-02 18:19:48,711 7fec3da8ab80 DEBUG /usr/bin/podman: stdout bea6d6056140a1237279d00840c835e057b2d146f9ce7dc51391e06705a39910,[quay.ceph.io/ceph-ci/ceph@sha256:77acc4c088be6af3a1febbb71c133f46cf301e630beda5ebf693dedd0cbb6c24] 2023-10-02 18:19:51,782 7fec3da8ab80 DEBUG ceph: stdout ceph version 18.0.0-6505-g37d71fd3 (37d71fd39f09a164b3518d8aef9e20133197e5f9) reef (dev) 2023-10-02 18:19:52,137 7ff867ac7b80 DEBUG -------------------------------------------------------------------------------- cephadm ['--image', 'quay.ceph.io/ceph-ci/ceph@sha256:77acc4c088be6af3a1febbb71c133f46cf301e630beda5ebf693dedd0cbb6c24', '--timeout', '895', '_orch', 'deploy', '--fsid', '0f4cacde-614f-11ee-8db5-212e2dc638e7'] 2023-10-02 18:19:52,168 7ff867ac7b80 DEBUG Loaded deploy configuration: {'fsid': '0f4cacde-614f-11ee-8db5-212e2dc638e7', 'name': 'mgr.smithi037.otxkbg', 'image': '', 'deploy_arguments': [], 'params': {'tcp_ports': [8443, 9283, 8765], 'allow_ptrace': True}, 'meta': {'service_name': 'mgr', 'ports': [8443, 9283, 8765], 'ip': None, 'deployed_by': ['quay.ceph.io/ceph-ci/ceph@sha256:77acc4c088be6af3a1febbb71c133f46cf301e630beda5ebf693dedd0cbb6c24'], 'rank': None, 'rank_generation': None, 'extra_container_args': None, 'extra_entrypoint_args': None}, 'config_blobs': {'config': '# minimal ceph.conf for 0f4cacde-614f-11ee-8db5-212e2dc638e7\n[global]\n\tfsid = 0f4cacde-614f-11ee-8db5-212e2dc638e7\n\tmon_host = [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] [v2:172.21.15.99:3300/0,v1:172.21.15.99:6789/0]\n', 'keyring': '[mgr.smithi037.otxkbg]\nkey = AQDVBxtl0V+EBRAAL7HxayTcNPSffvcNIumVsw==\n'}} 2023-10-02 18:19:52,168 7ff867ac7b80 DEBUG Determined image: 'quay.ceph.io/ceph-ci/ceph@sha256:77acc4c088be6af3a1febbb71c133f46cf301e630beda5ebf693dedd0cbb6c24' 2023-10-02 18:19:52,189 7ff867ac7b80 INFO Redeploy daemon mgr.smithi037.otxkbg ... 2023-10-02 18:22:53,628 7ff867ac7b80 INFO Non-zero exit code 1 from systemctl daemon-reload 2023-10-02 18:22:53,628 7ff867ac7b80 INFO systemctl: stderr Failed to reload daemon: Connection timed out 2023-10-02 18:22:54,040 7ff76eb10b80 DEBUG -------------------------------------------------------------------------------- cephadm ['--timeout', '895', 'check-host'] 2023-10-02 18:22:54,086 7ff76eb10b80 INFO podman (/usr/bin/podman) version 4.6.1 is present 2023-10-02 18:22:54,086 7ff76eb10b80 INFO systemctl is present 2023-10-02 18:22:54,086 7ff76eb10b80 INFO lvcreate is present 2023-10-02 18:30:22,670 7f923cd15b80 DEBUG -------------------------------------------------------------------------------- cephadm ['shell', '--', 'ceph', 'orch', 'upgrade', 'status'] 2023-10-02 18:30:22,697 7f923cd15b80 DEBUG /usr/bin/podman: 4.6.1 2023-10-02 18:30:22,699 7f923cd15b80 DEBUG Using default config: /etc/ceph/ceph.conf 2023-10-02 18:30:52,751 7ff76eb10b80 WARNING No time sync service is running; checked for ['chrony.service', 'chronyd.service', 'systemd-timesyncd.service', 'ntpd.service', 'ntp.service', 'ntpsec.service', 'openntpd.service'] 2023-10-02 18:30:52,752 7ff76eb10b80 ERROR ERROR: No time synchronization is active
Updated by Adam King 7 months ago
`journalctl -f` during the time where the `systemctl daemon-reload` would time out was just printing
Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1 Oct 02 18:29:18 smithi037 kernel: SELinux: inode_doinit_use_xattr: getxattr returned 2 for dev=0:148 ino=1
hundreds of times per second. Additionally, even just trying `systemctl -l` seemed to hang for me during that time.
Updated by Adam King 7 months ago
`ceph health detail` at the time
[root@smithi037 ~]# cephadm shell -- ceph health detail Inferring fsid 0f4cacde-614f-11ee-8db5-212e2dc638e7 Using recent ceph image quay.ceph.io/ceph-ci/ceph@sha256:77acc4c088be6af3a1febbb71c133f46cf301e630beda5ebf693dedd0cbb6c24 HEALTH_WARN Upgrading daemon mgr.smithi037.otxkbg on host smithi037 failed. [WRN] UPGRADE_REDEPLOY_DAEMON: Upgrading daemon mgr.smithi037.otxkbg on host smithi037 failed. Upgrade daemon: mgr.smithi037.otxkbg: cephadm exited with an error code: 1, stderr: Redeploy daemon mgr.smithi037.otxkbg ... Non-zero exit code 1 from systemctl daemon-reload systemctl: stderr Failed to reload daemon: Connection timed out Traceback (most recent call last): File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/tmp/tmp7dyruzaw.cephadm.build/__main__.py", line 7681, in <module> File "/tmp/tmp7dyruzaw.cephadm.build/__main__.py", line 7669, in main File "/tmp/tmp7dyruzaw.cephadm.build/__main__.py", line 5092, in command_deploy_from File "/tmp/tmp7dyruzaw.cephadm.build/__main__.py", line 5110, in _common_deploy File "/tmp/tmp7dyruzaw.cephadm.build/__main__.py", line 5147, in _dispatch_deploy File "/tmp/tmp7dyruzaw.cephadm.build/__main__.py", line 2570, in deploy_daemon File "/tmp/tmp7dyruzaw.cephadm.build/__main__.py", line 2803, in deploy_daemon_units File "/tmp/tmp7dyruzaw.cephadm.build/cephadmlib/call_wrappers.py", line 308, in call_throws RuntimeError: Failed command: systemctl daemon-reload: Failed to reload daemon: Connection timed out
Actions