Bug #56391
Teuthology jobs scheduled with rhel result in "the output has been hidden due to the fact that 'no_log: true' was specified for this result"
0%
Description
/a/yuriw-2022-06-24_15:13:42-rados-wip-yuri4-testing-2022-06-22-1415-pacific-distro-default-smithi/teuthology.log
2022-06-24T15:59:07.388 INFO:teuthology.repo_utils:Resetting repo at /home/teuthworker/src/git.ceph.com_git_ceph-cm-ansible_main to origin/main
2022-06-24T15:59:07.410 INFO:teuthology.task.ansible:Playbook: [{'import_playbook': 'ansible_managed.yml'}, {'import_playbook': 'teuthology.yml'}, {'hosts': 'testnodes', 'tasks': [{'set_fact': {'ran_from_cephlab_playbook': True}}]}, {'import_playbook': 'testnodes.yml'}, {'import_playbook': 'container-host.yml'}, {'import_playbook': 'cobbler.yml'}, {'import_playbook': 'paddles.yml'}, {'import_playbook': 'pulpito.yml'}, {'hosts': 'testnodes', 'become': True, 'tasks': [{'name': 'Touch /ceph-qa-ready', 'file': {'path': '/ceph-qa-ready', 'state': 'touch'}, 'when': 'ran_from_cephlab_playbook|bool'}]}]
2022-06-24T15:59:07.412 DEBUG:teuthology.task.ansible:Running ansible-playbook -v --extra-vars '{"ansible_ssh_user": "ubuntu"}' -i /etc/ansible/hosts --limit smithi114.front.sepia.ceph.com,smithi175.front.sepia.ceph.com /home/teuthworker/src/git.ceph.com_git_ceph-cm-ansible_main/cephlab.yml
2022-06-24T16:01:50.160 INFO:teuthology.task.ansible:Archiving ansible failure log at: /home/teuthworker/archive/yuriw-2022-06-24_15:13:42-rados-wip-yuri4-testing-2022-06-22-1415-pacific-distro-default-smithi/6897479/ansible_failures.yaml
2022-06-24T16:01:50.163 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_git_teuthology_082ae7ef4302fa54665ed0a2535e8e254118dcfd/teuthology/run_tasks.py", line 106, in run_tasks
manager.__enter__()
File "/home/teuthworker/src/git.ceph.com_git_teuthology_082ae7ef4302fa54665ed0a2535e8e254118dcfd/teuthology/task/__init__.py", line 123, in __enter__
self.begin()
File "/home/teuthworker/src/git.ceph.com_git_teuthology_082ae7ef4302fa54665ed0a2535e8e254118dcfd/teuthology/task/ansible.py", line 422, in begin
super(CephLab, self).begin()
File "/home/teuthworker/src/git.ceph.com_git_teuthology_082ae7ef4302fa54665ed0a2535e8e254118dcfd/teuthology/task/ansible.py", line 265, in begin
self.execute_playbook()
File "/home/teuthworker/src/git.ceph.com_git_teuthology_082ae7ef4302fa54665ed0a2535e8e254118dcfd/teuthology/task/ansible.py", line 291, in execute_playbook
self._handle_failure(command, status)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_082ae7ef4302fa54665ed0a2535e8e254118dcfd/teuthology/task/ansible.py", line 316, in _handle_failure
raise AnsibleFailedError(failures)
teuthology.exceptions.AnsibleFailedError: {'smithi175.front.sepia.ceph.com': {'attempts': 12, 'censored': "the output has been hidden due to the fact that 'no_log: true' was specified for this result", 'changed': True}, 'smithi114.front.sepia.ceph.com': {'attempts': 12, 'censored': "the output has been hidden due to the fact that 'no_log: true' was specified for this result", 'changed': True}}
2022-06-24T16:01:50.332 ERROR:teuthology.run_tasks: Sentry event: https://sentry.ceph.com/organizations/ceph/?query=53b2d09a94b44934b3b4565b4e7a2b17
Seen in Pacific and Octopus:
https://pulpito.ceph.com/yuriw-2022-06-24_15:13:42-rados-wip-yuri4-testing-2022-06-22-1415-pacific-distro-default-smithi/
https://pulpito.ceph.com/yuriw-2022-06-24_16:58:58-rados-wip-yuri-testing-2022-06-24-0817-octopus-distro-default-smithi/
There seems to be an old yaml fragment somewhere that needs to be updated.
/a/yuriw-2022-06-24_15:13:42-rados-wip-yuri4-testing-2022-06-22-1415-pacific-distro-default-smithi/6897479/ansible_failures.yaml
smithi175.front.sepia.ceph.com:
attempts: 12
censored: 'the output has been hidden due to the fact that ''no_log: true'' was
specified for this result'
changed: true
smithi114.front.sepia.ceph.com:
attempts: 12
censored: 'the output has been hidden due to the fact that ''no_log: true'' was
specified for this result'
changed: true
Related issues
History
#1 Updated by Laura Flores over 1 year ago
- Subject changed from Teuthology jobs scheduled with rhel 8.3 result in "the output has been hidden due to the fact that 'no_log: true' was specified for this result" to Teuthology jobs scheduled with rhel 8.3 and 8.4 result in "the output has been hidden due to the fact that 'no_log: true' was specified for this result"
Pacific is experiences this failure with rhel 8.4, while Octopus experiences this failure with rhel 8.3.
Runs looked fine just a few days ago:
https://pulpito.ceph.com/yuriw-2022-06-21_16:28:27-rados-wip-yuri4-testing-2022-06-21-0704-pacific-distro-default-smithi/
https://pulpito.ceph.com/yuriw-2022-06-22_17:35:19-rados-wip-yuri5-testing-2022-06-22-0914-octopus-distro-default-smithi/
#2 Updated by Laura Flores over 1 year ago
Sentry reports this failure happening across the board: https://sentry.ceph.com/organizations/ceph/issues/36/events/53b2d09a94b44934b3b4565b4e7a2b17/events/?project=2
All occurrences started happening today on June 24th. The main branch is getting affected on rhel 8.6.
#3 Updated by Neha Ojha over 1 year ago
I wonder if this has something to do with the issue
/a/prsrivas-2022-06-24_15:53:46-rgw-wip-rgw-gc-break-chain-distro-default-smithi/6897697/ansible.log
Friday 24 June 2022 21:52:37 +0000 (0:00:00.043) 0:03:04.413 *********** ^M ^M TASK [common : Determine if node is registered with subscription-manager.] *****^M fatal: [smithi107.front.sepia.ceph.com]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}^M ...ignoring^M Friday 24 June 2022 21:52:37 +0000 (0:00:00.922) 0:03:05.336 *********** ^M fatal: [smithi165.front.sepia.ceph.com]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}^M ...ignoring^M ... FAILED - RETRYING: Register with subscription-manager. (12 retries left).^M FAILED - RETRYING: Register with subscription-manager. (12 retries left).^M FAILED - RETRYING: Register with subscription-manager. (11 retries left).^M FAILED - RETRYING: Register with subscription-manager. (11 retries left).^M FAILED - RETRYING: Register with subscription-manager. (10 retries left).^M FAILED - RETRYING: Register with subscription-manager. (10 retries left).^M FAILED - RETRYING: Register with subscription-manager. (9 retries left).^M FAILED - RETRYING: Register with subscription-manager. (9 retries left).^M FAILED - RETRYING: Register with subscription-manager. (8 retries left).^M FAILED - RETRYING: Register with subscription-manager. (8 retries left).^M FAILED - RETRYING: Register with subscription-manager. (7 retries left).^M FAILED - RETRYING: Register with subscription-manager. (7 retries left).^M FAILED - RETRYING: Register with subscription-manager. (6 retries left).^M FAILED - RETRYING: Register with subscription-manager. (6 retries left).^M FAILED - RETRYING: Register with subscription-manager. (5 retries left).^M FAILED - RETRYING: Register with subscription-manager. (5 retries left).^M FAILED - RETRYING: Register with subscription-manager. (4 retries left).^M
#4 Updated by Laura Flores over 1 year ago
Yeah, that's coming from this task here:
https://github.com/ceph/ceph-cm-ansible/blob/main/roles/common/tasks/rhel-entitlements.yml#L38-L43
- name: Determine if node is registered with subscription-manager.
command: subscription-manager identity
register: subscription
ignore_errors: true
changed_when: false
no_log: true
Doesn't seem like there's anything wrong with the tasks itself, and this file hasn't undergone any recent changes. But the problem seems to be that we can't successfully register a rhel node with the subscription manager.
#5 Updated by Laura Flores over 1 year ago
- Related to Bug #41387: "Register with subscription-manager."fails after 5 retries added
#6 Updated by Laura Flores over 1 year ago
The same issue was fixed before in the related Tracker I linked.
#7 Updated by David Galloway over 1 year ago
- Status changed from New to In Progress
- Assignee set to David Galloway
So the Satellite VM's disk filled up. I suspect there was already filesystem corruption but growing the partition, logical volume, and filesystem caused xfs to be very unhappy. The postgres DB won't start anymore.
I am restoring the Satellite server from a backup but anticipate it will take days to sync and stabilize as the most recent snapshot was from 2019.
#8 Updated by Laura Flores over 1 year ago
I see. Thanks for looking into it David.
#9 Updated by Laura Flores over 1 year ago
- Subject changed from Teuthology jobs scheduled with rhel 8.3 and 8.4 result in "the output has been hidden due to the fact that 'no_log: true' was specified for this result" to Teuthology jobs scheduled with rhel result in "the output has been hidden due to the fact that 'no_log: true' was specified for this result"
#11 Updated by Kamoltat (Junior) Sirivadhna over 1 year ago
/a/yuriw-2022-06-30_14:20:05-rados-wip-yuri3-testing-2022-06-28-1737-distro-default-smithi/6907408
#12 Updated by Kamoltat (Junior) Sirivadhna over 1 year ago
/a/yuriw-2022-06-29_13:30:16-rados-wip-yuri3-testing-2022-06-28-1737-distro-default-smithi/6905506/
#13 Updated by Matan Breizman over 1 year ago
/a/yuriw-2022-08-15_17:54:08-rados-wip-yuri2-testing-2022-08-15-0848-quincy-distro-default-smithi/6973864
#14 Updated by Matan Breizman over 1 year ago
Not the same error message but it's also a failure with same ansible stage on a rhel node:
2022-08-15T23:56:29.852 DEBUG:teuthology.task.ansible:Running ansible-playbook -v --extra-vars '{"ansible_ssh_user": "ubuntu"}' -i /etc/ansible/hosts --limit smithi078.front.sepia.ceph.com,smithi165.front.sepia.ceph.com /home/teuthworker/src/git.ceph.com_git_ceph-cm-ansible_main/cephlab.yml
2022-08-16T00:05:36.030 INFO:teuthology.task.ansible:Archiving ansible failure log at: /home/teuthworker/archive/yuriw-2022-08-15_17:54:08-rados-wip-yuri2-testing-2022-08-15-0848-quincy-distro-default-smithi/6973866/ansible_failures.yaml
/a/yuriw-2022-08-15_17:54:08-rados-wip-yuri2-testing-2022-08-15-0848-quincy-distro-default-smithi/6973866
from ansible.log (/a/yuriw-2022-08-15_17:54:08-rados-wip-yuri2-testing-2022-08-15-0848-quincy-distro-default-smithi/6973866/ansible.log) :
TASK [common : Determine if node is registered with subscription-manager.] *****
fatal: [smithi165.front.sepia.ceph.com]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
...ignoring
FAILED - RETRYING: Register with subscription-manager. (12 retries left).
#15 Updated by David Galloway over 1 year ago
Matan Breizman wrote:
Not the same error message but it's also a failure with same ansible stage on a rhel node:
[...]/a/yuriw-2022-08-15_17:54:08-rados-wip-yuri2-testing-2022-08-15-0848-quincy-distro-default-smithi/6973866
from ansible.log (/a/yuriw-2022-08-15_17:54:08-rados-wip-yuri2-testing-2022-08-15-0848-quincy-distro-default-smithi/6973866/ansible.log) :
[...]
That is a non-fatal error and the system did go on to register successfully. The real error was
failed: [smithi165.front.sepia.ceph.com] (item={'key': 'nvme0n1', 'value': {'virtual': 1, 'links': {'ids': ['lvm-pv-uuid-O1eT7A-94aA-KB2Q-NwlE-Lq6O-L8F2-03YTlQ', 'nvme-INTEL_SSDPEDMD400G4_CVFT623300CE400BGN', 'nvme-nvme.8086-43564654363233333030434534303042474e-494e54454c205353445045444d443430304734-00000001'], 'uuids': [], 'labels': [], 'masters': ['dm-0', 'dm-1', 'dm-2', 'dm-3', 'dm-4']}, 'vendor': None, 'model': 'INTEL SSDPEDMD400G4', 'sas_address': None, 'sas_device_handle': None, 'removable': '0', 'support_discard': '512', 'partitions': {}, 'rotational': '0', 'scheduler_mode': 'none', 'sectors': '781422768', 'sectorsize': '512', 'size': '372.61 GB', 'host': 'Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)', 'holders': ['vg_nvme-lv_2', 'vg_nvme-lv_5', 'vg_nvme-lv_3', 'vg_nvme-lv_1', 'vg_nvme-lv_4']}}) => {"ansible_loop_var": "item", "item": {"key": "nvme0n1", "value": {"holders": ["vg_nvme-lv_2", "vg_nvme-lv_5", "vg_nvme-lv_3", "vg_nvme-lv_1", "vg_nvme-lv_4"], "host": "Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)", "links": {"ids": ["lvm-pv-uuid-O1eT7A-94aA-KB2Q-NwlE-Lq6O-L8F2-03YTlQ", "nvme-INTEL_SSDPEDMD400G4_CVFT623300CE400BGN", "nvme-nvme.8086-43564654363233333030434534303042474e-494e54454c205353445045444d443430304734-00000001"], "labels": [], "masters": ["dm-0", "dm-1", "dm-2", "dm-3", "dm-4"], "uuids": []}, "model": "INTEL SSDPEDMD400G4", "partitions": {}, "removable": "0", "rotational": "0", "sas_address": null, "sas_device_handle": null, "scheduler_mode": "none", "sectors": "781422768", "sectorsize": "512", "size": "372.61 GB", "support_discard": "512", "vendor": null, "virtual": 1}}, "msg": "Failed to connect to the host via ssh: ssh: connect to host smithi165.front.sepia.ceph.com port 22: No route to host", "unreachable": true}
I suspect this was caused by https://tracker.ceph.com/issues/45512