Bug #57175
opencephadm: don't try to write client/os tuning files to offline hosts
0%
Description
if a host is known to be offline, we shouldn't continue to try handle os tuning profiles or client files on that host. This can actually cause an exception that will block further progress in the serve loop
2022-08-17T21:46:17.371392+0000 mgr.vm-00.eudebg [DBG] Running command: ls /etc/sysctl.d 2022-08-17T21:46:17.395290+0000 mgr.vm-00.eudebg [DBG] Running command: ls /etc/sysctl.d 2022-08-17T21:46:17.417827+0000 mgr.vm-00.eudebg [DBG] Running command: ls /etc/sysctl.d 2022-08-17T21:46:17.443274+0000 mgr.vm-00.eudebg [DBG] Opening connection to root@192.168.122.80 with ssh options '-F /tmp/cephadm-conf-f351v8jc -i /tmp/cephadm-identity-nxo5q57o' 2022-08-17T21:46:20.400567+0000 mgr.vm-00.eudebg [ERR] Can't communicate with remote host `192.168.122.80`, possibly because python3 is not installed there or you are missing NOPASSWD in sudoers. [Errno 113] Connect call failed ('192.168.122.80', 22) Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/ssh.py", line 103, in redirect_log yield File "/usr/share/ceph/mgr/cephadm/ssh.py", line 82, in _remote_connection preferred_auth=['publickey'], options=ssh_options) File "/lib/python3.6/site-packages/asyncssh/connection.py", line 6804, in connect 'Opening SSH connection to') File "/lib/python3.6/site-packages/asyncssh/connection.py", line 299, in _connect local_addr=local_addr) File "/lib64/python3.6/asyncio/base_events.py", line 794, in create_connection raise exceptions[0] File "/lib64/python3.6/asyncio/base_events.py", line 781, in create_connection yield from self.sock_connect(sock, address) File "/lib64/python3.6/asyncio/selector_events.py", line 439, in sock_connect return (yield from fut) File "/lib64/python3.6/asyncio/selector_events.py", line 469, in _sock_connect_cb raise OSError(err, 'Connect call failed %s' % (address,)) OSError: [Errno 113] Connect call failed ('192.168.122.80', 22)
Updated by Adam King over 1 year ago
- Status changed from In Progress to Pending Backport
Updated by Backport Bot over 1 year ago
- Copied to Backport #57377: quincy: cephadm: don't try to write client/os tuning files to offline hosts added
Updated by Adam King over 1 year ago
- Status changed from Pending Backport to Resolved
Updated by Laura Flores over 1 year ago
We tried upgrading the Gibba cluster to the quincy-release for 17.2.4 and experienced this issue, but with a drained host:
We noticed some issues with the monitor before upgrading (gibba004), so we drained the host and proceeded with the upgrade using these commands:
HISTORY:
998 ceph -s
999 ssh gibba004
1000 ping gibba004
1001 ceph -s
1002 ceph orch drain gibba004 --force
1003 ceph orch host drain gibba004 --force
1004 ceph -s
1005 ceph orch host rm gibba004 --force --offline
1006 ceph -s
1007 ssh gibba006
1008 ssh gibba008
1009 ceph -s
1010 top
1011 ceph -s
1012 ceph orch daemon rm mon.gibba004 --force
1013 ceph orch host drain gibba004 --force
1014 ceph -s
1015 ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:6b3d60fd93a7222f1ca4ffabd5001bfab3f641f3
1016 ceph orch upgrade status
We saw the upgrade in progress:
[root@gibba001 ~]# ceph orch upgrade status
{
"target_image": "quay.ceph.io/ceph-ci/ceph:6b3d60fd93a7222f1ca4ffabd5001bfab3f641f3",
"in_progress": true,
"which": "Upgrading all daemon types on all hosts",
"services_complete": [],
"progress": "",
"message": ""
}
Every 2.0s: ceph -s gibba001: Mon Sep 19 21:32:14 2022
cluster:
id: f9d4cf6a-edcf-11ec-a96a-3cecef3d8fb8
health: HEALTH_WARN
6 failed cephadm daemon(s)
1 hosts fail cephadm check
services:
mon: 4 daemons, quorum gibba001,gibba002,gibba003,gibba005 (age 5m)
mgr: gibba006.enemnj(active, since 15m), standbys: gibba008.tfggyq
osd: 950 osds: 925 up (since 2m), 925 in (since 4w)
data:
pools: 2 pools, 8193 pgs
objects: 123.12M objects, 470 GiB
usage: 3.6 TiB used, 8.7 TiB / 12 TiB avail
pgs: 8193 active+clean
io:
client: 18 KiB/s rd, 784 KiB/s wr, 19 op/s rd, 35 op/s wr
progress:
Upgrade to quay.ceph.io/ceph-ci/ceph:6b3d60fd93a7222f1ca4ffabd5001bfab3f641f3 (0s)
[............................]
And got the mgr to upgrade:
[lflores@gibba001 ~]$ sudo ceph versions
{
"mon": {
"ceph version 17.2.2-1-gf516549e (f516549e3e4815795ff0343ab71b3ebf567e5531) quincy (stable)": 4
},
"mgr": {
"ceph version 17.2.3-768-g6b3d60fd (6b3d60fd93a7222f1ca4ffabd5001bfab3f641f3) quincy (stable)": 2
},
"osd": {
"ceph version 17.2.2-1-gf516549e (f516549e3e4815795ff0343ab71b3ebf567e5531) quincy (stable)": 925
},
"mds": {},
"overall": {
"ceph version 17.2.2-1-gf516549e (f516549e3e4815795ff0343ab71b3ebf567e5531) quincy (stable)": 929,
"ceph version 17.2.3-768-g6b3d60fd (6b3d60fd93a7222f1ca4ffabd5001bfab3f641f3) quincy (stable)": 2
}
But later, we hit an UPGRADE_EXCEPTION:
[lflores@gibba001 ~]$ sudo ceph orch upgrade status
{
"target_image": "quay.ceph.io/ceph-ci/ceph@sha256:a0d58276ba1e4af4163da27ac218a0c6eacdf182af71cabc56072075c7c47890",
"in_progress": true,
"which": "Upgrading all daemon types on all hosts",
"services_complete": [
"mgr"
],
"progress": "5/1036 daemons upgraded",
"message": "Error: UPGRADE_EXCEPTION: Upgrade: failed due to an unexpected exception",
"is_paused": true
}
Health detail from that time:
[lflores@gibba001 ~]$ sudo ceph health detail
HEALTH_WARN 4 failed cephadm daemon(s); 1 hosts fail cephadm check
[WRN] CEPHADM_FAILED_DAEMON: 4 failed cephadm daemon(s)
daemon prometheus.gibba001 on gibba001 is in error state
daemon osd.725 on gibba005 is in error state
daemon osd.66 on gibba006 is in error state
daemon osd.768 on gibba014 is in error state
[WRN] CEPHADM_HOST_CHECK_FAILED: 1 hosts fail cephadm check
host gibba004 (172.21.2.104) failed check: Can't communicate with remote host `172.21.2.104`, possibly because python3 is not installed there. [Errno 113] Connect call failed ('172.21.2.104', 22)
Now with removing the host, we are having better luck with the upgrade.
ceph orch host rm gibba004 --force --offline
ceph orch upgrade stop
ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:6b3d60fd93a7222f1ca4ffabd5001bfab3f641f3
The upgrade has not yet completed, so I will update once I'm sure that removing the host fixed the issue.
Updated by Vikhyat Umrao over 1 year ago
Laura Flores wrote:
We tried upgrading the Gibba cluster to the quincy-release for 17.2.4 and experienced this issue, but with a drained host:
We noticed some issues with the monitor before upgrading (gibba004), so we drained the host and proceeded with the upgrade using these commands:
Yep, the gibba004 node was down.
[...]
We saw the upgrade in progress:
[...]And got the mgr to upgrade:
[...]But later, we hit an UPGRADE_EXCEPTION:
Yes, the main question is why upgrade was trying to ping drained host!
[...]
Health detail from that time:
[...]Now with removing the host, we are having better luck with the upgrade.
[...]The upgrade has not yet completed, so I will update once I'm sure that removing the host fixed the issue.
Updated by Michael Fritch 12 months ago
- Backport changed from quincy to quincy, pacific
Updated by Michael Fritch 12 months ago
- Status changed from Resolved to Pending Backport
Updated by Backport Bot 12 months ago
- Copied to Backport #59649: pacific: cephadm: don't try to write client/os tuning files to offline hosts added
Updated by Laura Flores 5 months ago
- Related to Bug #63756: Can't communicate with remote host, possibly because the host is not reachable or python3 is not installed on the host added