Project

General

Profile

Actions

Bug #57175

open

cephadm: don't try to write client/os tuning files to offline hosts

Added by Adam King over 1 year ago. Updated 12 months ago.

Status:
Pending Backport
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

if a host is known to be offline, we shouldn't continue to try handle os tuning profiles or client files on that host. This can actually cause an exception that will block further progress in the serve loop

2022-08-17T21:46:17.371392+0000 mgr.vm-00.eudebg [DBG] Running command: ls /etc/sysctl.d
2022-08-17T21:46:17.395290+0000 mgr.vm-00.eudebg [DBG] Running command: ls /etc/sysctl.d
2022-08-17T21:46:17.417827+0000 mgr.vm-00.eudebg [DBG] Running command: ls /etc/sysctl.d
2022-08-17T21:46:17.443274+0000 mgr.vm-00.eudebg [DBG] Opening connection to root@192.168.122.80 with ssh options '-F /tmp/cephadm-conf-f351v8jc -i /tmp/cephadm-identity-nxo5q57o'
2022-08-17T21:46:20.400567+0000 mgr.vm-00.eudebg [ERR] Can't communicate with remote host `192.168.122.80`, possibly because python3 is not installed there or you are missing NOPASSWD in sudoers. [Errno 113] Connect call failed ('192.168.122.80', 22)
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/ssh.py", line 103, in redirect_log
    yield
  File "/usr/share/ceph/mgr/cephadm/ssh.py", line 82, in _remote_connection
    preferred_auth=['publickey'], options=ssh_options)
  File "/lib/python3.6/site-packages/asyncssh/connection.py", line 6804, in connect
    'Opening SSH connection to')
  File "/lib/python3.6/site-packages/asyncssh/connection.py", line 299, in _connect
    local_addr=local_addr)
  File "/lib64/python3.6/asyncio/base_events.py", line 794, in create_connection
    raise exceptions[0]
  File "/lib64/python3.6/asyncio/base_events.py", line 781, in create_connection
    yield from self.sock_connect(sock, address)
  File "/lib64/python3.6/asyncio/selector_events.py", line 439, in sock_connect
    return (yield from fut)
  File "/lib64/python3.6/asyncio/selector_events.py", line 469, in _sock_connect_cb
    raise OSError(err, 'Connect call failed %s' % (address,))
OSError: [Errno 113] Connect call failed ('192.168.122.80', 22)

Related issues 3 (1 open2 closed)

Related to Orchestrator - Bug #63756: Can't communicate with remote host, possibly because the host is not reachable or python3 is not installed on the hostNew

Actions
Copied to Orchestrator - Backport #57377: quincy: cephadm: don't try to write client/os tuning files to offline hostsResolvedAdam KingActions
Copied to Orchestrator - Backport #59649: pacific: cephadm: don't try to write client/os tuning files to offline hostsResolvedMichael FritchActions
Actions #1

Updated by Adam King over 1 year ago

  • Backport set to quincy
Actions #2

Updated by Adam King over 1 year ago

  • Status changed from In Progress to Pending Backport
Actions #3

Updated by Backport Bot over 1 year ago

  • Copied to Backport #57377: quincy: cephadm: don't try to write client/os tuning files to offline hosts added
Actions #4

Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed
Actions #5

Updated by Adam King over 1 year ago

  • Pull request ID set to 47666
Actions #6

Updated by Adam King over 1 year ago

  • Status changed from Pending Backport to Resolved
Actions #7

Updated by Laura Flores over 1 year ago

We tried upgrading the Gibba cluster to the quincy-release for 17.2.4 and experienced this issue, but with a drained host:

We noticed some issues with the monitor before upgrading (gibba004), so we drained the host and proceeded with the upgrade using these commands:

HISTORY:
998  ceph -s
999  ssh gibba004
 1000  ping gibba004
 1001  ceph -s
 1002  ceph orch drain gibba004 --force
 1003  ceph orch host drain gibba004 --force
 1004  ceph -s
 1005  ceph orch host rm gibba004 --force --offline
 1006  ceph -s
 1007  ssh gibba006
 1008  ssh gibba008
 1009  ceph -s
 1010  top
 1011  ceph -s
 1012  ceph orch daemon rm mon.gibba004 --force
 1013  ceph orch host drain gibba004 --force
 1014  ceph -s
 1015  ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:6b3d60fd93a7222f1ca4ffabd5001bfab3f641f3
 1016  ceph orch upgrade status

We saw the upgrade in progress:

[root@gibba001 ~]# ceph orch upgrade status
{
    "target_image": "quay.ceph.io/ceph-ci/ceph:6b3d60fd93a7222f1ca4ffabd5001bfab3f641f3",
    "in_progress": true,
    "which": "Upgrading all daemon types on all hosts",
    "services_complete": [],
    "progress": "",
    "message": "" 
}
Every 2.0s: ceph -s                                                                                                  gibba001: Mon Sep 19 21:32:14 2022

  cluster:
    id:     f9d4cf6a-edcf-11ec-a96a-3cecef3d8fb8
    health: HEALTH_WARN
            6 failed cephadm daemon(s)
            1 hosts fail cephadm check

  services:
    mon: 4 daemons, quorum gibba001,gibba002,gibba003,gibba005 (age 5m)
    mgr: gibba006.enemnj(active, since 15m), standbys: gibba008.tfggyq
    osd: 950 osds: 925 up (since 2m), 925 in (since 4w)

  data:
    pools:   2 pools, 8193 pgs
    objects: 123.12M objects, 470 GiB
    usage:   3.6 TiB used, 8.7 TiB / 12 TiB avail
    pgs:     8193 active+clean

  io:
    client:   18 KiB/s rd, 784 KiB/s wr, 19 op/s rd, 35 op/s wr

  progress:
    Upgrade to quay.ceph.io/ceph-ci/ceph:6b3d60fd93a7222f1ca4ffabd5001bfab3f641f3 (0s)
      [............................]

And got the mgr to upgrade:

[lflores@gibba001 ~]$ sudo ceph versions
{
    "mon": {
        "ceph version 17.2.2-1-gf516549e (f516549e3e4815795ff0343ab71b3ebf567e5531) quincy (stable)": 4
    },
    "mgr": {
        "ceph version 17.2.3-768-g6b3d60fd (6b3d60fd93a7222f1ca4ffabd5001bfab3f641f3) quincy (stable)": 2
    },
    "osd": {
        "ceph version 17.2.2-1-gf516549e (f516549e3e4815795ff0343ab71b3ebf567e5531) quincy (stable)": 925
    },
    "mds": {},
    "overall": {
        "ceph version 17.2.2-1-gf516549e (f516549e3e4815795ff0343ab71b3ebf567e5531) quincy (stable)": 929,
        "ceph version 17.2.3-768-g6b3d60fd (6b3d60fd93a7222f1ca4ffabd5001bfab3f641f3) quincy (stable)": 2
    }

But later, we hit an UPGRADE_EXCEPTION:

[lflores@gibba001 ~]$ sudo ceph orch upgrade status
{
    "target_image": "quay.ceph.io/ceph-ci/ceph@sha256:a0d58276ba1e4af4163da27ac218a0c6eacdf182af71cabc56072075c7c47890",
    "in_progress": true,
    "which": "Upgrading all daemon types on all hosts",
    "services_complete": [
        "mgr" 
    ],
    "progress": "5/1036 daemons upgraded",
    "message": "Error: UPGRADE_EXCEPTION: Upgrade: failed due to an unexpected exception",
    "is_paused": true
}

Health detail from that time:

[lflores@gibba001 ~]$ sudo ceph health detail
HEALTH_WARN 4 failed cephadm daemon(s); 1 hosts fail cephadm check
[WRN] CEPHADM_FAILED_DAEMON: 4 failed cephadm daemon(s)
    daemon prometheus.gibba001 on gibba001 is in error state
    daemon osd.725 on gibba005 is in error state
    daemon osd.66 on gibba006 is in error state
    daemon osd.768 on gibba014 is in error state
[WRN] CEPHADM_HOST_CHECK_FAILED: 1 hosts fail cephadm check
    host gibba004 (172.21.2.104) failed check: Can't communicate with remote host `172.21.2.104`, possibly because python3 is not installed there. [Errno 113] Connect call failed ('172.21.2.104', 22)

Now with removing the host, we are having better luck with the upgrade.

ceph orch host rm gibba004 --force --offline
ceph orch upgrade stop
ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:6b3d60fd93a7222f1ca4ffabd5001bfab3f641f3

The upgrade has not yet completed, so I will update once I'm sure that removing the host fixed the issue.

Actions #8

Updated by Vikhyat Umrao over 1 year ago

Laura Flores wrote:

We tried upgrading the Gibba cluster to the quincy-release for 17.2.4 and experienced this issue, but with a drained host:

We noticed some issues with the monitor before upgrading (gibba004), so we drained the host and proceeded with the upgrade using these commands:

Yep, the gibba004 node was down.

[...]

We saw the upgrade in progress:
[...]

And got the mgr to upgrade:
[...]

But later, we hit an UPGRADE_EXCEPTION:

Yes, the main question is why upgrade was trying to ping drained host!

[...]

Health detail from that time:
[...]

Now with removing the host, we are having better luck with the upgrade.
[...]

The upgrade has not yet completed, so I will update once I'm sure that removing the host fixed the issue.

Actions #9

Updated by Michael Fritch 12 months ago

  • Backport changed from quincy to quincy, pacific
Actions #10

Updated by Michael Fritch 12 months ago

  • Status changed from Resolved to Pending Backport
Actions #11

Updated by Michael Fritch 12 months ago

  • Tags deleted (backport_processed)
Actions #12

Updated by Backport Bot 12 months ago

  • Copied to Backport #59649: pacific: cephadm: don't try to write client/os tuning files to offline hosts added
Actions #13

Updated by Backport Bot 12 months ago

  • Tags set to backport_processed
Actions #14

Updated by Laura Flores 5 months ago

  • Related to Bug #63756: Can't communicate with remote host, possibly because the host is not reachable or python3 is not installed on the host added
Actions

Also available in: Atom PDF