Project

General

Profile

Bug #57175

cephadm: don't try to write client/os tuning files to offline hosts

Added by Adam King 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

if a host is known to be offline, we shouldn't continue to try handle os tuning profiles or client files on that host. This can actually cause an exception that will block further progress in the serve loop

2022-08-17T21:46:17.371392+0000 mgr.vm-00.eudebg [DBG] Running command: ls /etc/sysctl.d
2022-08-17T21:46:17.395290+0000 mgr.vm-00.eudebg [DBG] Running command: ls /etc/sysctl.d
2022-08-17T21:46:17.417827+0000 mgr.vm-00.eudebg [DBG] Running command: ls /etc/sysctl.d
2022-08-17T21:46:17.443274+0000 mgr.vm-00.eudebg [DBG] Opening connection to root@192.168.122.80 with ssh options '-F /tmp/cephadm-conf-f351v8jc -i /tmp/cephadm-identity-nxo5q57o'
2022-08-17T21:46:20.400567+0000 mgr.vm-00.eudebg [ERR] Can't communicate with remote host `192.168.122.80`, possibly because python3 is not installed there or you are missing NOPASSWD in sudoers. [Errno 113] Connect call failed ('192.168.122.80', 22)
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/ssh.py", line 103, in redirect_log
    yield
  File "/usr/share/ceph/mgr/cephadm/ssh.py", line 82, in _remote_connection
    preferred_auth=['publickey'], options=ssh_options)
  File "/lib/python3.6/site-packages/asyncssh/connection.py", line 6804, in connect
    'Opening SSH connection to')
  File "/lib/python3.6/site-packages/asyncssh/connection.py", line 299, in _connect
    local_addr=local_addr)
  File "/lib64/python3.6/asyncio/base_events.py", line 794, in create_connection
    raise exceptions[0]
  File "/lib64/python3.6/asyncio/base_events.py", line 781, in create_connection
    yield from self.sock_connect(sock, address)
  File "/lib64/python3.6/asyncio/selector_events.py", line 439, in sock_connect
    return (yield from fut)
  File "/lib64/python3.6/asyncio/selector_events.py", line 469, in _sock_connect_cb
    raise OSError(err, 'Connect call failed %s' % (address,))
OSError: [Errno 113] Connect call failed ('192.168.122.80', 22)

Related issues

Copied to Orchestrator - Backport #57377: quincy: cephadm: don't try to write client/os tuning files to offline hosts Resolved

History

#1 Updated by Adam King 3 months ago

  • Backport set to quincy

#2 Updated by Adam King 3 months ago

  • Status changed from In Progress to Pending Backport

#3 Updated by Backport Bot 3 months ago

  • Copied to Backport #57377: quincy: cephadm: don't try to write client/os tuning files to offline hosts added

#4 Updated by Backport Bot 3 months ago

  • Tags set to backport_processed

#5 Updated by Adam King 3 months ago

  • Pull request ID set to 47666

#6 Updated by Adam King 3 months ago

  • Status changed from Pending Backport to Resolved

#7 Updated by Laura Flores 2 months ago

We tried upgrading the Gibba cluster to the quincy-release for 17.2.4 and experienced this issue, but with a drained host:

We noticed some issues with the monitor before upgrading (gibba004), so we drained the host and proceeded with the upgrade using these commands:

HISTORY:
998  ceph -s
999  ssh gibba004
 1000  ping gibba004
 1001  ceph -s
 1002  ceph orch drain gibba004 --force
 1003  ceph orch host drain gibba004 --force
 1004  ceph -s
 1005  ceph orch host rm gibba004 --force --offline
 1006  ceph -s
 1007  ssh gibba006
 1008  ssh gibba008
 1009  ceph -s
 1010  top
 1011  ceph -s
 1012  ceph orch daemon rm mon.gibba004 --force
 1013  ceph orch host drain gibba004 --force
 1014  ceph -s
 1015  ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:6b3d60fd93a7222f1ca4ffabd5001bfab3f641f3
 1016  ceph orch upgrade status

We saw the upgrade in progress:

[root@gibba001 ~]# ceph orch upgrade status
{
    "target_image": "quay.ceph.io/ceph-ci/ceph:6b3d60fd93a7222f1ca4ffabd5001bfab3f641f3",
    "in_progress": true,
    "which": "Upgrading all daemon types on all hosts",
    "services_complete": [],
    "progress": "",
    "message": "" 
}
Every 2.0s: ceph -s                                                                                                  gibba001: Mon Sep 19 21:32:14 2022

  cluster:
    id:     f9d4cf6a-edcf-11ec-a96a-3cecef3d8fb8
    health: HEALTH_WARN
            6 failed cephadm daemon(s)
            1 hosts fail cephadm check

  services:
    mon: 4 daemons, quorum gibba001,gibba002,gibba003,gibba005 (age 5m)
    mgr: gibba006.enemnj(active, since 15m), standbys: gibba008.tfggyq
    osd: 950 osds: 925 up (since 2m), 925 in (since 4w)

  data:
    pools:   2 pools, 8193 pgs
    objects: 123.12M objects, 470 GiB
    usage:   3.6 TiB used, 8.7 TiB / 12 TiB avail
    pgs:     8193 active+clean

  io:
    client:   18 KiB/s rd, 784 KiB/s wr, 19 op/s rd, 35 op/s wr

  progress:
    Upgrade to quay.ceph.io/ceph-ci/ceph:6b3d60fd93a7222f1ca4ffabd5001bfab3f641f3 (0s)
      [............................]

And got the mgr to upgrade:

[lflores@gibba001 ~]$ sudo ceph versions
{
    "mon": {
        "ceph version 17.2.2-1-gf516549e (f516549e3e4815795ff0343ab71b3ebf567e5531) quincy (stable)": 4
    },
    "mgr": {
        "ceph version 17.2.3-768-g6b3d60fd (6b3d60fd93a7222f1ca4ffabd5001bfab3f641f3) quincy (stable)": 2
    },
    "osd": {
        "ceph version 17.2.2-1-gf516549e (f516549e3e4815795ff0343ab71b3ebf567e5531) quincy (stable)": 925
    },
    "mds": {},
    "overall": {
        "ceph version 17.2.2-1-gf516549e (f516549e3e4815795ff0343ab71b3ebf567e5531) quincy (stable)": 929,
        "ceph version 17.2.3-768-g6b3d60fd (6b3d60fd93a7222f1ca4ffabd5001bfab3f641f3) quincy (stable)": 2
    }

But later, we hit an UPGRADE_EXCEPTION:

[lflores@gibba001 ~]$ sudo ceph orch upgrade status
{
    "target_image": "quay.ceph.io/ceph-ci/ceph@sha256:a0d58276ba1e4af4163da27ac218a0c6eacdf182af71cabc56072075c7c47890",
    "in_progress": true,
    "which": "Upgrading all daemon types on all hosts",
    "services_complete": [
        "mgr" 
    ],
    "progress": "5/1036 daemons upgraded",
    "message": "Error: UPGRADE_EXCEPTION: Upgrade: failed due to an unexpected exception",
    "is_paused": true
}

Health detail from that time:

[lflores@gibba001 ~]$ sudo ceph health detail
HEALTH_WARN 4 failed cephadm daemon(s); 1 hosts fail cephadm check
[WRN] CEPHADM_FAILED_DAEMON: 4 failed cephadm daemon(s)
    daemon prometheus.gibba001 on gibba001 is in error state
    daemon osd.725 on gibba005 is in error state
    daemon osd.66 on gibba006 is in error state
    daemon osd.768 on gibba014 is in error state
[WRN] CEPHADM_HOST_CHECK_FAILED: 1 hosts fail cephadm check
    host gibba004 (172.21.2.104) failed check: Can't communicate with remote host `172.21.2.104`, possibly because python3 is not installed there. [Errno 113] Connect call failed ('172.21.2.104', 22)

Now with removing the host, we are having better luck with the upgrade.

ceph orch host rm gibba004 --force --offline
ceph orch upgrade stop
ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:6b3d60fd93a7222f1ca4ffabd5001bfab3f641f3

The upgrade has not yet completed, so I will update once I'm sure that removing the host fixed the issue.

#8 Updated by Vikhyat Umrao 2 months ago

Laura Flores wrote:

We tried upgrading the Gibba cluster to the quincy-release for 17.2.4 and experienced this issue, but with a drained host:

We noticed some issues with the monitor before upgrading (gibba004), so we drained the host and proceeded with the upgrade using these commands:

Yep, the gibba004 node was down.

[...]

We saw the upgrade in progress:
[...]

And got the mgr to upgrade:
[...]

But later, we hit an UPGRADE_EXCEPTION:

Yes, the main question is why upgrade was trying to ping drained host!

[...]

Health detail from that time:
[...]

Now with removing the host, we are having better luck with the upgrade.
[...]

The upgrade has not yet completed, so I will update once I'm sure that removing the host fixed the issue.

Also available in: Atom PDF