Bug #45627: cephadm: frequently getting `1 hosts fail cephadm check` - Orchestrator - Ceph

Custom queries

Bug queue
Bug triage
Crash queue
Crash triage
Feedback
My issues
Need Review
Pending backports
Product Backlog Scrub

Actions

Copy link

Bug #45627

closed

cephadm: frequently getting `1 hosts fail cephadm check`

Added by Sebastian Wagner almost 4 years ago. Updated almost 4 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Matthew Oliver

Category:

cephadm

Target version:

Ceph - v15.2.4

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

35281

Crash signature (v1):

Crash signature (v2):

Description

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ADK3Y2XHTIJ2YV6MFSQX4XPTQ4WP5ETM/

I can access all rdb devices and CephFS. They work. All OSDs in server-1
is up.

    health: HEALTH_WARN
            1 hosts fail cephadm check
            failed to probe daemons or devices

I even restarted server-1. No luck.

I'm on server-1. cephadm complains it cannot access to server-1. In basic
term, server-1 cannot access server-1 (192.168.0.1)

server-1: 192.168.0.1
server-2: 192.168.0.3

$ ssh -F =(ceph cephadm get-ssh-config) -i =(ceph config-key get
mgr/cephadm/ssh_identity_key) root@server-1
> Success.

I think we have to rethink ssh connections. Looks like execnet can't handle being loaded within a long-running daemon.

This happens (unfortunately) frequently to me. Look for the active mgr
(ceph -s), and go restart the mgr service there (systemctl list-units |grep
mgr then systemctl restart NAMEOFSERVICE). This normally resolves that
error for me.

Related issues 3 (0 open — 3 closed)

Related to Orchestrator - Bug #45032: cephadm: Not recovering from `OSError: cannot send (already closed?)`

Resolved

Matthew Oliver

Actions

Related to Orchestrator - Bug #45621: check-host returns terrible unhelpful error message

Duplicate

Actions

Related to Orchestrator - Bug #45737: Module 'cephadm' has failed: cannot send (already closed?)

Duplicate

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by Sebastian Wagner almost 4 years ago

Related to Bug #45032: cephadm: Not recovering from `OSError: cannot send (already closed?)` added

Actions

Copy link

Updated by Sebastian Wagner almost 4 years ago

Description updated (diff)

Actions

Copy link

Updated by Matthew Oliver almost 4 years ago

Assignee set to Matthew Oliver

I've updated the duplicate bug with:

I've managed to recreate, I have 2 nodes, node1(10.20.92.201) and node2(10.20.92.202).

Node2 happens to be the current mgr.

So I do a node check:

node2:~ # ceph cephadm check-host node1                                                                                                                                                                                                      
node1 (None) ok

If we look at the connections, we'll look on node1, because we can easily recreate the issue from there:

node1:~ # ss -ntp |grep 10.20.92.202 |grep ssh
ESTAB   0        0               10.20.92.201:22           10.20.92.202:55550    users:(("sshd",pid=3125,fd=4))

We can see the connection.

If I run it again, it'll reuse the same connection, because we're storing this connection to the node to reuse:

node2:~ # ceph cephadm check-host node1                                                                                                                                                                                                      
node1 (None) ok

Now what happens if the other end of the (the non active mgr) node closes it's connection abruptly:

node1:~ # kill 3125
node1:~ # ss -ntp |grep 10.20.92.202 |grep ssh
<no output>

The connection is gone, obviously. But back in the mgr the stored connection object is still there, which we try and use:

node2:~ # ceph cephadm check-host node1
Error EINVAL: Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 997, in _send
    message.to_io(self._io)
  File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 443, in to_io
    io.write(header + self.data)
  File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 409, in write
    self._write(data)
ValueError: write to closed file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1153, in _handle_command
    return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 110, in handle_command
    return dispatch[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 308, in call
    return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 72, in <lambda>
    wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 63, in wrapper
    return func(*args, **kwargs)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1485, in check_host
    error_ok=True, no_fsid=True)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1601, in _run_cephadm
    python = connr.choose_python()
  File "/usr/lib/python3.6/site-packages/remoto/backends/__init__.py", line 158, in wrapper
    self.channel.send("%s(%s)" % (name, arguments))
  File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 729, in send
    self.gateway._send(Message.CHANNEL_DATA, self.id, dumps_internal(item))
  File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 1003, in _send
    raise IOError("cannot send (already closed?)")
OSError: cannot send (already closed?)

So I'll look at adding some smarts to the mgr side to catch the exception and recreate the connection.

Actions

Copy link

Updated by Matthew Oliver almost 4 years ago

We just need to be connection aware.

Solution 1 ===================

Remoto annoyingly doesn't have a method in the connection to check to see if a connection is still established. So I wrote:

diff --git a/remoto/backends/__init__.py b/remoto/backends/__init__.py
index ff20b79..0332274 100644
--- a/remoto/backends/__init__.py
+++ b/remoto/backends/__init__.py
@@ -133,6 +133,11 @@ class BaseConnection(object):
             self.remote_module = LegacyModuleExecute(self.gateway, module, self.logger)
         return self.remote_module

+    def has_connection(self):
+        if self.gateway:
+            return self.gateway.hasreceiver()
+        return False
+

 class LegacyModuleExecute(object):
     """

So now it does, once I push that PR.

On our end all we need is:

diff --git a/src/pybind/mgr/cephadm/module.py b/src/pybind/mgr/cephadm/module.py
index 9db98b977b..e02f94a51a 100644
--- a/src/pybind/mgr/cephadm/module.py
+++ b/src/pybind/mgr/cephadm/module.py
@@ -816,10 +816,13 @@ class CephadmOrchestrator(orchestrator.Orchestrator, MgrModule):
         """ 
         Setup a connection for running commands on remote host.
         """ 
-        conn_and_r = self._cons.get(host)
-        if conn_and_r:
-            self.log.debug('Have connection to %s' % host)
-            return conn_and_r
+        conn, r = self._cons.get(host, (None, None))
+        if conn:
+            if conn.has_connection():
+                self.log.debug('Have connection to %s' % host)
+                return conn, r
+            else:
+                self._reset_con(host)
         n = self.ssh_user + '@' + host
         self.log.debug("Opening connection to {} with ssh options '{}'".format(
             n, self._ssh_options))

With these applied, we get the correct behaviour.

node1:~ # ceph cephadm check-host node1
node1 (None) ok

We have the ssh connection from node2 (the mgr) to node1:

node1:~ # ss -ntp  |grep ssh |grep 10.20.92.202
ESTAB   0        0               10.20.92.201:22           10.20.92.202:58592    users:(("sshd",pid=11037,fd=4))

Kill it:

node1:~ # kill 11037

No connection:

node1:~ # ss -ntp  |grep ssh |grep 10.20.92.202

Attempt to check again:

node1:~ # ceph cephadm check-host node1
node1 (None) ok

It worked, has it recreated the connection?

node1:~ # ss -ntp  |grep ssh |grep 10.20.92.202
ESTAB   0        0               10.20.92.201:22           10.20.92.202:58604    users:(("sshd",pid=11307,fd=4))

Yup!

Solution 2 ====================

Solution 1 expects a patch to remoto, we could have a work around so we don't have to wait, that's solution 2.
We could attempt to always send a command like `pwd` through and on failure reset the connection.
But the remoto approach is cleaner.

Actions

Copy link

Updated by Sebastian Wagner almost 4 years ago

I'd go with solution 1.

Plus a monkey patch. Something like

import remoto
if remoto.__version__ < ...

    from remoto.backends import BaseConnection

    def baseconnection_get_connection(self): ...

    BaseConnectin.get_connection = baseconnection_get_connection