Project

General

Profile

Bug #45627

cephadm: frequently getting `1 hosts fail cephadm check`

Added by Sebastian Wagner about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Urgent
Category:
cephadm
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ADK3Y2XHTIJ2YV6MFSQX4XPTQ4WP5ETM/

I can access all rdb devices and CephFS. They work. All OSDs in server-1
is up.

    health: HEALTH_WARN
            1 hosts fail cephadm check
            failed to probe daemons or devices

I even restarted server-1. No luck.

I'm on server-1. cephadm complains it cannot access to server-1. In basic
term, server-1 cannot access server-1 (192.168.0.1)

server-1: 192.168.0.1
server-2: 192.168.0.3

$ ssh -F =(ceph cephadm get-ssh-config) -i =(ceph config-key get
mgr/cephadm/ssh_identity_key) root@server-1
> Success.

I think we have to rethink ssh connections. Looks like execnet can't handle being loaded within a long-running daemon.

This happens (unfortunately) frequently to me. Look for the active mgr
(ceph -s), and go restart the mgr service there (systemctl list-units |grep
mgr then systemctl restart NAMEOFSERVICE). This normally resolves that
error for me.

Related issues

Related to Orchestrator - Bug #45032: cephadm: Not recovering from `OSError: cannot send (already closed?)` Resolved
Related to Orchestrator - Bug #45621: check-host returns terrible unhelpful error message Duplicate
Related to Orchestrator - Bug #45737: Module 'cephadm' has failed: cannot send (already closed?) Duplicate

History

#1 Updated by Sebastian Wagner about 2 months ago

  • Related to Bug #45032: cephadm: Not recovering from `OSError: cannot send (already closed?)` added

#2 Updated by Sebastian Wagner about 2 months ago

  • Description updated (diff)

#3 Updated by Matthew Oliver about 2 months ago

  • Assignee set to Matthew Oliver

I've updated the duplicate bug with:

I've managed to recreate, I have 2 nodes, node1(10.20.92.201) and node2(10.20.92.202).

Node2 happens to be the current mgr.

So I do a node check:

node2:~ # ceph cephadm check-host node1                                                                                                                                                                                                      
node1 (None) ok

If we look at the connections, we'll look on node1, because we can easily recreate the issue from there:

node1:~ # ss -ntp |grep 10.20.92.202 |grep ssh
ESTAB   0        0               10.20.92.201:22           10.20.92.202:55550    users:(("sshd",pid=3125,fd=4))

We can see the connection.

If I run it again, it'll reuse the same connection, because we're storing this connection to the node to reuse:

node2:~ # ceph cephadm check-host node1                                                                                                                                                                                                      
node1 (None) ok

Now what happens if the other end of the (the non active mgr) node closes it's connection abruptly:

node1:~ # kill 3125
node1:~ # ss -ntp |grep 10.20.92.202 |grep ssh
<no output>

The connection is gone, obviously. But back in the mgr the stored connection object is still there, which we try and use:

node2:~ # ceph cephadm check-host node1
Error EINVAL: Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 997, in _send
    message.to_io(self._io)
  File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 443, in to_io
    io.write(header + self.data)
  File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 409, in write
    self._write(data)
ValueError: write to closed file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1153, in _handle_command
    return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 110, in handle_command
    return dispatch[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 308, in call
    return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 72, in <lambda>
    wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 63, in wrapper
    return func(*args, **kwargs)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1485, in check_host
    error_ok=True, no_fsid=True)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1601, in _run_cephadm
    python = connr.choose_python()
  File "/usr/lib/python3.6/site-packages/remoto/backends/__init__.py", line 158, in wrapper
    self.channel.send("%s(%s)" % (name, arguments))
  File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 729, in send
    self.gateway._send(Message.CHANNEL_DATA, self.id, dumps_internal(item))
  File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 1003, in _send
    raise IOError("cannot send (already closed?)")
OSError: cannot send (already closed?)

So I'll look at adding some smarts to the mgr side to catch the exception and recreate the connection.

#4 Updated by Matthew Oliver about 2 months ago

We just need to be connection aware.

Solution 1 ===================

Remoto annoyingly doesn't have a method in the connection to check to see if a connection is still established. So I wrote:

diff --git a/remoto/backends/__init__.py b/remoto/backends/__init__.py
index ff20b79..0332274 100644
--- a/remoto/backends/__init__.py
+++ b/remoto/backends/__init__.py
@@ -133,6 +133,11 @@ class BaseConnection(object):
             self.remote_module = LegacyModuleExecute(self.gateway, module, self.logger)
         return self.remote_module

+    def has_connection(self):
+        if self.gateway:
+            return self.gateway.hasreceiver()
+        return False
+

 class LegacyModuleExecute(object):
     """ 

So now it does, once I push that PR.

On our end all we need is:

diff --git a/src/pybind/mgr/cephadm/module.py b/src/pybind/mgr/cephadm/module.py
index 9db98b977b..e02f94a51a 100644
--- a/src/pybind/mgr/cephadm/module.py
+++ b/src/pybind/mgr/cephadm/module.py
@@ -816,10 +816,13 @@ class CephadmOrchestrator(orchestrator.Orchestrator, MgrModule):
         """ 
         Setup a connection for running commands on remote host.
         """ 
-        conn_and_r = self._cons.get(host)
-        if conn_and_r:
-            self.log.debug('Have connection to %s' % host)
-            return conn_and_r
+        conn, r = self._cons.get(host, (None, None))
+        if conn:
+            if conn.has_connection():
+                self.log.debug('Have connection to %s' % host)
+                return conn, r
+            else:
+                self._reset_con(host)
         n = self.ssh_user + '@' + host
         self.log.debug("Opening connection to {} with ssh options '{}'".format(
             n, self._ssh_options))

With these applied, we get the correct behaviour.

node1:~ # ceph cephadm check-host node1
node1 (None) ok

We have the ssh connection from node2 (the mgr) to node1:

node1:~ # ss -ntp  |grep ssh |grep 10.20.92.202
ESTAB   0        0               10.20.92.201:22           10.20.92.202:58592    users:(("sshd",pid=11037,fd=4))                                               

Kill it:

node1:~ # kill 11037

No connection:

node1:~ # ss -ntp  |grep ssh |grep 10.20.92.202

Attempt to check again:

node1:~ # ceph cephadm check-host node1
node1 (None) ok

It worked, has it recreated the connection?

node1:~ # ss -ntp  |grep ssh |grep 10.20.92.202
ESTAB   0        0               10.20.92.201:22           10.20.92.202:58604    users:(("sshd",pid=11307,fd=4))

Yup!

Solution 2 ====================

Solution 1 expects a patch to remoto, we could have a work around so we don't have to wait, that's solution 2.
We could attempt to always send a command like `pwd` through and on failure reset the connection.
But the remoto approach is cleaner.

#5 Updated by Sebastian Wagner about 2 months ago

I'd go with solution 1.

Plus a monkey patch. Something like

import remoto
if remoto.__version__ < ...

    from remoto.backends import BaseConnection

    def baseconnection_get_connection(self): ...

    BaseConnectin.get_connection = baseconnection_get_connection

Something like https://github.com/ceph/ceph/blob/master/src/pybind/mgr/dashboard/cherrypy_backports.py

#6 Updated by Matthew Oliver about 2 months ago

Good idea! Will push up a PR for ceph (and on for remoto) in the morning :)

#7 Updated by Sebastian Wagner about 2 months ago

  • Related to Bug #45621: check-host returns terrible unhelpful error message added

#8 Updated by Sebastian Wagner about 1 month ago

  • Related to Bug #45737: Module 'cephadm' has failed: cannot send (already closed?) added

#9 Updated by Matthew Oliver about 1 month ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 35281

#10 Updated by Sebastian Wagner about 1 month ago

  • Status changed from Fix Under Review to Resolved
  • Target version set to v15.2.4

Also available in: Atom PDF