Bug #45627
cephadm: frequently getting `1 hosts fail cephadm check`
0%
Description
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ADK3Y2XHTIJ2YV6MFSQX4XPTQ4WP5ETM/
I can access all rdb devices and CephFS. They work. All OSDs in server-1 is up. health: HEALTH_WARN 1 hosts fail cephadm check failed to probe daemons or devices I even restarted server-1. No luck. I'm on server-1. cephadm complains it cannot access to server-1. In basic term, server-1 cannot access server-1 (192.168.0.1) server-1: 192.168.0.1 server-2: 192.168.0.3 $ ssh -F =(ceph cephadm get-ssh-config) -i =(ceph config-key get mgr/cephadm/ssh_identity_key) root@server-1 > Success.
I think we have to rethink ssh connections. Looks like execnet can't handle being loaded within a long-running daemon.
This happens (unfortunately) frequently to me. Look for the active mgr (ceph -s), and go restart the mgr service there (systemctl list-units |grep mgr then systemctl restart NAMEOFSERVICE). This normally resolves that error for me.
Related issues
History
#1 Updated by Sebastian Wagner almost 4 years ago
- Related to Bug #45032: cephadm: Not recovering from `OSError: cannot send (already closed?)` added
#2 Updated by Sebastian Wagner almost 4 years ago
- Description updated (diff)
#3 Updated by Matthew Oliver almost 4 years ago
- Assignee set to Matthew Oliver
I've updated the duplicate bug with:
I've managed to recreate, I have 2 nodes, node1(10.20.92.201) and node2(10.20.92.202).
Node2 happens to be the current mgr.
So I do a node check:
node2:~ # ceph cephadm check-host node1
node1 (None) ok
If we look at the connections, we'll look on node1, because we can easily recreate the issue from there:
node1:~ # ss -ntp |grep 10.20.92.202 |grep ssh
ESTAB 0 0 10.20.92.201:22 10.20.92.202:55550 users:(("sshd",pid=3125,fd=4))
We can see the connection.
If I run it again, it'll reuse the same connection, because we're storing this connection to the node to reuse:
node2:~ # ceph cephadm check-host node1
node1 (None) ok
Now what happens if the other end of the (the non active mgr) node closes it's connection abruptly:
node1:~ # kill 3125
node1:~ # ss -ntp |grep 10.20.92.202 |grep ssh
<no output>
The connection is gone, obviously. But back in the mgr the stored connection object is still there, which we try and use:
node2:~ # ceph cephadm check-host node1
Error EINVAL: Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 997, in _send
message.to_io(self._io)
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 443, in to_io
io.write(header + self.data)
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 409, in write
self._write(data)
ValueError: write to closed file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/share/ceph/mgr/mgr_module.py", line 1153, in _handle_command
return self.handle_command(inbuf, cmd)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 110, in handle_command
return dispatch[cmd['prefix']].call(self, cmd, inbuf)
File "/usr/share/ceph/mgr/mgr_module.py", line 308, in call
return self.func(mgr, **kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 72, in <lambda>
wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 63, in wrapper
return func(*args, **kwargs)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1485, in check_host
error_ok=True, no_fsid=True)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1601, in _run_cephadm
python = connr.choose_python()
File "/usr/lib/python3.6/site-packages/remoto/backends/__init__.py", line 158, in wrapper
self.channel.send("%s(%s)" % (name, arguments))
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 729, in send
self.gateway._send(Message.CHANNEL_DATA, self.id, dumps_internal(item))
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 1003, in _send
raise IOError("cannot send (already closed?)")
OSError: cannot send (already closed?)
So I'll look at adding some smarts to the mgr side to catch the exception and recreate the connection.
#4 Updated by Matthew Oliver almost 4 years ago
We just need to be connection aware.
Solution 1 ===================
Remoto annoyingly doesn't have a method in the connection to check to see if a connection is still established. So I wrote:
diff --git a/remoto/backends/__init__.py b/remoto/backends/__init__.py
index ff20b79..0332274 100644
--- a/remoto/backends/__init__.py
+++ b/remoto/backends/__init__.py
@@ -133,6 +133,11 @@ class BaseConnection(object):
self.remote_module = LegacyModuleExecute(self.gateway, module, self.logger)
return self.remote_module
+ def has_connection(self):
+ if self.gateway:
+ return self.gateway.hasreceiver()
+ return False
+
class LegacyModuleExecute(object):
"""
So now it does, once I push that PR.
On our end all we need is:
diff --git a/src/pybind/mgr/cephadm/module.py b/src/pybind/mgr/cephadm/module.py
index 9db98b977b..e02f94a51a 100644
--- a/src/pybind/mgr/cephadm/module.py
+++ b/src/pybind/mgr/cephadm/module.py
@@ -816,10 +816,13 @@ class CephadmOrchestrator(orchestrator.Orchestrator, MgrModule):
"""
Setup a connection for running commands on remote host.
"""
- conn_and_r = self._cons.get(host)
- if conn_and_r:
- self.log.debug('Have connection to %s' % host)
- return conn_and_r
+ conn, r = self._cons.get(host, (None, None))
+ if conn:
+ if conn.has_connection():
+ self.log.debug('Have connection to %s' % host)
+ return conn, r
+ else:
+ self._reset_con(host)
n = self.ssh_user + '@' + host
self.log.debug("Opening connection to {} with ssh options '{}'".format(
n, self._ssh_options))
With these applied, we get the correct behaviour.
node1:~ # ceph cephadm check-host node1
node1 (None) ok
We have the ssh connection from node2 (the mgr) to node1:
node1:~ # ss -ntp |grep ssh |grep 10.20.92.202
ESTAB 0 0 10.20.92.201:22 10.20.92.202:58592 users:(("sshd",pid=11037,fd=4))
Kill it:
node1:~ # kill 11037
No connection:
node1:~ # ss -ntp |grep ssh |grep 10.20.92.202
Attempt to check again:
node1:~ # ceph cephadm check-host node1
node1 (None) ok
It worked, has it recreated the connection?
node1:~ # ss -ntp |grep ssh |grep 10.20.92.202
ESTAB 0 0 10.20.92.201:22 10.20.92.202:58604 users:(("sshd",pid=11307,fd=4))
Yup!
Solution 2 ====================
Solution 1 expects a patch to remoto, we could have a work around so we don't have to wait, that's solution 2.
We could attempt to always send a command like `pwd` through and on failure reset the connection.
But the remoto approach is cleaner.
#5 Updated by Sebastian Wagner almost 4 years ago
I'd go with solution 1.
Plus a monkey patch. Something like
import remoto
if remoto.__version__ < ...
from remoto.backends import BaseConnection
def baseconnection_get_connection(self): ...
BaseConnectin.get_connection = baseconnection_get_connection
Something like https://github.com/ceph/ceph/blob/master/src/pybind/mgr/dashboard/cherrypy_backports.py
#6 Updated by Matthew Oliver almost 4 years ago
Good idea! Will push up a PR for ceph (and on for remoto) in the morning :)
#7 Updated by Sebastian Wagner almost 4 years ago
- Related to Bug #45621: check-host returns terrible unhelpful error message added
#8 Updated by Sebastian Wagner almost 4 years ago
- Related to Bug #45737: Module 'cephadm' has failed: cannot send (already closed?) added
#9 Updated by Matthew Oliver almost 4 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 35281
#10 Updated by Sebastian Wagner almost 4 years ago
- Status changed from Fix Under Review to Resolved
- Target version set to v15.2.4