I've updated the duplicate bug with:
I've managed to recreate, I have 2 nodes, node1(10.20.92.201) and node2(10.20.92.202).
Node2 happens to be the current mgr.
So I do a node check:
node2:~ # ceph cephadm check-host node1
node1 (None) ok
If we look at the connections, we'll look on node1, because we can easily recreate the issue from there:
node1:~ # ss -ntp |grep 10.20.92.202 |grep ssh
ESTAB 0 0 10.20.92.201:22 10.20.92.202:55550 users:(("sshd",pid=3125,fd=4))
We can see the connection.
If I run it again, it'll reuse the same connection, because we're storing this connection to the node to reuse:
node2:~ # ceph cephadm check-host node1
node1 (None) ok
Now what happens if the other end of the (the non active mgr) node closes it's connection abruptly:
node1:~ # kill 3125
node1:~ # ss -ntp |grep 10.20.92.202 |grep ssh
<no output>
The connection is gone, obviously. But back in the mgr the stored connection object is still there, which we try and use:
node2:~ # ceph cephadm check-host node1
Error EINVAL: Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 997, in _send
message.to_io(self._io)
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 443, in to_io
io.write(header + self.data)
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 409, in write
self._write(data)
ValueError: write to closed file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/share/ceph/mgr/mgr_module.py", line 1153, in _handle_command
return self.handle_command(inbuf, cmd)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 110, in handle_command
return dispatch[cmd['prefix']].call(self, cmd, inbuf)
File "/usr/share/ceph/mgr/mgr_module.py", line 308, in call
return self.func(mgr, **kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 72, in <lambda>
wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 63, in wrapper
return func(*args, **kwargs)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1485, in check_host
error_ok=True, no_fsid=True)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1601, in _run_cephadm
python = connr.choose_python()
File "/usr/lib/python3.6/site-packages/remoto/backends/__init__.py", line 158, in wrapper
self.channel.send("%s(%s)" % (name, arguments))
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 729, in send
self.gateway._send(Message.CHANNEL_DATA, self.id, dumps_internal(item))
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 1003, in _send
raise IOError("cannot send (already closed?)")
OSError: cannot send (already closed?)
So I'll look at adding some smarts to the mgr side to catch the exception and recreate the connection.