Bug #45032
closedcephadm: Not recovering from `OSError: cannot send (already closed?)`
0%
Description
Workaround for this was:
ceph mgr fail ; sleep 20 ; ceph orch host add mon001
Apr 09 17:27:31 mon000 bash[39359]: Warning: Permanently added 'mon001,10.201.32.111' (ECDSA) to the list of known hosts. Apr 09 17:27:32 mon000 bash[39359]: bash: python3: command not found Apr 09 17:27:32 mon000 bash[39359]: debug 2020-04-09T22:27:32.096+0000 7f652806d700 0 [cephadm ERROR root@mon001] Can't communicate with remote host, possibly be Apr 09 17:27:32 mon000 bash[39359]: debug 2020-04-09T22:27:32.097+0000 7f652806d700 -1 log_channel(cephadm) log [ERR] : Can't communicate with remote host, possib Apr 09 17:27:32 mon000 bash[39359]: debug 2020-04-09T22:27:32.099+0000 7f652806d700 0 [cephadm ERROR root] cannot send (already closed?) Apr 09 17:27:32 mon000 bash[39359]: Traceback (most recent call last): Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 997, in _send Apr 09 17:27:32 mon000 bash[39359]: message.to_io(self._io) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 443, in to_io Apr 09 17:27:32 mon000 bash[39359]: io.write(header + self.data) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 410, in write Apr 09 17:27:32 mon000 bash[39359]: self.outfile.flush() Apr 09 17:27:32 mon000 bash[39359]: BrokenPipeError: [Errno 32] Broken pipe Apr 09 17:27:32 mon000 bash[39359]: During handling of the above exception, another exception occurred: Apr 09 17:27:32 mon000 bash[39359]: Traceback (most recent call last): Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/cephadm/module.py", line 1544, in _run_cephadm Apr 09 17:27:32 mon000 bash[39359]: conn, connr = self._get_connection(addr) Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/cephadm/module.py", line 1507, in _get_connection Apr 09 17:27:32 mon000 bash[39359]: ssh_options=self._ssh_options) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 34, in __init__ Apr 09 17:27:32 mon000 bash[39359]: self.gateway = self._make_gateway(hostname) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 46, in _make_gateway Apr 09 17:27:32 mon000 bash[39359]: gateway.reconfigure(py2str_as_py3str=False, py3str_as_py2str=False) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway.py", line 72, in reconfigure Apr 09 17:27:32 mon000 bash[39359]: self._send(Message.RECONFIGURE, data=data) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 1003, in _send Apr 09 17:27:32 mon000 bash[39359]: raise IOError("cannot send (already closed?)") Apr 09 17:27:32 mon000 bash[39359]: OSError: cannot send (already closed?) Apr 09 17:27:32 mon000 bash[39359]: debug 2020-04-09T22:27:32.099+0000 7f652806d700 -1 log_channel(cephadm) log [ERR] : cannot send (already closed?) Apr 09 17:27:32 mon000 bash[39359]: Traceback (most recent call last): Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 997, in _send Apr 09 17:27:32 mon000 bash[39359]: message.to_io(self._io) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 443, in to_io Apr 09 17:27:32 mon000 bash[39359]: io.write(header + self.data) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 410, in write Apr 09 17:27:32 mon000 bash[39359]: self.outfile.flush() Apr 09 17:27:32 mon000 bash[39359]: BrokenPipeError: [Errno 32] Broken pipe Apr 09 17:27:32 mon000 bash[39359]: During handling of the above exception, another exception occurred: Apr 09 17:27:32 mon000 bash[39359]: Traceback (most recent call last): Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/cephadm/module.py", line 1544, in _run_cephadm Apr 09 17:27:32 mon000 bash[39359]: conn, connr = self._get_connection(addr) Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/cephadm/module.py", line 1507, in _get_connection Apr 09 17:27:32 mon000 bash[39359]: ssh_options=self._ssh_options) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 34, in __init__ Apr 09 17:27:32 mon000 bash[39359]: self.gateway = self._make_gateway(hostname) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 46, in _make_gateway Apr 09 17:27:32 mon000 bash[39359]: gateway.reconfigure(py2str_as_py3str=False, py3str_as_py2str=False) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway.py", line 72, in reconfigure Apr 09 17:27:32 mon000 bash[39359]: self._send(Message.RECONFIGURE, data=data) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 1003, in _send Apr 09 17:27:32 mon000 bash[39359]: raise IOError("cannot send (already closed?)") Apr 09 17:27:32 mon000 bash[39359]: OSError: cannot send (already closed?) Apr 09 17:27:32 mon000 bash[39359]: debug 2020-04-09T22:27:32.100+0000 7f652806d700 0 [cephadm ERROR orchestrator._interface] _Promise failed Apr 09 17:27:32 mon000 bash[39359]: Traceback (most recent call last): Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 997, in _send Apr 09 17:27:32 mon000 bash[39359]: message.to_io(self._io) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 443, in to_io Apr 09 17:27:32 mon000 bash[39359]: io.write(header + self.data) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 410, in write Apr 09 17:27:32 mon000 bash[39359]: self.outfile.flush() Apr 09 17:27:32 mon000 bash[39359]: BrokenPipeError: [Errno 32] Broken pipe Apr 09 17:27:32 mon000 bash[39359]: During handling of the above exception, another exception occurred: Apr 09 17:27:32 mon000 bash[39359]: Traceback (most recent call last): Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/cephadm/module.py", line 444, in do_work Apr 09 17:27:32 mon000 bash[39359]: res = self._on_complete_(*args, **kwargs) Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/cephadm/module.py", line 512, in <lambda> Apr 09 17:27:32 mon000 bash[39359]: return cls(on_complete=lambda x: f(*x), value=args, name=name, **c_kwargs) Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/cephadm/module.py", line 1645, in add_host Apr 09 17:27:32 mon000 bash[39359]: error_ok=True, no_fsid=True) Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/cephadm/module.py", line 1544, in _run_cephadm Apr 09 17:27:32 mon000 bash[39359]: conn, connr = self._get_connection(addr) Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/cephadm/module.py", line 1507, in _get_connection Apr 09 17:27:32 mon000 bash[39359]: ssh_options=self._ssh_options) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 34, in __init__ Apr 09 17:27:32 mon000 bash[39359]: self.gateway = self._make_gateway(hostname) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 46, in _make_gateway Apr 09 17:27:32 mon000 bash[39359]: gateway.reconfigure(py2str_as_py3str=False, py3str_as_py2str=False) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway.py", line 72, in reconfigure Apr 09 17:27:32 mon000 bash[39359]: self._send(Message.RECONFIGURE, data=data) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 1003, in _send Apr 09 17:27:32 mon000 bash[39359]: raise IOError("cannot send (already closed?)") Apr 09 17:27:32 mon000 bash[39359]: OSError: cannot send (already closed?) Apr 09 17:27:32 mon000 bash[39359]: debug 2020-04-09T22:27:32.100+0000 7f652806d700 -1 log_channel(cephadm) log [ERR] : _Promise failed Apr 09 17:27:32 mon000 bash[39359]: Traceback (most recent call last): Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 997, in _send Apr 09 17:27:32 mon000 bash[39359]: message.to_io(self._io) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 443, in to_io Apr 09 17:27:32 mon000 bash[39359]: io.write(header + self.data) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 410, in write Apr 09 17:27:32 mon000 bash[39359]: self.outfile.flush() Apr 09 17:27:32 mon000 bash[39359]: BrokenPipeError: [Errno 32] Broken pipe Apr 09 17:27:32 mon000 bash[39359]: During handling of the above exception, another exception occurred: Apr 09 17:27:32 mon000 bash[39359]: Traceback (most recent call last): Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/cephadm/module.py", line 444, in do_work Apr 09 17:27:32 mon000 bash[39359]: res = self._on_complete_(*args, **kwargs) Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/cephadm/module.py", line 512, in <lambda> Apr 09 17:27:32 mon000 bash[39359]: return cls(on_complete=lambda x: f(*x), value=args, name=name, **c_kwargs) Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/cephadm/module.py", line 1645, in add_host Apr 09 17:27:32 mon000 bash[39359]: error_ok=True, no_fsid=True) Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/cephadm/module.py", line 1544, in _run_cephadm Apr 09 17:27:32 mon000 bash[39359]: conn, connr = self._get_connection(addr) Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/cephadm/module.py", line 1507, in _get_connection Apr 09 17:27:32 mon000 bash[39359]: ssh_options=self._ssh_options) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 34, in __init__ Apr 09 17:27:32 mon000 bash[39359]: self.gateway = self._make_gateway(hostname) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 46, in _make_gateway Apr 09 17:27:32 mon000 bash[39359]: gateway.reconfigure(py2str_as_py3str=False, py3str_as_py2str=False) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway.py", line 72, in reconfigure Apr 09 17:27:32 mon000 bash[39359]: self._send(Message.RECONFIGURE, data=data) Apr 09 17:27:32 mon000 bash[39359]: File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 1003, in _send Apr 09 17:27:32 mon000 bash[39359]: raise IOError("cannot send (already closed?)") Apr 09 17:27:32 mon000 bash[39359]: OSError: cannot send (already closed?) Apr 09 17:27:32 mon000 bash[39359]: debug 2020-04-09T22:27:32.628+0000 7f653890e700 -1 mgr handle_command module 'orchestrator' command handler threw exception: c Apr 09 17:27:32 mon000 bash[39359]: debug 2020-04-09T22:27:32.628+0000 7f653890e700 -1 mgr.server reply reply (22) Invalid argument Traceback (most recent call la Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/mgr_module.py", line 1153, in _handle_command Apr 09 17:27:32 mon000 bash[39359]: return self.handle_command(inbuf, cmd) Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 110, in handle_command Apr 09 17:27:32 mon000 bash[39359]: return dispatch[cmd['prefix']].call(self, cmd, inbuf) Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/mgr_module.py", line 308, in call Apr 09 17:27:32 mon000 bash[39359]: return self.func(mgr, **kwargs) Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 72, in <lambda> Apr 09 17:27:32 mon000 bash[39359]: wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs) Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 63, in wrapper Apr 09 17:27:32 mon000 bash[39359]: return func(*args, **kwargs) Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/orchestrator/module.py", line 179, in _add_host Apr 09 17:27:32 mon000 bash[39359]: raise_if_exception(completion) Apr 09 17:27:32 mon000 bash[39359]: File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 628, in raise_if_exception Apr 09 17:27:32 mon000 bash[39359]: raise e Apr 09 17:27:32 mon000 bash[39359]: OSError: cannot send (already closed?)
Updated by Sebastian Wagner almost 4 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 35022
Updated by Sebastian Wagner almost 4 years ago
- Related to Bug #45627: cephadm: frequently getting `1 hosts fail cephadm check` added
Updated by Sebastian Wagner almost 4 years ago
- Status changed from Fix Under Review to Pending Backport
I think we sill have this problem.
Updated by Tobias Fischer almost 4 years ago
same here after a reboot of the hosts:
root@one1-ceph4.storage.:~# ceph cephadm check-host one1-ceph4
one1-ceph4 (None) ok
root@one1-ceph4.storage.:~# ceph cephadm check-host one1-ceph5
Error EINVAL: Traceback (most recent call last):
File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 997, in _send
message.to_io(self._io)
File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 443, in to_io
io.write(header + self.data)
File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 409, in write
self._write(data)
ValueError: write to closed file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/share/ceph/mgr/mgr_module.py", line 1153, in handle_command
return self.handle_command(inbuf, cmd)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 110, in handle_command
return dispatch[cmd['prefix']].call(self, cmd, inbuf)
File "/usr/share/ceph/mgr/mgr_module.py", line 308, in call
return self.func(mgr, **kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 72, in <lambda>
wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 63, in wrapper
return func(*args, **kwargs)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1482, in check_host
error_ok=True, no_fsid=True)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1607, in _run_cephadm
python = connr.choose_python()
File "/lib/python3.6/site-packages/remoto/backends/_init__.py", line 158, in wrapper
self.channel.send("%s(%s)" % (name, arguments))
File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 729, in send
self.gateway._send(Message.CHANNEL_DATA, self.id, dumps_internal(item))
File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 1003, in _send
raise IOError("cannot send (already closed?)")
OSError: cannot send (already closed?)
root@one1-ceph4.storage.:~# ceph cephadm check-host one1-ceph6
Error EINVAL: Traceback (most recent call last):
File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 997, in _send
message.to_io(self._io)
File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 443, in to_io
io.write(header + self.data)
File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 409, in write
self._write(data)
ValueError: write to closed file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/share/ceph/mgr/mgr_module.py", line 1153, in handle_command
return self.handle_command(inbuf, cmd)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 110, in handle_command
return dispatch[cmd['prefix']].call(self, cmd, inbuf)
File "/usr/share/ceph/mgr/mgr_module.py", line 308, in call
return self.func(mgr, **kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 72, in <lambda>
wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 63, in wrapper
return func(*args, **kwargs)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1482, in check_host
error_ok=True, no_fsid=True)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1607, in _run_cephadm
python = connr.choose_python()
File "/lib/python3.6/site-packages/remoto/backends/_init__.py", line 158, in wrapper
self.channel.send("%s(%s)" % (name, arguments))
File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 729, in send
self.gateway._send(Message.CHANNEL_DATA, self.id, dumps_internal(item))
File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 1003, in _send
raise IOError("cannot send (already closed?)")
OSError: cannot send (already closed?)
Updated by Tobias Fischer almost 4 years ago
fixed after reboot of active mgr
root@one1-ceph4.storage.:~# ceph cephadm check-host one1-ceph5
one1-ceph5 (None) ok
root@one1-ceph4.storage.:~# ceph cephadm check-host one1-ceph6
one1-ceph6 (None) ok
Updated by Matthew Oliver almost 4 years ago
- Assignee set to Matthew Oliver
I've managed to recreate, I have 2 nodes, node1(10.20.92.201) and node2(10.20.92.202).
Node2 happens to be the current mgr.
So I do a node check:
node2:~ # ceph cephadm check-host node1
node1 (None) ok
If we look at the connections, we'll look on node1, because we can easily recreate the issue from there:
node1:~ # ss -ntp |grep 10.20.92.202 |grep ssh
ESTAB 0 0 10.20.92.201:22 10.20.92.202:55550 users:(("sshd",pid=3125,fd=4))
We can see the connection.
If I run it again, it'll reuse the same connection, because we're storing this connection to the node to reuse:
node2:~ # ceph cephadm check-host node1
node1 (None) ok
Now what happens if the other end of the (the non active mgr) node closes it's connection abruptly:
node1:~ # kill 3125
node1:~ # ss -ntp |grep 10.20.92.202 |grep ssh
<no output>
The connection is gone, obviously. But back in the mgr the stored connection object is still there, which we try and use:
node2:~ # ceph cephadm check-host node1
Error EINVAL: Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 997, in _send
message.to_io(self._io)
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 443, in to_io
io.write(header + self.data)
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 409, in write
self._write(data)
ValueError: write to closed file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/share/ceph/mgr/mgr_module.py", line 1153, in _handle_command
return self.handle_command(inbuf, cmd)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 110, in handle_command
return dispatch[cmd['prefix']].call(self, cmd, inbuf)
File "/usr/share/ceph/mgr/mgr_module.py", line 308, in call
return self.func(mgr, **kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 72, in <lambda>
wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 63, in wrapper
return func(*args, **kwargs)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1485, in check_host
error_ok=True, no_fsid=True)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1601, in _run_cephadm
python = connr.choose_python()
File "/usr/lib/python3.6/site-packages/remoto/backends/__init__.py", line 158, in wrapper
self.channel.send("%s(%s)" % (name, arguments))
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 729, in send
self.gateway._send(Message.CHANNEL_DATA, self.id, dumps_internal(item))
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 1003, in _send
raise IOError("cannot send (already closed?)")
OSError: cannot send (already closed?)
So I'll look at adding some smarts to the mgr side to catch the exception and recreate the connection.
Updated by Sebastian Wagner almost 4 years ago
- Pull request ID changed from 35022 to 35281
Updated by Sebastian Wagner almost 4 years ago
- Status changed from Pending Backport to Resolved
- Target version set to v15.2.4