https://tracker.ceph.com/
https://tracker.ceph.com/favicon.ico
2020-05-20T13:48:29Z
Ceph
Orchestrator - Bug #45627: cephadm: frequently getting `1 hosts fail cephadm check`
https://tracker.ceph.com/issues/45627?journal_id=166229
2020-05-20T13:48:29Z
Sebastian Wagner
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-3 priority-5 priority-high3 closed" href="/issues/45032">Bug #45032</a>: cephadm: Not recovering from `OSError: cannot send (already closed?)`</i> added</li></ul>
Orchestrator - Bug #45627: cephadm: frequently getting `1 hosts fail cephadm check`
https://tracker.ceph.com/issues/45627?journal_id=166232
2020-05-20T13:50:09Z
Sebastian Wagner
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/166232/diff?detail_id=170693">diff</a>)</li></ul>
Orchestrator - Bug #45627: cephadm: frequently getting `1 hosts fail cephadm check`
https://tracker.ceph.com/issues/45627?journal_id=166740
2020-05-27T03:01:24Z
Matthew Oliver
moliver@suse.com
<ul><li><strong>Assignee</strong> set to <i>Matthew Oliver</i></li></ul><p>I've updated the duplicate bug with:</p>
<p>I've managed to recreate, I have 2 nodes, node1(10.20.92.201) and node2(10.20.92.202).</p>
<p>Node2 happens to be the current mgr.</p>
<p>So I do a node check:</p>
<pre><code class="text syntaxhl"><span class="CodeRay">node2:~ # ceph cephadm check-host node1
node1 (None) ok
</span></code></pre>
<p>If we look at the connections, we'll look on node1, because we can easily recreate the issue from there:</p>
<pre><code class="text syntaxhl"><span class="CodeRay">node1:~ # ss -ntp |grep 10.20.92.202 |grep ssh
ESTAB 0 0 10.20.92.201:22 10.20.92.202:55550 users:(("sshd",pid=3125,fd=4))
</span></code></pre>
<p>We can see the connection.</p>
<p>If I run it again, it'll reuse the same connection, because we're storing this connection to the node to reuse:</p>
<pre><code class="text syntaxhl"><span class="CodeRay">node2:~ # ceph cephadm check-host node1
node1 (None) ok
</span></code></pre>
<p>Now what happens if the other end of the (the non active mgr) node closes it's connection abruptly:</p>
<pre><code class="text syntaxhl"><span class="CodeRay">node1:~ # kill 3125
node1:~ # ss -ntp |grep 10.20.92.202 |grep ssh
<no output>
</span></code></pre>
<p>The connection is gone, obviously. But back in the mgr the stored connection object is still there, which we try and use:</p>
<pre><code class="text syntaxhl"><span class="CodeRay">node2:~ # ceph cephadm check-host node1
Error EINVAL: Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 997, in _send
message.to_io(self._io)
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 443, in to_io
io.write(header + self.data)
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 409, in write
self._write(data)
ValueError: write to closed file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/share/ceph/mgr/mgr_module.py", line 1153, in _handle_command
return self.handle_command(inbuf, cmd)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 110, in handle_command
return dispatch[cmd['prefix']].call(self, cmd, inbuf)
File "/usr/share/ceph/mgr/mgr_module.py", line 308, in call
return self.func(mgr, **kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 72, in <lambda>
wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 63, in wrapper
return func(*args, **kwargs)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1485, in check_host
error_ok=True, no_fsid=True)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1601, in _run_cephadm
python = connr.choose_python()
File "/usr/lib/python3.6/site-packages/remoto/backends/__init__.py", line 158, in wrapper
self.channel.send("%s(%s)" % (name, arguments))
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 729, in send
self.gateway._send(Message.CHANNEL_DATA, self.id, dumps_internal(item))
File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 1003, in _send
raise IOError("cannot send (already closed?)")
OSError: cannot send (already closed?)
</span></code></pre>
<p>So I'll look at adding some smarts to the mgr side to catch the exception and recreate the connection.</p>
Orchestrator - Bug #45627: cephadm: frequently getting `1 hosts fail cephadm check`
https://tracker.ceph.com/issues/45627?journal_id=166741
2020-05-27T05:25:54Z
Matthew Oliver
moliver@suse.com
<ul></ul><p>We just need to be connection aware.</p>
<p>Solution 1
===================</p>
<p>Remoto annoyingly doesn't have a method in the connection to check to see if a connection is still established. So I wrote:</p>
<pre><code class="diff syntaxhl"><span class="CodeRay"><span class="line comment">diff --git a/remoto/backends/__init__.py b/remoto/backends/__init__.py</span>
<span class="line comment">index ff20b79..0332274 100644</span>
<span class="line head"><span class="head">--- </span><span class="filename">a/remoto/backends/__init__.py</span></span>
<span class="line head"><span class="head">+++ </span><span class="filename">b/remoto/backends/__init__.py</span></span>
<span class="change"><span class="change">@@</span> -133,6 +133,11 <span class="change">@@</span></span> <span class="keyword">class</span> <span class="class">BaseConnection</span>(<span class="predefined">object</span>):
<span class="predefined-constant">self</span>.remote_module = LegacyModuleExecute(<span class="predefined-constant">self</span>.gateway, module, <span class="predefined-constant">self</span>.logger)
<span class="keyword">return</span> <span class="predefined-constant">self</span>.remote_module
<span class="line insert"><span class="insert">+</span> <span class="keyword">def</span> <span class="function">has_connection</span>(<span class="predefined-constant">self</span>):</span>
<span class="line insert"><span class="insert">+</span> <span class="keyword">if</span> <span class="predefined-constant">self</span>.gateway:</span>
<span class="line insert"><span class="insert">+</span> <span class="keyword">return</span> <span class="predefined-constant">self</span>.gateway.hasreceiver()</span>
<span class="line insert"><span class="insert">+</span> <span class="keyword">return</span> <span class="predefined-constant">False</span></span>
<span class="line insert"><span class="insert">+</span></span>
<span class="keyword">class</span> <span class="class">LegacyModuleExecute</span>(<span class="predefined">object</span>):
<span class="docstring"><span class="delimiter">"""</span><span class="content"> </span></span>
</span></code></pre>
<p>So now it does, once I push that PR.</p>
<p>On our end all we need is:</p>
<pre><code class="diff syntaxhl"><span class="CodeRay"><span class="line comment">diff --git a/src/pybind/mgr/cephadm/module.py b/src/pybind/mgr/cephadm/module.py</span>
<span class="line comment">index 9db98b977b..e02f94a51a 100644</span>
<span class="line head"><span class="head">--- </span><span class="filename">a/src/pybind/mgr/cephadm/module.py</span></span>
<span class="line head"><span class="head">+++ </span><span class="filename">b/src/pybind/mgr/cephadm/module.py</span></span>
<span class="change"><span class="change">@@</span> -816,10 +816,13 <span class="change">@@</span></span> <span class="keyword">class</span> <span class="class">CephadmOrchestrator</span>(orchestrator.Orchestrator, MgrModule):
<span class="docstring"><span class="delimiter">"""</span><span class="content"> </span></span>
Setup a connection <span class="keyword">for</span> running commands on remote host.
<span class="docstring"><span class="delimiter">"""</span><span class="content"> </span></span>
<span class="line delete"><span class="delete">-</span> conn_and_r = <span class="predefined-constant">self</span>._cons.get(host)</span>
<span class="line delete"><span class="delete">-</span> <span class="keyword">if</span> conn_and_r:</span>
<span class="line delete"><span class="delete">-</span> <span class="predefined-constant">self</span>.log.debug(<span class="string"><span class="delimiter">'</span><span class="content">Have connection to %s</span><span class="delimiter">'</span></span> % host)</span>
<span class="line delete"><span class="delete">-</span> <span class="keyword">return</span> conn_and_r</span>
<span class="line insert"><span class="insert">+</span> conn, r = <span class="predefined-constant">self</span>._cons.get(host, (<span class="predefined-constant">None</span>, <span class="predefined-constant">None</span>))</span>
<span class="line insert"><span class="insert">+</span> <span class="keyword">if</span> conn:</span>
<span class="line insert"><span class="insert">+</span> <span class="keyword">if</span> conn.has_connection():</span>
<span class="line insert"><span class="insert">+</span> <span class="predefined-constant">self</span>.log.debug(<span class="string"><span class="delimiter">'</span><span class="content">Have connection to %s</span><span class="delimiter">'</span></span> % host)</span>
<span class="line insert"><span class="insert">+</span> <span class="keyword">return</span> conn, r</span>
<span class="line insert"><span class="insert">+</span> <span class="keyword">else</span>:</span>
<span class="line insert"><span class="insert">+</span> <span class="predefined-constant">self</span>._reset_con(host)</span>
n = <span class="predefined-constant">self</span>.ssh_user + <span class="string"><span class="delimiter">'</span><span class="content">@</span><span class="delimiter">'</span></span> + host
<span class="predefined-constant">self</span>.log.debug(<span class="string"><span class="delimiter">"</span><span class="content">Opening connection to {} with ssh options '{}'</span><span class="delimiter">"</span></span>.format(
n, <span class="predefined-constant">self</span>._ssh_options))
</span></code></pre>
<p>With these applied, we get the correct behaviour.</p>
<pre><code class="text syntaxhl"><span class="CodeRay">node1:~ # ceph cephadm check-host node1
node1 (None) ok
</span></code></pre>
<p>We have the ssh connection from node2 (the mgr) to node1:</p>
<pre><code class="text syntaxhl"><span class="CodeRay">node1:~ # ss -ntp |grep ssh |grep 10.20.92.202
ESTAB 0 0 10.20.92.201:22 10.20.92.202:58592 users:(("sshd",pid=11037,fd=4))
</span></code></pre>
<p>Kill it:<br /><pre><code class="text syntaxhl"><span class="CodeRay">node1:~ # kill 11037
</span></code></pre></p>
<p>No connection:<br /><pre><code class="text syntaxhl"><span class="CodeRay">node1:~ # ss -ntp |grep ssh |grep 10.20.92.202
</span></code></pre></p>
<p>Attempt to check again:<br /><pre><code class="text syntaxhl"><span class="CodeRay">node1:~ # ceph cephadm check-host node1
node1 (None) ok
</span></code></pre></p>
<p>It worked, has it recreated the connection?</p>
<pre><code class="text syntaxhl"><span class="CodeRay">node1:~ # ss -ntp |grep ssh |grep 10.20.92.202
ESTAB 0 0 10.20.92.201:22 10.20.92.202:58604 users:(("sshd",pid=11307,fd=4))
</span></code></pre>
<p>Yup!</p>
<p>Solution 2
====================</p>
<p>Solution 1 expects a patch to remoto, we could have a work around so we don't have to wait, that's solution 2.<br />We could attempt to always send a command like `pwd` through and on failure reset the connection.<br />But the remoto approach is cleaner.</p>
Orchestrator - Bug #45627: cephadm: frequently getting `1 hosts fail cephadm check`
https://tracker.ceph.com/issues/45627?journal_id=166748
2020-05-27T07:33:16Z
Sebastian Wagner
<ul></ul><p>I'd go with solution 1.</p>
<p>Plus a monkey patch. Something like</p>
<pre><code class="python syntaxhl"><span class="CodeRay"><span class="keyword">import</span> <span class="include">remoto</span>
<span class="keyword">if</span> remoto.__version__ < ...
<span class="keyword">from</span> <span class="include">remoto.backends</span> <span class="keyword">import</span> <span class="include">BaseConnection</span>
<span class="keyword">def</span> <span class="function">baseconnection_get_connection</span>(<span class="predefined-constant">self</span>): ...
BaseConnectin.get_connection = baseconnection_get_connection
</span></code></pre>
<p>Something like <a class="external" href="https://github.com/ceph/ceph/blob/master/src/pybind/mgr/dashboard/cherrypy_backports.py">https://github.com/ceph/ceph/blob/master/src/pybind/mgr/dashboard/cherrypy_backports.py</a></p>
Orchestrator - Bug #45627: cephadm: frequently getting `1 hosts fail cephadm check`
https://tracker.ceph.com/issues/45627?journal_id=166763
2020-05-27T09:33:03Z
Matthew Oliver
moliver@suse.com
<ul></ul><p>Good idea! Will push up a PR for ceph (and on for remoto) in the morning :)</p>
Orchestrator - Bug #45627: cephadm: frequently getting `1 hosts fail cephadm check`
https://tracker.ceph.com/issues/45627?journal_id=166783
2020-05-27T12:23:56Z
Sebastian Wagner
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-10 priority-4 priority-default closed" href="/issues/45621">Bug #45621</a>: check-host returns terrible unhelpful error message</i> added</li></ul>
Orchestrator - Bug #45627: cephadm: frequently getting `1 hosts fail cephadm check`
https://tracker.ceph.com/issues/45627?journal_id=166876
2020-05-28T15:46:03Z
Sebastian Wagner
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-10 priority-4 priority-default closed" href="/issues/45737">Bug #45737</a>: Module 'cephadm' has failed: cannot send (already closed?)</i> added</li></ul>
Orchestrator - Bug #45627: cephadm: frequently getting `1 hosts fail cephadm check`
https://tracker.ceph.com/issues/45627?journal_id=166918
2020-05-29T00:21:49Z
Matthew Oliver
moliver@suse.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Fix Under Review</i></li><li><strong>Pull request ID</strong> set to <i>35281</i></li></ul>
Orchestrator - Bug #45627: cephadm: frequently getting `1 hosts fail cephadm check`
https://tracker.ceph.com/issues/45627?journal_id=167765
2020-06-08T12:27:10Z
Sebastian Wagner
<ul><li><strong>Status</strong> changed from <i>Fix Under Review</i> to <i>Resolved</i></li><li><strong>Target version</strong> set to <i>v15.2.4</i></li></ul>