Ceph : Issueshttps://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2023-08-31T01:12:26ZCeph
Redmine sepia - Bug #62650 (Resolved): Various SSH errors are preventing many jobs from completing properlyhttps://tracker.ceph.com/issues/626502023-08-31T01:12:26ZZack Cerza
<p>We've been having more and more issues with SSH errors recently:<br /><a class="external" href="https://sentry.ceph.com/organizations/ceph/issues/?end=2023-08-30T23%3A59%3A59&query=paramiko&start=2023-08-22T00%3A00%3A00&utc=true">https://sentry.ceph.com/organizations/ceph/issues/?end=2023-08-30T23%3A59%3A59&query=paramiko&start=2023-08-22T00%3A00%3A00&utc=true</a></p>
<p>I found a fix for the AttributeError: <a class="external" href="https://sentry.ceph.com/share/issue/e9092ab6059e4ea299350022b9b2cb52/">https://sentry.ceph.com/share/issue/e9092ab6059e4ea299350022b9b2cb52/</a> <a class="external" href="https://github.com/ceph/teuthology/pull/1886">https://github.com/ceph/teuthology/pull/1886</a> - but there's clearly more going on at this point.</p>
<p>This issue alone has occurred over 600 times in the last 24h: <a class="external" href="https://sentry.ceph.com/share/issue/ef95cc1bf37f4e89a849c9a1c5e26a6b/">https://sentry.ceph.com/share/issue/ef95cc1bf37f4e89a849c9a1c5e26a6b/</a></p>
<p>I noticed that all of the hosts it affected were CentOS 9.Stream, and I've narrowed this particular issue down to an SSH key incompatibility.</p> teuthology - Feature #56390 (In Progress): Running teuthology locally with loop device storage fo...https://tracker.ceph.com/issues/563902022-06-24T19:51:34ZZack Cerza
<p>Our previous work getting the full teuthology stack running locally has been successful, but has limitations. On machines without extra storage available to dedicate to a Ceph cluster, we're unable to run tests that need OSDs. This includes most laptops.</p>
<p>Separately, I've been doing work on a <a href="https://github.com/ceph/ceph/compare/main...zmc:wip-box-rootless-podman?expand=1" class="external">cephadm tool</a> that can stand up a Ceph cluster using <code>/dev/loopN</code> as devices for OSDs. This work also depends on <a href="https://github.com/ceph/ceph/pull/46375" class="external">this ceph-volume PR</a>. These devices have to be set up as 'raw' as opposed to using LVM because <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1623944#c16" class="external">podman will not support device-mapper</a></p>
<p>We can adapt this to work with or inside of teuthology by taking the docker-compose work we've done and combining it with the above to create something new. I've created an outline of how this might go:</p>
<ol>
<li>Decide if this should be implemented as part of teuthology or as an external tool
<ol>
<li>IMO it should be written in Python as opposed to bash; it needs to create containers, keep track of them, and destroy them later</li>
</ol>
</li>
<li>Convert <code>docker-compose.yml</code> to <code>podman</code> run commands (I did a version of this in the branch mentioned above). This is necessary because the only way unmask <code>/sys/dev/block</code> is via a CLI flag. It was implemented <a href="https://github.com/containers/podman/pull/8408" class="external">in this PR</a>.
<ol>
<li>Decide if loop devs should be created on the host itself or within the testnode containers</li>
</ol>
</li>
<li>Modify the cephadm task (if necessary) to work with our containerized testnodes
<ol>
<li>Optionally, also allow it to use a cephadm container instead of downloading the Python module from a remote git repo</li>
</ol>
</li>
<li>Figure out how to start up new containers after we destroy the ones we've used</li>
</ol>
<p>Once 1 is decided, 2, 3 and 4 could be worked on in parallel. I think it might be smart to start work in a new repo until we decide it should (or must) be part of teuthology.</p>
<p>Feedback welcome!</p> Orchestrator - Bug #56026 (Resolved): cephadm attempts to write a sysctl conf even if none is neededhttps://tracker.ceph.com/issues/560262022-06-13T19:43:40ZZack Cerza
<p>In rootless containers, we can't (as far as I can tell) modify most sysctl settings. Even if the host is configured with the appropriate settings, cephadm will try to do so anyway - causing unnecessary failures.</p> Dashboard - Feature #37298 (Rejected): mgr/dashboard: Support a more compact data format (Message...https://tracker.ceph.com/issues/372982018-11-16T21:09:05ZZack CerzaDashboard - Feature #36675 (Closed): mgr/dashboard: Provide API endpoint providing minimal health...https://tracker.ceph.com/issues/366752018-11-01T16:18:38ZZack Cerza
<p>Currently the dashboard polls both <code>/api/summary</code> and <code>/api/dashboard/health</code> every 5s. The latter endpoint returns a large amount of data, but the amount needed to actually render the dashboard is quite small by comparison.</p>
<p>I think it would make sense to get rid of <code>/api/dashboard/health</code> and replace it with e.g. <code>/api/health/full</code> and <code>/api/health/minimal</code>.</p>
<p>Another idea would be to merge the minimal health data into <code>/api/summary</code>, but that might warrant more discussion.</p>
<p>A caveat here is that <code>/api/dashboard/health</code> currently provides logging data which will be moving out as a result of <a class="issue tracker-2 status-3 priority-3 priority-lowest closed" title="Feature: mgr/dashboard: Move Cluster/Audit logs from front page to dedicated "Logs" page (Resolved)" href="https://tracker.ceph.com/issues/24571">#24571</a> - perhaps the solution to that piece is to simply move that data out to a new <code>/api/logs</code> endpoint.</p> Dashboard - Bug #36674 (Closed): mgr/dashboard: Enable compression for backend requestshttps://tracker.ceph.com/issues/366742018-11-01T16:08:13ZZack Cerza
<p><a class="external" href="https://github.com/ceph/ceph/pull/24727">https://github.com/ceph/ceph/pull/24727</a></p> mgr - Bug #22669 (Duplicate): KeyError: 'pg_deep' from prometheus pluginhttps://tracker.ceph.com/issues/226692018-01-11T21:44:22ZZack Cerza
<p>On <code>12.2.2-78-g905b734-1xenial</code>:<br /><pre>
$ curl http://mira021:9283/metrics
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>
<title>500 Internal Server Error</title>
<style type="text/css">
#powered_by {
margin-top: 20px;
border-top: 2px solid black;
font-style: italic;
}
#traceback {
color: red;
}
</style>
</head>
<body>
<h2>500 Internal Server Error</h2>
<p>The server encountered an unexpected condition which prevented it from fulfilling the request.</p>
<pre id="traceback">Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670, in respond
response.body = self.handler()
File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 217, in __call__
self.body = self.oldhandler(*args, **kwargs)
File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61, in __call__
return self.callable(*self.args, **self.kwargs)
File "/usr/lib/ceph/mgr/prometheus/module.py", line 386, in metrics
metrics = global_instance().collect()
File "/usr/lib/ceph/mgr/prometheus/module.py", line 324, in collect
self.get_pg_status()
File "/usr/lib/ceph/mgr/prometheus/module.py", line 266, in get_pg_status
self.metrics[path].set(value)
KeyError: 'pg_deep'
</pre>
<div id="powered_by">
<span>
Powered by <a href="http://www.cherrypy.org">CherryPy 3.5.0</a>
</span>
</div>
</body>
</html>
</pre></p>
<p>I'd hoped updating from <code>12.2.1-795-g5c9b93d-1xenial</code> would fix this, but it didn't.</p> mgr - Bug #20692 (Resolved): mgr: 500 error when attempting to view filesystem datahttps://tracker.ceph.com/issues/206922017-07-19T20:37:31ZZack Cerza
<p><a class="external" href="http://mira049.front.sepia.ceph.com:8000/filesystem/0/">http://mira049.front.sepia.ceph.com:8000/filesystem/0/</a><br /><pre>
500 Internal Server Error
The server encountered an unexpected condition which prevented it from fulfilling the request.
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670, in respond
response.body = self.handler()
File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 217, in __call__
self.body = self.oldhandler(*args, **kwargs)
File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61, in __call__
return self.callable(*self.args, **self.kwargs)
File "/usr/lib/ceph/mgr/dashboard/module.py", line 449, in filesystem
"fs_status": global_instance().fs_status(int(fs_id))
File "/usr/lib/ceph/mgr/dashboard/module.py", line 359, in fs_status
mds_versions[metadata['ceph_version']].append(standby['name'])
KeyError: 'ceph_version'
Powered by CherryPy 3.5.0
</pre></p>
<pre>
root@mira049:~# ceph --version
ceph version 12.1.0-628-g8b08bd5 (8b08bd580d3001027b33a615bbb527fc2a5f26c6) luminous (rc)
</pre> teuthology - Bug #19259 (In Progress): Reconciling teuthology master and rh brancheshttps://tracker.ceph.com/issues/192592017-03-10T22:52:05ZZack Cerza
<p>Here is a dump of the current delta between teuthology master and <a class="external" href="https://github.com/ceph/teuthology/compare/rh">https://github.com/ceph/teuthology/compare/rh</a><br />Current HEAD is <a class="external" href="https://github.com/ceph/teuthology/commit/3f330ef6d9943443a191ff324e7b2246df3995f2">https://github.com/ceph/teuthology/commit/3f330ef6d9943443a191ff324e7b2246df3995f2</a></p>
<p><code>bootstrap</code><br />We can revisit removing mariadb from requires; good opportunity to maybe drop coverage stuff</p>
<p><code>teuthology/orchestra/daemon.py</code><br /><a class="external" href="https://github.com/ceph/teuthology/pull/934">https://github.com/ceph/teuthology/pull/934</a></p>
<code>teuthology/run.py</code><br /><a class="external" href="https://github.com/ceph/teuthology/pull/1007">https://github.com/ceph/teuthology/pull/1007</a><br />other changes noted:
<ul>
<li>kernel</li>
<li>redhat-build</li>
<li>test-mode</li>
</ul>
<p><code>teuthology/task/__init__.py</code><br />Changes to <code>apply_overrides()</code> won't be accepted as-is; preferred naming uses underscores and not hyphens. Potentially we could add code to look for overrides using both naming conventions, but what do we do if we find both?</p>
<p><code>teuthology/task/ceph_ansible.py</code><br /><a class="external" href="https://github.com/ceph/teuthology/pull/1008">https://github.com/ceph/teuthology/pull/1008</a></p>
<p><code>teuthology/task/internal/__init__.py</code><br />Changes that should actually be in <a class="external" href="https://github.com/ceph/teuthology/pull/1007">https://github.com/ceph/teuthology/pull/1007</a> ?</p>
<p><code>teuthology/task/internal/lock_machines.py</code><br />Hack in lieu of multi-OS locking support. We need to add a proper feature for this.</p>
<p><code>teuthology/task/internal/redhat.py</code><br /><a class="external" href="https://github.com/ceph/teuthology/pull/1007">https://github.com/ceph/teuthology/pull/1007</a></p>
<p><code>teuthology/task/kernel.py</code><br />Will need a separate PR for this</p> teuthology - Bug #19257 (Resolved): remote processes started with wait=False appear to never finishhttps://tracker.ceph.com/issues/192572017-03-10T21:08:10ZZack Cerza
<p><code>RemoteProcess.finished</code> is a property used by (among others) <code>RemoteProcess.poll()</code>. I noticed that it wasn't working properly when investigating the feasibility of immediately detecting when a daemon fails to start.</p>
<p>Test case:<br /><pre>
$ cat test.py
#!/usr/bin/env python
from teuthology.orchestra import remote
import time
REMOTE_NAME = 'smithi019'
r = remote.Remote(REMOTE_NAME)
def run(cmd):
return r.run(
args=cmd,
wait=False,
check_status=False,
)
def p_finished(proc):
print "finished", proc.finished
if __name__ == "__main__":
p = run("echo hi; sleep 2; echo hi again; sleep 3; echo bye")
p_finished(p)
time.sleep(1.1)
p_finished(p)
time.sleep(1.1)
p_finished(p)
time.sleep(3.1)
p_finished(p)
</pre></p>
<pre>
$ ./test.py
2017-03-10 14:05:31,198.198 INFO:teuthology.orchestra.run.smithi019:Running: 'true'
2017-03-10 14:05:31,594.594 INFO:teuthology.orchestra.run.smithi019:Running: 'echo hi; sleep 2; echo hi again; sleep 3; echo bye'
finished False
2017-03-10 14:05:31,892.892 INFO:teuthology.orchestra.run.smithi019.stdout:hi
finished False
2017-03-10 14:05:33,894.894 INFO:teuthology.orchestra.run.smithi019.stdout:hi again
finished False
finished False
</pre> CI - Bug #17863 (Resolved): kernel builds are missing 'extra' data in shamanhttps://tracker.ceph.com/issues/178632016-11-10T23:48:16ZZack Cerza
<p>See:<br /><a class="external" href="https://shaman.ceph.com/api/search/?status=ready&project=ceph&flavor=default&distros=ubuntu%2F14.04%2Fx86_64&ref=jewel">https://shaman.ceph.com/api/search/?status=ready&project=ceph&flavor=default&distros=ubuntu%2F14.04%2Fx86_64&ref=jewel</a></p>
<p>And:<br /><a class="external" href="https://shaman.ceph.com/api/search/?status=ready&project=kernel&flavor=default&distros=ubuntu%2F14.04%2Fx86_64&ref=testing">https://shaman.ceph.com/api/search/?status=ready&project=kernel&flavor=default&distros=ubuntu%2F14.04%2Fx86_64&ref=testing</a></p>
<p>The ceph builds have things like:<br /><pre>
extra: {
build_url: "https://jenkins.ceph.com/job/ceph-dev-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=trusty,DIST=trusty,MACHINE_SIZE=huge/2251/",
root_build_cause: "SCMTRIGGER",
version: "10.2.3-358-g427f357",
node_name: "172.21.15.125+smithi125",
job_name: "ceph-dev-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=trusty,DIST=trusty,MACHINE_SIZE=huge",
package_manager_version: "10.2.3-358-g427f357-1trusty"
},
</pre></p>
<p>Specifically, we'll need <code>package_manager_version</code> for teuthology.</p> CI - Bug #17729 (Resolved): ceph-release rpms have changed nameshttps://tracker.ceph.com/issues/177292016-10-27T21:38:51ZZack Cerza
<p>See:<br /><a class="external" href="http://pulpito.ceph.com/zack-2016-10-27_14:59:23-upgrade:jewel-x-master-distro-basic-smithi/498277/">http://pulpito.ceph.com/zack-2016-10-27_14:59:23-upgrade:jewel-x-master-distro-basic-smithi/498277/</a></p>
<p>Long story short:<br /><pre>
$ curl -I https://chacra.ceph.com/r/ceph/jewel/3a9fba20ec743699b69bd0181dd6c54dc01c64b9/centos/7/flavors/default/noarch/ceph-release-1-0.el7.noarch.rpm
HTTP/1.1 404 Not Found
Server: nginx
Date: Thu, 27 Oct 2016 21:36:15 GMT
Content-Type: text/html
Content-Length: 162
Connection: keep-alive
$ curl -I http://gitbuilder.ceph.com/ceph-rpm-centos7-x86_64-basic/sha1/3a9fba20ec743699b69bd0181dd6c54dc01c64b9/noarch/ceph-release-1-0.el7.noarch.rpm
HTTP/1.1 200 OK
Date: Thu, 27 Oct 2016 21:31:22 GMT
Server: Apache/2.4.7 (Ubuntu)
Last-Modified: Thu, 21 Apr 2016 12:14:22 GMT
ETag: "b28-530fda682c380"
Accept-Ranges: bytes
Content-Length: 2856
Content-Type: application/x-redhat-package-manager
</pre></p>
<p>It appears that chacra has <code>ceph-release-1-1.el7.noarch.rpm</code> for some reason.</p>
<p>That package, for RPM distros, at least, needs to be in a predictable location.</p> CI - Bug #17627 (Resolved): inconsistent package versionshttps://tracker.ceph.com/issues/176272016-10-19T21:58:02ZZack Cerza
<p>Sorry for the poor title, but I'm not sure what's going on here.</p>
<p>I'm testing a teuthology PR that does tag->sha1 mapping before querying shaman, and I hit an unrelated issue. The job I'm looking at is:<br /><a class="external" href="http://qa-proxy.ceph.com/teuthology/zack-2016-10-19_15:22:27-upgrade:jewel-x-master-distro-basic-vps/486318/teuthology.log">http://qa-proxy.ceph.com/teuthology/zack-2016-10-19_15:22:27-upgrade:jewel-x-master-distro-basic-vps/486318/teuthology.log</a></p>
<p>First, teuthology queries shaman using this URL:<br /><a class="external" href="https://shaman.ceph.com/api/search/?status=ready&project=ceph&flavor=default&distros=ubuntu%2F14.04%2Fx86_64&ref=jewel">https://shaman.ceph.com/api/search/?status=ready&project=ceph&flavor=default&distros=ubuntu%2F14.04%2Fx86_64&ref=jewel</a></p>
<p>The first item in the result is:<br /><pre>
{'archs': ['x86_64'],
'chacra_url': 'https://chacra.ceph.com/repos/ceph/jewel/3a9fba20ec743699b69bd0181dd6c54dc01c64b9/ubuntu/trusty/flavors/default/',
'distro': 'ubuntu',
'distro_codename': 'trusty',
'distro_version': '14.04',
'extra': {'build_url': 'https://jenkins.ceph.com/job/ceph-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=trusty,DIST=trusty,MACHINE_SIZE=huge/188/',
'job_name': 'ceph-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=trusty,DIST=trusty,MACHINE_SIZE=huge',
'node_name': '158.69.75.252+xenial_trusty_pbuilder_huge__2cdfb58f-56a5-4001-9f68-93dc62bfe990',
'package_manager_version': '10.2.0-1trusty',
'root_build_cause': 'MANUALTRIGGER',
'version': '10.2.0'},
'flavor': 'default',
'modified': '2016-10-19 20:27:40.516792',
'project': 'ceph',
'ref': 'jewel',
'sha1': '3a9fba20ec743699b69bd0181dd6c54dc01c64b9',
'status': 'ready',
'url': 'https://chacra.ceph.com/r/ceph/jewel/3a9fba20ec743699b69bd0181dd6c54dc01c64b9/ubuntu/trusty/flavors/default/'}
</pre></p>
<p>Note how it reports <code>version: '10.2.0'</code></p>
<p>However, look at the packages themselves:<br /><a class="external" href="https://chacra.ceph.com/r/ceph/jewel/3a9fba20ec743699b69bd0181dd6c54dc01c64b9/ubuntu/trusty/flavors/default/pool/main/c/ceph/">https://chacra.ceph.com/r/ceph/jewel/3a9fba20ec743699b69bd0181dd6c54dc01c64b9/ubuntu/trusty/flavors/default/pool/main/c/ceph/</a></p>
<pre>
ceph-base_10.2.3-1trusty_amd64.deb 19-Oct-2016 18:50 52318782
ceph-base_10.2.3-1xenial_amd64.deb 19-Oct-2016 20:20 50114332
ceph-base_10.2.3-1xenial_arm64.deb 19-Oct-2016 20:21 44040482
ceph-base_10.2.3-1~bpo80+1_amd64.deb 19-Oct-2016 19:40 51342962
ceph-common-dbg_10.2.3-1trusty_amd64.deb 19-Oct-2016 20:13 227623294
ceph-common-dbg_10.2.3-1xenial_amd64.deb 19-Oct-2016 18:46 211041846
ceph-common-dbg_10.2.3-1xenial_arm64.deb 19-Oct-2016 19:00 208923884
ceph-common-dbg_10.2.3-1~bpo80+1_amd64.deb 19-Oct-2016 19:40 209801072
ceph-common_10.2.3-1trusty_amd64.deb 19-Oct-2016 18:55 13501002
ceph-common_10.2.3-1xenial_amd64.deb 19-Oct-2016 19:50 14157514
ceph-common_10.2.3-1xenial_arm64.deb 19-Oct-2016 20:20 12110014
ceph-common_10.2.3-1~bpo80+1_amd64.deb
...
</pre>
<p>Edit:<br />For reference: <a class="external" href="http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/3a9fba20ec743699b69bd0181dd6c54dc01c64b9/version">http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/3a9fba20ec743699b69bd0181dd6c54dc01c64b9/version</a> -> <code> 10.2.0-1trusty</code></p> CI - Bug #17619 (Resolved): shaman.ceph.com's SSL cert has expiredhttps://tracker.ceph.com/issues/176192016-10-19T15:47:06ZZack Cerza
<p>The cert appears to have expired about five minutes ago. Teuthology jobs using shaman will fail until it's renewed.</p> CI - Bug #17397 (Resolved): ubuntu builds are missing libcephfs-javahttps://tracker.ceph.com/issues/173972016-09-23T22:48:47ZZack Cerza
<p>An example:<br /><a class="external" href="http://pulpito.ceph.com/zack-2016-09-23_12:32:36-powercycle-jewel-distro-basic-mira/433190">http://pulpito.ceph.com/zack-2016-09-23_12:32:36-powercycle-jewel-distro-basic-mira/433190</a><br />failed because:<br /><pre>
2016-09-23T22:30:20.784 INFO:teuthology.orchestra.run.mira028.stderr:E: Version '10.2.3-1trusty' for 'libcephfs-java' was not found
</pre></p>
<p>whereas this job:<br /><a class="external" href="http://pulpito.ceph.com/zack-2016-09-23_10:29:39-powercycle-jewel-distro-basic-mira/433122/">http://pulpito.ceph.com/zack-2016-09-23_10:29:39-powercycle-jewel-distro-basic-mira/433122/</a><br />needed and found:<br /><a class="external" href="http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/ecc23778eb545d8dd55e2e4735b53cc93f92e65b/pool/main/c/ceph/libcephfs-java_10.2.3-1trusty_all.deb">http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/ecc23778eb545d8dd55e2e4735b53cc93f92e65b/pool/main/c/ceph/libcephfs-java_10.2.3-1trusty_all.deb</a></p>