Ceph : Issueshttps://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2024-03-07T11:48:12ZCeph
Redmine RADOS - Bug #64788 (Fix Under Review): EpollDriver::del_event() crashes when the nic is unpluggedhttps://tracker.ceph.com/issues/647882024-03-07T11:48:12ZKefu Chaitchaikov@gmail.com
<p>librbd uses msgr to talk to its Ceph cluster. if the client's nic is hot unplugged, there is chance that <code>EpollDriver::del_event()</code> crashes because <code>epoll_ctl(epfd, EPOLL_CTL_DEL, fd, &……)</code> returns <code>-ENOENT</code>. as its caller, <code>EventCenter::delete_file_event()</code> considers its a signal of a bug.</p> Ceph - Backport #64508 (New): quincy: Debian bookworm package needs to explicitly specify cephadm...https://tracker.ceph.com/issues/645082024-02-20T17:57:40ZBackport BotCeph - Bug #64069 (Pending Backport): Debian bookworm package needs to explicitly specify cephadm...https://tracker.ceph.com/issues/640692024-01-17T18:46:25ZChris Palmer
<p>The bookworm/reef cephadm package needs updating to accommodate the last change in /usr/share/doc/adduser/NEWS.Debian.gz:</p>
<pre><code>System user home defaults to /nonexistent if --home is not specified.<br /> Packages that call adduser to create system accounts should explicitly<br /> specify a location for /home (see Lintian check<br /> maintainer-script-lacks-home-in-adduser).</code></pre>
<p>i.e. when creating the cephadm user as a system user it needs to explicitly specify the expected home directory of /home/cephadm.</p>
<p>A workaround is to manually create the user+directory before installing ceph.</p>
<p>Kefu Chai has created PR55218 to address this.</p> RADOS - Bug #50012 (Fix Under Review): Ceph-osd refuses to bind on an IP on the local loopback lo...https://tracker.ceph.com/issues/500122021-03-26T12:10:11ZKefu Chaitchaikov@gmail.comRADOS - Bug #49359 (New): osd: warning: unused variablehttps://tracker.ceph.com/issues/493592021-02-18T17:17:22ZPatrick Donnellypdonnell@redhat.com
<pre>
/build/ceph-17.0.0-888-g5ce097b4/src/osd/osd_types.h: In member function 'void object_ref_delta_t::mut_ref(const hobject_t&, int)':
/build/ceph-17.0.0-888-g5ce097b4/src/osd/osd_types.h:5577:35: warning: unused variable '_' [-Wunused-variable]
[[maybe_unused]] auto [iter, _] = ref_delta.try_emplace(hoid, 0);
^
</pre>
<p>From: <a class="external" href="https://jenkins.ceph.com/job/ceph-dev-build/ARCH=arm64,AVAILABLE_ARCH=arm64,AVAILABLE_DIST=bionic,DIST=bionic,MACHINE_SIZE=gigantic/46981//consoleFull">https://jenkins.ceph.com/job/ceph-dev-build/ARCH=arm64,AVAILABLE_ARCH=arm64,AVAILABLE_DIST=bionic,DIST=bionic,MACHINE_SIZE=gigantic/46981//consoleFull</a></p> Ceph - Bug #48498 (Fix Under Review): octopus: timeout when running the "ceph" commandhttps://tracker.ceph.com/issues/484982020-12-08T16:00:28ZMathew Clarke
<p>Running "Ubuntu 20.04.1" with kernel "5.4.77-217"</p>
<p>I've installed octopus version 15.2.5 from distro armhf repo as "deb-src <a class="external" href="https://download.ceph.com/debian-octopus/">https://download.ceph.com/debian-octopus/</a> focal main" fails with unmet dependencies (see issue: <a class="external" href="https://tracker.ceph.com/issues/45915">https://tracker.ceph.com/issues/45915</a>)</p>
<p>when I run "ceph" from a bash prompt I get the error below.</p>
<pre>
Traceback (most recent call last):
File "/usr/bin/ceph", line 1278, in <module>
retval = main()
File "/usr/bin/ceph", line 984, in main
cluster_handle = run_in_thread(rados.Rados,
File "/usr/lib/python3/dist-packages/ceph_argparse.py", line 1339, in run_in_thread
raise Exception("timed out")
Exception: timed out
</pre>
<p>Even when I run "ceph -h" I get the same error halfway through the help output. I'm new to ceph so any pointers on how I can troubleshoot this would be greatly appreciated.</p>
<p>Thanks</p> RADOS - Bug #45457 (Pending Backport): CEPH Graylog Logging Missing "host" Fieldhttps://tracker.ceph.com/issues/454572020-05-10T02:39:52ZDaniel Neilson
<p>Hello,</p>
<p>I have tried sending CEPH logs to Graylog with the following configuration:</p>
<p>mon_cluster_log_to_graylog = true<br />mon_cluster_log_to_graylog host = 10.50.100.70<br />mon_cluster_log_to_graylog port = 12202</p>
<p>The logs in the server.log entries in Graylog is showing:</p>
<blockquote>
<p>2020-05-09T20:28:26.063-06:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=ed6876e1-9265-11ea-bc94-1ad2db8b5489, journalOffset=5055367, codec=gelf, payloadSize=317, timestamp=2020-05-10T02:28:26.062Z, remoteAddress=/10.10.10.12:36509}<br />java.lang.IllegalArgumentException: GELF message <ed6876e1-9265-11ea-bc94-1ad2db8b5489> (received from <10.10.10.12:36509>) has empty mandatory "host" field.</p>
</blockquote>
<p>There does not appear to be any other options that can be enabled for Graylog logging in CEPH, not sure what else to try.</p> teuthology - Bug #45384 (Fix Under Review): bootstrapping teuthology on Ubuntu 20.04 does not workhttps://tracker.ceph.com/issues/453842020-05-05T12:39:31ZMichael Roesch
<p>We try to get teuthology working on Ubuntu 20.04, following this guide: <a class="external" href="https://docs.ceph.com/teuthology/docs/LAB_SETUP.html">https://docs.ceph.com/teuthology/docs/LAB_SETUP.html</a>. However we are currently stuck on the Worker part, i.e. when we try to execute "worker_start magna 1", after it git clones the repo, it tries to execute the bootstrap script, which fails, because the packages that the bootstrap script is expecting, are named differently. <br />Changing the bootstrap script also does not work, because on every start, the repo gets cloned again and overwrites the changes.</p> RADOS - Bug #40772 (New): mon: pg size change delayed 1 minute because osdmap 35 delayhttps://tracker.ceph.com/issues/407722019-07-12T21:21:34ZDavid Zafmandzafman@redhat.com
<p>osd-recovery-prio.sh TEST_recovery_pool_priority fails intermittently due to a delay in recovery starting on a pg. The test requires simultaneous recovery of 2 PGs.</p>
<pre><code>ceph osd pool set $pool1 size 2<br /> ceph osd pool set $pool2 size 2</code></pre>
Running the test alone doesn't necessarily reproduce the problem which I saw when running all standalone tests (although they are run sequentially).
<ol>
<li>../qa/run-standalone.sh "osd-recovery-prio.sh TEST_recovery_pool_priority"</li>
</ol>
<p>No pg[2.0 messages for almost 1 minute. Map 35 seems to be the one that has the size change.\</p>
<pre>
2019-07-12T00:29:49.072-0700 7fc35220f700 10 osd.0 pg_epoch: 34 pg[2.0( v 31'600 (0'0,31'600] local-lis/les=29/30 n=200 ec=15/15 lis/c 29/29 les/c/f 30/30/0 29/29/15) [0] r=0 lpr=29 crt=31'600 lcod 31'599 mlcod 31'599 active+clean] do_peering_event: epoch_sent: 34 epoch_requested: 34 NullEvt
2019-07-12T00:30:44.588-0700 7fc35220f700 10 osd.0 pg_epoch: 34 pg[2.0( v 31'600 (0'0,31'600] local-lis/les=29/30 n=200 ec=15/15 lis/c 29/29 les/c/f 30/30/0 29/29/15) [0] r=0 lpr=29 crt=31'600 lcod 31'599 mlcod 31'599 active+clean] handle_advance_map: 35
</pre>
"message": "pg 2.0 is stuck undersized for 61.208213, current state active+recovering+undersized+degraded+remapped, last acting [0]" <br /> "message": "pg 2.0 is stuck undersized for 63.209241, current state active+recovering+undersized+degraded+remapped, last acting [0]" RADOS - Tasks #25186 (In Progress): setup repo for building dependencies like boost, rocksdb, whi...https://tracker.ceph.com/issues/251862018-07-31T03:57:05ZKefu Chaitchaikov@gmail.com
<p>we need to build boost, spdk, dpdk, fio, rocksdb, gperftools, seastar for preparing the build dependencies for each PR of ceph, and for each CI build. the list grows overtime. it'd be much more efficient if we can cache the built artifacts in a repo.</p>
<p>we could use ppa or chacra for hosting the repo. and update the ceph-build and cmake scripts accordingly to pick up these pre-built packages.</p>
<p>action items:</p>
<ul>
<li>package ceph-libboost for centos on amd64/aarch64</li>
<li>package c-ares, libfmt, zstd and rocksdb.</li>
</ul> RADOS - Bug #24515 (New): "[WRN] Health check failed: 1 slow ops, oldest one blocked for 32 sec, ...https://tracker.ceph.com/issues/245152018-06-13T20:13:58ZYuri Weinsteinyweinste@redhat.com
<p>This seems to be rhel specific</p>
<p>Run: <a class="external" href="http://pulpito.ceph.com/yuriw-2018-06-12_21:09:43-fs-master-distro-basic-smithi/">http://pulpito.ceph.com/yuriw-2018-06-12_21:09:43-fs-master-distro-basic-smithi/</a><br />Jobs: '2659781', '2659792', '2659770', '2659840'<br />Logs: <a class="external" href="http://qa-proxy.ceph.com/teuthology/yuriw-2018-06-12_21:09:43-fs-master-distro-basic-smithi/2659781/teuthology.log">http://qa-proxy.ceph.com/teuthology/yuriw-2018-06-12_21:09:43-fs-master-distro-basic-smithi/2659781/teuthology.log</a></p>
<pre>
2018-06-13T03:57:47.950 INFO:teuthology.orchestra.run.smithi075.stdout:2018-06-13 03:05:37.853886 mon.a (mon.0) 197 : cluster [WRN] Health check failed: 1 slow ops, oldest one blocked for 32 sec, mon.c has slow ops (SLOW_OPS)
....
2018-06-13T04:09:34.350 INFO:teuthology.run:Summary data:
{description: 'fs/snaps/{begin.yaml clusters/fixed-2-ucephfs.yaml mount/fuse.yaml
objectstore-ec/bluestore-comp.yaml overrides/{debug.yaml frag_enable.yaml whitelist_health.yaml
whitelist_wrongly_marked_down.yaml} supported-random-distros$/{rhel_latest.yaml}
tasks/snaptests.yaml}', duration: 4418.809953927994, failure_reason: '"2018-06-13
03:05:37.853886 mon.a (mon.0) 197 : cluster [WRN] Health check failed: 1 slow
ops, oldest one blocked for 32 sec, mon.c has slow ops (SLOW_OPS)" in cluster
log', flavor: basic, owner: scheduled_yuriw@teuthology, success: false}
</pre> mgr - Bug #23756 (New): mgr is not updated with latest pgmap in qa/workunits/rados/test_large_oma...https://tracker.ceph.com/issues/237562018-04-16T07:24:08ZKefu Chaitchaikov@gmail.com
<p>see <a class="external" href="http://pulpito.ceph.com/kchai-2018-04-12_15:56:07-rados-wip-kefu-testing-2018-04-12-2211-distro-basic-mira/2390083/">http://pulpito.ceph.com/kchai-2018-04-12_15:56:07-rados-wip-kefu-testing-2018-04-12-2211-distro-basic-mira/2390083/</a></p>
<p>need to test w/o <a class="external" href="https://github.com/ceph/ceph/pull/21410">https://github.com/ceph/ceph/pull/21410</a>. before this PR, we instruct mgr with the OSD id to perform the scrub, if the mgr does not possess the latest pgmap, it will fail to send OSD all the pgs to be scrub.</p>
<p>that's the case of kchai-2018-04-12_15:56:07-rados-wip-kefu-testing-2018-04-12-2211-distro-basic-mira/2390083.</p> mgr - Bug #23136 (In Progress): mgr: disable then enable of Restful plugin does not work without ...https://tracker.ceph.com/issues/231362018-02-26T15:42:19ZHans van den Bogerthansbogert@gmail.com
<p>The restful plugin does not work after a disable/enable cycle:</p>
<blockquote>
<p>curl -k <a class="external" href="https://mon03:8003">https://mon03:8003</a>
{<br />"api_version": 1,<br />"auth": "Use \"ceph restful create-key <key>\" to create a key pair, pass it as HTTP Basic auth to authenticate",<br />"doc": "See /doc endpoint",<br />"info": "Ceph Manager RESTful API server" <br />cephadmin@mon03:~$ ceph mgr module disable restful<br />cephadmin@mon03:~$ ceph mgr module enable restful<br />cephadmin@mon03:~$ curl --connect-timeout 20 -k <a class="external" href="https://mon03:8003">https://mon03:8003</a><br />curl: (28) Operation timed out after 0 milliseconds with 0 out of 0 bytes received<br />cephadmin@mon03:~$</p>
</blockquote> Ceph - Tasks #12797 (In Progress): create the upgrade test suite for gmt and sortbitmap changehttps://tracker.ceph.com/issues/127972015-08-26T16:19:39ZKefu Chaitchaikov@gmail.com
<p>for more context, see <a class="external" href="http://tracker.ceph.com/issues/9732#note-16">http://tracker.ceph.com/issues/9732#note-16</a></p> Ceph - Fix #10877 (In Progress): CLI error numbers are not described anywherehttps://tracker.ceph.com/issues/108772015-02-13T16:37:54ZAlfredo Dezaadeza@redhat.com
<p>While trying to run the following command:</p>
<pre>
rados lspools
</pre>
<p>I got this as the last line of the output:</p>
<pre>
couldn't connect to cluster! error -2
</pre>
<p>The man page doesn't list error numbers and searching around I saw mailing list threads that showed `-1` as well.</p>
<p>I understand that there is an error connecting to the cluster, but if the CLI is using error numbers, those should be<br />listed/explained somewhere, if possible in the man page too.</p>