Ceph : Issueshttps://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2023-06-06T10:51:16ZCeph
Redmine Ceph - Bug #61598 (New): gcc-14: FTBFS "error: call to non-'constexpr' function 'virtual unsigned...https://tracker.ceph.com/issues/615982023-06-06T10:51:16ZTim Serongtserong@suse.com
<p>gcc 14 has introduced a change which results in ceph build failures:</p>
<pre>
[ 270s] /home/abuild/rpmbuild/BUILD/ceph-18.0.0-4135-g87cd54281c8/src/osd/osd_types.h: In lambda function:
[ 270s] /home/abuild/rpmbuild/BUILD/ceph-18.0.0-4135-g87cd54281c8/src/common/dout.h:184:73: error: call to non-'constexpr' function 'virtual unsigned int DoutPrefixProvider::get_subsys() const'
[ 270s] 184 | dout_impl(pdpp->get_cct(), ceph::dout::need_dynamic(pdpp->get_subsys()), v) \
[ 270s] | ~~~~~~~~~~~~~~~~^~
[ 270s] /home/abuild/rpmbuild/BUILD/ceph-18.0.0-4135-g87cd54281c8/src/common/dout.h:155:58: note: in definition of macro 'dout_impl'
[ 270s] 155 | return (cctX->_conf->subsys.template should_gather<sub, v>()); \
[ 270s] | ^~~
[ 270s] /home/abuild/rpmbuild/BUILD/ceph-18.0.0-4135-g87cd54281c8/src/osd/osd_types.h:3618:3: note: in expansion of macro 'ldpp_dout'
[ 270s] 3618 | ldpp_dout(dpp, 10) << "build_prior all_probe " << all_probe << dendl;
[ 270s] | ^~~~~~~~~
[ 270s] /home/abuild/rpmbuild/BUILD/ceph-18.0.0-4135-g87cd54281c8/src/common/dout.h:51:20: note: 'virtual unsigned int DoutPrefixProvider::get_subsys() const' declared here
[ 270s] 51 | virtual unsigned get_subsys() const = 0;
[ 270s] | ^~~~~~~~~~
</pre>
<p>The gcc change is described at <a class="external" href="https://gcc.gnu.org/pipermail/gcc-patches/2023-May/617196.html">https://gcc.gnu.org/pipermail/gcc-patches/2023-May/617196.html</a>.</p>
<p>The ceph FTBFS was mentioned in a followup post at <a class="external" href="https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618384.html">https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618384.html</a>, and apparently this failure is now expected, as <code> DoutPrefixProvider::get_subsys()</code> isn't declared <code>constexpr</code> but really should be.</p>
<p>I tried to fix this experimentally by simply declaring <code>constexpr get_subsys()</code>, e.g.:</p>
<pre>
diff --git a/src/common/dout.h b/src/common/dout.h
index a1375fbb910..6e91750708a 100644
--- a/src/common/dout.h
+++ b/src/common/dout.h
@@ -61,7 +61,7 @@ class NoDoutPrefix : public DoutPrefixProvider {
std::ostream& gen_prefix(std::ostream& out) const override { return out; }
CephContext *get_cct() const override { return cct; }
- unsigned get_subsys() const override { return subsys; }
+ constexpr unsigned get_subsys() const override { return subsys; }
};
// a prefix provider with static (const char*) prefix
@@ -88,7 +88,7 @@ class DoutPrefixPipe : public DoutPrefixProvider {
return out;
}
CephContext *get_cct() const override { return dpp.get_cct(); }
- unsigned get_subsys() const override { return dpp.get_subsys(); }
+ constexpr unsigned get_subsys() const override { return dpp.get_subsys(); }
virtual void add_prefix(std::ostream& out) const = 0;
};
</pre>
<p>...but that has some problems:</p>
<p>1) Instead of an outright build failure, I get <code>warning: virtual functions cannot be 'constexpr' before C++20 [-Winvalid-constexpr]</code>. I imaging this is undesirable.<br />2) Even if 1 <em>is</em> desirable, there's plenty of other subclasses of <code>DoutPrefixProvider</code> which would all <em>also</em> need to have their <code>get_subsys()</code> methods declared <code>conxtexpr</code> for the build to complete.</p>
<p>TBH the whole <code>dout</code> thing is black magic to me, so I could really use some assistance with how best to fix this.</p> Ceph - Bug #58501 (Resolved): ceph.spec.in: need to replace SUSE usrmerged macro with version checkhttps://tracker.ceph.com/issues/585012023-01-19T07:23:05ZTim Serongtserong@suse.com
<p><a class="external" href="https://github.com/ceph/ceph/commit/e4c4a4ce97fff8a5b4efa747d9cffeabcceedd25">https://github.com/ceph/ceph/commit/e4c4a4ce97fff8a5b4efa747d9cffeabcceedd25</a> introduced the use of the <code>usrmerged</code> macro on SUSE distros to guard against installing the /sbin/mount.ceph symlink. This macro has since been deprecated and should be replaced with a version check instead (<code>%if 0%{?suse_version} < 1550</code>). See <a class="external" href="https://en.opensuse.org/openSUSE:Usr_merge">https://en.opensuse.org/openSUSE:Usr_merge</a> for more details.</p> Ceph - Bug #57967 (Resolved): ceph-crash service should run as unprivileged user, not root (CVE-2...https://tracker.ceph.com/issues/579672022-11-03T05:11:53ZTim Serongtserong@suse.com
<p>As reported at <a class="external" href="https://www.openwall.com/lists/oss-security/2022/10/25/1">https://www.openwall.com/lists/oss-security/2022/10/25/1</a>, ceph-crash runs as root, which makes it vulnerable to a potential ceph user to root privilege escalation. This is fixable by making the ceph-crash process drop privileges and run as the ceph user, just as the other ceph daemons do.</p> Ceph - Bug #57893 (Pending Backport): make-dist creates ceph.spec with incorrect Release tag for ...https://tracker.ceph.com/issues/578932022-10-19T08:04:36ZTim Serongtserong@suse.com
<p><code>ceph.spec.in</code> says:</p>
<pre>
Name: ceph
Version: @PROJECT_VERSION@
Release: @RPM_RELEASE@%{?dist}
%if 0%{?fedora} || 0%{?rhel}
Epoch: 2
%endif
</pre>
<p>When the <code>make-dist</code> script generates the final <code>ceph.spec</code> file for RPM builds, it will set PROJECT_VERSION to the version from the latest tag (e.g.: 17.0.0), and set RPM_RELEASE to the number of additional commits plus the last commit hash (e.g.: 14883.gc49b81c7d61). This doesn't work properly when building in SUSE's Open Build Service, because OBS overwrites the Release tag with checkin and build counters (see <a class="external" href="https://en.opensuse.org/openSUSE:Package_versioning_guidelines">https://en.opensuse.org/openSUSE:Package_versioning_guidelines</a>).</p>
<p>We've long carried a downstream patch for <code>make-dist</code> to fix this, by putting everything in PROJECT_VERSION, so you end up with something like <code>Version: 17.0.0.14883+gc49b81c7d61</code> (see <a class="external" href="https://github.com/SUSE/ceph/commit/9ee636cdca3">https://github.com/SUSE/ceph/commit/9ee636cdca3</a>), so I figure I should really submit that upstream.</p> Ceph - Bug #57860 (Pending Backport): disable system_pmdk on s390x for SUSE distroshttps://tracker.ceph.com/issues/578602022-10-13T04:28:31ZTim Serongtserong@suse.com
<p>Same as <a class="external" href="https://tracker.ceph.com/issues/56491">https://tracker.ceph.com/issues/56491</a> which addressed RHEL and Fedora not shipping libpmem on s390x, but for SUSE.</p> Orchestrator - Bug #57609 (Resolved): applying osd service spec with size filter fails if there's...https://tracker.ceph.com/issues/576092022-09-20T05:33:28ZTim Serongtserong@suse.com
<p>This issue came up on a system with a 4KB virtual floppy disk drive.</p>
<p><code>ceph-volume inventory</code> gives:</p>
<pre>
Device Path Size rotates available Model name
/dev/fd0 4.00 KB True False
/dev/sda 50.00 GB True False Virtual disk
/dev/sdb 50.00 GB True False Virtual disk
/dev/sdc 50.00 GB True False Virtual disk
/dev/sdd 50.00 GB True False Virtual disk
</pre>
<p>Doing a simple <code>ceph orch apply osd --all-available-devices</code> works just fine, but service specs utilising size specifiers will fail to apply. For example:</p>
<pre>
service_id: at_least_8g
service_type: osd
placement:
host_pattern: '*'
spec:
data_devices:
size: '8G:'
</pre>
<p>Applying the above will give the following error in <code>ceph log last cephadm</code>:</p>
<pre>
ceph.deployment.drive_group.DriveGroupValidationError: Failed to validate OSD spec "at_least_8g.data_devices": Unit 'KB' not supported
</pre>
<p>The problem is that the SizeMatcher class only understands MB, GB and TB. When presented with a disk whose size is expressed in KB, it doesn't know what to do with it.</p> Ceph - Bug #57497 (Pending Backport): openSUSE Leap 15.x needs to explicitly specify gcc-11https://tracker.ceph.com/issues/574972022-09-12T01:06:36ZTim Serongtserong@suse.com
<p>This is a recurrence of <a class="external" href="https://tracker.ceph.com/issues/55237">https://tracker.ceph.com/issues/55237</a>. I wrote <a class="external" href="https://github.com/ceph/ceph/commit/80949babab4">https://github.com/ceph/ceph/commit/80949babab4</a> to use gcc-c++ >= 11 on SUSE distros, which works fine on Tumbleweed (our latest and greatest), but doesn't work on openSUSE Leap 15, which has gcc 11, but not packaged in a way in which that nice neat >= requirement works. So I need to re-instate part of <a class="external" href="https://github.com/ceph/ceph/pull/45845/commits/8ab5d7eea07">https://github.com/ceph/ceph/pull/45845/commits/8ab5d7eea07</a></p> Ceph - Bug #57390 (Pending Backport): denc-mod-osd.so: undefined symbol _ZN4ceph25ErasureCodePlug...https://tracker.ceph.com/issues/573902022-09-02T08:42:22ZTim Serongtserong@suse.com
<p>When running <code>ceph-dencoder</code> on openSUSE Tumbleweed (built with GCC 12 and LTO, in case that's relevant), I get the following failure:</p>
<pre>
# ceph-dencoder
failed to dlopen("/usr/lib64/ceph/denc/denc-mod-osd.so"): /usr/lib64/ceph/denc/denc-mod-osd.so: undefined symbol: _ZN4ceph25ErasureCodePluginRegistry9singletonE
-h for help
</pre>
<p>This is fixable by adding "erasure_code" to denc-mod-osd's target_link_libraries.</p> Ceph - Bug #56658 (Resolved): build: cephfs-shell fails to build/install with python setuptools >...https://tracker.ceph.com/issues/566582022-07-21T07:56:14ZTim Serongtserong@suse.com
<p>python setuptools v61 changed package discovery so that if it finds what it thinks are multiple top-level packages in a directory, it will fail to build. This was introduced by <a class="external" href="https://github.com/pypa/setuptools/pull/3177">https://github.com/pypa/setuptools/pull/3177</a>, and causes the ceph RPM build to fail with:</p>
<pre>
...
[ 9562s] error: Multiple top-level packages discovered in a flat-layout: ['top', 'CMakeFiles'].
[ 9562s]
[ 9562s] To avoid accidental inclusion of unwanted files or directories,
[ 9562s] setuptools will not proceed with this build.
[ 9562s]
[ 9562s] If you are trying to create a single distribution with multiple packages
[ 9562s] on purpose, you should not rely on automatic discovery.
[ 9562s] Instead, consider the following options:
[ 9562s]
[ 9562s] 1. set up custom discovery (`find` directive with `include` or `exclude`)
[ 9562s] 2. use a `src-layout`
[ 9562s] 3. explicitly set `py_modules` or `packages` with a list of names
[ 9562s]
[ 9562s] To find more information, look for "package discovery" on setuptools docs.
...
[ 9833s] RPM build errors:
[ 9833s] File not found: /home/abuild/rpmbuild/BUILDROOT/ceph-16.2.9.158+gd93952c7eea-2.3.x86_64/usr/lib/python3.10/site-packages/cephfs_shell-*.egg-info
[ 9833s] File not found: /home/abuild/rpmbuild/BUILDROOT/ceph-16.2.9.158+gd93952c7eea-2.3.x86_64/usr/bin/cephfs-shell
</pre>
<p>This has been fixed in Fedora downstream by moving a/src/tools/cephfs/cephfs-shell to a separate subdirectory (see <a class="external" href="https://src.fedoraproject.org/rpms/ceph/blob/rawhide/f/0021-cephfs-shell.patch">https://src.fedoraproject.org/rpms/ceph/blob/rawhide/f/0021-cephfs-shell.patch</a>). I've confirmed this approach also works for openSUSE.</p> Ceph - Bug #56466 (Resolved): pacific: boost 1.73.0 is incompatible with python 3.10https://tracker.ceph.com/issues/564662022-07-05T05:54:10ZTim Serongtserong@suse.com
<p>Ceph pacific includes boost 1.73.0, which uses the <code>_Py_fopen()</code> function, which is no longer available in python 3.10. This means it's not possible to build ceph pacific RPMs against python 3.10. Builds will fail with:</p>
<pre>[ 182s] libs/python/src/exec.cpp: In function 'boost::python::api::object boost::python::exec_file(const char*, api::object, api::object)':
[ 182s] libs/python/src/exec.cpp:109:14: error: '_Py_fopen' was not declared in this scope; did you mean '_Py_wfopen'?
[ 182s] 109 | FILE *fs = _Py_fopen(f, "r");
[ 182s] | ^~~~~~~~~
[ 182s] | _Py_wfopen
</pre>
<p>This is not a problem with quincy or newer, as those use boost 1.75.0, which includes a patch to switches to using fopen() for python versions >= 3.1.</p> Ceph - Bug #55237 (Resolved): rpm: openSUSE build fails - needs explicit gcc version, also can't ...https://tracker.ceph.com/issues/552372022-04-08T06:23:06ZTim Serongtserong@suse.com
<p>Two issues here which are strictly speaking unrelated, but I thought it'd be less annoying to just fix the openSUSE build with one bug.</p>
<p>Issue 1: openSUSE Leap 15.3 and 15.4 use gcc 7 by default, which is not new enough to build ceph. Both distros do provide gcc 11, but we have to explicitly request that version if we want to use it.</p>
<p>Issue 2: Parquet, which in turn requires Arrow, can't currently be built for openSUSE. The problem here is that we don't have those dependencies packaged as RPMs, and when trying to build Arrow out of the submodule in the ceph source tree, one of its dependencies (xsimd) tries to download source from the internet, which doesn't work in the openSUSE Build Service (build workers have no internet access).</p> Ceph - Bug #55079 (Pending Backport): rpm: remove contents of build directory at end of %install ...https://tracker.ceph.com/issues/550792022-03-28T04:02:45ZTim Serongtserong@suse.com
<p>I've been doing some measurements of disk usage during SUSE RPM builds (of Pacific, but this should roughly apply for newer Cephs too). In our particular build environment, which builds everything in VMs, we see something like this:</p>
<pre>
Filesystem Size Used Avail Use% Mounted on
df start of build: /dev/vda 53G 14G 40G 25% /
df end of build: /dev/vda 53G 31G 23G 58% /
df end of install: /dev/vda 53G 39G 15G 74% /
df before clamscan: /dev/vda 53G 41G 13G 78% /
df after clamscan: /dev/vda 53G 50G 3.9G 93% /
</pre>
<p>So after compiling everything, we've consumed about 17GB (that's all the binaries and object files and whatnot that end up in the "build" directory in the source tree). Then, after %install (which installs everything in the build root, ready to be turned into actual RPMs), we've used another 8GB. The next part - the clamscan bit - is one of the rpmlint checks SUSE runs, which takes another 9G when it extracts all the built RPMs (including debuginfo RPMs), in order to scan them.</p>
<p>In summary, our build worker VMs currently need a bit over 50G disk to build Ceph.</p>
<p>If I add <code>`rm -rf build`</code> to the very end of the <span>install section, to get rid of the 17GB of built binaries, we go into clamscan with 24G free, rather than 41G free, and when clamscan finishes we're using 32G. This means the peak build disk usage with that change is about 39G, so we reduce our build worker's disk space requirements by about 11G (or 20</span>%).</p> RADOS - Bug #52553 (Resolved): pybind: rados.RadosStateError raised when closed watch object goes...https://tracker.ceph.com/issues/525532021-09-09T07:06:25ZTim Serongtserong@suse.com
<p>This one is easiest to demonstrate by example. Here's some code:</p>
<pre>
#!/usr/bin/env python3
import rados
def notify(notify_id, notifier_id, watch_id, data):
pass
if __name__ == "__main__":
cluster = rados.Rados(conffile="/etc/ceph/ceph.conf")
cluster.connect()
ioctx = cluster.open_ioctx("aquarium")
watch = ioctx.watch("kvstore", notify)
watch.close()
cluster.shutdown()
</pre>
<p>If I run that, I see the following error output:</p>
<pre>Traceback (most recent call last):
File "rados.pyx", line 477, in rados.Rados.require_state
rados.RadosStateError: RADOS rados state (You cannot perform that operation on a Rados object in state shutdown.)
Exception ignored in: 'rados.Watch.__dealloc__'
Traceback (most recent call last):
File "rados.pyx", line 477, in rados.Rados.require_state
rados.RadosStateError: RADOS rados state (You cannot perform that operation on a Rados object in state shutdown.)
</pre>
<p>What's happening here, is that even though I called <code>watch.close()</code>, later, once the watch goes out of scope, its <code>__dealloc__()</code> method tries to close the watch <em>again</em>, after first calling <code>self.ioctx.rados.require_state("connected")</code>, which results in that exception.</p>
<p>The fix is easy:</p>
<pre>
diff --git a/src/pybind/rados/rados.pyx b/src/pybind/rados/rados.pyx
index 4a5db349516..8772942e7ca 100644
--- a/src/pybind/rados/rados.pyx
+++ b/src/pybind/rados/rados.pyx
@@ -2025,6 +2025,8 @@ cdef class Watch(object):
return False
def __dealloc__(self):
+ if self.id == 0:
+ return
self.ioctx.rados.require_state("connected")
self.close()
</pre>
<p>The one thing I can't work out how to do, is write a test for this case, because as the exception is in <code>__dealloc__</code>, it gets printed to stderr, but is otherwise ignored, so I can't seem to catch it anywhere in src/test/pybind/test_rados.py</p> Orchestrator - Feature #45996 (New): adopted prometheus instance uses port 9095, regardless of or...https://tracker.ceph.com/issues/459962020-06-15T11:13:22ZTim Serongtserong@suse.com
<p>When adopting prometheus (<code>cephadm adopt --style legacy --name prometheus.HOSTNAME</code>), the new prometheus daemon starts listening on port 9095, regardless of what port the original daemon was running on. This is a problem for upgrades, as if you have an existing grafana instance it will still be looking at the old prometheus port number.</p> mgr - Bug #37377 (New): ceph-mgr/influx: verify "no metadata" fix is completehttps://tracker.ceph.com/issues/373772018-11-23T10:11:07ZTim Serongtserong@suse.com
<p>Seen while reviewing <a class="external" href="https://github.com/ceph/ceph/pull/25184">https://github.com/ceph/ceph/pull/25184</a>. The fix for <a class="external" href="http://tracker.ceph.com/issues/25191">http://tracker.ceph.com/issues/25191</a> in <a class="external" href="https://github.com/ceph/ceph/pull/22794">https://github.com/ceph/ceph/pull/22794</a> is applied to the get_pg_summary() function, but not to the get_daemon_stats() function. We need to verify whether this fix should also be applied to the latter function (my guess is "yes", but I don't know for certain).</p>