Ceph : Issueshttps://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2024-01-18T14:28:46ZCeph
Redmine CephFS - Bug #64085 (Fix Under Review): qa: stop testing on rhel8https://tracker.ceph.com/issues/640852024-01-18T14:28:46ZPatrick Donnellypdonnell@redhat.com
<p>Causes packaging failures like:</p>
<p>/teuthology/vshankar-2024-01-10_06:48:43-fs-wip-vshankar-testing-20240103.072409-testing-default-smithi/7511104/teuthology.log</p>
<p>where cephfs-shell does not exist. centos8/rhel8 was deprecated for Squid.</p> CephFS - Bug #64058 (Pending Backport): qa: Command failed (workunit test fs/snaps/untar_snap_rm.sh)https://tracker.ceph.com/issues/640582024-01-17T06:13:10ZMilind Changire
<pre>
2024-01-10T20:43:52.620+0000 7f8e9c4e3700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7-278-gd6f81946/rpm/el8/BUILD/ceph-17.2.7-278-gd6f81946/src/mds/MDSRank.cc: In function 'void MDSRank::abort(std::string_view)' thread 7f8e9c4e3700 time 2024-01-10T20:43:52.620029+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7-278-gd6f81946/rpm/el8/BUILD/ceph-17.2.7-278-gd6f81946/src/mds/MDSRank.cc: 941: ceph_abort_msg("abort() called")
ceph version 17.2.7-278-gd6f81946 (d6f81946113429824c706976328252a37cd7285f) quincy (stable)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xd7) [0x7f8eaacfcb07]
2: (MDSRank::abort(std::basic_string_view<char, std::char_traits<char> >)+0x7d) [0x55ce71c89ebd]
3: (CDentry::check_corruption(bool)+0x740) [0x55ce71f011b0]
4: (EMetaBlob::add_primary_dentry(EMetaBlob::dirlump&, CDentry*, CInode*, unsigned char)+0x47) [0x55ce71d5d4a7]
5: (EMetaBlob::add_dir_context(CDir*, int)+0x17b) [0x55ce72067b4b]
6: (MDCache::create_subtree_map()+0xc41) [0x55ce71dcfbe1]
7: (MDLog::_journal_segment_subtree_map(MDSContext*)+0x4d) [0x55ce71ff679d]
8: (MDLog::start_submit_entry(LogEvent*, MDSLogContextBase*)+0xc1) [0x55ce71d543c1]
9: (MDCache::_fragment_old_purged(dirfrag_t, int, boost::intrusive_ptr<MDRequestImpl> const&)+0x33a) [0x55ce71dc3c7a]
10: (MDSContext::complete(int)+0x5f) [0x55ce71fdf49f]
11: (MDSIOContextBase::complete(int)+0x534) [0x55ce71fdfc34]
12: (Finisher::finisher_thread_entry()+0x18d) [0x7f8eaad9b8bd]
13: /lib64/libpthread.so.0(+0x81cf) [0x7f8ea9cea1cf]
14: clone()
</pre>
<p>archive_path: /home/teuthworker/archive/yuriw-2024-01-10_16:13:48-fs-wip-yuri6-testing-2024-01-05-0744-quincy-distro-default-smithi/7511909</p> mgr - Bug #63882 (Pending Backport): pybind/mgr/devicehealth: "rados.ObjectNotFound: [errno 2] RA...https://tracker.ceph.com/issues/638822023-12-21T13:47:37ZPatrick Donnellypdonnell@redhat.com
<p>We got a report of this failure:</p>
<pre>
"backtrace": [
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 342, in serve\n finished_loading_legacy = self.check_legacy_pool()",
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 321, in check_legacy_pool\n if self._load_legacy_object(ioctx, obj.key):",
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 295, in _load_legacy_object\n ioctx.operate_read_op(op, oid)",
" File \"rados.pyx\", line 3720, in rados.Ioctx.operate_read_op",
"rados.ObjectNotFound: [errno 2] RADOS object not found (Failed to operate read op for oid HGST_<redacted>)"
],
</pre>
<p>The object appears in the pool listing but cannot be operated on, strangely. Make the module handle this robustly.</p> CephFS - Fix #63432 (Fix Under Review): qa: run TestSnapshots.test_kill_mdstable for all mount typeshttps://tracker.ceph.com/issues/634322023-11-03T17:33:28ZPatrick Donnellypdonnell@redhat.com
<p>It currently only runs for fuse and that hasn't had a successful instance in months. The test skipped kernel mounts because you couldn't "kill" the mount but that is no longer true with the addition of netns.</p> Ceph - Bug #63218 (Pending Backport): cmake: dependency ordering error for liburing and librocksdbhttps://tracker.ceph.com/issues/632182023-10-16T15:01:52ZPatrick Donnellypdonnell@redhat.com
<pre>
./do_cmake.sh -DWITH_PYTHON3=3.6 -DWITH_BABELTRACE=OFF -DWITH_MANPAGE=OFF -DWITH_RBD=OFF -DWITH_KRBD=OFF -DWITH_RADOSGW=OFF -DWITH_LTTNG=OFF -DWITH_RDMA=OFF -DWITH_SEASTAR=OFF -DWITH_CEPH_DEBUG_MUTEX=ON
...
$ cmake --build build/ --verbose
...
FAILED: bin/ceph_test_keyvaluedb_iterators
: && /opt/rh/gcc-toolset-11/root/usr/bin/g++ -Og -g -rdynamic -pie src/test/ObjectMap/CMakeFiles/ceph_test_keyvaluedb_iterators.dir/test_keyvaluedb_iterators.cc.o src/test/ObjectMap/CMakeFiles/ceph_test_keyvaluedb_iterators.dir/KeyValueDBMemory.cc.o -o bin/ceph_test_keyvaluedb_iterators -Wl,-rpath,/home/pdonnell/scratch/build/lib lib/libos.a lib/libgmock_maind.a lib/libgmockd. a lib/libgtestd.a -lpthread -ldl lib/libglobal.a -ldl /usr/lib64/librt.so -lresolv -ldl lib/libblk.a /lib64/libaio.so src/liburing/src/liburing.a lib/libkv.a lib/libheap_profiler.a /lib64/libtcmalloc.so src/rocksdb/librocksdb.a /lib64/libsnappy.so /usr/lib64/liblz4.so /usr/lib64/libz.so /usr/lib64/libfuse.so lib/libceph-common.so.2 src/opentelemetry-cpp/sdk/src/trace/libopentelemetry_trace.a src/opentelemetry-cpp/sdk/src/resource/libopentelemetry_resources.a src/opentelemetry-cpp/sdk/src/common/libopentelemetry_common.a src/opentelemetry-cpp/exporters/jaeger/libopentelemetry_exporter_jaeger_trace.a src/opentelemetry-cpp/ext/src/http/client/curl/libopentelemetry_http_client_curl.a /usr/lib64/libcurl.so /usr/lib64/libthrift.so lib/libjson_spirit.a lib/libcommon_utf8.a lib/liberasure_code.a lib/libextblkdev.a -lcap boost/lib/libboost_thread.a boost/lib/libboost_chrono.a boost/lib/libboost_atomic.a boost/lib/libboost_system.a boost/lib/libboost_random.a boost/lib/libboost_program_options.a boost/lib/libboost_date_time.a boost/lib/libboost_iostreams.a boost/lib/libboost_regex.a lib/libfmtd.a /usr/lib64/libblkid.so -lpthread /usr/lib64/libcrypto.so /usr/lib64/libudev.so /usr/lib64/libz.so -ldl -lresolv -Wl,--as-needed -latomic && :
/opt/rh/gcc-toolset-11/root/usr/bin/ld: src/rocksdb/librocksdb.a(fs_posix.cc.o): in function `io_uring_wait_cqe_nr':
/home/pdonnell/scratch/build/src/liburing/src/include/liburing.h:494: undefined reference to `__io_uring_get_cqe'
/opt/rh/gcc-toolset-11/root/usr/bin/ld: src/rocksdb/librocksdb.a(fs_posix.cc.o): in function `rocksdb::(anonymous namespace)::PosixFileSystem::AbortIO(std::vector<void*, std::allocator<void*> >&)':
/home/pdonnell/ceph/src/rocksdb/env/fs_posix.cc:1125: undefined reference to `io_uring_get_sqe'
/opt/rh/gcc-toolset-11/root/usr/bin/ld: /home/pdonnell/ceph/src/rocksdb/env/fs_posix.cc:1134: undefined reference to `io_uring_submit'
/opt/rh/gcc-toolset-11/root/usr/bin/ld: src/rocksdb/librocksdb.a(fs_posix.cc.o): in function `rocksdb::CreateIOUring()':
/home/pdonnell/ceph/src/rocksdb/env/io_posix.h:272: undefined reference to `io_uring_queue_init'
/opt/rh/gcc-toolset-11/root/usr/bin/ld: src/rocksdb/librocksdb.a(io_posix.cc.o): in function `io_uring_wait_cqe_nr':
/home/pdonnell/scratch/build/src/liburing/src/include/liburing.h:494: undefined reference to `__io_uring_get_cqe'
/opt/rh/gcc-toolset-11/root/usr/bin/ld: src/rocksdb/librocksdb.a(io_posix.cc.o): in function `rocksdb::PosixRandomAccessFile::MultiRead(rocksdb::FSReadRequest*, unsigned long, rocksdb::IOOptions const&, rocksdb::IODebugContext*)':
/home/pdonnell/ceph/src/rocksdb/env/io_posix.cc:674: undefined reference to `io_uring_get_sqe'
/opt/rh/gcc-toolset-11/root/usr/bin/ld: /home/pdonnell/ceph/src/rocksdb/env/io_posix.cc:684: undefined reference to `io_uring_submit_and_wait'
/opt/rh/gcc-toolset-11/root/usr/bin/ld: src/rocksdb/librocksdb.a(io_posix.cc.o): in function `rocksdb::PosixRandomAccessFile::ReadAsync(rocksdb::FSReadRequest&, rocksdb::IOOptions const&, std::function<void (rocksdb::FSReadRequest const&, void*)>, void*, void**, std::function<void (void*)>*, rocksdb::IODebugContext*)':
/home/pdonnell/ceph/src/rocksdb/env/io_posix.cc:901: undefined reference to `io_uring_get_sqe'
/opt/rh/gcc-toolset-11/root/usr/bin/ld: /home/pdonnell/ceph/src/rocksdb/env/io_posix.cc:910: undefined reference to `io_uring_submit'
collect2: error: ld returned 1 exit status
</pre>
<p>This is on vossi04. I'm not sure what change to the system caused this error but the linking order is clearly wrong.</p> CephFS - Feature #62849 (In Progress): mds/FSMap: add field indicating the birth time of the epochhttps://tracker.ceph.com/issues/628492023-09-15T16:35:25ZPatrick Donnellypdonnell@redhat.com
<p>So you can easily see when the FSMap epoch was published (real time) without looking at each file system's mdsmap. In fact, none of the file system's mdsmaps may be updated if e.g. only a standby was added.</p>
<p>Add tests to confirm that the time is added to the FSMap and that it's accurate.</p> CephFS - Bug #62764 (New): qa: use stdin-killer for kclient mountshttps://tracker.ceph.com/issues/627642023-09-07T19:53:55ZPatrick Donnellypdonnell@redhat.com
<p>To reduce the number of dead jobs caused by a e.g. umount command stuck in uninterruptible sleep.</p> CephFS - Bug #62763 (Fix Under Review): qa: use stdin-killer for ceph-fuse mountshttps://tracker.ceph.com/issues/627632023-09-07T19:52:25ZPatrick Donnellypdonnell@redhat.com
<p>To reduce the number of dead jobs caused by a e.g. umount command stuck in uninterruptible sleep.</p> CephFS - Bug #62577 (Pending Backport): mds: log a message when exiting due to asok "exit" commandhttps://tracker.ceph.com/issues/625772023-08-24T19:43:08ZPatrick Donnellypdonnell@redhat.com
<p>So it's clear what caused the call to suicide.</p> CephFS - Bug #62208 (Fix Under Review): mds: use MDSRank::abort to ceph_abort so necessary sync i...https://tracker.ceph.com/issues/622082023-07-27T13:32:34ZPatrick Donnellypdonnell@redhat.com
<p>The MDS calls ceph_abort("msg") in various places. If there is any pending cluster log messages to be sent to the mons, those messages may be lost when the MDS hard stops due to abort. To guarantee those messages are flushed before abort, use the new MDSRank::abort method instead of ceph_abort.</p>
<p>See discussion: <a class="external" href="https://github.com/ceph/ceph/pull/52638#discussion_r1276289186">https://github.com/ceph/ceph/pull/52638#discussion_r1276289186</a></p> CephFS - Bug #59425 (Pending Backport): qa: RuntimeError: more than one file system availablehttps://tracker.ceph.com/issues/594252023-04-11T14:10:25ZPatrick Donnellypdonnell@redhat.com
<pre>Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_teuthology_8d156aede5efdae00b53d8d3b8d127082980e7ec/teuthology/run_tasks.py", line 109, in run_tasks
manager.__enter__()
File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
return next(self.gen)
File "/home/teuthworker/src/git.ceph.com_ceph-c_687b814f6a8c4db74daf97d825bbfa90b5560fa3/qa/tasks/ceph.py", line 1893, in task
healthy(ctx=ctx, config=dict(cluster=config['cluster']))
File "/home/teuthworker/src/git.ceph.com_ceph-c_687b814f6a8c4db74daf97d825bbfa90b5560fa3/qa/tasks/ceph.py", line 1474, in healthy
ceph_fs.wait_for_daemons(timeout=300)
File "/home/teuthworker/src/git.ceph.com_ceph-c_687b814f6a8c4db74daf97d825bbfa90b5560fa3/qa/tasks/cephfs/filesystem.py", line 1097, in wait_for_daemons
status = self.getinfo(refresh=True)
File "/home/teuthworker/src/git.ceph.com_ceph-c_687b814f6a8c4db74daf97d825bbfa90b5560fa3/qa/tasks/cephfs/filesystem.py", line 545, in getinfo
raise RuntimeError("more than one file system available")
RuntimeError: more than one file system available
2023-04-11T00:43:27.591 ERROR:teuthology.run_tasks: Sentry event: https://sentry.ceph.com/organizations/ceph/?query=e97063c90400452b932c9b98d75f076a
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_teuthology_8d156aede5efdae00b53d8d3b8d127082980e7ec/teuthology/run_tasks.py", line 109, in run_tasks
manager.__enter__()
File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
return next(self.gen)
File "/home/teuthworker/src/git.ceph.com_ceph-c_687b814f6a8c4db74daf97d825bbfa90b5560fa3/qa/tasks/ceph.py", line 1893, in task
healthy(ctx=ctx, config=dict(cluster=config['cluster']))
File "/home/teuthworker/src/git.ceph.com_ceph-c_687b814f6a8c4db74daf97d825bbfa90b5560fa3/qa/tasks/ceph.py", line 1474, in healthy
ceph_fs.wait_for_daemons(timeout=300)
File "/home/teuthworker/src/git.ceph.com_ceph-c_687b814f6a8c4db74daf97d825bbfa90b5560fa3/qa/tasks/cephfs/filesystem.py", line 1097, in wait_for_daemons
status = self.getinfo(refresh=True)
File "/home/teuthworker/src/git.ceph.com_ceph-c_687b814f6a8c4db74daf97d825bbfa90b5560fa3/qa/tasks/cephfs/filesystem.py", line 545, in getinfo
raise RuntimeError("more than one file system available")
RuntimeError: more than one file system available
</pre>
<p>/ceph/teuthology-archive/pdonnell-2023-04-11_00:14:25-fs-wip-pdonnell-testing-20230410.205400-quincy-distro-default-smithi/7237945/teuthology.log</p>
<p>Problem also exists on main. Caused by: <a class="external" href="https://github.com/ceph/ceph/pull/50896">https://github.com/ceph/ceph/pull/50896</a></p> RADOS - Bug #58974 (Pending Backport): mon/MonmapMonitor: do not propose on error in prepare_updatehttps://tracker.ceph.com/issues/589742023-03-13T16:30:55ZPatrick Donnellypdonnell@redhat.com
<p>See discussion: <a class="external" href="https://github.com/ceph/ceph/pull/50404#discussion_r1133791746">https://github.com/ceph/ceph/pull/50404#discussion_r1133791746</a></p> CephFS - Bug #48673 (Pending Backport): High memory usage on standby replay MDShttps://tracker.ceph.com/issues/486732020-12-18T08:21:58ZDaniel Persson
<p>Hi.</p>
<p>We have recently installed a Ceph cluster and with about 27M objects. The filesystem seems to have 15M files.</p>
<p>The MDS is configured with a 20Gb mds_cache_memory_limit. If we look at the nodes, the memory keeps a bit above the limit on the active node 4 but not extremely so.</p>
<pre>
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2165668 ceph 20 0 27.6g 26.1g 22088 S 12.3 13.9 2081:55 ceph-mds
</pre>
<p>However, we have problems with the standby replay node 3 with a large memory footprint.</p>
<pre>
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2166195 ceph 20 0 40.7g 38.2g 21000 S 0.7 20.4 86:31.18 ceph-mds
</pre>
<p>This level has remained constant for days. We have received warnings from the cluster reset a couple of times, even if the memory footprint has not changed.</p>
<pre>
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mdsnode3(mds.0): MDS cache is too large (30GB/20GB); 0 inodes in use by clients, 0 stray files
</pre>
<p>The nodes also run a couple of OSDs, and we don't want them to be affected now that we soon go for the Xmas holidays, so I thought I open a ticket here and see if we can get any suggestions on preventive measures from now on.</p>
<p>If you want any extra information, please ask.</p>
<p>Best regards<br />Daniel</p> CephFS - Feature #18154 (Fix Under Review): qa: enable mds thrash exports testshttps://tracker.ceph.com/issues/181542016-12-06T14:58:00ZJohn Sprayjcspray@gmail.com
<p>Currently:<br /><pre>
$ git grep thrash.exports
suites/experimental/multimds/tasks/fsstress_thrash_subtrees.yaml: mds thrash exports: 1
suites/marginal/multimds/thrash/exports.yaml: mds thrash exports: 1
</pre></p>
<p>This needs to be part of our normal multimds testing.</p> CephFS - Feature #7320 (Fix Under Review): qa: thrash directory fragmentationhttps://tracker.ceph.com/issues/73202014-02-03T16:54:52ZSage Weilsage@newdream.net
<p>Define killpoints for directory fragmentation. Create tests as in <a class="external" href="https://github.com/ceph/ceph/pull/28004">https://github.com/ceph/ceph/pull/28004</a></p>