Ceph : Issueshttps://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2024-03-27T16:01:30ZCeph
Redmine CephFS - Bug #65182 (Pending Backport): mds: quiesce_inode op waiting on remote auth pins is not ...https://tracker.ceph.com/issues/651822024-03-27T16:01:30ZPatrick Donnellypdonnell@redhat.com
<pre>
{
"description": "internal op quiesce_path:mds.1:1048 fp=#0x1/volumes/_nogroup/sv_new_1_def_11/0d61d4d2-d869-46f0-93a0-d9b9e74401c2",
"initiated_at": "2024-03-26T10:06:14.974850+0000",
"age": 101818.022728012,
"duration": 101818.025116246,
"continuous": true,
"type_data": {
"result": -2147483648,
"flag_point": "cleaned up request",
"reqid": {
"entity": {
"type": "mds",
"num": 1
},
"tid": 1048
},
"op_type": "internal_op",
"internal_op": 5384,
"op_name": "quiesce_path",
"events": [
{
"time": "2024-03-26T10:06:14.974850+0000",
"event": "initiated"
},
{
"time": "2024-03-26T10:06:14.974850+0000",
"event": "throttled"
},
{
"time": "2024-03-26T10:06:14.974850+0000",
"event": "header_read"
},
{
"time": "2024-03-26T10:06:14.974850+0000",
"event": "all_read"
},
{
"time": "2024-03-26T10:06:14.974850+0000",
"event": "dispatched"
},
{
"time": "2024-03-26T10:06:14.974869+0000",
"event": "acquired locks"
},
{
"time": "2024-03-26T10:06:14.974879+0000",
"event": "acquired locks"
},
{
"time": "2024-03-26T10:06:14.974888+0000",
"event": "acquired locks"
},
{
"time": "2024-03-26T10:06:14.974898+0000",
"event": "acquired locks"
},
{
"time": "2024-03-26T10:06:21.501232+0000",
"event": "killing request"
},
{
"time": "2024-03-26T10:06:21.501253+0000",
"event": "cleaned up request"
}
],
"locks": []
}
},
...
{
"description": "internal op quiesce_inode:mds.1:1049 fp=#0x100008e255a fp2=#0x100008e255a",
"initiated_at": "2024-03-26T10:06:14.974908+0000",
"age": 101818.022670109,
"duration": 101818.02511086701,
"continuous": true,
"type_data": {
"result": -2147483648,
"flag_point": "quiesce complete for non-auth inode",
"reqid": {
"entity": {
"type": "mds",
"num": 1
},
"tid": 1049
},
"op_type": "internal_op",
"internal_op": 5385,
"op_name": "quiesce_inode",
"events": [
{
"time": "2024-03-26T10:06:14.974908+0000",
"event": "initiated"
},
{
"time": "2024-03-26T10:06:14.974908+0000",
"event": "throttled"
},
{
"time": "2024-03-26T10:06:14.974908+0000",
"event": "header_read"
},
{
"time": "2024-03-26T10:06:14.974908+0000",
"event": "all_read"
},
{
"time": "2024-03-26T10:06:14.974908+0000",
"event": "dispatched"
},
{
"time": "2024-03-26T10:06:14.974977+0000",
"event": "requesting remote authpins"
},
{
"time": "2024-03-26T10:06:21.615411+0000",
"event": "acquired locks"
},
{
"time": "2024-03-26T10:06:21.615458+0000",
"event": "quiesce complete for non-auth inode"
}
],
"locks": [
{
"object": {
"is_auth": false,
"auth_state": {
"replicas": {}
},
"replica_state": {
"authority": [
0,
-2
],
"replica_nonce": 1
},
"auth_pins": 0,
"is_frozen": false,
"is_freezing": false,
"pins": {
"request": 1,
"lock": 1
},
"nref": 2
},
"object_string": "[inode 0x100008e255a [...2ae,head] /volumes/_nogroup/sv_new_1_def_11/0d61d4d2-d869-46f0-93a0-d9b9e74401c2/ rep@0.1 v1696 snaprealm=0x55b78d09f440 f(v0 m2024-03-26T10:05:13.326074+0000 10=2+8) n(v56 rc2024-03-26T10:17:04.624239+0000 b2670077140 31541=28967+2574)/n(v0 rc2024-03-26T09:40:15.892764+0000 b1027604480 138=3+135) (inest mix) (iquiesce lock x=1 by request(mds.1:1049 nref=3)) | request=1 lock=1 0x55b78d1b4580]",
"lock": {
"gather_set": [],
"state": "lock",
"type": "iquiesce",
"is_leased": false,
"num_rdlocks": 0,
"num_wrlocks": 0,
"num_xlocks": 1,
"xlock_by": {
"reqid": {
"entity": {
"type": "mds",
"num": 1
},
"tid": 1049
}
}
},
"flags": 4,
"wrlock_target": -1
}
]
}
},
</pre>
<p>This is an op dump from a QE test cluster. The quiesce_path was killed and shortly after the quiesce_inode op received the remote authpins allowing it to proceed. However, MDCache::request_kill does not actually kill a request waiting on remote authpins so it is allowed to proceed with its quiesce.</p> crimson - Bug #65130 (Fix Under Review): crimson: crimson-rados did not detect reintroduction of ...https://tracker.ceph.com/issues/651302024-03-26T01:42:35ZSamuel Justsjust@redhat.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/56376">https://github.com/ceph/ceph/pull/56376</a> would have reintroduced <a class="external" href="https://tracker.ceph.com/issues/61875">https://tracker.ceph.com/issues/61875</a> as it puts the snap mapper keys back into the pg meta object. Oddly, a teuthology run on that branch which seems to have included tests with both snapshots and osd restarts did not show crashes associated with this regression and at least one case that seems like it should have exercised the relevant code passed. A quick glance over PGLog.cc::FuturizedShardStoreReader doesn't show any changes, so it should have crashed in the final else branch of FuturizedShardStoreLogReader::process_entry at e.decode_with_checksum.</p>
<p>Tasks:<br />- Confirm that the crimson-rados suite actually combines snapshots with OSD restarts<br />- Work out why the existing suite didn't fail the above PR<br />- Amend the tests to cover the gap</p> CephFS - Bug #65039 (Triaged): mds: standby-replay segmentation fault in md_log_replayhttps://tracker.ceph.com/issues/650392024-03-21T14:19:46ZPatrick Donnellypdonnell@redhat.com
<pre>
2024-03-21T03:15:55.310 INFO:journalctl@ceph.mds.h.smithi060.stdout:Mar 21 03:15:55 smithi060 ceph-87dd0fc6-e72e-11ee-95c9-87774f69a715-mds-h[71557]: *** Caught signal (Segmentation fault) **
2024-03-21T03:15:55.310 INFO:journalctl@ceph.mds.h.smithi060.stdout:Mar 21 03:15:55 smithi060 ceph-87dd0fc6-e72e-11ee-95c9-87774f69a715-mds-h[71557]: in thread 7f7135d7c700 thread_name:md_log_replay
</pre>
<p>From: /teuthology/pdonnell-2024-03-21_02:37:43-fs:workload-main-distro-default-smithi/7614435/teuthology.log</p>
<p>I logged into the machine and collected a gdb stack trace (attached). Initially I was looking for a deadlock not a segmentation fault. The signal handler for SIGSEGV got deadlocked (predictably) because it was using malloc:</p>
<pre>
Thread 26 (Thread 0x7f7135d7c700 (LWP 72204)):
#0 0x00007f7148e163d0 in base::internal::SpinLockDelay(int volatile*, int, int) () from /lib64/libtcmalloc.so.4
#1 0x00007f7148e162d3 in SpinLock::SlowLock() () from /lib64/libtcmalloc.so.4
#2 0x00007f7148e05a55 in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () from /lib64/libtcmalloc.so.4
#3 0x00007f7148e093e3 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long)) () from /lib64/libtcmalloc.so.4
#4 0x00007f71484409b3 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char const*> () from /usr/lib64/ceph/libceph-common.so.2
#5 0x00007f7148440aa9 in ceph::ClibBackTrace::demangle[abi:cxx11](char const*) () from /usr/lib64/ceph/libceph-common.so.2
#6 0x00007f7148441025 in ceph::ClibBackTrace::print(std::ostream&) const () from /usr/lib64/ceph/libceph-common.so.2
#7 0x000055c9ae7266dd in handle_oneshot_fatal_signal (signum=11) at /usr/src/debug/ceph-19.0.0-2244.gcab8141b.el8.x86_64/src/global/signal_handler.cc:331
#8 <signal handler called>
#9 0x00007f7148e05603 in tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**) () from /lib64/libtcmalloc.so.4
#10 0x00007f7148e058ae in tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) () from /lib64/libtcmalloc.so.4
#11 0x00007f7148e05971 in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () from /lib64/libtcmalloc.so.4
#12 0x00007f7148e093e3 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long)) () from /lib64/libtcmalloc.so.4
#13 0x000055c9ae311e17 in EMetaBlob::fullbit::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&) () at /usr/src/debug/ceph-19.0.0-2244.gcab8141b.el8.x86_64/src/include/compact_map.h:27
#14 0x000055c9ae31429d in EMetaBlob::dirlump::_decode_bits (this=0x55c9b25c9770) at /usr/src/debug/ceph-19.0.0-2244.gcab8141b.el8.x86_64/src/mds/events/EMetaBlob.h:609
#15 0x000055c9ae31c397 in EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*) () at /usr/src/debug/ceph-19.0.0-2244.gcab8141b.el8.x86_64/src/mds/events/EMetaBlob.h:296
#16 0x000055c9ae322551 in EUpdate::replay(MDSRank*) () at /usr/src/debug/ceph-19.0.0-2244.gcab8141b.el8.x86_64/src/mds/journal.cc:2252
#17 0x000055c9ae64dd97 in MDLog::_replay_thread (this=0x55c9b18e6000) at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/unique_ptr.h:421
#18 0x000055c9ae6543b1 in MDLog::ReplayThread::entry (this=<optimized out>) at /usr/src/debug/ceph-19.0.0-2244.gcab8141b.el8.x86_64/src/mds/MDLog.h:181
#19 0x00007f71471331ca in start_thread () from /lib64/libpthread.so.0
#20 0x00007f71456308d3 in clone () from /lib64/libc.so.6
</pre>
<p>Unfortunately, I didn't get a chance to dig into frame <a class="issue tracker-2 status-3 priority-4 priority-default closed parent" title="Feature: uclient: Make readdir use the cache (Resolved)" href="https://tracker.ceph.com/issues/13">#13</a> to see why it segfaulted.</p> rgw - Bug #64841 (New): java_s3tests: testObjectCreateBadExpectMismatch failurehttps://tracker.ceph.com/issues/648412024-03-11T17:06:12ZCasey Bodleycbodley@redhat.com
<p>ex. <a class="external" href="http://qa-proxy.ceph.com/teuthology/cbodley-2024-03-10_14:50:40-rgw-wip-cbodley2-testing-distro-default-smithi/7589576/teuthology.log">http://qa-proxy.ceph.com/teuthology/cbodley-2024-03-10_14:50:40-rgw-wip-cbodley2-testing-distro-default-smithi/7589576/teuthology.log</a><br /><pre>
2024-03-10T16:40:31.654 INFO:teuthology.orchestra.run.smithi060.stdout:suite > Object tests > ObjectTest.testObjectCreateBadExpectMismatch STARTED
2024-03-10T16:40:32.455 INFO:teuthology.orchestra.run.smithi060.stdout:
2024-03-10T16:40:32.455 INFO:teuthology.orchestra.run.smithi060.stdout:suite > Object tests > ObjectTest.testObjectCreateBadExpectMismatch FAILED
2024-03-10T16:40:32.455 INFO:teuthology.orchestra.run.smithi060.stdout: com.amazonaws.services.s3.model.AmazonS3Exception at ObjectTest.java:525
</pre></p>
<p>scanning the rgw log <a class="external" href="http://qa-proxy.ceph.com/teuthology/cbodley-2024-03-10_14:50:40-rgw-wip-cbodley2-testing-distro-default-smithi/7589576/remote/smithi060/log/rgw.ceph.client.0.log.gz">http://qa-proxy.ceph.com/teuthology/cbodley-2024-03-10_14:50:40-rgw-wip-cbodley2-testing-distro-default-smithi/7589576/remote/smithi060/log/rgw.ceph.client.0.log.gz</a> i see this sequence of requests:<br /><pre>
2024-03-10T16:40:31.653+0000 CreateBucket test-c2dd4bb1-7d2b-431d-a7b0-200c96d8349313
2024-03-10T16:40:32.445+0000 status ok
2024-03-10T16:40:33.409+0000 ListObjectVersions
2024-03-10T16:40:34.465+0000 status ok
2024-03-10T16:40:34.469+0000 a4d1c640 1 failed to read header: bad method
2024-03-10T16:40:34.885+0000 ListObjectVersions
2024-03-10T16:40:35.645+0000 status ok
2024-03-10T16:40:35.649+0000 DeleteBucket
</pre></p>
<p>that "bad method" error corresponds to the test's PutObject request that's supposed to pass `Expect: 200`. the "bad method" error comes from beast's http parser when it's trying to read the next request from a connection. i assume this means we either read too many bytes from the previous request, or too few</p>
<p>there are several other occurrances of this "bad method" error, but none led to test failures. scanning the rgw log of successful java_s3tests runs, i don't see any "bad method" errors</p> rgw - Bug #64571 (New): lifecycle transition crashes since merge end-to-end tracinghttps://tracker.ceph.com/issues/645712024-02-26T15:58:44ZCasey Bodleycbodley@redhat.com
<p>regression from <a class="external" href="https://github.com/ceph/ceph/pull/52114">https://github.com/ceph/ceph/pull/52114</a>, whose test results included these failures <a class="external" href="https://pulpito.ceph.com/yuvalif-2024-01-30_08:46:48-rgw-wip-end2end-tracing-distro-default-smithi/">https://pulpito.ceph.com/yuvalif-2024-01-30_08:46:48-rgw-wip-end2end-tracing-distro-default-smithi/</a></p>
<p>example rgw/lifecycle job from <a class="external" href="http://qa-proxy.ceph.com/teuthology/yuriw-2024-02-23_19:55:04-rgw-main-distro-default-smithi/7572634/teuthology.log">http://qa-proxy.ceph.com/teuthology/yuriw-2024-02-23_19:55:04-rgw-main-distro-default-smithi/7572634/teuthology.log</a><br /><pre>
2024-02-24T03:30:13.667 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_transition_set_invalid_date PASSED [ 76%]
2024-02-24T03:30:23.431 INFO:tasks.rgw.client.0.smithi171.stderr:daemon-helper: command crashed with signal 11
2024-02-24T03:30:28.964 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~0s
2024-02-24T03:30:34.468 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~6s
2024-02-24T03:30:39.972 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~11s
2024-02-24T03:30:45.475 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~17s
2024-02-24T03:30:50.979 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~22s
2024-02-24T03:30:56.483 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~28s
2024-02-24T03:31:01.986 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~33s
2024-02-24T03:31:05.290 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_transition FAILED [ 76%]
2024-02-24T03:31:05.291 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_transition ERROR [ 76%]
2024-02-24T03:31:07.490 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~39s
2024-02-24T03:31:12.994 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~44s
2024-02-24T03:31:18.497 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~50s
2024-02-24T03:31:20.174 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_transition_single_rule_multi_trans FAILED [ 76%]
2024-02-24T03:31:20.174 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_transition_single_rule_multi_trans ERROR [ 76%]
2024-02-24T03:31:24.001 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~55s
2024-02-24T03:31:29.506 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~61s
2024-02-24T03:31:35.009 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~66s
2024-02-24T03:31:36.263 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_set_noncurrent_transition FAILED [ 76%]
2024-02-24T03:31:36.263 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_set_noncurrent_transition ERROR [ 76%]
2024-02-24T03:31:40.513 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~72s
2024-02-24T03:31:46.017 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~77s
2024-02-24T03:31:48.736 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_noncur_transition FAILED [ 77%]
2024-02-24T03:31:48.736 INFO:teuthology.orchestra.run.smithi171.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_noncur_transition ERROR [ 77%]
</pre></p>
<p>example rgw/cloud-transition job from <a class="external" href="http://qa-proxy.ceph.com/teuthology/yuriw-2024-02-23_19:55:04-rgw-main-distro-default-smithi/7572619/teuthology.log">http://qa-proxy.ceph.com/teuthology/yuriw-2024-02-23_19:55:04-rgw-main-distro-default-smithi/7572619/teuthology.log</a><br /><pre>
2024-02-24T03:12:02.182 INFO:tasks.rgw.client.0.smithi106.stderr:daemon-helper: command crashed with signal 11
2024-02-24T03:12:05.128 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~0s
2024-02-24T03:12:10.732 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~6s
2024-02-24T03:12:16.337 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~11s
2024-02-24T03:12:21.947 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~17s
2024-02-24T03:12:27.553 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~22s
2024-02-24T03:12:33.157 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~28s
2024-02-24T03:12:38.761 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~34s
2024-02-24T03:12:44.365 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~39s
2024-02-24T03:12:49.968 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~45s
2024-02-24T03:12:55.572 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~50s
2024-02-24T03:13:01.177 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~56s
2024-02-24T03:13:06.781 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~62s
2024-02-24T03:13:12.386 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~67s
2024-02-24T03:13:17.990 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~73s
2024-02-24T03:13:23.594 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~78s
2024-02-24T03:13:29.198 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~84s
2024-02-24T03:13:34.803 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~90s
2024-02-24T03:13:40.407 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~95s
2024-02-24T03:13:42.823 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_transition FAILED [ 25%]
2024-02-24T03:13:42.823 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_transition ERROR [ 25%]
2024-02-24T03:13:46.011 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~101s
2024-02-24T03:13:51.615 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~106s
2024-02-24T03:13:57.219 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~112s
2024-02-24T03:13:59.823 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_multiple_transition FAILED [ 50%]
2024-02-24T03:13:59.824 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_multiple_transition ERROR [ 50%]
2024-02-24T03:14:02.823 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~118s
2024-02-24T03:14:08.427 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~123s
2024-02-24T03:14:14.031 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~129s
2024-02-24T03:14:19.635 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~135s
2024-02-24T03:14:21.200 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_noncur_cloud_transition FAILED [ 75%]
2024-02-24T03:14:21.200 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_noncur_cloud_transition ERROR [ 75%]
2024-02-24T03:14:25.240 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~140s
2024-02-24T03:14:30.844 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~146s
2024-02-24T03:14:35.192 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_transition_large_obj FAILED [100%]
2024-02-24T03:14:35.193 INFO:teuthology.orchestra.run.smithi106.stdout:s3tests_boto3/functional/test_s3.py::test_lifecycle_cloud_transition_large_obj ERROR [100%]
2024-02-24T03:14:35.193 INFO:teuthology.orchestra.run.smithi106.stdout:
</pre></p> Linux kernel client - Bug #64172 (Fix Under Review): Test failure: test_multiple_path_r (tasks.ce...https://tracker.ceph.com/issues/641722024-01-25T05:55:58ZVenky Shankarvshankar@redhat.com
<p>/a/vshankar-2024-01-22_07:03:31-fs-wip-vshankar-testing-20240119.075157-1-testing-default-smithi/7525717</p>
<p>The test setup involves "read" cap on a file system path (directory), remount the directory as file system root and read the created files.</p>
<p>MDS logs: ./remote/smithi157/log/ceph-mds.c.log.gz</p>
<pre>
2024-01-22T08:27:55.205+0000 7f81a7600640 1 -- [v2:172.21.15.157:6835/1231338113,v1:172.21.15.157:6837/1231338113] <== client.17628 v1:192.168.0.1:0/312855551 9 ==== client_request(client.17628:5 lookupino #0x1 2024-01-22T08:27:55.205939+0000 caller_uid=0, caller_gid=0{0,}) v6 ==== 176+0+0 (unknown 772642831 0 0) 0x55ab1d52bb00 con 0x55ab1d52e400
2024-01-22T08:27:55.205+0000 7f81a7600640 4 mds.0.server handle_client_request client_request(client.17628:5 lookupino #0x1 2024-01-22T08:27:55.205939+0000 caller_uid=0, caller_gid=0{0,}) v6
2024-01-22T08:27:55.205+0000 7f81a7600640 20 mds.0.356 get_session have 0x55ab1d202f00 client.17628 v1:192.168.0.1:0/312855551 state open
2024-01-22T08:27:55.205+0000 7f81a7600640 15 mds.0.server oldest_client_tid=5
2024-01-22T08:27:55.205+0000 7f81a7600640 7 mds.0.cache request_start request(client.17628:5 nref=2 cr=0x55ab1d52bb00)
2024-01-22T08:27:55.205+0000 7f81a7600640 7 mds.0.server dispatch_client_request client_request(client.17628:5 lookupino #0x1 2024-01-22T08:27:55.205939+0000 caller_uid=0, caller_gid=0{0,}) v6
2024-01-22T08:27:55.205+0000 7f81a7600640 20 Session check_access path
2024-01-22T08:27:55.205+0000 7f81a7600640 10 MDSAuthCap is_capable inode(path / owner 0:0 mode 041777) by caller 0:0 mask 0 new 0:0 cap: MDSAuthCaps[allow r fsname=cephfs path="/dir1/dir12", allow r fsname=cephfs path="/dir2/dir22"]
2024-01-22T08:27:55.205+0000 7f81a7600640 7 mds.0.server reply_client_request -13 ((13) Permission denied) client_request(client.17628:5 lookupino #0x1 2024-01-22T08:27:55.205939+0000 caller_uid=0, caller_gid=0{0,}) v6
2024-01-22T08:27:55.205+0000 7f81a7600640 10 mds.0.server apply_allocated_inos 0x0 / [] / 0x0
2024-01-22T08:27:55.205+0000 7f81a7600640 20 mds.0.server lat 0.000095
2024-01-22T08:27:55.205+0000 7f81a7600640 10 mds.0.356 send_message_client client.17628 v1:192.168.0.1:0/312855551 client_reply(???:5 = -13 (13) Permission denied) v1
2024-01-22T08:27:55.205+0000 7f81a7600640 1 -- [v2:172.21.15.157:6835/1231338113,v1:172.21.15.157:6837/1231338113] --> v1:192.168.0.1:0/312855551 -- client_reply(???:5 = -13 (13) Permission denied) v1 -- 0x55ab1d5b4700 con 0x55ab1d52e400
</pre> Ceph - Bug #63617 (New): ceph-common: CommonSafeTimer<std::mutex>::timer_thread(): python3.12 kil...https://tracker.ceph.com/issues/636172023-11-23T18:40:45ZKaleb KEITHLEY
<p><a class="external" href="https://bugzilla.redhat.com/show_bug.cgi?id=2251165">https://bugzilla.redhat.com/show_bug.cgi?id=2251165</a></p>
<p>Description of problem:</p>
<p>Version-Release number of selected component:<br />ceph-common-2:18.2.1-1.fc39</p>
<p>Additional info:<br />reporter: libreport-2.17.11<br />cmdline: /usr/bin/python3.12 /usr/bin/ceph -s<br />backtrace_rating: 4<br />runlevel: N 5<br />executable: /usr/bin/python3.12<br />journald_cursor: s=9f8a7a66b4194fdcbd75dcd3edf4da87;i=173e8c976;b=a08b8db920744522980a5387af245706;m=2743cc1c;t=60accf74a277f;x=cef1ac3a8dc81a9d<br />comment: <br />cgroup: 0::/user.slice/user-1000.slice/user/app.slice/app-org.kde.konsole-44b42a69b68946748c9899bd38ac8c6d.scope<br />kernel: 6.6.2-200.fc39.x86_64<br />uid: 1000<br />rootdir: /<br />crash_function: CommonSafeTimer<std::mutex>::timer_thread<br />type: CCpp<br />package: ceph-common-2:18.2.1-1.fc39<br />reason: python3.12 killed by SIGSEGV</p>
<p>Truncated backtrace:<br />Thread no. 1 (3 frames)<br /> #0 CommonSafeTimer<std::mutex>::timer_thread at /usr/src/debug/ceph-18.2.1-1.fc39.x86_64/src/common/Timer.cc:103<br /> <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: gpf in tcp_sendpage (Closed)" href="https://tracker.ceph.com/issues/1">#1</a> CommonSafeTimerThread<std::mutex>::entry at /usr/src/debug/ceph-18.2.1-1.fc39.x86_64/src/common/Timer.cc:33<br /> <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: leaked dentry ref on umount (Closed)" href="https://tracker.ceph.com/issues/3">#3</a> clone3 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78</p> CephFS - Backport #63590 (In Progress): reef: qa: fs:mixed-clients kernel_untar_build failurehttps://tracker.ceph.com/issues/635902023-11-20T02:20:32ZBackport Bot
<p><a class="external" href="https://github.com/ceph/ceph/pull/54711">https://github.com/ceph/ceph/pull/54711</a></p> CephFS - Backport #63589 (In Progress): quincy: qa: fs:mixed-clients kernel_untar_build failurehttps://tracker.ceph.com/issues/635892023-11-20T02:20:24ZBackport Bot
<p><a class="external" href="https://github.com/ceph/ceph/pull/54712">https://github.com/ceph/ceph/pull/54712</a></p> CephFS - Feature #61866 (Fix Under Review): MDSMonitor: require --yes-i-really-mean-it when faili...https://tracker.ceph.com/issues/618662023-07-01T23:59:29ZPatrick Donnellypdonnell@redhat.com
<p>If an MDS is already having issues with getting behind on trimming its journal or an oversized cache, restarting it may only create new problems with very slow recovery. In particular, if the MDS gets very behind on trimming its journal with 1M or more segments, replay can take hours or longer.</p>
<p>We already track these warnings in MDSMonitor so do a simple check to help the operator or support folks not shoot themselves in the foot.</p> rgw - Backport #58913 (New): quincy: multisite reshard: old buckets with num_shards=0 get reshard...https://tracker.ceph.com/issues/589132023-03-03T19:48:01ZBackport Botrgw - Bug #58891 (Pending Backport): multisite reshard: old buckets with num_shards=0 get reshard...https://tracker.ceph.com/issues/588912023-03-01T15:19:55ZCasey Bodleycbodley@redhat.com
<p>the new reshard strategy doesn't take into account the old semantics for num_shards=0:</p>
<pre><code class="cpp syntaxhl"><span class="CodeRay"> <span class="comment">// Represents the number of bucket index object shards:</span>
<span class="comment">// - value of 0 indicates there is no sharding (this is by default before this</span>
<span class="comment">// feature is implemented).</span>
<span class="comment">// - value of UINT32_T::MAX indicates this is a blind bucket.</span>
uint32_t num_shards;
</span></code></pre> rgw - Bug #57905 (Pending Backport): multisite: terminate called after throwing an instance of 'c...https://tracker.ceph.com/issues/579052022-10-20T13:30:01ZCasey Bodleycbodley@redhat.com
<p>example from rgw/multisite suite: <a class="external" href="http://qa-proxy.ceph.com/teuthology/cbodley-2022-10-19_23:28:37-rgw-wip-cbodley-testing-distro-default-smithi/7075088/teuthology.log">http://qa-proxy.ceph.com/teuthology/cbodley-2022-10-19_23:28:37-rgw-wip-cbodley-testing-distro-default-smithi/7075088/teuthology.log</a></p>
<p>the tcmalloc warnings make it look like we're decode something, getting a really big 'size', and failing to decode that many bytes</p>
<pre>
2022-10-20T05:29:45.277 DEBUG:tasks.util.rgw:rgwadmin: cmd=['adjust-ulimits', 'ceph-coverage', '/home/ubuntu/cephtest/archive/coverage', 'radosgw-admin', '--log-to-stderr', '--format', 'json', '-n', 'client.0', '--cluster', 'c1', 'bucket', 'sync', 'checkpoint', '--bucket', 'swwtcn-52', '--source-zone', 'a1', '--retry-delay-ms', '5000', '--timeout-sec', '300', '--rgw-zone', 'a2', '--rgw-zonegroup', 'a', '--rgw-realm', 'test-realm', '--cluster', 'c1', '--debug-rgw', '1', '--debug-ms', '0']
2022-10-20T05:29:45.277 DEBUG:teuthology.orchestra.run.smithi150:> adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage radosgw-admin --log-to-stderr --format json -n client.0 --cluster c1 bucket sync checkpoint --bucket swwtcn-52 --source-zone a1 --retry-delay-ms 5000 --timeout-sec 300 --rgw-zone a2 --rgw-zonegroup a --rgw-realm test-realm --cluster c1 --debug-rgw 1 --debug-ms 0
2022-10-20T05:29:45.336 INFO:teuthology.orchestra.run.smithi150.stderr:ignoring --setuser ceph since I am not root
2022-10-20T05:29:45.337 INFO:teuthology.orchestra.run.smithi150.stderr:ignoring --setgroup ceph since I am not root
2022-10-20T05:29:45.381 INFO:teuthology.orchestra.run.smithi150.stderr:2022-10-20T05:29:45.380+0000 7f59b3460780 1 waiting to reach incremental sync..
2022-10-20T05:29:47.652 INFO:tasks.rgw.c1.client.0.smithi150.stdout:tcmalloc: large alloc 13655506944 bytes == 0x560fc6c8c000 @ 0x7f7e06715760 0x7f7e06736c64 0x7f7cc5270166 0x7f7cc526ee93 0x560fba1059e9 0x560fba2170a4 0x560fba1f69a6 0x560fba2346c1 0x560fba234f44 0x560fba1c2525 0x560fb9f429f3 0x560fb9f443b7 0x560fb9e9ad96 0x560fb9e9b94a 0x560fbaca884f
2022-10-20T05:29:47.657 INFO:tasks.rgw.c1.client.0.smithi150.stdout:tcmalloc: large alloc 9825697792 bytes == 0x5612f6420000 @ 0x7f7e06715760 0x7f7e06736c64 0x7f7cc5270166 0x7f7cc526ee93 0x560fba1059e9 0x560fba2170a4 0x560fba1f69a6 0x560fba2346c1 0x560fba234f44 0x560fba1c2525 0x560fb9f429f3 0x560fb9f443b7 0x560fb9e9ad96 0x560fb9e9b94a 0x560fbaca884f
2022-10-20T05:29:50.382 INFO:teuthology.orchestra.run.smithi150.stderr:2022-10-20T05:29:50.381+0000 7f59b3460780 1 waiting to reach incremental sync..
2022-10-20T05:29:51.336 INFO:tasks.rgw.c1.client.0.smithi150.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer'
2022-10-20T05:29:51.336 INFO:tasks.rgw.c1.client.0.smithi150.stdout: what(): End of buffer [buffer:2]
2022-10-20T05:29:51.337 INFO:tasks.rgw.c1.client.0.smithi150.stdout:*** Caught signal (Aborted) **
2022-10-20T05:29:51.337 INFO:tasks.rgw.c1.client.0.smithi150.stdout: in thread 7f7cf11dc700 thread_name:radosgw
2022-10-20T05:29:51.338 INFO:tasks.rgw.c1.client.0.smithi150.stdout: ceph version 18.0.0-564-g492571cb (492571cb93a9d1551a1968e5374657023093a0a8) reef (dev)
2022-10-20T05:29:51.338 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f7e0596ccf0]
2022-10-20T05:29:51.338 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 2: gsignal()
2022-10-20T05:29:51.339 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 3: abort()
2022-10-20T05:29:51.339 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 4: /lib64/libstdc++.so.6(+0x9009b) [0x7f7e050b309b]
2022-10-20T05:29:51.339 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 5: /lib64/libstdc++.so.6(+0x9653c) [0x7f7e050b953c]
2022-10-20T05:29:51.339 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 6: /lib64/libstdc++.so.6(+0x95559) [0x7f7e050b8559]
2022-10-20T05:29:51.340 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 7: __gxx_personality_v0()
2022-10-20T05:29:51.340 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 8: /lib64/libgcc_s.so.1(+0x10b03) [0x7f7e04a99b03]
2022-10-20T05:29:51.340 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 9: _Unwind_Resume()
2022-10-20T05:29:51.340 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 10: radosgw(+0x524ec4) [0x560fb9d5cec4]
2022-10-20T05:29:51.341 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 11: radosgw(+0x653ecd) [0x560fb9e8becd]
2022-10-20T05:29:51.341 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 12: /lib64/libstdc++.so.6(+0xc2ba3) [0x7f7e050e5ba3]
2022-10-20T05:29:51.341 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 13: /lib64/libpthread.so.0(+0x81ca) [0x7f7e059621ca]
2022-10-20T05:29:51.341 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 14: clone()
</pre> CephFS - Bug #57655 (Pending Backport): qa: fs:mixed-clients kernel_untar_build failurehttps://tracker.ceph.com/issues/576552022-09-23T01:03:33ZPatrick Donnellypdonnell@redhat.com
<pre>
2022-09-12T12:12:00.425 INFO:tasks.workunit.client.1.smithi176.stderr:fs/compat.o: warning: objtool: missing symbol for section .text
2022-09-12T12:12:00.487 INFO:tasks.workunit.client.1.smithi176.stdout: CC fs/binfmt_misc.o
2022-09-12T12:12:00.842 INFO:tasks.workunit.client.1.smithi176.stdout: CC fs/binfmt_script.o
2022-09-12T12:12:00.980 INFO:tasks.workunit.client.1.smithi176.stdout: CC fs/binfmt_elf.o
2022-09-12T12:12:01.273 INFO:tasks.workunit.client.1.smithi176.stdout: CC fs/compat_binfmt_elf.o
2022-09-12T12:12:01.278 INFO:tasks.workunit.client.1.smithi176.stdout: AR kernel/built-in.a
2022-09-12T12:12:01.714 INFO:tasks.workunit.client.1.smithi176.stdout: CC fs/mbcache.o
2022-09-12T12:12:01.739 INFO:tasks.workunit.client.1.smithi176.stdout: CC fs/posix_acl.o
2022-09-12T12:12:01.742 INFO:tasks.workunit.client.1.smithi176.stdout: CC fs/coredump.o
2022-09-12T12:12:01.777 INFO:tasks.workunit.client.1.smithi176.stdout: CC fs/drop_caches.o
2022-09-12T12:12:01.795 INFO:tasks.workunit.client.1.smithi176.stdout: CC fs/fhandle.o
2022-09-12T12:12:02.186 INFO:tasks.workunit.client.1.smithi176.stdout: CC fs/dcookies.o
2022-09-12T12:12:02.982 INFO:tasks.workunit.client.1.smithi176.stderr:fs/dcookies.o: warning: objtool: missing symbol for section .text
2022-09-12T12:12:02.999 INFO:tasks.workunit.client.1.smithi176.stdout: AR fs/built-in.a
2022-09-12T12:12:03.195 DEBUG:teuthology.orchestra.run:got remote process result: 2
2022-09-12T12:12:03.196 INFO:tasks.workunit:Stopping ['kernel_untar_build.sh'] on client.1...
</pre>
<p>Seen: /ceph/teuthology-archive/dparmar-2022-09-12_11:38:14-fs:mixed-clients-main-distro-default-smithi/7029223/teuthology.log</p>
<p>and more recently: /ceph/teuthology-archive/pdonnell-2022-09-22_12:22:37-fs-wip-pdonnell-testing-20220920.234701-distro-default-smithi/7041086/teuthology.log</p> Calamari - Support #14437 (New): dashboard widgets IOPs and Usage are blankhttps://tracker.ceph.com/issues/144372016-01-20T09:35:10Zdeng peidengpei_dp@126.com
<p>dashboard widgets IOPs and Usage are blank</p>
<p>#calamari server and ceph node salt service status</p>
<p>root@calamari:~# /etc/init.d/salt-master status
* salt-master is running<br />root@calamari:~# /etc/init.d/salt-minion status
* salt-minion is running</p>
<p>root@mon1:~# /etc/init.d/salt-minion status
* salt-minion is running</p>
<p>#calamari server and ceph node salt version</p>
<p>root@mon1:~# salt-minion --version<br />salt-minion 2014.7.5 (Helium)<br />root@mon1:~# diamond --version<br />Diamond version 3.4.67</p>
<p>root@calamari:~# salt-master --version<br />salt-master 2014.7.5 (Helium)<br />root@calamari:~# salt-minion --version<br />salt-minion 2014.7.5 (Helium)</p>