Ceph : Issueshttps://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2024-03-28T12:42:52ZCeph
Redmine Ceph - Bug #65199 (New): autoscaler: Scale PGs based on number of objectshttps://tracker.ceph.com/issues/651992024-03-28T12:42:52ZNiklas Hambuechen
<p>Ceph's autoscaler scales PGs based on Bytes stored. It seemingly ignores number of objects. This creates problems for pools with many small files.</p>
<p>It creates even more problems for pools with an apparent byte size of 0, but millions of objects; such pools get created when following CephFS-on-EC best practices in the docs.</p>
<p>Red Hat docs describe:</p>
<p><a class="external" href="https://access.redhat.com/documentation/de-de/red_hat_ceph_storage/4/html/storage_strategies_guide/placement_groups_pgs#viewing-placement-group-scaling-recommendations">https://access.redhat.com/documentation/de-de/red_hat_ceph_storage/4/html/storage_strategies_guide/placement_groups_pgs#viewing-placement-group-scaling-recommendations</a></p>
<blockquote>
<p><strong>BIAS</strong>, is a pool property that is used by the PG autoscaler to scale some pools faster than others, in terms of number of PGs. It is essentially a multiplier used to give more PG to a pool than the default number of PGs. This property is <strong>particularly used for metadata pools which might be small in size but have large number of objects</strong>, so scaling them faster is important for better performance.`</p>
</blockquote>
<p>(Note these docs are better than the upstream Ceph docs on BIAS, which are much shorter: <a class="external" href="https://docs.ceph.com/en/reef/rados/operations/placement-groups/">https://docs.ceph.com/en/reef/rados/operations/placement-groups/</a>)</p>
<p>So this confirms that BIAS (pg_autoscale_bias) can be used to partially address the "many small objects", using a constant factor.</p>
<p>But the constant factor stops working when the objects are 0-sized.</p>
<p>This happens when following CephFS best practices: <a class="external" href="https://docs.ceph.com/en/reef/cephfs/createfs/#creating-pools">https://docs.ceph.com/en/reef/cephfs/createfs/#creating-pools</a></p>
<blockquote>
<p>The data pool used to create the file system is the “default” data pool and the location for storing all inode backtrace information, which is used for hard link management and disaster recovery. For this reason, all CephFS inodes have at least one object in the default data pool.<br /><strong>If erasure-coded pools are planned for file system data, it is best to configure the default as a replicated pool</strong> to improve small-object write and read performance when updating backtraces. Separately, another erasure-coded data pool can be added (see also Erasure code) that can be used on an entire hierarchy of directories and files (see also File layouts).</p>
</blockquote>
<p>If you do what is described here ("default" pool on replicated, directory File Layout on EC), you end up with pools like this in `ceph df`:</p>
<pre>
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 203 MiB 26 609 MiB 90.00 5 GiB
data 2 32 0 B 112.23M 0 B 0 61 TiB
data_ec 3 168 124 TiB 115.30M 186 TiB 50.53 121 TiB
metadata 4 128 63 GiB 32.87k 189 GiB 90.00 5 GiB
</pre>
<p>Note how <strong>the `data` pool that stores the inodes bas 112 M objects but 0 Bytes stored</strong>. Apparently the inodes</p>
<p>Because the data size is low (0), the autoscaler assigns no more than 32 PGs.</p>
<p>This means that there are ~4 M objects per PG. If the objects are on HDD that can do 100 seeks per second, running scrubbing, recovery, or balancing (which needs to seek all objects in a PG) will <strong>take at least 11 hours</strong>. And this does not even take EC overhead factors into account.</p>
<p>If there were 1 B objects, handling a single PG would take > 100 hours.</p>
<p>There seems to be nothing in Ceph that scales PGs based on number of objects. This issue requests that to be added.</p>
<p>This would:</p>
<ul>
<li>Fix that CephFS EC recommendations actually make sense and do not produce operational problems.</li>
<li>Improve Ceph's default behaviour for many small files/objects, without the user manually having to set BIAS.</li>
</ul> RADOS - Backport #65198 (In Progress): squid: Failed to encode map X with expected CRChttps://tracker.ceph.com/issues/651982024-03-28T10:29:12ZBackport BotRADOS - Bug #65186 (New): OSDs unreachable in upgrade testhttps://tracker.ceph.com/issues/651862024-03-27T20:28:19ZLaura Flores
<p>/a/teuthology-2024-03-22_02:08:13-upgrade-squid-distro-default-smithi/7616011/remote/smithi087/log/a8e8c570-e819-11ee-95cd-87774f69a715<br /><pre><code class="text syntaxhl"><span class="CodeRay">2024-03-22T07:19:18.215315+0000 mon.a (mon.0) 10 : cluster 0 Standby manager daemon x restarted
2024-03-22T07:19:18.215450+0000 mon.a (mon.0) 11 : cluster 0 Standby manager daemon x started
2024-03-22T07:19:18.215315+0000 mon.a (mon.0) 10 : cluster 0 Standby manager daemon x restarted
2024-03-22T07:19:18.215450+0000 mon.a (mon.0) 11 : cluster 0 Standby manager daemon x started
2024-03-22T07:19:18.277027+0000 mon.a (mon.0) 12 : cluster 0 mgrmap e33: y(active, since 63s), standbys: x
2024-03-22T07:19:18.414028+0000 mon.a (mon.0) 13 : cluster 1 Active manager daemon y restarted
2024-03-22T07:19:18.414630+0000 mon.a (mon.0) 14 : cluster 4 Health check failed: 8 osds(s) are not reachable (OSD_UNREACHABLE)
2024-03-22T07:19:18.414953+0000 mon.a (mon.0) 15 : cluster 1 Activating manager daemon y
2024-03-22T07:19:18.427127+0000 mon.a (mon.0) 16 : cluster 0 osdmap e81: 8 total, 8 up, 8 in
2024-03-22T07:19:18.277027+0000 mon.a (mon.0) 12 : cluster 0 mgrmap e33: y(active, since 63s), standbys: x
2024-03-22T07:19:18.427673+0000 mon.a (mon.0) 17 : cluster 0 mgrmap e34: y(active, starting, since 0.0129348s), standbys: x
2024-03-22T07:19:18.414028+0000 mon.a (mon.0) 13 : cluster 1 Active manager daemon y restarted
2024-03-22T07:19:18.433869+0000 osd.4 (osd.4) 3 : cluster 3 failed to encode map e81 with expected crc
2024-03-22T07:19:18.435418+0000 osd.2 (osd.2) 3 : cluster 3 failed to encode map e81 with expected crc
2024-03-22T07:19:18.414630+0000 mon.a (mon.0) 14 : cluster 4 Health check failed: 8 osds(s) are not reachable (OSD_UNREACHABLE)
2024-03-22T07:19:18.443967+0000 osd.4 (osd.4) 4 : cluster 3 failed to encode map e81 with expected crc
</span></code></pre></p>
<p>Likely connected to <a class="external" href="https://tracker.ceph.com/issues/63389">https://tracker.ceph.com/issues/63389</a>.</p> RADOS - Bug #65185 (New): OSD_SCRUB_ERROR, inconsistent pg in upgrade testshttps://tracker.ceph.com/issues/651852024-03-27T20:21:29ZLaura Flores
<p>/a/teuthology-2024-03-22_02:08:13-upgrade-squid-distro-default-smithi/7616025/remote/smithi098/log/b1f19696-e81a-11ee-95cd-87774f69a715/ceph.log.gz<br /><pre><code class="text syntaxhl"><span class="CodeRay">2024-03-22T09:20:00.000187+0000 mon.a (mon.0) 7863 : cluster 4 [ERR] OSD_SCRUB_ERRORS: 1 scrub errors
2024-03-22T09:20:00.000194+0000 mon.a (mon.0) 7864 : cluster 4 [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
2024-03-22T09:19:59.897409+0000 mon.a (mon.0) 7860 : cluster 0 osdmap e3595: 8 total, 8 up, 8 in
2024-03-22T09:20:00.000202+0000 mon.a (mon.0) 7865 : cluster 4 pg 103.14 is active+clean+inconsistent, acting [5,1,2]
2024-03-22T09:20:00.000151+0000 mon.a (mon.0) 7861 : cluster 4 Health detail: HEALTH_ERR noscrub flag(s) set; 1 scrub errors; Possible data damage: 1 pg inconsistent
</span></code></pre></p>
<p>More in this run: <a class="external" href="https://pulpito.ceph.com/teuthology-2024-03-22_02:08:13-upgrade-squid-distro-default-smithi/">https://pulpito.ceph.com/teuthology-2024-03-22_02:08:13-upgrade-squid-distro-default-smithi/</a></p> rgw - Bug #65160 (Fix Under Review): rgw/lc: A few buckets stuck in UNINITIAL state https://tracker.ceph.com/issues/651602024-03-26T18:37:34ZSoumya Koduri
<p>We are observing lifecycle policy for certain buckets remains stuck in the UNINITIAL state, when the LC transition policy is applied on multiple buckets.</p>
<p>There are a total of buckets 428 buckets in the LC list, out of which 5 are stuck in the UNINITIAL state for a few days. All belong to the same lc shard of lc.4</p>
<p>-------------- snippet of the buckets stuck ----------</p>
<pre><code>"status": "COMPLETE" <br /> },
{<br /> "bucket": ":lucileg.356-bucky-4188-12:1e614482-6c1c-49cb-81ef-8fb70c1ff4e8.208655.6",<br /> "shard": "lc.4",<br /> "started": "Thu, 01 Jan 1970 00:00:00 GMT",<br /> "status": "UNINITIAL" <br /> },
{<br /> "bucket": ":lucileg.356-bucky-4188-31:1e614482-6c1c-49cb-81ef-8fb70c1ff4e8.218596.16",<br /> "shard": "lc.4",<br /> "started": "Thu, 01 Jan 1970 00:00:00 GMT",<br /> "status": "UNINITIAL" <br /> },
{<br /> "bucket": ":lucileg.356-bucky-4188-42:1e614482-6c1c-49cb-81ef-8fb70c1ff4e8.208655.24",<br /> "shard": "lc.4",<br /> "started": "Thu, 01 Jan 1970 00:00:00 GMT",<br /> "status": "UNINITIAL" <br /> },
{<br /> "bucket": ":lucileg.356-bucky-4188-63:1e614482-6c1c-49cb-81ef-8fb70c1ff4e8.218596.32",<br /> "shard": "lc.4",<br /> "started": "Thu, 01 Jan 1970 00:00:00 GMT",<br /> "status": "UNINITIAL" <br /> },
{<br /> "bucket": ":lucileg.356-bucky-4188-95:1e614482-6c1c-49cb-81ef-8fb70c1ff4e8.218596.47",<br /> "shard": "lc.4",<br /> "started": "Thu, 01 Jan 1970 00:00:00 GMT",<br /> "status": "UNINITIAL" <br /> },
{<br /> "bucket": ":maryx.267-bucky-4525-37:1e614482-6c1c-49cb-81ef-8fb70c1ff4e8.133978.25",<br /> "shard": "lc.4",<br /> "started": "Wed, 21 Feb 2024 11:08:55 GMT",<br /> "status": "COMPLETE" <br /> }<br />How reproducible:</code></pre>
<p>Seen intermittently when LC is applied simultaneously on multiple buckets (100 at a time.)</p>
<p>Reproducibility: 2/3 times.</p>
<p>Steps to Reproduce:</p>
<p>1. Create around 100 buckets and upload 10 objects per bucket.<br />2. Apply an LC rule on all the buckets to transition the objects to a 'cold' storage class.<br />3. Repeat the test (steps 1 and 2) around 2-3 times.</p>
<p>Actual results:</p>
<p>We observed that 5 buckets are stuck in a UNINITIAL state for a few days.</p>
<p>Expected results:</p>
<p>LC process should not be stuck in a UNINITIAL state.</p> Ceph QA - QA Run #65148 (QA Testing): wip-vshankar-testing-20240326.105515-reefhttps://tracker.ceph.com/issues/651482024-03-26T10:59:35ZVenky Shankarvshankar@redhat.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/55803">https://github.com/ceph/ceph/pull/55803</a> - reef: client: Fix return in removexattr for xattrs from `system.` namespace<br /><a class="external" href="https://github.com/ceph/ceph/pull/56016">https://github.com/ceph/ceph/pull/56016</a> - reef: mds: relax certain asserts in mdlog replay thread<br /><a class="external" href="https://github.com/ceph/ceph/pull/56049">https://github.com/ceph/ceph/pull/56049</a> - reef: mds: allow lock state to be LOCK_MIX_SYNC in replica for filelock<br /><a class="external" href="https://github.com/ceph/ceph/pull/56167">https://github.com/ceph/ceph/pull/56167</a> - reef: qa: correct usage of DEBUGFS_META_DIR in dedent<br /><a class="external" href="https://github.com/ceph/ceph/pull/56169">https://github.com/ceph/ceph/pull/56169</a> - reef: qa: set mds config with `config set` for a particular test<br /><a class="external" href="https://github.com/ceph/ceph/pull/56171">https://github.com/ceph/ceph/pull/56171</a> - reef: qa: `fs volume rename` requires `fs fail` and `refuse_client_session` set<br /><a class="external" href="https://github.com/ceph/ceph/pull/56211">https://github.com/ceph/ceph/pull/56211</a> - reef: qa: enhance labeled perf counters test for cephfs-mirror<br /><a class="external" href="https://github.com/ceph/ceph/pull/56392">https://github.com/ceph/ceph/pull/56392</a> - reef: qa/suites/fs/nfs: use standard health ignorelist<br /><a class="external" href="https://github.com/ceph/ceph/pull/56396">https://github.com/ceph/ceph/pull/56396</a> - reef: qa: change log-whitelist to log-ignorelist<br /><a class="external" href="https://github.com/ceph/ceph/pull/56398">https://github.com/ceph/ceph/pull/56398</a> - reef: mds: ensure snapclient is synced before corruption check<br /><a class="external" href="https://github.com/ceph/ceph/pull/56424">https://github.com/ceph/ceph/pull/56424</a> - reef: qa/suites/fs/workload: enable snap_schedule early</p> rgw - Feature #65131 (New): perf counters for CreateMultipartUpload, AbortMultipartUpload, Comple...https://tracker.ceph.com/issues/651312024-03-26T02:08:51ZPaul Cuzner
<p>The rgw daemon exposes a perf counter "rgw.puts" as a counter representing the number of put ops performed by the gateway.</p>
<p>On small objects the value is fine, but beyond 16MB the value appears to be counting in 16MB ops instead of the actual object size. For example, a workload of 64MB PUTs is 4x the value it should be when cross checked with the client.<br />(GETs counters are not effected by this issue)</p>
<p>To reproduce, use warp with something like<br /><pre>
warp put --warp-client warp-1 --host rgw-1 --access-key $ACCESS_KEY --secret-key $SECRET_KEY --bucket $bucket --obj.size 64MB --concurrent 1 --duration 1m
</pre></p>
<p>And grab the counter stas via the admin socket<br />e.g.<br /><pre>
#!/usr/bin/bash
write_stats () {
put_count=$(ceph daemon /var/run/ceph/ceph-client.rgw.group2.storage-13-09008.ujvjki.7.94246064814240.asok perf dump | jq ".rgw.put")
now=$(date '+%s')
echo "${now},${put_count}"
}
while true; do
write_stats
sleep 5
done
</pre></p>
<p>Here's my results;</p>
<p>From the rgw script<br /><pre>
1711416974,2108690
1711416980,2108696
1711416985,2108727
1711416990,2108759
1711416995,2108790
1711417000,2108823
1711417005,2108858
1711417010,2108891
1711417015,2108923
1711417021,2108954
1711417026,2108985
1711417031,2109014
1711417036,2109044
1711417041,2109062
1711417046,2109062
</pre><br />The above shows a delta of 31-32 every 5s, so in theory the PUT rate is around 6 ops/sec</p>
<p>However, the results from warp show;<br /><pre>
started workload, 1 client(s) with 64MB objects at Tue Mar 26 01:36:14 UTC 2024
warp: Benchmark data written to "put-experiment/ec8-2/client_count_1/clients_1_64MB_PUT.csv.zst"
----------------------------------------
Operation: PUT. Concurrency: 1
* Average: 94.18 MiB/s, 1.54 obj/s
Throughput, split into 58 x 1s:
* Fastest: 109.1MiB/s, 1.79 obj/s
* 50% Median: 94.4MiB/s, 1.55 obj/s
* Slowest: 84.2MiB/s, 1.38 obj/s
warp: Cleanup done.
workload completed, 1 client(s) with 64MB objects at Tue Mar 26 01:37:20 UTC 2024
</pre></p>
<p>If you account for the reporting unit in rgw being in 16MB ops not 64MB ops and divide by 4, the RGW stats would become 1.5 which is a match for the value reported by warp.</p>
<p>This affects any monitoring or grafana graphs that show put ops, and any that show PUT latency since latency calculations may rely on a count ops processed in an interval.</p> Ceph QA - QA Run #65048 (QA Needs Approval): wip-yuri11-testing-2024-03-21-0851-reefhttps://tracker.ceph.com/issues/650482024-03-21T15:52:03ZYuri Weinsteinyweinste@redhat.com
<p>--- done. these PRs were included:<br /><a class="external" href="https://github.com/ceph/ceph/pull/54729">https://github.com/ceph/ceph/pull/54729</a> - reef: qa/cephfs: improvements for name generators in test_volumes.py<br /><a class="external" href="https://github.com/ceph/ceph/pull/54981">https://github.com/ceph/ceph/pull/54981</a> - reef: osd: Apply randomly selected scheduler type across all OSD shards <br /><a class="external" href="https://github.com/ceph/ceph/pull/56120">https://github.com/ceph/ceph/pull/56120</a> - reef: rgw/beast: Enable SSL session-id reuse speedup mechanism<br /><a class="external" href="https://github.com/ceph/ceph/pull/56151">https://github.com/ceph/ceph/pull/56151</a> - reef: qa: Add benign cluster warning from ec-inconsistent-hinfo test to ignorelist</p>
<p>OLD:<br />--- done. these PRs were included:<br /><a class="external" href="https://github.com/ceph/ceph/pull/54729">https://github.com/ceph/ceph/pull/54729</a> - reef: qa/cephfs: improvements for name generators in test_volumes.py<br /><a class="external" href="https://github.com/ceph/ceph/pull/54942">https://github.com/ceph/ceph/pull/54942</a> - reef: cephfs: add command "ceph fs swap" <br /><a class="external" href="https://github.com/ceph/ceph/pull/54981">https://github.com/ceph/ceph/pull/54981</a> - reef: osd: Apply randomly selected scheduler type across all OSD shards <br /><a class="external" href="https://github.com/ceph/ceph/pull/56120">https://github.com/ceph/ceph/pull/56120</a> - reef: rgw/beast: Enable SSL session-id reuse speedup mechanism<br /><a class="external" href="https://github.com/ceph/ceph/pull/56151">https://github.com/ceph/ceph/pull/56151</a> - reef: qa: Add benign cluster warning from ec-inconsistent-hinfo test to ignorelist</p> CephFS - Bug #65043 (Triaged): Unable to set timestamp to value > UINT32_MAXhttps://tracker.ceph.com/issues/650432024-03-21T15:09:56ZSachin Prabhu
<p>I have been testing the samba integration with cephfs at the backend using smbtorture - a tool provided by Samba to test for protocol correctness. The test in question is <br />smb2.timestamps</p>
<p>This sets mtime, btime and atime on a file to various times and then checks the return values for consistency. I noticed that the problem arises when the timestamp sets the number of seconds either in a negative range or to values > UINT32_MAX.</p>
<p>It was bought to my notice that CephFS uses u32 for the tv_sec element for time_t and this explains the issue we see<br /><a class="external" href="https://github.com/ceph/ceph/blob/main/src/include/utime.h#L51">https://github.com/ceph/ceph/blob/main/src/include/utime.h#L51</a></p>
<p>To demonstrate the problem, I have attached a reproducer which sets the atime/ctime to UINT32_MAX, UINT32_MAX+1 and UINT32_MAX+2 for a file opened on a cephfs filesystem and for a file opened in the cwd of the directory from which the test is run. The timestamps on the local filesystem are set correctly while those on the cephfs system fail for values > UINT32_MAX.</p>
<ol>
<li>gcc -o test -D_FILE_OFFSET_BITS=64 -lcephfs test.c</li>
<li>./test <br />Ceph: Create a new file<br />cephfs: Get time: 1711023453</li>
</ol>
<p>local: Create a new file<br />local: Get time: 1711023453</p>
<p>Set time to UINT32_MAX<br />cephfs: Set time: 4294967295<br />cephfs: Get time: 4294967295<br />local: Set time: 4294967295<br />local: Get time: 4294967295</p>
<p>Set time to UINT32_MAX+1<br />cephfs: Set time: 4294967296<br />cephfs: Get time: 0<br />local: Set time: 4294967296<br />local: Get time: 4294967296</p>
<p>Set time to UINT32_MAX+2<br />cephfs: Set time: 4294967297<br />cephfs: Get time: 1<br />local: Set time: 4294967297<br />local: Get time: 4294967297</p>
<p>Is there any chance that we will see 64 bit values used in the future?</p> rgw - Bug #64971 (Need More Info): Rgw lifecycle skip https://tracker.ceph.com/issues/649712024-03-18T18:39:53ZErik Skytthe
<p>Running ceph ver. 18.2.0<br />We observe that bucket lifecycle cleanup, are being dropped on a random day each week. <br />Should lifecycle cleanup not run every day of the week?<br />I have tried to find traces in logs - but did not succeed.</p>
<pre><code class="text syntaxhl"><span class="CodeRay">
March 8, 2024 at 10:56 AM
Did run the 6. but not the 7.??
radosgw-admin lc list
[
{
"bucket": ":simple-search-solr-bibdk-1-3:5442c8b2-5744-428b-b2c2-5b631018a838.190697630.1",
"shard": "lc.8",
"started": "Wed, 06 Mar 2024 23:00:00 GMT",
"status": "COMPLETE"
},
{
"bucket": ":simple-search-solr-bibdk-1-4:5442c8b2-5744-428b-b2c2-5b631018a838.262119777.1",
"shard": "lc.15",
"started": "Wed, 06 Mar 2024 23:00:00 GMT",
"status": "COMPLETE"
},
...
</span></code></pre> rgw - Bug #64841 (New): java_s3tests: testObjectCreateBadExpectMismatch failurehttps://tracker.ceph.com/issues/648412024-03-11T17:06:12ZCasey Bodleycbodley@redhat.com
<p>ex. <a class="external" href="http://qa-proxy.ceph.com/teuthology/cbodley-2024-03-10_14:50:40-rgw-wip-cbodley2-testing-distro-default-smithi/7589576/teuthology.log">http://qa-proxy.ceph.com/teuthology/cbodley-2024-03-10_14:50:40-rgw-wip-cbodley2-testing-distro-default-smithi/7589576/teuthology.log</a><br /><pre>
2024-03-10T16:40:31.654 INFO:teuthology.orchestra.run.smithi060.stdout:suite > Object tests > ObjectTest.testObjectCreateBadExpectMismatch STARTED
2024-03-10T16:40:32.455 INFO:teuthology.orchestra.run.smithi060.stdout:
2024-03-10T16:40:32.455 INFO:teuthology.orchestra.run.smithi060.stdout:suite > Object tests > ObjectTest.testObjectCreateBadExpectMismatch FAILED
2024-03-10T16:40:32.455 INFO:teuthology.orchestra.run.smithi060.stdout: com.amazonaws.services.s3.model.AmazonS3Exception at ObjectTest.java:525
</pre></p>
<p>scanning the rgw log <a class="external" href="http://qa-proxy.ceph.com/teuthology/cbodley-2024-03-10_14:50:40-rgw-wip-cbodley2-testing-distro-default-smithi/7589576/remote/smithi060/log/rgw.ceph.client.0.log.gz">http://qa-proxy.ceph.com/teuthology/cbodley-2024-03-10_14:50:40-rgw-wip-cbodley2-testing-distro-default-smithi/7589576/remote/smithi060/log/rgw.ceph.client.0.log.gz</a> i see this sequence of requests:<br /><pre>
2024-03-10T16:40:31.653+0000 CreateBucket test-c2dd4bb1-7d2b-431d-a7b0-200c96d8349313
2024-03-10T16:40:32.445+0000 status ok
2024-03-10T16:40:33.409+0000 ListObjectVersions
2024-03-10T16:40:34.465+0000 status ok
2024-03-10T16:40:34.469+0000 a4d1c640 1 failed to read header: bad method
2024-03-10T16:40:34.885+0000 ListObjectVersions
2024-03-10T16:40:35.645+0000 status ok
2024-03-10T16:40:35.649+0000 DeleteBucket
</pre></p>
<p>that "bad method" error corresponds to the test's PutObject request that's supposed to pass `Expect: 200`. the "bad method" error comes from beast's http parser when it's trying to read the next request from a connection. i assume this means we either read too many bytes from the previous request, or too few</p>
<p>there are several other occurrances of this "bad method" error, but none led to test failures. scanning the rgw log of successful java_s3tests runs, i don't see any "bad method" errors</p> rgw - Backport #64500 (In Progress): reef: multisite: Deadlock in RGWDeleteMultiObj with default ...https://tracker.ceph.com/issues/645002024-02-19T18:03:18ZBackport Bot
<p><a class="external" href="https://github.com/ceph/ceph/pull/55655">https://github.com/ceph/ceph/pull/55655</a></p> rgw - Backport #64326 (In Progress): reef: RGW: Cloud sync module fails to sync foldershttps://tracker.ceph.com/issues/643262024-02-05T21:28:54ZBackport Bot
<p><a class="external" href="https://github.com/ceph/ceph/pull/56554">https://github.com/ceph/ceph/pull/56554</a></p> RADOS - Bug #63389 (Pending Backport): Failed to encode map X with expected CRChttps://tracker.ceph.com/issues/633892023-11-01T13:26:34ZNavid Golpa
<p>During upgrade of ceph from Quincy to Reef we encountered a problem as we upgraded each OSD. Every time an OSD was restarted to upgrade the Reef the MON's would get spammed with<br /><pre>
failed to encode map X with expected crc
</pre></p>
<p>Network load on the MON would skyrocket. The problem was identical to what was described by Kefu here in 2016:<br /><a class="external" href="https://lore.kernel.org/all/CAJE9aONFauhy7v6n9bT11Sga+e0Qgi8hWu=gr-zoxuAq5Yv+cA@mail.gmail.com/T/">https://lore.kernel.org/all/CAJE9aONFauhy7v6n9bT11Sga+e0Qgi8hWu=gr-zoxuAq5Yv+cA@mail.gmail.com/T/</a></p>
<p>We did not follow the recommendation in that post of downgrading the MONs and upgrading the OSDs first and then upgrading the MONs again. Instead we powered through the upgrade by just taking a one day downtime and upgrading all remaining OSDs. Once all OSDs were upgraded the errors went away and cluster was back to normal operation.</p> rgw - Bug #63373 (Pending Backport): multisite: Deadlock in RGWDeleteMultiObj with default rgw_mu...https://tracker.ceph.com/issues/633732023-10-31T11:08:18ZAbhishek Lekshmananabhishek.lekshmanan@gmail.com
<p>On running a multisite setup with Reef (18.2.0) we run into this situation where<br />the client facing RGWs (the ones with rgw run sync thread = false) deadlock<br />after running a warp s3 benchmark with around 10 clients. The deadlock is<br />reproducible even when running rgw run sync thread = true is set. On inspecting<br />one of the RGWs with eu-stack we see that all of the RGW threads are blocked at<br />`RGWDataChangesLog::add_entry`, we see mainly the following 2 patterns in the active threads</p>
<pre>
#0 0x00007f9ba0a7d82d __lll_lock_wait
#1 0x00007f9ba0a76ad9 __pthread_mutex_lock
#2 0x000055d9caf70e4b LazyFIFO::lazy_init(DoutPrefixProvider const*, optional_yield)
#3 0x000055d9caf79968 RGWDataChangesFIFO::push(DoutPrefixProvider const*, int, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long...(=)
#4 0x000055d9caf6c938 RGWDataChangesLog::add_entry(DoutPrefixProvider const*, RGWBucketInfo const&, rgw::bucket_log_layout_generation const&, int, optional_yield)
#5 0x000055d9cac8cba5 add_datalog_entry(DoutPrefixProvider const*, RGWDataChangesLog*, RGWBucketInfo const&, unsigned int, optional_yield)
#6 0x000055d9cac9da60 RGWRados::Bucket::UpdateIndex::complete(...)
#8 0x000055d9cacc4e67 RGWRados::Object::Write::write_meta(...)
</pre>
<p>And this state<br /><pre>
#0 0x00007f9ba0a7a45c pthread_cond_wait@@GLIBC_2.3.2
#1 0x00007f9b9ff35860 std::condition_variable::wait(std::unique_lock<std::mutex>&)
#2 0x000055d9caf6c323 RGWDataChangesLog::add_entry(DoutPrefixProvider const*, RGWBucketInfo const&, rgw::bucket_log_layout_generation const&, int, optional_yield)
#3 0x000055d9cac8cba5 add_datalog_entry(DoutPrefixProvider const*, RGWDataChangesLog*, RGWBucketInfo const&, unsigned int, optional_yield)
#4 0x000055d9cac9da60 RGWRados::Bucket::UpdateIndex::complete()
</pre></p>
<p>On a more detailed analysis via GDB on the lazy init threads we see</p>
<pre>
#0 0x00007f9ba0a7d82d in __lll_lock_wait () from target:/lib64/libpthread.so.0
#1 0x00007f9ba0a76ad9 in pthread_mutex_lock () from target:/lib64/libpthread.so.0
#2 0x000055d9caf70e4b in __gthread_mutex_lock (__mutex=0x55d9cc496de8)
at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/x86_64-redhat-linux/bits/gthr-default.h:749
#3 std::mutex::lock (this=0x55d9cc496de8) at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/std_mutex.h:100
#4 std::unique_lock<std::mutex>::lock (this=0x7f9a54644ec0) at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/unique_lock.h:139
#5 std::unique_lock<std::mutex>::unique_lock (__m=..., this=0x7f9a54644ec0) at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/unique_lock.h:69
#6 LazyFIFO::lazy_init (this=this@entry=0x55d9cc496dc0, dpp=dpp@entry=0x55d9d897c000, y=...)
at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h:264
#7 0x000055d9caf79968 in LazyFIFO::push (y=..., bl=..., dpp=0x55d9d897c000, this=0x55d9cc496dc0)
at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h:305
#8 RGWDataChangesFIFO::push (this=0x55d9cd28d300, dpp=0x55d9d897c000, index=40, bl=..., y=...)
at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_datalog.cc:286
#9 0x000055d9caf6c938 in RGWDataChangesLog::add_entry (this=0x55d9cc546200, dpp=0x55d9d897c000, bucket_info=..., gen=...,
shard_id=<optimized out>, y=...)
at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:197
#10 0x000055d9cac8cba5 in add_datalog_entry(DoutPrefixProvider const*, RGWDataChangesLog*, RGWBucketInfo const&, unsigned int, optional_yield) ()
at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/stl_iterator.h:1083
</pre>
<p>Inspecting the mutex we see</p>
<pre>
(gdb) f 6
#6 LazyFIFO::lazy_init (this=this@entry=0x55d9cc496dc0, dpp=dpp@entry=0x55d9d897c000, y=...)
at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h:264
264 /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h: No such file or directory.
(gdb) p this
$77 = (LazyFIFO * const) 0x55d9cc496dc0
(gdb) p this->m
$78 = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 2, __count = 0, __owner = 286845, __nusers = 1, __kind = 0, __spins = 0,
__elision = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = "\002\000\000\000\000\000\000\000}`\004\000\001", '\000' <repeats 26 times>, __align = 2}}, <No data fields>}
</pre>
<p>Looking at the thread corresponding to the owner of the mutex</p>
<pre>
(gdb) thread 261
[Switching to thread 261 (Thread 0x7f9b16bde700 (LWP 286845))]
#0 0x00007f9ba0a7a45c in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
(gdb) bt
#0 0x00007f9ba0a7a45c in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1 0x00007f9b9ff35860 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from target:/lib64/libstdc++.so.6
#2 0x000055d9caf6c323 in ceph::common::RefCountedCond::wait (this=0x55d9d044d400)
at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/common/RefCountedObj.h:111
#3 RGWDataChangesLog::add_entry (this=0x55d9cc546200, dpp=<optimized out>, bucket_info=..., gen=..., shard_id=<optimized out>, y=...)
at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_datalog.cc:689
#4 0x000055d9cac8cba5 in add_datalog_entry(DoutPrefixProvider const*, RGWDataChangesLog*, RGWBucketInfo const&, unsigned int, optional_yield) ()
at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/stl_iterator.h:1083
#5 0x000055d9cac9dcd8 in RGWRados::Bucket::UpdateIndex::complete_del(DoutPrefixProvider const*, long, unsigned long, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >&, std::__cxx11::list<rgw_obj_index_key, std::allocator<rgw_obj_index_key> >*, optional_yield) () at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_rados.cc:6374
#6 0x000055d9cacb041b in RGWRados::Object::Delete::delete_obj(optional_yield, DoutPrefixProvider const*) ()
at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_rados.cc:5347
#7 0x000055d9cad18942 in rgw::sal::RadosObject::RadosDeleteOp::delete_obj (this=0x55d9d0b39a00, dpp=0x55d9d003a6c0, y=...)
at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_sal_rados.cc:2273
#8 0x000055d9caa3f859 in RGWDeleteMultiObj::handle_individual_object(rgw_obj_key const&, optional_yield, boost::asio::basic_deadline_timer<boost::posix_time::ptime, boost::asio::time_traits<boost::posix_time::ptime>, boost::asio::executor>*) ()
at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/unique_ptr.h:421
#9 0x000055d9caa404f0 in void boost::context::detail::context_entry<boost::context::detail::record<boost::context::continuation, boost::context::basic_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, RGWDeleteMultiObj::execute(optional_yield)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > > >)#3}, boost::context::basic_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}> >(boost::context::detail::transfer_t) () at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/common/async/yield_context.h:40
#10 0x000055d9cb2860af in make_fcontext ()
#11 0x0000000000000000 in ?? ()
</pre>
<p>Randomly looking at most threads blocked in lazy_init, it looks like the owner of mutex still to be thread 286845<br /><pre>
(gdb) thread 596
[Switching to thread 596 (Thread 0x7f9a6f28f700 (LWP 287180))]
#0 0x00007f9ba0a7d82d in __lll_lock_wait () from target:/lib64/libpthread.so.0
(gdb) f 6
#6 LazyFIFO::lazy_init (this=this@entry=0x55d9cc496dc0, dpp=dpp@entry=0x55d9d31dab00, y=...)
at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h:264
264 /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h: No such file or directory.
(gdb) p this->m
$79 = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 2, __count = 0, __owner = 286845, __nusers = 1, __kind = 0, __spins = 0,
__elision = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = "\002\000\000\000\000\000\000\000}`\004\000\001", '\000' <repeats 26 times>, __align = 2}}, <No data fields>}
(gdb) thread 595
[Switching to thread 595 (Thread 0x7f9a6fa90700 (LWP 287179))]
#0 0x00007f9ba0a7d82d in __lll_lock_wait () from target:/lib64/libpthread.so.0
(gdb) f 6
#6 LazyFIFO::lazy_init (this=this@entry=0x55d9cc496dc0, dpp=dpp@entry=0x55d9d680c200, y=...)
at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h:264
264 in /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h
(gdb) p this->m
$80 = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 2, __count = 0, __owner = 286845, __nusers = 1, __kind = 0, __spins = 0,
__elision = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = "\002\000\000\000\000\000\000\000}`\004\000\001", '\000' <repeats 26 times>, __align = 2}}, <No data fields>}
(gdb) thread 589
[Switching to thread 589 (Thread 0x7f9a72a96700 (LWP 287173))]
#0 0x00007f9ba0a7d82d in __lll_lock_wait () from target:/lib64/libpthread.so.0
(gdb) f 6
#6 LazyFIFO::lazy_init (this=this@entry=0x55d9cc496dc0, dpp=dpp@entry=0x55d9d4462200, y=...)
at /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h:264
264 in /usr/src/debug/ceph-18.2.0-1.rh8.cern.x86_64/src/rgw/driver/rados/rgw_log_backing.h
(gdb) p this->m
$81 = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 2, __count = 0, __owner = 286845, __nusers = 1, __kind = 0, __spins = 0,
__elision = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = "\002\000\000\000\000\000\000\000}`\004\000\001", '\000' <repeats 26 times>, __align = 2}}, <No data fields>}
</pre></p>
<p>Is this by any chance fixed in main already? What steps can we help to debug this problem further?</p>