Ceph : Issueshttps://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2023-11-02T23:45:33ZCeph
Redmine Ceph - Bug #63425 (Pending Backport): tasks.cephadm: ceph.log No such file or directoryhttps://tracker.ceph.com/issues/634252023-11-02T23:45:33ZDan van der Ster
<p>cephadm tasks don't have a cluster log to egrep for ERR|WRN|SEC, e.g:</p>
<p><a class="external" href="http://qa-proxy.ceph.com/teuthology/teuthology-2023-10-27_14:23:02-upgrade:pacific-x-quincy-distro-default-smithi/7438907/teuthology.log">http://qa-proxy.ceph.com/teuthology/teuthology-2023-10-27_14:23:02-upgrade:pacific-x-quincy-distro-default-smithi/7438907/teuthology.log</a><br /><pre>
2023-10-27T16:06:59.111 DEBUG:teuthology.orchestra.run.smithi150:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/38cc7fce-74d9-11ee-8db9-212e2dc638e7/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-10-27T16:06:59.141 INFO:teuthology.orchestra.run.smithi150.stderr:grep: /var/log/ceph/38cc7fce-74d9-11ee-8db9-212e2dc638e7/ceph.log: No such file or directory
</pre></p>
<p><a class="external" href="https://pulpito.ceph.com/teuthology-2023-10-28_14:23:03-upgrade:quincy-x-reef-distro-default-smithi/7439369/">https://pulpito.ceph.com/teuthology-2023-10-28_14:23:03-upgrade:quincy-x-reef-distro-default-smithi/7439369/</a><br /><pre>
2023-10-28T15:59:53.486 DEBUG:teuthology.orchestra.run.smithi007:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/10bc8c0a-75a0-11ee-8db9-212e2dc638e7/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-10-28T15:59:53.517 INFO:teuthology.orchestra.run.smithi007.stderr:grep: /var/log/ceph/10bc8c0a-75a0-11ee-8db9-212e2dc638e7/ceph.log: No such file or directory
</pre></p>
<p>mon_cluster_log_to_file is not true for the cephadm task.</p> Orchestrator - Bug #63379 (New): cephadm: assumes logrotate is installed on the hosthttps://tracker.ceph.com/issues/633792023-10-31T17:59:12ZDan van der Ster
<p>cephadm sets up logrotate.d on a container host, but this is not useful if the logrotate package is not installed.<br />We found a slim Debian variant (GardenLinux) that does not install logrotate ootb.</p>
<p>With logrotate broken like this, a ceph host can run out of disk space and take down a cluster.</p>
<p>I didn't find where cephadm installs general package dependencies (other than podman).<br />Can this be added as a dep somewhere?</p> RADOS - Bug #62588 (New): ceph config set allows WHO to be osd.*, which is misleadinghttps://tracker.ceph.com/issues/625882023-08-25T17:43:32ZDan van der Ster
<p>We came across a customer cluster who uses `ceph config set osd.* ...` thinking it would apply to <strong>all</strong> OSDs. <br />In fact this applies to zero osds, because there is no OSD named literally `osd.*`<br />Config set should better filter for WHO daemons that can exist, e.g. osd.<int>, mds.<string></p>
<pre>
# ceph config set osd.* osd_max_backfills 3
# ceph config dump
WHO MASK LEVEL OPTION VALUE RO
...
# ceph config dump
WHO MASK LEVEL OPTION VALUE RO
...
osd advanced osd_max_backfills 10
osd advanced osd_recovery_sleep_hdd 0.000000
osd.* advanced osd_max_backfills 3
mds basic mds_cache_memory_limit 2147483648
</pre> rgw - Documentation #58092 (New): rgw_enable_gc_threads / lc_threads not documented on webhttps://tracker.ceph.com/issues/580922022-11-28T11:42:22ZDan van der Ster
<p>Options rgw_enable_gc_threads and rgw_enable_lc_threads are not rendered for docs.ceph.com.</p>
<p>I would expect those to be documented at <a class="external" href="https://github.com/ceph/ceph/blob/main/doc/radosgw/config-ref.rst">https://github.com/ceph/ceph/blob/main/doc/radosgw/config-ref.rst</a> but they are not present.</p>
<p>Is this intentional? Can we add them to config-ref.rst for rgw?</p> bluestore - Bug #56488 (Resolved): BlueStore doesn't defer small writes for pre-pacific hdd osdshttps://tracker.ceph.com/issues/564882022-07-07T08:31:27ZDan van der Ster
<p>We're upgrading clusters to v16.2.9 from v15.2.16, and our simple "rados bench -p test 10 write -b 4096 -t 1" latency probe showed something is very wrong with deferred writes in pacific.<br />I attached a plot from an example cluster, upgraded today.</p>
<p>The OSDs are 12TB HDDs, formatted in nautilus with the default bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db.</p>
<p>I found that the performance issue is because 4kB writes are no longer deferred from those pre-pacific hdds to flash in pacific with the default config.<br />Here are example bench writes from both releases: <a class="external" href="https://pastebin.com/raw/m0yL1H9Z">https://pastebin.com/raw/m0yL1H9Z</a></p>
<p>I worked out that the issue is fixed if I set bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. Note the default was 32k in octopus).</p>
<p>I think this is related to the fixes in <a class="issue tracker-1 status-3 priority-4 priority-default closed" title="Bug: Deferred writes are unexpectedly applied to large writes on spinners (Resolved)" href="https://tracker.ceph.com/issues/52089">#52089</a> which landed in 16.2.6 -- _do_alloc_write is now comparing the prealloc size 0x10000 with bluestore_prefer_deferred_size_hdd (0x10000) and the "strictly less than" condition prevents deferred writes from ever happening.</p>
<p>So I think this would impact anyone upgrading clusters with hdd/ssd mixed osds.</p>
<p>Should we increase the default bluestore_prefer_deferred_size_hdd up to 128kB or is there in fact a bug here?</p> RADOS - Bug #56386 (Can't reproduce): Writes to a cephfs after metadata pool snapshot causes inco...https://tracker.ceph.com/issues/563862022-06-24T07:26:40ZDan van der Ster
<p>If you take a snapshot of the meta pool, then decrease max_mds, metadata objects will be inconsistent.<br />Removing the pool snap then deep scrubbing again removes the inconsistent objects.</p>
<p>Details:</p>
<pre>
# ceph versions
{
"mon": {
"ceph version 16.2.9-1 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)": 4
},
"mgr": {
"ceph version 16.2.9-1 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)": 4
},
"osd": {
"ceph version 16.2.9-1 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)": 4
},
"mds": {
"ceph version 16.2.9-1 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)": 4
},
"overall": {
"ceph version 16.2.9-1 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)": 16
}
}
# ceph -s
cluster:
id: 03a1871e-32d8-4f3c-9be1-a7d7e9846205
health: HEALTH_OK
services:
mon: 4 daemons, quorum cephoctopus-1,cephoctopus-2,cephoctopus-3,cephoctopus-4 (age 6h)
mgr: cephoctopus-1(active, since 9d), standbys: cephoctopus-3, cephoctopus-4, cephoctopus-2
mds: 2/2 daemons up, 2 standby
osd: 4 osds: 4 up (since 2d), 4 in (since 16M)
data:
volumes: 1/1 healthy
pools: 9 pools, 106 pgs
objects: 16.91k objects, 12 GiB
usage: 39 GiB used, 41 GiB / 80 GiB avail
pgs: 106 active+clean
# ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 80 GiB 41 GiB 39 GiB 39 GiB 48.50
TOTAL 80 GiB 41 GiB 39 GiB 39 GiB 48.50
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
device_health_metrics 1 1 3.8 KiB 1 12 KiB 0 11 GiB
test 2 32 3.3 MiB 847 159 MiB 0.48 11 GiB
cephfs.cephfs.meta 3 8 193 MiB 102 560 MiB 1.67 11 GiB
cephfs.cephfs.data 4 32 102 MiB 981 470 MiB 1.41 11 GiB
volumes 5 8 2.5 GiB 6.35k 7.8 GiB 19.60 11 GiB
images 6 8 5.5 GiB 750 17 GiB 33.98 11 GiB
backups 7 8 1.5 GiB 472 4.6 GiB 12.42 11 GiB
vms 8 8 19 B 1 192 KiB 0 11 GiB
testpool 10 1 2.3 GiB 7.42k 8.1 GiB 20.07 11 GiB
# rados mksnap -p cephfs.cephfs.meta testsnap1
created pool cephfs.cephfs.meta snap testsnap1
# ceph fs set cephfs max_mds 1
# sleep 60
# for i in {0..7}; do ceph pg deep-scrub 3.$i; done
instructing pg 3.0 on osd.0 to deep-scrub
instructing pg 3.1 on osd.3 to deep-scrub
instructing pg 3.2 on osd.1 to deep-scrub
instructing pg 3.3 on osd.3 to deep-scrub
instructing pg 3.4 on osd.0 to deep-scrub
instructing pg 3.5 on osd.3 to deep-scrub
instructing pg 3.6 on osd.0 to deep-scrub
instructing pg 3.7 on osd.0 to deep-scrub
2022-06-24T09:09:19.245382+0200 mon.cephoctopus-1 (mon.0) 106142 : cluster [INF] daemon mds.xxx-2 finished stopping rank 1 in filesystem cephfs (now has 1 ranks)
2022-06-24T09:09:33.385965+0200 osd.1 (osd.1) 923 : cluster [ERR] 3.2 shard 3 soid 3:54ba6923:::mds1_openfiles.0:1 : omap_digest 0xffffffff != omap_digest 0x22bb480a from shard 1, omap_digest 0xffffffff != omap_digest 0x22bb480a from auth oi 3:54ba6923:::mds1_openfiles.0:1(7845'910 osd.1.0:15 dirty|omap|data_digest|omap_digest s 0 uv 908 dd ffffffff od 22bb480a alloc_hint [0 0 0])
2022-06-24T09:09:33.386347+0200 osd.1 (osd.1) 924 : cluster [ERR] 3.2 deep-scrub 0 missing, 1 inconsistent objects
2022-06-24T09:09:33.386383+0200 osd.1 (osd.1) 925 : cluster [ERR] 3.2 deep-scrub 2 errors
2022-06-24T09:09:35.615407+0200 osd.3 (osd.3) 593 : cluster [ERR] 3.3 soid 3:c27f454c:::mds1_sessionmap:1 : omap_digest 0x95baa69c != omap_digest 0xffffffff from shard 3
2022-06-24T09:09:35.615418+0200 osd.3 (osd.3) 594 : cluster [ERR] 3.3 shard 2 soid 3:c27f454c:::mds1_sessionmap:1 : omap_digest 0xffffffff != omap_digest 0x95baa69c from auth oi 3:c27f454c:::mds1_sessionmap:1(7845'1553 osd.3.0:54 dirty|omap|data_digest|omap_digest s 0 uv 1412 dd ffffffff od 95baa69c alloc_hint [0 0 0])
2022-06-24T09:09:35.615425+0200 osd.3 (osd.3) 595 : cluster [ERR] 3.3 shard 3 soid 3:c27f454c:::mds1_sessionmap:1 : omap_digest 0xffffffff != omap_digest 0x95baa69c from auth oi 3:c27f454c:::mds1_sessionmap:1(7845'1553 osd.3.0:54 dirty|omap|data_digest|omap_digest s 0 uv 1412 dd ffffffff od 95baa69c alloc_hint [0 0 0])
</pre>
<pre>
# rados list-inconsistent-obj 3.2 | jq .
{
"epoch": 7835,
"inconsistents": [
{
"object": {
"name": "mds1_openfiles.0",
"nspace": "",
"locator": "",
"snap": 1,
"version": 908
},
"errors": [
"omap_digest_mismatch"
],
"union_shard_errors": [
"omap_digest_mismatch_info"
],
"selected_object_info": {
"oid": {
"oid": "mds1_openfiles.0",
"key": "",
"snapid": 1,
"hash": 3298188586,
"max": 0,
"pool": 3,
"namespace": ""
},
"version": "7845'910",
"prior_version": "7818'909",
"last_reqid": "osd.1.0:15",
"user_version": 908,
"size": 0,
"mtime": "2022-06-14T18:22:20.287883+0200",
"local_mtime": "2022-06-14T18:22:20.288351+0200",
"lost": 0,
"flags": [
"dirty",
"omap",
"data_digest",
"omap_digest"
],
"truncate_seq": 0,
"truncate_size": 0,
"data_digest": "0xffffffff",
"omap_digest": "0x22bb480a",
"expected_object_size": 0,
"expected_write_size": 0,
"alloc_hint_flags": 0,
"manifest": {
"type": 0
},
"watchers": {}
},
"shards": [
{
"osd": 0,
"primary": false,
"errors": [],
"size": 0,
"omap_digest": "0x22bb480a",
"data_digest": "0xffffffff"
},
{
"osd": 1,
"primary": true,
"errors": [],
"size": 0,
"omap_digest": "0x22bb480a",
"data_digest": "0xffffffff"
},
{
"osd": 3,
"primary": false,
"errors": [
"omap_digest_mismatch_info"
],
"size": 0,
"omap_digest": "0xffffffff",
"data_digest": "0xffffffff"
}
]
}
]
}
</pre>
<pre>
# rados list-inconsistent-obj 3.3 | jq .
{
"epoch": 7758,
"inconsistents": [
{
"object": {
"name": "mds1_sessionmap",
"nspace": "",
"locator": "",
"snap": 1,
"version": 1412
},
"errors": [
"omap_digest_mismatch"
],
"union_shard_errors": [
"omap_digest_mismatch_info"
],
"selected_object_info": {
"oid": {
"oid": "mds1_sessionmap",
"key": "",
"snapid": 1,
"hash": 849542723,
"max": 0,
"pool": 3,
"namespace": ""
},
"version": "7845'1553",
"prior_version": "7511'1467",
"last_reqid": "osd.3.0:54",
"user_version": 1412,
"size": 0,
"mtime": "2022-05-11T11:47:08.738876+0200",
"local_mtime": "2022-05-11T11:47:08.741128+0200",
"lost": 0,
"flags": [
"dirty",
"omap",
"data_digest",
"omap_digest"
],
"truncate_seq": 0,
"truncate_size": 0,
"data_digest": "0xffffffff",
"omap_digest": "0x95baa69c",
"expected_object_size": 0,
"expected_write_size": 0,
"alloc_hint_flags": 0,
"manifest": {
"type": 0
},
"watchers": {}
},
"shards": [
{
"osd": 0,
"primary": false,
"errors": [],
"size": 0,
"omap_digest": "0x95baa69c",
"data_digest": "0xffffffff"
},
{
"osd": 2,
"primary": false,
"errors": [
"omap_digest_mismatch_info"
],
"size": 0,
"omap_digest": "0xffffffff",
"data_digest": "0xffffffff"
},
{
"osd": 3,
"primary": true,
"errors": [
"omap_digest_mismatch_info"
],
"size": 0,
"omap_digest": "0xffffffff",
"data_digest": "0xffffffff"
}
]
}
]
}
</pre>
<pre>
# rados lssnap -p cephfs.cephfs.meta
1 testsnap1 2022.06.24 09:06:56
1 snaps
# rados rmsnap -p cephfs.cephfs.meta testsnap1
removed pool cephfs.cephfs.meta snap testsnap1
# rados lssnap -p cephfs.cephfs.meta
0 snaps
# for i in {0..7}; do ceph pg deep-scrub 3.$i; done
instructing pg 3.0 on osd.0 to deep-scrub
instructing pg 3.1 on osd.3 to deep-scrub
instructing pg 3.2 on osd.1 to deep-scrub
instructing pg 3.3 on osd.3 to deep-scrub
instructing pg 3.4 on osd.0 to deep-scrub
instructing pg 3.5 on osd.3 to deep-scrub
instructing pg 3.6 on osd.0 to deep-scrub
instructing pg 3.7 on osd.0 to deep-scrub
# ceph health
HEALTH_OK
</pre> RADOS - Feature #55764 (New): Adaptive mon_warn_pg_not_deep_scrubbed_ratio according to actual sc...https://tracker.ceph.com/issues/557642022-05-25T12:12:56ZDan van der Ster
<p>This request comes from the Science Users Working Group <a class="external" href="https://pad.ceph.com/p/Ceph_Science_User_Group_20220524">https://pad.ceph.com/p/Ceph_Science_User_Group_20220524</a></p>
<p>For clusters with very large OSDs with high space usage and intensive client IO, the defaults related to PG_NOT_SCRUBBED and PG_NOT_DEEP_SCRUBBED warnings can be too aggressive.<br />That is, it is not always possible to scrub all PGs daily and to deep scrub of all PGs weekly.<br />Such clusters raise warnings that PGs are not scrubbed in time, leading to operator confusion.</p>
Factors which impact the rate at which a cluster can scrub PGs might include:
<ul>
<li>osd_max_scrubs (defaults to 1 per OSD)</li>
<li>the amount of data to be scrubbed per OSD (which is increasing, can be over 15TB nowadays).</li>
<li>the rate at which an OSD can satisfy scrub reads (can be in the low 10s of MBps for large HDDs busy with client IO).</li>
<li>the size of a PG: E.g. a replica=3 PG locks three OSDs for scrubs, whereas an EC4+2 PG locks six OSDs.</li>
</ul>
<p>Would it be possible for the MON to use an adaptive approach to issuing scrub timeout warnings? E.g. the mon could scale the mon_warn_pg_not_deep_scrubbed_ratio configs according to the above parameters, or perhaps by monitoring the actual time used to complete scrubs.<br />Note that the wallclock time to scrub a given PG should be uniform across a pool, but would vary widely from pool to pool (i.e. empty pools can be scrubbed quickly).</p> mgr - Feature #55303 (New): pg autoscaler: only warn about changes that will take many dayshttps://tracker.ceph.com/issues/553032022-04-12T19:52:59ZDan van der Ster
<p>Motivation: <a class="external" href="https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/Z2UPETNARNVEPTYYA5Q6J5QBCUWKWTZ2/">https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/Z2UPETNARNVEPTYYA5Q6J5QBCUWKWTZ2/</a></p>
<blockquote>
<p>We just upgraded our 640 OSD cluster to Ceph 16.2.7 and the resulting rebalancing of<br />misplaced objects is overwhelming the cluster and impacting MON DB compaction, deep scrub<br />repairs and us upgrading legacy bluestore OSDs. We have to pause the rebalancing if<br />misplaced objects or we're going to fall over.</p>
<p>Autoscaler-status tells us that we are reducing our PGs by 700'ish which will take us<br />over 100 days to complete at our current recovery speed</p>
</blockquote>
<p>The autoscaler should not trigger such changes behind the back of the operator.<br />I propose that it should estimate the amount of time to carry out a split or merge operation, and only "warn" if that operation would take longer than a day (configurable), even if autoscale_mode is on.<br />Seeing the HEALTH_WARN, the operator can then schedule and carry out the pg split or merge at a time that suits their operations.</p> RADOS - Feature #55169 (In Progress): crush: should validate rule outputs osdshttps://tracker.ceph.com/issues/551692022-04-04T09:16:35ZDan van der Ster
<p>In this thread <a class="external" href="https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/2ZUJN75RLL4YYD4EHAUS5I4IL37A7UUL/">https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/2ZUJN75RLL4YYD4EHAUS5I4IL37A7UUL/</a> a user suffered a multi day outage, with down PGs and OSDs crashing due to "start interval does not contain the required bound".</p>
<p>After a long story, the root cause was found to be that the user had injected a crush rule that had "choose" instead of "chooseleaf".</p>
<pre>
rule csd-data-pool {
id 5
type erasure
min_size 3
max_size 5
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class big
step choose indep 0 type host <--- HERE!
step emit
}
</pre>
<p>Can we add better validation to prevent such mistakes?</p> Linux kernel client - Bug #55090 (New): mounting subvolume shows size/used bytes for entire fs, n...https://tracker.ceph.com/issues/550902022-03-28T14:51:47ZDan van der Ster
<p>When mounting a subvolume at the base dir of the subvolume, the kernel client correctly shows the size/usage of a subvol:<br /><pre>
Filesystem Size Used Avail Use% Mounted on
xxx:6789:/volumes/_nogroup/4db8f9a6-926b-4306-8a6d-0e1b897c1d2f/d2ef5fea-040c-4ec1-b1bb-66073f9fc8ac 8.8T 0 8.8T 0% /cephfs
</pre></p>
<p>However if the client mounts a subdir of the subvolume, they see the size/usage of the entire cephfs:<br /><pre>
Filesystem Size Used Avail Use% Mounted on
xxx:6789:/volumes/_nogroup/4db8f9a6-926b-4306-8a6d-0e1b897c1d2f/d2ef5fea-040c-4ec1-b1bb-66073f9fc8ac/my/subdir 1.3P 430T 860T 34% /var/lib/service
</pre></p>
<p>`ceph-fuse` does not have this behaviour -- mounting at a subdir below the subvolume shows the "correct" subvolume size and usage.</p> RADOS - Feature #54525 (New): osd/mon: log memory usage during tickhttps://tracker.ceph.com/issues/545252022-03-10T21:15:46ZDan van der Ster
<p>The MDS has a nice feature that it prints out the rss and other memory stats every couple seconds at debug level 2.<br /><pre>
2022-03-10T22:13:50.779+0100 7f00f85aa700 2 mds.0.cache Memory usage: total 6652188, rss 5644928, heap 331992, baseline 307416, 599334 / 1484019 inodes have caps, 629138 caps, 0.423942 caps per inode
</pre></p>
<p>Similar logging for the OSD (and MON, less urgently) would be ultra useful when debugging things like <a class="issue tracker-1 status-3 priority-6 priority-high2 closed" title="Bug: ceph-osd takes all memory before oom on boot (Resolved)" href="https://tracker.ceph.com/issues/53729">#53729</a></p>
<p>The MDS uses a MemoryModel class, see MDCache::check_memory_usage, so this should be pretty easy to copy.</p> rgw - Bug #54500 (Resolved): Trim olh entries with empty name from bihttps://tracker.ceph.com/issues/545002022-03-08T20:11:31ZDan van der Ster
<p>Is there any legitimate use-case for an olh entry with key.name == "" ? If not, let's trim them, e.g. during reshard, because...</p>
<p><a class="issue tracker-1 status-3 priority-4 priority-default closed" title="Bug: OLH entries pending removal get mistakenly resharded to shard 0 (Resolved)" href="https://tracker.ceph.com/issues/46456">#46456</a> had the effect of leaving several olh entries in shard 0 with an empty name.<br />Historically, these polluted buckets with versioning / lc expiration, and it's difficult or impossible for an operator to clean those from existing cli tooling.</p>
<p>For detail, here's the sort of entry we are trying to remove from a bucket index:</p>
<p>(The idx prefix is 0x80, the "ugly namespace")</p>
<pre>
{
"type": "olh",
"idx": "�1001_tmp/uploads/1544726371-28197-0001-2685-9890cb8c67e31032c7660031601b8dcc",
"entry": {
"key": {
"name": "",
"instance": ""
},
"delete_marker": "false",
"epoch": 7,
"pending_log": [],
"tag": "uemrg479c5wg78kcrq7ezeb4mhc616vg",
"exists": "false",
"pending_removal": "true"
}
}
</pre> RADOS - Bug #54396 (Resolved): Setting osd_pg_max_concurrent_snap_trims to 0 prematurely clears t...https://tracker.ceph.com/issues/543962022-02-24T08:41:47ZDan van der Ster
<p>See <a class="external" href="https://www.spinics.net/lists/ceph-users/msg71061.html">https://www.spinics.net/lists/ceph-users/msg71061.html</a></p>
<pre>
This time around, after a few hours of snaptrimming, users complained of high IO
latency, and indeed Ceph reported "slow ops" on a number of OSDs and on the
active MDS. I attributed this to the snaptrimming and decided to reduce it by
initially setting osd_pg_max_concurrent_snap_trims to 1, which didn't seem to
help much, so I then set it to 0, which had the surprising effect of
transitioning all PGs back to active+clean (is this intended?). I also restarted
the MDS which seemed to be struggling. IO latency went back to normal
immediately.
</pre>
<p>In the code, when osd_pg_max_concurrent_snap_trims is 0, PrimaryLogPG::AwaitAsyncWork::react(const DoSnapWork&) calls pg->snap_mapper.get_next_objects_to_trim looking for 0 snaps to trim. But pg->snap_mapper.get_next_objects_to_trim returns ENOENT in this case, then DoSnapWork erases the remaining snap_to_trim.</p> Ceph - Bug #54385 (Fix Under Review): better test mon and osd smart commandhttps://tracker.ceph.com/issues/543852022-02-23T14:50:03ZDan van der Ster
<p>It appears the daemon smart command from mon and osd is not directly tested.</p> bluestore - Support #54315 (New): 1 fsck error per osd during nautilus -> octopus upgrade (S3 clu...https://tracker.ceph.com/issues/543152022-02-17T17:23:37ZDan van der Ster
<p>At the end of the conversion to per-pool omap, around half of our OSDs had 1 error, but the log didn't show the error.</p>
<pre>
2022-02-15T16:02:16.554+0100 7fdfde8d8f00 0 bluestore(/var/lib/ceph/osd/ceph-1247) _fsck_check_objects partial offload, done myself 7925084 of 7942492objects, threads 2
2022-02-15T16:02:16.678+0100 7fdfde8d8f00 1 bluestore(/var/lib/ceph/osd/ceph-1247) _fsck_on_open checking shared_blobs
2022-02-15T16:02:16.693+0100 7fdfde8d8f00 1 bluestore(/var/lib/ceph/osd/ceph-1247) _fsck_on_open checking pool_statfs
2022-02-15T16:17:37.407+0100 7fdfde8d8f00 1 bluestore(/var/lib/ceph/osd/ceph-1247) _fsck_on_open <<<FINISH>>> with 1 errors, 318 warnings, 319 repaired, 0 remaining in 1672.130946 seconds
</pre>
<p>Full log is posted: ceph-post-file: 82f661a7-b10f-4a80-acaf-37f1268f275e</p>