Ceph : Issueshttps://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2023-10-31T17:59:12ZCeph
Redmine Orchestrator - Bug #63379 (New): cephadm: assumes logrotate is installed on the hosthttps://tracker.ceph.com/issues/633792023-10-31T17:59:12ZDan van der Ster
<p>cephadm sets up logrotate.d on a container host, but this is not useful if the logrotate package is not installed.<br />We found a slim Debian variant (GardenLinux) that does not install logrotate ootb.</p>
<p>With logrotate broken like this, a ceph host can run out of disk space and take down a cluster.</p>
<p>I didn't find where cephadm installs general package dependencies (other than podman).<br />Can this be added as a dep somewhere?</p> RADOS - Bug #62588 (New): ceph config set allows WHO to be osd.*, which is misleadinghttps://tracker.ceph.com/issues/625882023-08-25T17:43:32ZDan van der Ster
<p>We came across a customer cluster who uses `ceph config set osd.* ...` thinking it would apply to <strong>all</strong> OSDs. <br />In fact this applies to zero osds, because there is no OSD named literally `osd.*`<br />Config set should better filter for WHO daemons that can exist, e.g. osd.<int>, mds.<string></p>
<pre>
# ceph config set osd.* osd_max_backfills 3
# ceph config dump
WHO MASK LEVEL OPTION VALUE RO
...
# ceph config dump
WHO MASK LEVEL OPTION VALUE RO
...
osd advanced osd_max_backfills 10
osd advanced osd_recovery_sleep_hdd 0.000000
osd.* advanced osd_max_backfills 3
mds basic mds_cache_memory_limit 2147483648
</pre> rgw - Documentation #58092 (New): rgw_enable_gc_threads / lc_threads not documented on webhttps://tracker.ceph.com/issues/580922022-11-28T11:42:22ZDan van der Ster
<p>Options rgw_enable_gc_threads and rgw_enable_lc_threads are not rendered for docs.ceph.com.</p>
<p>I would expect those to be documented at <a class="external" href="https://github.com/ceph/ceph/blob/main/doc/radosgw/config-ref.rst">https://github.com/ceph/ceph/blob/main/doc/radosgw/config-ref.rst</a> but they are not present.</p>
<p>Is this intentional? Can we add them to config-ref.rst for rgw?</p> RADOS - Feature #55764 (New): Adaptive mon_warn_pg_not_deep_scrubbed_ratio according to actual sc...https://tracker.ceph.com/issues/557642022-05-25T12:12:56ZDan van der Ster
<p>This request comes from the Science Users Working Group <a class="external" href="https://pad.ceph.com/p/Ceph_Science_User_Group_20220524">https://pad.ceph.com/p/Ceph_Science_User_Group_20220524</a></p>
<p>For clusters with very large OSDs with high space usage and intensive client IO, the defaults related to PG_NOT_SCRUBBED and PG_NOT_DEEP_SCRUBBED warnings can be too aggressive.<br />That is, it is not always possible to scrub all PGs daily and to deep scrub of all PGs weekly.<br />Such clusters raise warnings that PGs are not scrubbed in time, leading to operator confusion.</p>
Factors which impact the rate at which a cluster can scrub PGs might include:
<ul>
<li>osd_max_scrubs (defaults to 1 per OSD)</li>
<li>the amount of data to be scrubbed per OSD (which is increasing, can be over 15TB nowadays).</li>
<li>the rate at which an OSD can satisfy scrub reads (can be in the low 10s of MBps for large HDDs busy with client IO).</li>
<li>the size of a PG: E.g. a replica=3 PG locks three OSDs for scrubs, whereas an EC4+2 PG locks six OSDs.</li>
</ul>
<p>Would it be possible for the MON to use an adaptive approach to issuing scrub timeout warnings? E.g. the mon could scale the mon_warn_pg_not_deep_scrubbed_ratio configs according to the above parameters, or perhaps by monitoring the actual time used to complete scrubs.<br />Note that the wallclock time to scrub a given PG should be uniform across a pool, but would vary widely from pool to pool (i.e. empty pools can be scrubbed quickly).</p> mgr - Feature #55303 (New): pg autoscaler: only warn about changes that will take many dayshttps://tracker.ceph.com/issues/553032022-04-12T19:52:59ZDan van der Ster
<p>Motivation: <a class="external" href="https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/Z2UPETNARNVEPTYYA5Q6J5QBCUWKWTZ2/">https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/Z2UPETNARNVEPTYYA5Q6J5QBCUWKWTZ2/</a></p>
<blockquote>
<p>We just upgraded our 640 OSD cluster to Ceph 16.2.7 and the resulting rebalancing of<br />misplaced objects is overwhelming the cluster and impacting MON DB compaction, deep scrub<br />repairs and us upgrading legacy bluestore OSDs. We have to pause the rebalancing if<br />misplaced objects or we're going to fall over.</p>
<p>Autoscaler-status tells us that we are reducing our PGs by 700'ish which will take us<br />over 100 days to complete at our current recovery speed</p>
</blockquote>
<p>The autoscaler should not trigger such changes behind the back of the operator.<br />I propose that it should estimate the amount of time to carry out a split or merge operation, and only "warn" if that operation would take longer than a day (configurable), even if autoscale_mode is on.<br />Seeing the HEALTH_WARN, the operator can then schedule and carry out the pg split or merge at a time that suits their operations.</p> Linux kernel client - Bug #55090 (New): mounting subvolume shows size/used bytes for entire fs, n...https://tracker.ceph.com/issues/550902022-03-28T14:51:47ZDan van der Ster
<p>When mounting a subvolume at the base dir of the subvolume, the kernel client correctly shows the size/usage of a subvol:<br /><pre>
Filesystem Size Used Avail Use% Mounted on
xxx:6789:/volumes/_nogroup/4db8f9a6-926b-4306-8a6d-0e1b897c1d2f/d2ef5fea-040c-4ec1-b1bb-66073f9fc8ac 8.8T 0 8.8T 0% /cephfs
</pre></p>
<p>However if the client mounts a subdir of the subvolume, they see the size/usage of the entire cephfs:<br /><pre>
Filesystem Size Used Avail Use% Mounted on
xxx:6789:/volumes/_nogroup/4db8f9a6-926b-4306-8a6d-0e1b897c1d2f/d2ef5fea-040c-4ec1-b1bb-66073f9fc8ac/my/subdir 1.3P 430T 860T 34% /var/lib/service
</pre></p>
<p>`ceph-fuse` does not have this behaviour -- mounting at a subdir below the subvolume shows the "correct" subvolume size and usage.</p> RADOS - Feature #54525 (New): osd/mon: log memory usage during tickhttps://tracker.ceph.com/issues/545252022-03-10T21:15:46ZDan van der Ster
<p>The MDS has a nice feature that it prints out the rss and other memory stats every couple seconds at debug level 2.<br /><pre>
2022-03-10T22:13:50.779+0100 7f00f85aa700 2 mds.0.cache Memory usage: total 6652188, rss 5644928, heap 331992, baseline 307416, 599334 / 1484019 inodes have caps, 629138 caps, 0.423942 caps per inode
</pre></p>
<p>Similar logging for the OSD (and MON, less urgently) would be ultra useful when debugging things like <a class="issue tracker-1 status-3 priority-6 priority-high2 closed" title="Bug: ceph-osd takes all memory before oom on boot (Resolved)" href="https://tracker.ceph.com/issues/53729">#53729</a></p>
<p>The MDS uses a MemoryModel class, see MDCache::check_memory_usage, so this should be pretty easy to copy.</p> bluestore - Support #54315 (New): 1 fsck error per osd during nautilus -> octopus upgrade (S3 clu...https://tracker.ceph.com/issues/543152022-02-17T17:23:37ZDan van der Ster
<p>At the end of the conversion to per-pool omap, around half of our OSDs had 1 error, but the log didn't show the error.</p>
<pre>
2022-02-15T16:02:16.554+0100 7fdfde8d8f00 0 bluestore(/var/lib/ceph/osd/ceph-1247) _fsck_check_objects partial offload, done myself 7925084 of 7942492objects, threads 2
2022-02-15T16:02:16.678+0100 7fdfde8d8f00 1 bluestore(/var/lib/ceph/osd/ceph-1247) _fsck_on_open checking shared_blobs
2022-02-15T16:02:16.693+0100 7fdfde8d8f00 1 bluestore(/var/lib/ceph/osd/ceph-1247) _fsck_on_open checking pool_statfs
2022-02-15T16:17:37.407+0100 7fdfde8d8f00 1 bluestore(/var/lib/ceph/osd/ceph-1247) _fsck_on_open <<<FINISH>>> with 1 errors, 318 warnings, 319 repaired, 0 remaining in 1672.130946 seconds
</pre>
<p>Full log is posted: ceph-post-file: 82f661a7-b10f-4a80-acaf-37f1268f275e</p> Ceph - Bug #54314 (New): OSD slow ops during Windows VM upgradeshttps://tracker.ceph.com/issues/543142022-02-17T17:18:20ZDan van der Ster
<p>When our Windows team triggers an upgrade to several 10s of Windows VMs concurrently, this triggers thousands of slow ops on our OSDs which in normal cases are performant and without slow ops.<br />Cluster is running 15.2.15. HDD block, SSD block.db. HDDs have a Media Cache enabled (write through cache).</p>
<p>I captured one osd like this just now. I posted logs to ceph-post-file: 04dd95b1-6227-4423-b5e0-e3a769903d19<br />There you have `ops` (the ongoing ops with slow reqs) the osd log with debug_bluestore=20 for around 10s, then the ops around 30s later (`ops.after`).</p>
<p>For example, lots of ops like this:</p>
<pre><code>"description": "osd_op(client.769991852.0:52672435 4.111d 4.b51a311d (undecoded) ondisk+write+known_if_redirected e3973877)",<br />"age": 143.48358313400001</code></pre>
<p>We're already aware that Windows VMs send sub 4k writes -- triggering a read-modify-write cycle on the 4096b OSDs. We are pushing out a change so the Windows VMs use emulated 4k sectors. Does this explain all of the performance issue?</p>
<p>In `ops` file I posted, there are only three write ops (but lots of repops where I don't see the size). One of the writes is 3k:</p>
<pre><code>"description": "osd_op(client.769991852.0:52672421 4.111d 4:b88c58ad:::rbd_data.e5584f61a3e0e9.0000000000001362:head [write 585728~3072 in=3072b] snapc 1d3ca=[1d3ca,1d382] ondisk+write+known_if_redirected e3973877)",<br />"description": "osd_op(client.678845326.0:134585512 4.1ced 4:b73ec2ef:::rbd_data.6ff73a60bf4abc.0000000000001823:head [write 630784~409088 in=409088b] snapc 0=[] ondisk+write+known_if_redirected e3973877)",<br />"description": "osd_op(client.252924738.0:665572961 4.1ced 4:b73b9b4c:::rbd_data.67eac959e2bbdf.0000000000000309:head [write 2408448~8192 in=8192b] snapc 0=[] ondisk+write+known_if_redirected e3973877)",</code></pre>
<p>The osd debug_bluestore log shows plenty of <4k writes, though.</p> CephFS - Bug #53645 (New): MDCache::shutdown_pass: ceph_assert(!migrator->is_importing())https://tracker.ceph.com/issues/536452021-12-16T16:48:36ZDan van der Ster
<p>I'm running a pinning/multimds thrash test (see stressfs.sh attached) on a 3 node test cluster and occasionally seeing this crash while stopping:</p>
<pre>
-10> 2021-12-16 15:50:15.760 7f4ab1070700 7 mds.2.cache shutdown_pass
-9> 2021-12-16 15:50:15.760 7f4ab1070700 10 mds.2.cache shutdown_export_strays 0x61d ''
-8> 2021-12-16 15:50:15.760 7f4ab1070700 7 mds.2.cache trim bytes_used=41kB limit=2GB reservation=0.05% count=18446744073709551615
-7> 2021-12-16 15:50:15.760 7f4ab1070700 7 mds.2.cache trim_lru trimming 18446744073709551615 items from LRU size=10 mid=0 pintail=10 pinned=10
-6> 2021-12-16 15:50:15.760 7f4ab1070700 7 mds.2.cache trim_lru trimmed 0 items
-5> 2021-12-16 15:50:15.760 7f4ab1070700 5 mds.2.cache lru size now 10/0
-4> 2021-12-16 15:50:15.760 7f4ab1070700 7 mds.2.cache looking for subtrees to export to mds0
-3> 2021-12-16 15:50:15.760 7f4ab1070700 10 mds.2.log trim_all: 1/0/0
-2> 2021-12-16 15:50:15.760 7f4ab1070700 10 mds.2.log _trim_expired_segments waiting for 20758/71894891982 to expire
-1> 2021-12-16 15:50:15.761 7f4ab1070700 -1 /builddir/build/BUILD/ceph-14.2.22/src/mds/MDCache.cc: In function 'bool MDCache::shutdown_pass()' thread 7f4ab1070700 time 2021-12-16 15:50:15.760517
/builddir/build/BUILD/ceph-14.2.22/src/mds/MDCache.cc: 7786: FAILED ceph_assert(!migrator->is_importing())
ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x156) [0x7f4abcc3282a]
2: (()+0x274a44) [0x7f4abcc32a44]
3: (MDCache::shutdown_pass()+0x13a7) [0x55d286352e87]
4: (MDSRankDispatcher::tick()+0x2a8) [0x55d2862375c8]
5: (FunctionContext::finish(int)+0x30) [0x55d2862212e0]
6: (Context::complete(int)+0xd) [0x55d28621f46d]
7: (SafeTimer::timer_thread()+0x19c) [0x7f4abcd0e76c]
8: (SafeTimerThread::entry()+0x11) [0x7f4abcd10171]
9: (()+0x817a) [0x7f4abaa1217a]
10: (clone()+0x43) [0x7f4ab9466d83]
</pre>
<p>mds log is at ceph-post-file: 358f9d7b-5953-4a7d-818a-092d1a645b3c</p>
<p>There have been some changes in shutdown_pass related to ephemeral pinning, so we'll try this on a pacific test cluster next to see if it is still a bug.</p> CephFS - Documentation #53558 (New): Document cephfs recursive accountinghttps://tracker.ceph.com/issues/535582021-12-09T09:43:08ZDan van der Ster
<p>I cannot find any user documentation for cephfs recursive accounting.</p>
<p>We should add something similar to <a class="external" href="https://blog.widodh.nl/2015/04/playing-with-cephfs-recursive-statistics/">https://blog.widodh.nl/2015/04/playing-with-cephfs-recursive-statistics/</a></p> Ceph - Feature #53140 (New): HEALTH_WARN about known issues of running versionhttps://tracker.ceph.com/issues/531402021-11-03T14:47:35ZDan van der Ster
<p>If the running version has a known issue (of configurable severity?), issue a health warning.<br />The proposed implementation would query an xml or json of known issues, keyed by version e.g. 16.2.6-0.</p> Ceph - Bug #52392 (New): crashes rebooting colocated mds/mgr with mon at boothttps://tracker.ceph.com/issues/523922021-08-24T12:49:27ZDan van der Ster
<p>I just updated a box in a test cluster to 15.2.14, then rebooted the box.<br />This box has MON/MGR/MDS/OSD all co-located.</p>
<p>The mgr and mds both crashed similarly while establishing a mon client, but they restarted correctly the second attempt.<br />Here are the crash infos -- for some reason I cannot post them (:</p>
<pre>
# ceph crash ls
ID ENTITY NEW
2021-08-24T12:39:17.798297Z_a95ebe7d-8043-4e23-b647-2a68ba0f5b8c mds.cephoctopus-2 *
2021-08-24T12:39:18.569890Z_3dae9cef-32b3-4eca-8630-22d8f8c92aca mgr.cephoctopus-2 *
# ceph crash post
malformed crash metadata: Expecting value: line 1 column 1 (char 0)
# ceph crash info 2021-08-24T12:39:17.798297Z_a95ebe7d-8043-4e23-b647-2a68ba0f5b8c
{
"backtrace": [
"(()+0x12b20) [0x7fa222114b20]",
"(MonClient::_add_conn(unsigned int)+0x1e1) [0x7fa2236541f1]",
"(MonClient::_add_conns()+0x2c9) [0x7fa223654969]",
"(MonClient::_reopen_session(int)+0x2b0) [0x7fa22365af70]",
"(MonClient::authenticate(double)+0x265) [0x7fa22365cbe5]",
"(MDSDaemon::init()+0x11d) [0x5566d9c6dc0d]",
"(main()+0xf86) [0x5566d9c5e7d6]",
"(__libc_start_main()+0xf3) [0x7fa220b4a493]",
"(_start()+0x2e) [0x5566d9c645ae]"
],
"ceph_version": "15.2.14",
"crash_id": "2021-08-24T12:39:17.798297Z_a95ebe7d-8043-4e23-b647-2a68ba0f5b8c",
"entity_name": "mds.cephoctopus-2",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-mds",
"stack_sig": "f4399fc7e37a98751c21745d38e813191cf2c681fbf1a72ca37061c80aa2e013",
"timestamp": "2021-08-24T12:39:17.798297Z",
"utsname_hostname": "cephoctopus-2.cern.ch",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-193.14.2.el8_2.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Sun Jul 26 03:54:29 UTC 2020"
}
# ceph crash info 2021-08-24T12:39:18.569890Z_3dae9cef-32b3-4eca-8630-22d8f8c92aca
{
"backtrace": [
"(()+0x12b20) [0x7f75e23e8b20]",
"(std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph_mon_subscribe_item>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph_mon_subscribe_item> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph_mon_subscribe_item> > >::find(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x3c) [0x7f75e409066c]",
"(MonSub::want(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned int)+0x44) [0x7f75e40901d4]",
"(MgrStandby::init()+0x175) [0x55d0fab6a6f5]",
"(main()+0x1b3) [0x55d0faa636e3]",
"(__libc_start_main()+0xf3) [0x7f75e0e1e493]",
"(_start()+0x2e) [0x55d0faa6865e]"
],
"ceph_version": "15.2.14",
"crash_id": "2021-08-24T12:39:18.569890Z_3dae9cef-32b3-4eca-8630-22d8f8c92aca",
"entity_name": "mgr.cephoctopus-2",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-mgr",
"stack_sig": "bcec6d90a6621b7f045ab716abb2d49f1cc01bb5cf2cce3fed450b6f8a94dbc6",
"timestamp": "2021-08-24T12:39:18.569890Z",
"utsname_hostname": "cephoctopus-2.cern.ch",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-193.14.2.el8_2.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Sun Jul 26 03:54:29 UTC 2020"
}
</pre>
<p>I uploaded the mds and mgr log files:<br />ceph-post-file: 57b862ed-5af4-40d7-87e2-40a3ec0e236c<br />ceph-post-file: 5055a7f2-e754-4d43-84dc-68d44ad3e080</p> RADOS - Bug #51194 (New): PG recovery_unfound after scrub repair failed on primaryhttps://tracker.ceph.com/issues/511942021-06-13T13:36:23ZDan van der Ster
<p>This comes from a mail I send to the ceph-users ML: <a class="external" href="https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/3LR2PC2KXVK66KG7U6ADGIC2IRGVQIAL/">https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/3LR2PC2KXVK66KG7U6ADGIC2IRGVQIAL/</a></p>
<p>The cluster here is running v14.2.20 and is used for RBD images. It is critical for our operations so appreciate any advice how to proceed.</p>
<p>PG 4.1904 went into state active+recovery_unfound+degraded+repair [1] during normal background scrubbing (but note that we have `osd scrub auto repair = true`).<br />The output of pg query is at <a class="external" href="https://pastebin.com/46XHVKps">https://pastebin.com/46XHVKps</a></p>
<p>The scrub detected a read error on the primary of this PG:</p>
<pre>
2021-06-13 03:15:11.559680 osd.951 (osd.951) 138 : cluster [DBG] 4.1904 repair starts
2021-06-13 04:00:49.369256 osd.951 (osd.951) 139 : cluster [ERR] 4.1904 shard 951 soid 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head : candidate had a read error
</pre><br />Here's the corresponding kernel log of the IO error on the primary:<br /><pre>
Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DR
Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 Sense Key : Medium Error [current] [descript
Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 Add. Sense: Unrecovered read error
Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 CDB: Read(16) 88 00 00 00 00 02 ba 8c 0b 00
Jun 13 04:00:46 xxx kernel: blk_update_request: critical medium error, dev sdp, sector 1171967531
</pre>
<p>But it seems that the other 2 osds could not repair this failed read on the primary because they don't have the correct version of the object:</p>
<pre>
2021-06-13 04:28:29.412765 osd.951 (osd.951) 140 : cluster [ERR] 4.1904 repair 0 missing, 1 inconsistent objects
2021-06-13 04:28:29.413320 osd.951 (osd.951) 141 : cluster [ERR] 4.1904 repair 1 errors, 1 fixed
2021-06-13 04:28:29.445659 osd.14 (osd.14) 414 : cluster [ERR] 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v 3592634'367863320 failed because local copy is 3593555'368312656
2021-06-13 04:28:29.472554 osd.344 (osd.344) 124 : cluster [ERR] 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v 3592634'367863320 failed because local copy is 3593555'368312656
2021-06-13 04:28:30.863807 mgr.yyy (mgr.692832499) 648287 : cluster [DBG] pgmap v557097: 19456 pgs: 1 active+recovery_unfound+degraded+repair, 2 active+clean+scrubbing,
19423 active+clean, 30 active+clean+scrubbing+deep+repair; 1.3 PiB data, 4.0 PiB used, 2.1 PiB / 6.1 PiB avail; 350 MiB/s rd, 766 MiB/s wr, 16.93k op/s; 3/1063641423 objects degraded (0.000%); 1/354547141 objects unfound (0.000%)
</pre>
<p>There were no other IO errors or failures during this time window.</p>
<p>We compared the object version numbers with the osdmap epoch at the time the object went unfound -- indeed the osdmap was e3593555 when this PG was marked recovery_unfound. So this implies that the osds 14 and 344 have a <strong>newer</strong> version of the object, written while scrub/repair was ongoing:</p>
<pre>
2021-06-13 03:50:13.808204 mon.cephbeesly-mon-2a00f134e5 (mon.0) 1232629 : cluster [DBG] osdmap e3593554: 1224 total, 1222 up, 1222 in
...
2021-06-13 04:00:49.369256 osd.951 (osd.951) 139 : cluster [ERR] 4.1904 shard 951 soid 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head : candidate had a read error
...
2021-06-13 04:04:50.359786 mon.cephbeesly-mon-2a00f134e5 (mon.0) 1232802 : cluster [DBG] osdmap e3593555: 1224 total, 1222 up, 1222 in
...
2021-06-13 04:28:29.445659 osd.14 (osd.14) 414 : cluster [ERR] 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v 3592634'367863320 failed because local copy is 3593555'368312656
2021-06-13 04:28:29.472554 osd.344 (osd.344) 124 : cluster [ERR] 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v 3592634'367863320 failed because local copy is 3593555'368312656
...
2021-06-13 04:31:17.106383 mon.cephbeesly-mon-2a00f134e5 (mon.0) 1233177 : cluster [DBG] osdmap e3593556: 1224 total, 1222 up, 1222 in
</pre>
<p>Could this be a race in the PG repair? Something like:</p>
<p>1. Repair starts, osd.951 has object read error when it is v 3592634'367863320.<br />2. Later on during repair, that object is updated by the client, bumping it to v 3593555'368312656<br />3. Repair completes a few minutes later, osd.951 asks his peers to recover the object at the older v 3592634'367863320 --> they don't have it anymore, object is marked lost.</p>
<p>How do we clean up from this? We are considering:</p>
<ul>
<li>run `ceph pg repair ..` again now and hope it repairs with the correct new version of the object.</li>
<li>Stop osd.951 and let the data backfill to other osds, then repair again.</li>
<li>Set osd.951's primary-affinity to zero so one of the other OSDs becomes primary, then repair again.</li>
</ul>
<p>Or maybe we should just run `mark_unfound_lost revert`? Will it take v 3593555'368312656 from the peer OSDs ?</p> RADOS - Feature #49505 (New): Warn about extremely anomalous commit_latencieshttps://tracker.ceph.com/issues/495052021-02-26T13:53:32ZDan van der Ster
<p>In a EC cluster with ~500 hdd osds, we suffered a drop in write performance from 30GiB/s down to 3GiB/s due to one sick drive:</p>
<pre>
# ceph osd perf | sort -n -k3 | tail
78 50 50
427 50 50
85 52 52
227 52 52
335 53 53
252 54 54
30 57 57
455 59 59
186 64 64
256 2306 2306
</pre>
<p>Moments after stopping osd.256 the write performance returned to normal, 30GiB/s.</p>
<p>While debugging this, ceph status was reporting plenty of slow requests on 10s-100s of OSDs, but there were no other warnings which might have narrowed in on osd.256 as having a problem. (E.g. the network ping maps were all OK apparently). We thought to check `osd perf` and voila we found the sick drive like that. (And btw, this hdd has a healthy SMART status -- so in this case we suspect that the high latency is due to a poor SATA connection).</p>
<p>To make it simpler to find this type of problem, should we raise a health warn when an OSD has commit_latency exceeding N times the mean, e.g. 10x ?</p>