Ceph : Issues
https://tracker.ceph.com/
https://tracker.ceph.com/favicon.ico
2024-02-26T18:46:22Z
Ceph
Redmine
RADOS - Backport #64576 (New): quincy: Incorrect behavior on combined cmpext+write ops in the fac...
https://tracker.ceph.com/issues/64576
2024-02-26T18:46:22Z
Backport Bot
RADOS - Backport #64575 (New): reef: Incorrect behavior on combined cmpext+write ops in the face ...
https://tracker.ceph.com/issues/64575
2024-02-26T18:46:15Z
Backport Bot
RADOS - Bug #64333 (Pending Backport): PG autoscaler tuning => catastrophic ceph cluster crash
https://tracker.ceph.com/issues/64333
2024-02-06T15:33:18Z
Loïc Dachary
loic@dachary.org
<p>Posting this report on behalf of a Ceph user. They will followup if there are any questions.</p>
<hr />
<p>After deploying some monitoring on the Ceph cluster nodes we finally started the benchmark suite on Friday 2024-01-26 afternoon. While doing so, we did a quick review of the Ceph pool settings, for the shards/shards-data rbd pool on which we had started to ingest images with winery.</p>
<p>During the review we noticed that the shards-data had very few PGs (64), which kept most OSDs idle, or at least with a very unbalanced load. As the autoscaler was set up, we decided to just go ahead and enable the "bulk" flag on the `shards-data` pool to let the autoscaler scale up the number of PGs.</p>
<p>The autoscaler immediately moved the pool to 4096 PGs and started the data movement process.</p>
<p>As soon as the reallocation started, 10-15% of the OSDs crashed hard. This crash looks persistent (the OSDs crash again as soon as systemd restarts them), and therefore we consider that the data are lost and the cluster is unavailable.</p>
<p>Remedial steps attempted (some of them happened multiple times, so the order isn't guaranteed):<br /> - manual restart of OSDs that were disabled by systemd after consecutive crashes
* no difference, apparently the crash is persistent<br /> - review of similar upstream tickets :
* <a class="external" href="https://tracker.ceph.com/issues/53584">https://tracker.ceph.com/issues/53584</a>
* <a class="external" href="https://tracker.ceph.com/issues/55662">https://tracker.ceph.com/issues/55662</a><br /> - attempt to set osd_read_ec_check_for_errors = true on all osds, no mitigation of the crash<br /> - revert of the bulk flag on the pool
* autoscaler target config moved back to 64 pgs
* no impact on data availability after restarting the crashed OSDs<br /> - ceph osd set noout
* stabilized the number of crashed OSDs (as no new reallocations are happening)
* no revival of dead OSDs after restarting them</p>
<p>All the current diagnostic information is dumped below:</p>
<p>ceph status: <a class="external" href="https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-status-2024-01-29-143117.txt">https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-status-2024-01-29-143117.txt</a></p>
<pre><code>cluster:<br /> id: e0a98ad0-fd1f-4079-894f-ed4554ce40c6<br /> health: HEALTH_ERR<br /> noout flag(s) set<br /> 25 osds down<br /> 7055371 scrub errors<br /> Reduced data availability: 138 pgs inactive, 103 pgs down<br /> Possible data damage: 30 pgs inconsistent<br /> Degraded data redundancy: 1797720/26981188 objects degraded (6.663%), 47 pgs degraded, 130 pgs undersized<br /> 49 daemons have recently crashed</code></pre>
<pre><code>services:<br /> mon: 3 daemons, quorum dwalin001,dwalin003,dwalin002 (age 2d)<br /> mgr: dwalin003(active, since 2d), standbys: dwalin001, dwalin002<br /> osd: 240 osds: 190 up (since 7h), 215 in (since 2d); 73 remapped pgs<br /> flags noout</code></pre>
<pre><code>data:<br /> pools: 6 pools, 389 pgs<br /> objects: 3.85M objects, 15 TiB<br /> usage: 18 TiB used, 2.0 PiB / 2.0 PiB avail<br /> pgs: 35.476% pgs not active<br /> 1797720/26981188 objects degraded (6.663%)<br /> 134 active+clean<br /> 73 down+remapped<br /> 62 active+undersized<br /> 29 down<br /> 29 active+undersized+degraded<br /> 22 active+clean+inconsistent<br /> 21 undersized+peered<br /> 11 undersized+degraded+peered<br /> 4 active+undersized+degraded+inconsistent<br /> 3 undersized+degraded+inconsistent+peered<br /> 1 down+inconsistent</code></pre>
<p>ceph report: <a class="external" href="https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-report-2024-01-29-152825.txt">https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-report-2024-01-29-152825.txt</a><br />ceph health detail: <a class="external" href="https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-health-detail-2024-01-29-143133.txt">https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-health-detail-2024-01-29-143133.txt</a><br />ceph crash ls: <a class="external" href="https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-crash-ls-2024-01-29-143402.txt">https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-crash-ls-2024-01-29-143402.txt</a><br />full logs (1.1 GB compressed, 31 GB uncompresed): <a class="external" href="https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-crash-2024-01-26.tar.zst">https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-crash-2024-01-26.tar.zst</a></p>
RADOS - Bug #64192 (Pending Backport): Incorrect behavior on combined cmpext+write ops in the fac...
https://tracker.ceph.com/issues/64192
2024-01-26T18:10:25Z
Ilya Dryomov
<p>There seems to be an expectation mismatch between the OSD and the Objecter (or at least the neorados bit of the Objecter). Based on a quick look, I'm inclined to say that this is an OSD bug.</p>
<p>1. A combined cmpext+write op arrives on the OSD:</p>
<pre>
2024-01-26T10:52:24.783-0500 7f6a861a1700 1 -- [v2:172.21.9.34:6802/39207300,v1:172.21.9.34:6803/39207300] <== client.4499 172.21.9.34:0/3763700879 4 ==== osd_op(client.4499.0:76 59.0 59.95312129 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e304) v8 ==== 284+0+28 (crc 0 0 0) 0x5623c974e000 con 0x5623ca360480
</pre>
<p>2. cmpext fails the compare (as expected -- the client is sending data that doesn't match on purpose):</p>
<pre>
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] _handle_message: osd_op(client.4499.0:76 59.0 59.95312129 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e304) v8
2024-01-26T10:52:24.783-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_op: op osd_op(client.4499.0:76 59.0 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head [cmpext 512~14 in=14b,write 512~14 in=14b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e304) v8
2024-01-26T10:52:24.783-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] op_has_sufficient_caps session=0x5623c8f2b400 pool=59 (test-librbd-senta04-2299691-58 ) pool_app_metadata={rados={}} need_read_cap=1 need_write_cap=1 classes=[] -> yes
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_op osd_op(client.4499.0:76 59.0 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head [cmpext 512~14 in=14b,write 512~14 in=14b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e304) v8 may_write may_read -> write-ordered flags ondisk+write+known_if_redirected+supports_pool_eio
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] get_object_context: found obc in cache: obc(59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head rwstate(none n=0 w=0))
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] get_object_context: obc(59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head rwstate(none n=0 w=0)) oi: 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head(304'22 client.4499.0:70 dirty s 526 uv 22 alloc_hint [4194304 4194304 0]) exists: 1 ssc(59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:snapdir snapset: 0=[]:{} ref: 1 registered: 1 exists: 1)
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] find_object_context 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head @head oi=59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head(304'22 client.4499.0:70 dirty s 526 uv 22 alloc_hint [4194304 4194304 0])
2024-01-26T10:52:24.783-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] maybe_handle_manifest_detail: 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head is not manifest object
2024-01-26T10:52:24.783-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_op obc obc(59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head rwstate(excl n=1 w=0))
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] execute_ctx 0x5623c967c400
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] execute_ctx 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head [cmpext 512~14 in=14b,write 512~14 in=14b] ov 304'22 av 304'28 snapc 0=[] snapset 0=[]:{}
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_osd_op 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head [cmpext 512~14 in=14b,write 512~14 in=14b]
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_osd_op cmpext 512~14 in=14b
2024-01-26T10:52:24.783-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_extent_cmp
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_osd_op 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head [sync_read 512~14]
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_osd_op sync_read 512~14
2024-01-26T10:52:24.783-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_read
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] read got 14 / 14 bytes from obj 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head
2024-01-26T10:52:24.783-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] do_osd_ops error: (4100) Unknown error 4100
2024-01-26T10:52:24.783-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] execute_ctx alloc reply 0x5623c80b2d80 result -4100
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] PeeringState::calc_trim_to_aggressive limit = 304'25
2024-01-26T10:52:24.787-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] op order client.4499 tid 76 last was 70
2024-01-26T10:52:24.787-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] execute_ctx update_log_only -- result=-4100
2024-01-26T10:52:24.787-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] record_write_error r=-4100
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] submit_log_entries 304'28 (0'0) error 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head by client.4499.0:76 0.000000 -4100 ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] new_repop: repgather(0x5623c982d800 304'28 rep_tid=1813 committed?=0 r=-4100)
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'27 (0'0,304'27] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] PeeringState::merge_new_log_entries 304'28 (0'0) error 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head by client.4499.0:76 0.000000 -4100 ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
2024-01-26T10:52:24.787-0500 7f6a645e6700 20 update missing, append 304'28 (0'0) error 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head by client.4499.0:76 0.000000 -4100 ObjectCleanRegions clean_offsets: [(0, 18446744073709551615)], clean_omap: true, new_object: false
2024-01-26T10:52:24.787-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 lua=304'27 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] PeeringState::append_log_entries_update_missing trim_to bool = 1 trim_to = 0'0
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 trim proposed trim_to = 0'0
2024-01-26T10:52:24.787-0500 7f6a645e6700 6 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 304'28, trimmed: , trimmed_dups: , clear_divergent_priors: 0
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 _write_log_and_missing clearing up to 0'0 dirty_to_dups=0'0 dirty_from_dups=4294967295'18446744073709551615 write_from_dups=4294967295'18446744073709551615 trimmed_dups.size()=0
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 _write_log_and_missing going to encode log.dups.size()=0
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 _write_log_and_missing 1st round encoded log.dups.size()=0
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 _write_log_and_missing 2st round encoded log.dups.size()=0
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 end of _write_log_and_missing
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 lua=304'27 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] op_applied version 304'28
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'25 crt=304'25 lcod 304'24 mlcod 304'24 active+clean ps=[2~1]] PeeringState::calc_trim_to_aggressive limit = 304'25
2024-01-26T10:52:24.787-0500 7f6a645e6700 10 osd.0 304 dequeue_op osd_op(client.4499.0:76 59.0 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head [cmpext 512~14 in=14b,write 512~14 in=14b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e304) v8 finish
</pre>
<p>3. Before the op is ready, OSD session gets reset (due to ms_inject_socket_failures injection on the client):</p>
<pre>
2024-01-26T10:52:24.787-0500 7f6a861a1700 1 --2- [v2:172.21.9.34:6802/39207300,v1:172.21.9.34:6803/39207300] >> 172.21.9.34:0/3763700879 conn(0x5623ca360480 0x5623c97b8580 crc :-1 s=READY pgs=20 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1
2024-01-26T10:52:24.787-0500 7f6a861a1700 1 --2- [v2:172.21.9.34:6802/39207300,v1:172.21.9.34:6803/39207300] >> 172.21.9.34:0/3763700879 conn(0x5623ca360480 0x5623c97b8580 crc :-1 s=READY pgs=20 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).stop
2024-01-26T10:52:24.787-0500 7f6a75b6c700 2 osd.0 304 ms_handle_reset con 0x5623ca360480 session 0x5623c8f2b400
</pre>
<p>4. The client resends the op and dup detection on the OSD kicks in:</p>
<pre>
2024-01-26T10:52:24.791-0500 7f6a861a1700 1 -- [v2:172.21.9.34:6802/39207300,v1:172.21.9.34:6803/39207300] <== client.4499 172.21.9.34:0/3763700879 1 ==== osd_op(client.4499.0:76 59.0 59.95312129 (undecoded) ondisk+retry+write+known_if_redirected+supports_pool_eio e304) v8 ==== 284+0+28 (crc 0 0 0) 0x5623c974e780 con 0x5623ca3e6480
2024-01-26T10:52:24.791-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'25 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] _handle_message: osd_op(client.4499.0:76 59.0 59.95312129 (undecoded) ondisk+retry+write+known_if_redirected+supports_pool_eio e304) v8
2024-01-26T10:52:24.791-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'25 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] do_op: op osd_op(client.4499.0:76 59.0 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head [cmpext 512~14 in=14b,write 512~14 in=14b] snapc 0=[] RETRY=1 ondisk+retry+write+known_if_redirected+supports_pool_eio e304) v8
2024-01-26T10:52:24.791-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'25 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] op_has_sufficient_caps session=0x5623c974e280 pool=59 (test-librbd-senta04-2299691-58 ) pool_app_metadata={rados={}} need_read_cap=1 need_write_cap=1 classes=[] -> yes
2024-01-26T10:52:24.791-0500 7f6a645e6700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'25 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] do_op osd_op(client.4499.0:76 59.0 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head [cmpext 512~14 in=14b,write 512~14 in=14b] snapc 0=[] RETRY=1 ondisk+retry+write+known_if_redirected+supports_pool_eio e304) v8 may_write may_read -> write-ordered flags ondisk+retry+write+known_if_redirected+supports_pool_eio
2024-01-26T10:52:24.791-0500 7f6a645e6700 3 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'25 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] do_op dup client.4499.0:76 version 304'28
</pre>
<p>^^^ It's not logged, but I suspect that PG::check_in_progress_op() returns with user_version = 0 and an empty op_returns vector here.</p>
<pre>
2024-01-26T10:52:24.791-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'25 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] already_complete: 304'28
2024-01-26T10:52:24.791-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'25 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] already_complete: repgather(0x5623c982d800 304'28 rep_tid=1813 committed?=0 r=-4100)
2024-01-26T10:52:24.791-0500 7f6a685ee700 20 osd.0 op_wq(0) _process 59.0 to_process <> waiting <> waiting_peering {}
2024-01-26T10:52:24.791-0500 7f6a645e6700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'28 (0'0,304'28] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'25 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] already_complete: repgather(0x5623c982d800 304'28 rep_tid=1813 committed?=0 r=-4100) not committed, returning false
</pre>
<p>5. The op becomes ready:</p>
<pre>
2024-01-26T10:52:24.791-0500 7f6a685ee700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'27 crt=304'26 lcod 304'26 mlcod 304'26 active+clean ps=[2~1]] repop_all_committed: repop tid 1813 all committed
2024-01-26T10:52:24.791-0500 7f6a685ee700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'26 active+clean ps=[2~1]] PeeringState::calc_min_last_complete_ondisk last_complete_ondisk is updated to: 304'27 from: 304'26
2024-01-26T10:52:24.791-0500 7f6a685ee700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'27 active+clean ps=[2~1]] eval_repop repgather(0x5623c982d800 304'28 rep_tid=1813 committed?=1 r=-4100)
2024-01-26T10:52:24.791-0500 7f6a685ee700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'27 active+clean ps=[2~1]] commit: repgather(0x5623c982d800 304'28 rep_tid=1813 committed?=1 r=-4100)
</pre>
<p>6. Reply with user_version = 0 is sent:</p>
<pre>
2024-01-26T10:52:24.791-0500 7f6a685ee700 1 -- [v2:172.21.9.34:6802/39207300,v1:172.21.9.34:6803/39207300] --> 172.21.9.34:0/3763700879 -- osd_op_reply(76 rbd_data.1193ad1baaa.0000000000000000 [cmpext 512~14,write 512~14] v304'28 uv0 ondisk = -4100 ((4100) Unknown error 4100)) v8 -- 0x5623c92786c0 con 0x5623ca3e6480
2024-01-26T10:52:24.791-0500 7f6a685ee700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'27 active+clean ps=[2~1]] PeeringState::prepare_stats_for_publish reporting purged_snaps [2~1]
2024-01-26T10:52:24.791-0500 7f6a685ee700 15 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'27 active+clean ps=[2~1]] PeeringState::prepare_stats_for_publish publish_stats_to_osd 304:75
2024-01-26T10:52:24.791-0500 7f6a685ee700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'27 active+clean ps=[2~1]] removing repgather(0x5623c982d800 304'28 rep_tid=1813 committed?=1 r=-4100)
2024-01-26T10:52:24.791-0500 7f6a685ee700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'27 active+clean ps=[2~1]] q front is repgather(0x5623c982d800 304'28 rep_tid=1813 committed?=1 r=-4100)
2024-01-26T10:52:24.791-0500 7f6a685ee700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'27 active+clean ps=[2~1]] finished operator() r=-4100
2024-01-26T10:52:24.791-0500 7f6a685ee700 10 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'27 active+clean ps=[2~1]] sending commit on osd_op(client.4499.0:76 59.0 59:94848ca9:::rbd_data.1193ad1baaa.0000000000000000:head [cmpext 512~14 in=14b,write 512~14 in=14b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e304) v8 0x5623c80b2d80
</pre>
<p>7. Reply with user_version != 0 is sent:</p>
<pre>
2024-01-26T10:52:24.791-0500 7f6a685ee700 1 -- [v2:172.21.9.34:6802/39207300,v1:172.21.9.34:6803/39207300] --> 172.21.9.34:0/3763700879 -- osd_op_reply(76 rbd_data.1193ad1baaa.0000000000000000 [cmpext 512~14,write 512~14] v304'28 uv22 ondisk = -4100 ((4100) Unknown error 4100)) v8 -- 0x5623c80b2d80 con 0x5623ca360480
2024-01-26T10:52:24.791-0500 7f6a685ee700 20 osd.0 pg_epoch: 304 pg[59.0( v 304'30 (0'0,304'30] local-lis/les=301/302 n=9 ec=301/301 lis/c=301/301 les/c/f=302/302/0 sis=301) [0] r=0 lpr=301 luod=304'28 crt=304'26 lcod 304'27 mlcod 304'27 active+clean ps=[2~1]] remove_repop repgather(0x5623c982d800 304'28 rep_tid=1813 committed?=1 r=-4100)
</pre>
<p>8. On the client side, Objecter processes the reply with user_version = 0. This reply also happens to have rval = 0 for all ops, including cmpext even though it failed the compare:</p>
<pre>
2024-01-26T10:52:24.791-0500 7ff755ffb700 1 -- 172.21.9.34:0/3763700879 <== osd.0 v2:172.21.9.34:6802/39207300 1 ==== osd_op_reply(76 rbd_data.1193ad1baaa.0000000000000000 [cmpext 512~14,write 512~14] v304'28 uv0 ondisk = -4100 ((4100) Unknown error 4100)) v8 ==== 223+0+0 (crc 0 0 0) 0x7ff704078040 con 0x7ff72c025d50
2024-01-26T10:52:24.791-0500 7ff755ffb700 10 client.4499.objecter ms_dispatch 0x55bffcc34780 osd_op_reply(76 rbd_data.1193ad1baaa.0000000000000000 [cmpext 512~14,write 512~14] v304'28 uv0 ondisk = -4100 ((4100) Unknown error 4100)) v8
2024-01-26T10:52:24.791-0500 7ff755ffb700 10 client.4499.objecter in handle_osd_op_reply
2024-01-26T10:52:24.791-0500 7ff755ffb700 7 client.4499.objecter handle_osd_op_reply 76 ondisk uv 0 in 59.0 attempt 1
2024-01-26T10:52:24.791-0500 7ff755ffb700 10 client.4499.objecter op 0 rval 0 len 0
2024-01-26T10:52:24.791-0500 7ff755ffb700 10 client.4499.objecter op 1 rval 0 len 0
2024-01-26T10:52:24.791-0500 7ff755ffb700 15 client.4499.objecter handle_osd_op_reply completed tid 76
2024-01-26T10:52:24.791-0500 7ff755ffb700 15 client.4499.objecter _finish_op 76
</pre>
<p>Because of rval = 0 there, Adam's cmpext "handler" from <a class="external" href="https://github.com/ceph/ceph/pull/52495">https://github.com/ceph/ceph/pull/52495</a> doesn't set mismatch_offset and doesn't adjust the overall return code. -4100 is returned to the user (librbd in this case), which chokes on it: we expect -MAX_ERRNO (-4095) overall return code and mismatch_offset = 5.</p>
<p>For comparison, here is Objecter processing the same thing when retries aren't involved:</p>
<pre>
2024-01-26T10:47:07.259-0500 7ff743fff700 10 client.4142.objecter ms_dispatch 0x55bffca54cd0 osd_op_reply(67 rbd_data.102eba81544b.0000000000000000 [cmpext 512~14,write 512~14] v12'23 uv21 ondisk = -4100 ((4100) Unknown error 4100)) v8
2024-01-26T10:47:07.259-0500 7ff743fff700 10 client.4142.objecter in handle_osd_op_reply
2024-01-26T10:47:07.259-0500 7ff743fff700 7 client.4142.objecter handle_osd_op_reply 67 ondisk uv 21 in 2.0 attempt 0
2024-01-26T10:47:07.259-0500 7ff743fff700 10 client.4142.objecter op 0 rval -4100 len 0
2024-01-26T10:47:07.259-0500 7ff743fff700 10 client.4142.objecter ERROR: tid 67: handler function threw CmpExt mismatch [osd:4095]
2024-01-26T10:47:07.259-0500 7ff743fff700 10 client.4142.objecter op 1 rval 0 len 0
2024-01-26T10:47:07.259-0500 7ff743fff700 15 client.4142.objecter handle_osd_op_reply completed tid 67
2024-01-26T10:47:07.259-0500 7ff743fff700 15 client.4142.objecter _finish_op 67
</pre>
<p>Radoslaw, is PG::check_in_progress_op() returning with user_version = 0 and an empty op_returns vector, causing a dummy reply with user_version = 0 and rval = 0 for all ops being sent first, expected? Should such a dummy reply be sent at all?</p>
RADOS - Backport #64158 (New): quincy: CommandFailedError (rados/test_python.sh): "RADOS object n...
https://tracker.ceph.com/issues/64158
2024-01-24T20:05:21Z
Backport Bot
RADOS - Bug #63389 (Pending Backport): Failed to encode map X with expected CRC
https://tracker.ceph.com/issues/63389
2023-11-01T13:26:34Z
Navid Golpa
<p>During upgrade of ceph from Quincy to Reef we encountered a problem as we upgraded each OSD. Every time an OSD was restarted to upgrade the Reef the MON's would get spammed with<br /><pre>
failed to encode map X with expected crc
</pre></p>
<p>Network load on the MON would skyrocket. The problem was identical to what was described by Kefu here in 2016:<br /><a class="external" href="https://lore.kernel.org/all/CAJE9aONFauhy7v6n9bT11Sga+e0Qgi8hWu=gr-zoxuAq5Yv+cA@mail.gmail.com/T/">https://lore.kernel.org/all/CAJE9aONFauhy7v6n9bT11Sga+e0Qgi8hWu=gr-zoxuAq5Yv+cA@mail.gmail.com/T/</a></p>
<p>We did not follow the recommendation in that post of downgrading the MONs and upgrading the OSDs first and then upgrading the MONs again. Instead we powered through the upgrade by just taking a one day downtime and upgrading all remaining OSDs. Once all OSDs were upgraded the errors went away and cluster was back to normal operation.</p>
Ceph - Bug #61400 (New): valgrind+ceph-mon: segmentation fault in rocksdb+tcmalloc
https://tracker.ceph.com/issues/61400
2023-05-24T14:28:51Z
Patrick Donnelly
pdonnell@redhat.com
<pre>
0> 2023-05-24T02:54:54.546+0000 708e7c0 -1 *** Caught signal (Segmentation fault) **
in thread 708e7c0 thread_name:ceph-mon
ceph version 18.0.0-4167-gfa0e62c4 (fa0e62c4a1d8e4a737d9cbe50224f70009b79b28) reef (dev)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420) [0x5827420]
2: (tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**)+0x20) [0x55f50a0]
3: (tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**)+0x20) [0x55f5370]
4: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0x80) [0x55f5430]
5: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long))+0x76) [0x55f8e46]
6: (tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0x165) [0x5609015]
7: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool)+0x22b) [0x109b0dd]
8: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x65) [0x109a139]
9: (RocksDBStore::do_open(std::ostream&, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x760) [0xf56d80]
10: (MonitorDBStore::open(std::ostream&)+0xfd) [0xc5ca1d]
11: main()
12: __libc_start_main()
13: _start()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
</pre>
<p>From: /ceph/teuthology-archive/pdonnell-2023-05-23_18:20:18-fs-wip-pdonnell-testing-20230523.134409-distro-default-smithi/7284230/remote/smithi007/log/ceph-mon.a.log.gz</p>
<p>This happened shortly after the ceph-mon starts.</p>
<p>Here's the error from valgrind as well:</p>
<pre>
<error>
<unique>0x1</unique>
<tid>1</tid>
<threadname>ceph-mon</threadname>
<kind>InvalidRead</kind>
<what>Invalid read of size 8</what>
<stack>
<frame>
<ip>0x55F50A0</ip>
<obj>/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4.5.3</obj>
<fn>tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**)</fn>
</frame>
<frame>
<ip>0x55F536F</ip>
<obj>/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4.5.3</obj>
<fn>tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**)</fn>
</frame>
<frame>
<ip>0x55F542F</ip>
<obj>/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4.5.3</obj>
<fn>tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)</fn>
</frame>
<frame>
<ip>0x55F8E45</ip>
<obj>/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4.5.3</obj>
<fn>tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long))</fn>
</frame>
<frame>
<ip>0x5609014</ip>
<obj>/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4.5.3</obj>
<fn>tcmalloc::allocate_full_cpp_throw_oom(unsigned long)</fn>
</frame>
<frame>
<ip>0x109B0DC</ip>
<obj>/usr/bin/ceph-mon</obj>
<fn>rocksdb::DBImpl::Open(rocksdb::DBOptions const&amp;, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; &gt; const&amp;, std::vector&lt;rocksdb::ColumnFamilyDescriptor, std::allocator&lt;rocksdb::ColumnFamilyDescriptor&gt; &gt; const&amp;, std::vector&lt;rocksdb::ColumnFamilyHandle*, std::allocator&lt;rocksdb::ColumnFamilyHandle*&gt; &gt;*, rocksdb::DB**, bool, bool)</fn>
</frame>
<frame>
<ip>0x109A138</ip>
<obj>/usr/bin/ceph-mon</obj>
<fn>rocksdb::DB::Open(rocksdb::DBOptions const&amp;, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; &gt; const&amp;, std::vector&lt;rocksdb::ColumnFamilyDescriptor, std::allocator&lt;rocksdb::ColumnFamilyDescriptor&gt; &gt; const&amp;, std::vector&lt;rocksdb::ColumnFamilyHandle*, std::allocator&lt;rocksdb::ColumnFamilyHandle*&gt; &gt;*, rocksdb::DB**)</fn>
</frame>
<frame>
<ip>0xF56D7F</ip>
<obj>/usr/bin/ceph-mon</obj>
<fn>RocksDBStore::do_open(std::ostream&amp;, bool, bool, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; &gt; const&amp;)</fn>
<dir>./obj-x86_64-linux-gnu/src/kv/./src/kv</dir>
<file>RocksDBStore.cc</file>
<line>1193</line>
</frame>
<frame>
<ip>0xC5CA1C</ip>
<obj>/usr/bin/ceph-mon</obj>
<fn>MonitorDBStore::open(std::ostream&amp;)</fn>
<dir>./obj-x86_64-linux-gnu/src/./src/mon</dir>
<file>MonitorDBStore.h</file>
<line>674</line>
</frame>
<frame>
<ip>0xC368BF</ip>
<obj>/usr/bin/ceph-mon</obj>
<fn>main</fn>
<dir>./obj-x86_64-linux-gnu/src/./src</dir>
<file>ceph_mon.cc</file>
<line>639</line>
</frame>
</stack>
<auxwhat>Address 0x20 is not stack'd, malloc'd or (recently) free'd</auxwhat>
</error>
</pre>
<p>From: /ceph/teuthology-archive/pdonnell-2023-05-23_18:20:18-fs-wip-pdonnell-testing-20230523.134409-distro-default-smithi/7284230/remote/smithi007/log/valgrind/mon.a.log.gz</p>
RADOS - Backport #59677 (New): quincy: osd:tick checking mon for new map
https://tracker.ceph.com/issues/59677
2023-05-08T17:10:15Z
Backport Bot
RADOS - Backport #59102 (In Progress): reef: msg/async: mismatch between in size/types of public_...
https://tracker.ceph.com/issues/59102
2023-03-17T19:03:32Z
Backport Bot
<p><a class="external" href="https://github.com/ceph/ceph/pull/52226">https://github.com/ceph/ceph/pull/52226</a></p>
RADOS - Bug #59100 (Pending Backport): msg/async: mismatch between in size/types of public_addr a...
https://tracker.ceph.com/issues/59100
2023-03-17T18:35:01Z
Radoslaw Zarzynski
rzarzyns@redhat.com
RADOS - Bug #57977 (Pending Backport): osd:tick checking mon for new map
https://tracker.ceph.com/issues/57977
2022-11-04T16:18:51Z
yite gu
<p>ceph version: 15.2.7<br />my cluster have a osd down, and it unable join the osdmap.<br /><pre>
# ceph osd dump
epoch 2614
fsid cf4db967-2c13-4921-9a1e-e83165bb2bbf
created 2022-03-24T07:59:06.550966+0000
modified 2022-11-04T08:02:42.559653+0000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 8
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client luminous
min_compat_client luminous
require_osd_release octopus
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 18 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 2 'replicapool-ssd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 2591 lfor 0/0/823 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
max_osd 9
osd.0 up in weight 1 up_from 2176 up_thru 2281 down_at 2175 last_clean_interval [38,2170) [v2:10.173.211.24:6810/39,v1:10.173.211.24:6813/39] [v2:10.173.211.24:6815/39,v1:10.173.211.24:6818/39] exists,up 6654dff6-bcb8-4616-9ac2-9e024e230cee
osd.1 up in weight 1 up_from 2182 up_thru 2268 down_at 2181 last_clean_interval [30,2170) [v2:10.173.211.25:6808/39,v1:10.173.211.25:6809/39] [v2:10.173.211.25:6810/39,v1:10.173.211.25:6811/39] exists,up 0aed2e15-9ea8-45c7-88c9-4c190fb861a9
osd.2 down out weight 0 up_from 36 up_thru 2186 down_at 2189 last_clean_interval [16,33) [v2:10.173.211.24:6800/39,v1:10.173.211.24:6801/39] [v2:10.173.211.24:6802/39,v1:10.173.211.24:6803/39] autoout,exists 48816ed4-88ff-4284-a70e-041c8373bbe7
osd.3 up in weight 1 up_from 2179 up_thru 2275 down_at 2178 last_clean_interval [32,2170) [v2:10.173.211.25:6800/40,v1:10.173.211.25:6801/40] [v2:10.173.211.25:6802/40,v1:10.173.211.25:6803/40] exists,up c0e5263b-2e57-451b-a2af-1f108a29d868
osd.4 up in weight 1 up_from 2174 up_thru 2280 down_at 2173 last_clean_interval [36,2170) [v2:10.173.211.24:6807/39,v1:10.173.211.24:6809/39] [v2:10.173.211.24:6811/39,v1:10.173.211.24:6812/39] exists,up b699246d-f283-44af-8a68-0ef8852163be
osd.5 up in weight 1 up_from 2185 up_thru 2264 down_at 2184 last_clean_interval [42,2170) [v2:10.173.211.26:6808/40,v1:10.173.211.26:6810/40] [v2:10.173.211.26:6812/40,v1:10.173.211.26:6814/40] exists,up 00e70df6-412c-41eb-9eed-742c6ea3291f
osd.6 up in weight 1 up_from 2180 up_thru 2260 down_at 2179 last_clean_interval [32,2170) [v2:10.173.211.25:6814/41,v1:10.173.211.25:6816/41] [v2:10.173.211.25:6818/41,v1:10.173.211.25:6819/41] exists,up ef94c655-1c53-46a4-95fe-687ecddc1738
osd.7 up in weight 1 up_from 2176 up_thru 2266 down_at 2175 last_clean_interval [42,2170) [v2:10.173.211.26:6800/39,v1:10.173.211.26:6801/39] [v2:10.173.211.26:6802/39,v1:10.173.211.26:6803/39] exists,up 57108459-ce4c-4bb1-aea1-4c16aaaa6708
osd.8 up in weight 1 up_from 2186 up_thru 2248 down_at 2185 last_clean_interval [42,2170) [v2:10.173.211.26:6809/39,v1:10.173.211.26:6811/39] [v2:10.173.211.26:6813/39,v1:10.173.211.26:6815/39] exists,up 63f416e5-426b-4864-903d-b5c7b3da10bd
</pre></p>
<p>osd.2 log report as below:<br /><pre>
2022-11-04T07:44:30.345+0000 7f6adbd1b700 1 osd.2 2607 tick checking mon for new map
2022-11-04T07:45:00.653+0000 7f6adbd1b700 1 osd.2 2607 tick checking mon for new map
2022-11-04T07:45:31.601+0000 7f6adbd1b700 1 osd.2 2607 tick checking mon for new map
2022-11-04T07:46:01.773+0000 7f6adbd1b700 1 osd.2 2607 tick checking mon for new map
2022-11-04T07:46:32.611+0000 7f6adbd1b700 1 osd.2 2607 tick checking mon for new map
2022-11-04T07:47:02.735+0000 7f6adbd1b700 1 osd.2 2607 tick checking mon for new map
2022-11-04T07:47:33.586+0000 7f6adbd1b700 1 osd.2 2607 tick checking mon for new map
2022-11-04T07:48:04.463+0000 7f6adbd1b700 1 osd.2 2607 tick checking mon for new map
</pre></p>
Ceph - Backport #54275 (In Progress): quincy: rados: 16.2.7 multiple gcc-12 compile errors
https://tracker.ceph.com/issues/54275
2022-02-14T17:55:21Z
Backport Bot
Ceph - Bug #53896 (Pending Backport): rados: 16.2.7 multiple gcc-12 compile errors
https://tracker.ceph.com/issues/53896
2022-01-15T14:49:43Z
Kaleb KEITHLEY
<p>Fedora 36/rawhide</p>
<p>see <a class="external" href="https://kojipkgs.fedoraproject.org//work/tasks/945/81280945/build.log">https://kojipkgs.fedoraproject.org//work/tasks/945/81280945/build.log</a></p>
<p>...<br />[14/2059] /usr/bin/g++ -DBOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION -DBOOST_ASIO_USE_TS_EXECUTOR_AS_DEFAULT -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_REENTRANT -D_THREAD_SAFE -D__CEPH__ -D__STDC_FORMAT_MACROS -D__linux__ -I/builddir/build/BUILD/ceph-16.2.7/build/redhat-linux-build/src/include -I/builddir/build/BUILD/ceph-16.2.7/src -isystem /builddir/build/BUILD/ceph-16.2.7/build/redhat-linux-build/include -isystem /builddir/build/BUILD/ceph-16.2.7/src/xxHash -isystem /builddir/build/BUILD/ceph-16.2.7/src/rapidjson/include -O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -O2 -g -DNDEBUG -fPIC -U_FORTIFY_SOURCE -Wall -fno-strict-aliasing -fsigned-char -Wtype-limits -Wignored-qualifiers -Wpointer-arith -Werror=format-security -Winit-self -Wno-unknown-pragmas -Wnon-virtual-dtor -Wno-ignored-qualifiers -ftemplate-depth-1024 -Wpessimizing-move -Wredundant-move -Wstrict-null-sentinel -Woverloaded-virtual -fno-new-ttp-matching -fstack-protector-strong -fdiagnostics-color=auto -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -std=c++17 -MD -MT src/CMakeFiles/rados_snap_set_diff_obj.dir/librados/snap_set_diff.cc.o -MF src/CMakeFiles/rados_snap_set_diff_obj.dir/librados/snap_set_diff.cc.o.d -o src/CMakeFiles/rados_snap_set_diff_obj.dir/librados/snap_set_diff.cc.o -c /builddir/build/BUILD/ceph-16.2.7/src/librados/snap_set_diff.cc<br />FAILED: src/CMakeFiles/rados_snap_set_diff_obj.dir/librados/snap_set_diff.cc.o <br />/usr/bin/g++ -DBOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION -DBOOST_ASIO_USE_TS_EXECUTOR_AS_DEFAULT -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_REENTRANT -D_THREAD_SAFE -D__CEPH__ -D__STDC_FORMAT_MACROS -D__linux__ -I/builddir/build/BUILD/ceph-16.2.7/build/redhat-linux-build/src/include -I/builddir/build/BUILD/ceph-16.2.7/src -isystem /builddir/build/BUILD/ceph-16.2.7/build/redhat-linux-build/include -isystem /builddir/build/BUILD/ceph-16.2.7/src/xxHash -isystem /builddir/build/BUILD/ceph-16.2.7/src/rapidjson/include -O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -O2 -g -DNDEBUG -fPIC -U_FORTIFY_SOURCE -Wall -fno-strict-aliasing -fsigned-char -Wtype-limits -Wignored-qualifiers -Wpointer-arith -Werror=format-security -Winit-self -Wno-unknown-pragmas -Wnon-virtual-dtor -Wno-ignored-qualifiers -ftemplate-depth-1024 -Wpessimizing-move -Wredundant-move -Wstrict-null-sentinel -Woverloaded-virtual -fno-new-ttp-matching -fstack-protector-strong -fdiagnostics-color=auto -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -std=c++17 -MD -MT src/CMakeFiles/rados_snap_set_diff_obj.dir/librados/snap_set_diff.cc.o -MF src/CMakeFiles/rados_snap_set_diff_obj.dir/librados/snap_set_diff.cc.o.d -o src/CMakeFiles/rados_snap_set_diff_obj.dir/librados/snap_set_diff.cc.o -c /builddir/build/BUILD/ceph-16.2.7/src/librados/snap_set_diff.cc<br />In file included from /builddir/build/BUILD/ceph-16.2.7/src/include/rados/rados_types.hpp:10,<br /> from /builddir/build/BUILD/ceph-16.2.7/src/librados/snap_set_diff.h:8,<br /> from /builddir/build/BUILD/ceph-16.2.7/src/librados/snap_set_diff.cc:6:<br />/builddir/build/BUILD/ceph-16.2.7/src/include/rados/buffer.h:98:52: error: expected template-name before '<' token<br /> 98 | struct unique_leakable_ptr : public std::unique_ptr<T, ceph::nop_delete<T>> {
| ^<br />/builddir/build/BUILD/ceph-16.2.7/src/include/rados/buffer.h:98:52: error: expected '{' before '<' token<br />...</p>
RADOS - Bug #53789 (Pending Backport): CommandFailedError (rados/test_python.sh): "RADOS object n...
https://tracker.ceph.com/issues/53789
2022-01-06T21:57:51Z
Laura Flores
<p>Description: rados/basic/{ceph clusters/{fixed-2 openstack} mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-low-osd-mem-target rados supported-random-distro$/{rhel_8} tasks/rados_python}</p>
<p>Failure Reason:<br /><pre><code class="text syntaxhl"><span class="CodeRay">Command failed (workunit test rados/test_python.sh) on smithi016 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=11ec2f2c963dded7e0cccf5a8c9afdde6d1c0f46 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test_python.sh'
</span></code></pre></p>
<p>/a/lflores-2022-01-05_19:04:35-rados-wip-lflores-mgr-rocksdb-distro-default-smithi/6596519<br /><pre><code class="text syntaxhl"><span class="CodeRay">2022-01-05T19:57:39.635 INFO:tasks.workunit.client.0.smithi016.stderr:test_rados.TestWatchNotify.test_aio_notify ... ERROR
2022-01-05T19:57:39.644 INFO:tasks.workunit.client.0.smithi016.stderr:test_rados.test_rados_init_error ... ok
2022-01-05T19:57:39.704 INFO:tasks.workunit.client.0.smithi016.stderr:test_rados.test_rados_init ... ok
2022-01-05T19:57:39.719 INFO:tasks.workunit.client.0.smithi016.stderr:test_rados.test_ioctx_context_manager ... ok
2022-01-05T19:57:39.729 INFO:tasks.workunit.client.0.smithi016.stderr:test_rados.test_parse_argv ... ok
2022-01-05T19:57:39.734 INFO:tasks.workunit.client.0.smithi016.stderr:test_rados.test_parse_argv_empty_str ... ok
2022-01-05T19:57:39.735 INFO:tasks.workunit.client.0.smithi016.stderr:
2022-01-05T19:57:39.735 INFO:tasks.workunit.client.0.smithi016.stderr:======================================================================
2022-01-05T19:57:39.735 INFO:tasks.workunit.client.0.smithi016.stderr:ERROR: test_rados.TestWatchNotify.test_aio_notify
2022-01-05T19:57:39.736 INFO:tasks.workunit.client.0.smithi016.stderr:----------------------------------------------------------------------
2022-01-05T19:57:39.736 INFO:tasks.workunit.client.0.smithi016.stderr:Traceback (most recent call last):
2022-01-05T19:57:39.736 INFO:tasks.workunit.client.0.smithi016.stderr: File "/usr/lib/python3.6/site-packages/nose/case.py", line 197, in runTest
2022-01-05T19:57:39.736 INFO:tasks.workunit.client.0.smithi016.stderr: self.test(*self.arg)
2022-01-05T19:57:39.737 INFO:tasks.workunit.client.0.smithi016.stderr: File "/home/ubuntu/cephtest/clone.client.0/src/test/pybind/test_rados.py", line 1533, in test_aio_notify
2022-01-05T19:57:39.737 INFO:tasks.workunit.client.0.smithi016.stderr: assert_raises(NotConnected, watch1.check)
2022-01-05T19:57:39.737 INFO:tasks.workunit.client.0.smithi016.stderr: File "/usr/lib64/python3.6/unittest/case.py", line 750, in assertRaises
2022-01-05T19:57:39.737 INFO:tasks.workunit.client.0.smithi016.stderr: return context.handle('assertRaises', args, kwargs)
2022-01-05T19:57:39.738 INFO:tasks.workunit.client.0.smithi016.stderr: File "/usr/lib64/python3.6/unittest/case.py", line 195, in handle
2022-01-05T19:57:39.738 INFO:tasks.workunit.client.0.smithi016.stderr: callable_obj(*args, **kwargs)
2022-01-05T19:57:39.738 INFO:tasks.workunit.client.0.smithi016.stderr: File "rados.pyx", line 2101, in rados.Watch.check
2022-01-05T19:57:39.738 INFO:tasks.workunit.client.0.smithi016.stderr:rados.ObjectNotFound: [errno 2] RADOS object not found (check error)
2022-01-05T19:57:39.739 INFO:tasks.workunit.client.0.smithi016.stderr:
2022-01-05T19:57:39.739 INFO:tasks.workunit.client.0.smithi016.stderr:----------------------------------------------------------------------
2022-01-05T19:57:39.739 INFO:tasks.workunit.client.0.smithi016.stderr:Ran 89 tests in 317.759s
2022-01-05T19:57:39.739 INFO:tasks.workunit.client.0.smithi016.stderr:
2022-01-05T19:57:39.740 INFO:tasks.workunit.client.0.smithi016.stderr:FAILED (errors=1)
2022-01-05T19:57:39.754 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-01-05T19:57:39.755 INFO:tasks.workunit:Stopping ['rados/test_python.sh'] on client.0...
2022-01-05T19:57:39.756 DEBUG:teuthology.orchestra.run.smithi016:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-01-05T19:57:40.032 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/run_tasks.py", line 91, in run_tasks
manager = run_one_task(taskname, ctx=ctx, config=config)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/run_tasks.py", line 70, in run_one_task
return task(**kwargs)
File "/home/teuthworker/src/git.ceph.com_ceph-c_11ec2f2c963dded7e0cccf5a8c9afdde6d1c0f46/qa/tasks/workunit.py", line 135, in task
coverage_and_limits=not config.get('no_coverage_and_limits', None))
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/parallel.py", line 84, in __exit__
for result in self:
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/parallel.py", line 98, in __next__
resurrect_traceback(result)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/parallel.py", line 30, in resurrect_traceback
raise exc.exc_info[1]
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/parallel.py", line 23, in capture_traceback
return func(*args, **kwargs)
File "/home/teuthworker/src/git.ceph.com_ceph-c_11ec2f2c963dded7e0cccf5a8c9afdde6d1c0f46/qa/tasks/workunit.py", line 427, in _run_tests
label="workunit test {workunit}".format(workunit=workunit)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/orchestra/remote.py", line 509, in run
r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/orchestra/run.py", line 455, in run
r.wait()
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/orchestra/run.py", line 161, in wait
self._raise_for_status()
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/orchestra/run.py", line 183, in _raise_for_status
node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test rados/test_python.sh) on smithi016 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=11ec2f2c963dded7e0cccf5a8c9afdde6d1c0f46 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test_python.sh'
</span></code></pre></p>
mgr - Bug #45591 (Pending Backport): mgr: FAILED ceph_assert(daemon != nullptr)
https://tracker.ceph.com/issues/45591
2020-05-18T20:31:14Z
Patrick Donnelly
pdonnell@redhat.com
<pre>
2020-05-16T11:54:45.842 INFO:tasks.ceph.mgr.x.smithi083.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-1640-g543315c9344/rpm/el8/BUILD/ceph-16.0.0-1640-g543315c9344/src/mgr/DaemonServer.cc: In function 'bool DaemonServer::handle_report(ceph::ref_t<MMgrReport>&)' thread 7fe985580700 time 2020-05-16T11:54:45.841099+0000
2020-05-16T11:54:45.842 INFO:tasks.ceph.mgr.x.smithi083.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-1640-g543315c9344/rpm/el8/BUILD/ceph-16.0.0-1640-g543315c9344/src/mgr/DaemonServer.cc: 610: FAILED ceph_assert(daemon != nullptr)
2020-05-16T11:54:45.844 INFO:tasks.ceph.mgr.x.smithi083.stderr: ceph version 16.0.0-1640-g543315c9344 (543315c934420269aa12ef2f9dec2c9eadb4fa6f) pacific (dev)
2020-05-16T11:54:45.844 INFO:tasks.ceph.mgr.x.smithi083.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7fe9ab892d90]
2020-05-16T11:54:45.844 INFO:tasks.ceph.mgr.x.smithi083.stderr: 2: (()+0x275faa) [0x7fe9ab892faa]
2020-05-16T11:54:45.845 INFO:tasks.ceph.mgr.x.smithi083.stderr: 3: (DaemonServer::handle_report(boost::intrusive_ptr<MMgrReport> const&)+0x13fd) [0x557354eaa34d]
2020-05-16T11:54:45.845 INFO:tasks.ceph.mgr.x.smithi083.stderr: 4: (DaemonServer::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x177) [0x557354ec0e17]
2020-05-16T11:54:45.845 INFO:tasks.ceph.mgr.x.smithi083.stderr: 5: (DispatchQueue::entry()+0x126a) [0x7fe9abab1efa]
2020-05-16T11:54:45.846 INFO:tasks.ceph.mgr.x.smithi083.stderr: 6: (DispatchQueue::DispatchThread::entry()+0x11) [0x7fe9abb549d1]
2020-05-16T11:54:45.846 INFO:tasks.ceph.mgr.x.smithi083.stderr: 7: (()+0x82de) [0x7fe9a9d1f2de]
2020-05-16T11:54:45.846 INFO:tasks.ceph.mgr.x.smithi083.stderr: 8: (clone()+0x43) [0x7fe9a88b2133]
2020-05-16T11:54:45.847 INFO:tasks.ceph.mgr.x.smithi083.stderr:*** Caught signal (Aborted) **
2020-05-16T11:54:45.847 INFO:tasks.ceph.mgr.x.smithi083.stderr: in thread 7fe985580700 thread_name:ms_dispatch
</pre>
<p>From: /ceph/teuthology-archive/pdonnell-2020-05-16_06:07:05-fs-wip-pdonnell-testing-20200516.030215-distro-basic-smithi/5060503/teuthology.log</p>