https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2021-08-31T14:05:55ZCeph bluestore - Bug #52464: FAILED ceph_assert(current_shard->second->valid())https://tracker.ceph.com/issues/52464?journal_id=2021982021-08-31T14:05:55ZJeff Laytonjlayton@redhat.com
<ul><li><strong>Subject</strong> changed from <i>osd crash in bluestore code</i> to <i>FAILED ceph_assert(current_shard->second->valid())</i></li></ul> bluestore - Bug #52464: FAILED ceph_assert(current_shard->second->valid())https://tracker.ceph.com/issues/52464?journal_id=2022032021-08-31T14:32:32ZNeha Ojhanojha@redhat.com
<ul><li><strong>Assignee</strong> set to <i>Gabriel BenHanokh</i></li></ul><p>Gabi, I am assigning it to you for now, since this looks related to NCB.</p> bluestore - Bug #52464: FAILED ceph_assert(current_shard->second->valid())https://tracker.ceph.com/issues/52464?journal_id=2068002021-12-01T01:56:57ZNeha Ojhanojha@redhat.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Can't reproduce</i></li></ul> bluestore - Bug #52464: FAILED ceph_assert(current_shard->second->valid())https://tracker.ceph.com/issues/52464?journal_id=2259982022-09-20T17:21:18ZRadoslaw Zarzynskirzarzyns@redhat.com
<ul><li><strong>Status</strong> changed from <i>Can't reproduce</i> to <i>New</i></li></ul><p>It happened again: <a class="external" href="https://github.com/rook/rook/issues/10936">https://github.com/rook/rook/issues/10936</a>. For logs see: <a class="external" href="https://github.com/rook/rook/issues/10936#issuecomment-1241428968">https://github.com/rook/rook/issues/10936#issuecomment-1241428968</a>.</p> bluestore - Bug #52464: FAILED ceph_assert(current_shard->second->valid())https://tracker.ceph.com/issues/52464?journal_id=2260042022-09-20T21:51:54ZYaarit Hatuka
<ul></ul><p>Also reported via telemetry:<br /><a class="external" href="http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?var-sig_v2=245d51f4792af37e9d6d2e5ae03b75476234c20337d0320971b789032bd76e30&orgId=1">http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?var-sig_v2=245d51f4792af37e9d6d2e5ae03b75476234c20337d0320971b789032bd76e30&orgId=1</a></p>
<p>Please note that the earliest version in these reports is 16.2.0 and the most recent one is 16.2.9.<br />Some of the reporting clusters upgraded to 17.2.0, 17.2.1, and 17.2.3, but there are no crashes from these versions yet.</p> bluestore - Bug #52464: FAILED ceph_assert(current_shard->second->valid())https://tracker.ceph.com/issues/52464?journal_id=2276292022-10-20T10:46:45ZIgor Fedotovigor.fedotov@croit.io
<ul></ul><p>Neha Ojha wrote:</p>
<blockquote>
<p>Gabi, I am assigning it to you for now, since this looks related to NCB.</p>
</blockquote>
<p>No, apparently this is unrelated to NCB - one can see the same issue in 16.2.10 (as pointed by Radek in comment <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: lockdep warning in socket code (Closed)" href="https://tracker.ceph.com/issues/4">#4</a>) which lacks NCB stuff...</p> bluestore - Bug #52464: FAILED ceph_assert(current_shard->second->valid())https://tracker.ceph.com/issues/52464?journal_id=2276302022-10-20T10:52:27ZIgor Fedotovigor.fedotov@croit.io
<ul><li><strong>Assignee</strong> changed from <i>Gabriel BenHanokh</i> to <i>Adam Kupczyk</i></li></ul><p>IMO this is rather related to DB sharding stuff introduced by <a class="external" href="https://github.com/ceph/ceph/pull/34006">https://github.com/ceph/ceph/pull/34006</a><br />Hence reassigning to Adam.</p> bluestore - Bug #52464: FAILED ceph_assert(current_shard->second->valid())https://tracker.ceph.com/issues/52464?journal_id=2557602024-03-01T02:00:22ZMichael Collins
<ul></ul><p>We're run into this bug twice recently, we don't have any urgency when it comes to re-provisioning these disks and are happy to run any tests you guys can think of to try and gain more insight into the problem:</p>
<p><a class="external" href="https://www.reddit.com/r/ceph/comments/1b3i0dl/osd_failed_ceph_assertret/">https://www.reddit.com/r/ceph/comments/1b3i0dl/osd_failed_ceph_assertret/</a></p> bluestore - Bug #52464: FAILED ceph_assert(current_shard->second->valid())https://tracker.ceph.com/issues/52464?journal_id=2560132024-03-05T06:51:04Zyu song
<ul></ul><p>We encountered the same issue on Ceph 16.2.10, suspecting it may be related to reusing-Iterator in RocksDB,rocksdb issue(<a class="external" href="https://github.com/facebook/rocksdb/pull/9258">https://github.com/facebook/rocksdb/pull/9258</a>). We are unsure how to reproduce this problem in bluestore.</p> bluestore - Bug #52464: FAILED ceph_assert(current_shard->second->valid())https://tracker.ceph.com/issues/52464?journal_id=2560282024-03-05T09:30:15ZIgor Fedotovigor.fedotov@croit.io
<ul></ul><p>Michael Collins wrote:</p>
<blockquote>
<p>We're run into this bug twice recently, we don't have any urgency when it comes to re-provisioning these disks and are happy to run any tests you guys can think of to try and gain more insight into the problem:</p>
<p><a class="external" href="https://www.reddit.com/r/ceph/comments/1b3i0dl/osd_failed_ceph_assertret/">https://www.reddit.com/r/ceph/comments/1b3i0dl/osd_failed_ceph_assertret/</a></p>
</blockquote>
<p>Hi Michael,</p>
<p>I'm not sure why do you think your issue is related to this ticket. The backtrace and the assertion are completely different. Here is the one from your post at reddit:</p>
<p>Mar 01 09:30:31 storage-09-08162 ceph-osd<sup><a href="#fn3072960">3072960</a></sup>: -4> 2024-03-01T09:30:31.692+0800 7fe158d2d700 -1 bdev(0x557579976000 /var/lib/ceph/osd/ceph-1141/block) _aio_thread got r=-61 ((61) No data available)<br />Mar 01 09:30:31 storage-09-08162 ceph-osd<sup><a href="#fn3072960">3072960</a></sup>: -3> 2024-03-01T09:30:31.692+0800 7fe158d2d700 -1 bdev(0x557579976000 /var/lib/ceph/osd/ceph-1141/block) _aio_thread translating the error to EIO for upper layer<br />Mar 01 09:30:31 storage-09-08162 ceph-osd<sup><a href="#fn3072960">3072960</a></sup>: -2> 2024-03-01T09:30:31.784+0800 7fe163dc63c0 -1 osd.1141 0 failed to load OSD map for epoch 1524611, got 0 bytes<br />Mar 01 09:30:31 storage-09-08162 ceph-osd<sup><a href="#fn3072960">3072960</a></sup>: -1> 2024-03-01T09:30:31.884+0800 7fe163dc63c0 -1 /build/ceph-17.2.6/src/osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fe163dc63c0 time 2024-03-01T09:30:31.793897+0800<br />Mar 01 09:30:31 storage-09-08162 ceph-osd<sup><a href="#fn3072960">3072960</a></sup>: /build/ceph-17.2.6/src/osd/OSD.h: 704: FAILED ceph_assert(ret)<br />Mar</p>
<p>To me this looks like a low level/hardware issue - BlueStore is unable to read from underlying block device.</p>
<p>You might want to check dmesg output for more information. And/or try smartctl -a against this disk's device. Also one can use dd tool to check if the disk is readable.</p> bluestore - Bug #52464: FAILED ceph_assert(current_shard->second->valid())https://tracker.ceph.com/issues/52464?journal_id=2560292024-03-05T09:32:46ZIgor Fedotovigor.fedotov@croit.io
<ul></ul><p>ah... just reread Michael's post at reddit - smartctl reports unrecoverable disk errors according to it - I believe that's a culprit.</p>