https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2018-03-14T13:57:37ZCeph bluestore - Bug #23333: bluestore: ENODATA on aiohttps://tracker.ceph.com/issues/23333?journal_id=1091152018-03-14T13:57:37ZSage Weilsage@newdream.net
<ul><li><strong>Project</strong> changed from <i>RADOS</i> to <i>bluestore</i></li><li><strong>Subject</strong> changed from <i>One OSD constantly crashes</i> to <i>bluestore: ENODATA on aio</i></li><li><strong>Status</strong> changed from <i>New</i> to <i>12</i></li></ul> bluestore - Bug #23333: bluestore: ENODATA on aiohttps://tracker.ceph.com/issues/23333?journal_id=1092872018-03-19T13:50:20ZRadoslaw Zarzynskirzarzyns@redhat.com
<ul><li><strong>Status</strong> changed from <i>12</i> to <i>Need More Info</i></li><li><strong>Assignee</strong> set to <i>Radoslaw Zarzynski</i></li></ul><p>This looks really interesting. The assertion failure came from the <em>io_getevents</em> (called by one of the <em>bstore_aio</em> threads) that finished with <em>ENODATA</em>. Surprisingly, the <em>man 2 io_getevents</em> doesn't enlist <em>ENODATA</em> as a possible error code. Maybe it's just undocumented, maybe originates in a lower layer (driver?). What is certain is that BlueStore currently is able to deal (in a way other than assertion) solely with the <em>EINTR</em>.</p>
<p>Could you please provide more information about used OS and hardware? Especially kernel version is vital as taking a look on the implementation details seems necessary.</p> bluestore - Bug #23333: bluestore: ENODATA on aiohttps://tracker.ceph.com/issues/23333?journal_id=1092912018-03-19T15:01:10ZRobert Sanderceph@gurubert.de
<ul></ul><p>Radoslaw Zarzynski wrote:</p>
<blockquote>
<p>Could you please provide more information about used OS and hardware? Especially kernel version is vital as taking a look on the implementation details seems necessary.</p>
</blockquote>
<p>The Ceph OSDs run on Ubuntu 16.04.4 with kernel 4.13.0-36-generic on Dell PowerEdge R720xd with Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz CPUs. HDDs and SSDs are connect to an LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 controller.</p>
<p>The BlueStore OSDs are on 4TB HDDs and configured to use a 10GB partition on SSD as WAL/DB. The size is not optimal but was used as this has been the FileStore journal before the migration to BlueStore.</p>
<p>Any other details you need?</p> bluestore - Bug #23333: bluestore: ENODATA on aiohttps://tracker.ceph.com/issues/23333?journal_id=1093282018-03-19T17:17:48ZRadoslaw Zarzynskirzarzyns@redhat.com
<ul></ul><p>Hi Robert,</p>
<p>thanks for providing the info! I've took a look on the implementation of <em>io_getevents</em> in your kernel but can't find any obvious way to get <em>ENODATA</em>. Maybe it can (?) be accidentally calculated by the loop in <a href="https://lxr.missinglinkelectronics.com/linux+v4.13/fs/aio.c#L1224" class="external"><em>aio_read_events_ring</em></a> (link to vanilla kernel) but this appears improbable to me. Before diving into improbabilities, let's make a quick cut between user and kernel spaces to exclude e.g. uspace error code corruption.</p>
<p>Could you try to run the OSD under e.g. <em>strace</em>? Something like <em>strace -f -e trace=io_getevents</em> would be really useful.</p> bluestore - Bug #23333: bluestore: ENODATA on aiohttps://tracker.ceph.com/issues/23333?journal_id=1093322018-03-19T20:55:41ZRadoslaw Zarzynskirzarzyns@redhat.com
<ul></ul><p>Errata, the issue is related to <em>aio_t::get_return_value</em>, not to <em>io_getevents</em>. Our debug message is misleading. I've just sent a <a href="https://github.com/ceph/ceph/pull/20963" class="external">pull request</a> rectifying that.</p>
There is no business in getting output from <em>strace</em> at the moment. Instead, could you please:
<ul>
<li>provide log with increased debug log levels (<em>debug_osd = 20</em> and <em>debug_bdev = 30</em>),</li>
<li>verify the hardware for failure (<em>dmesg</em>, SMART etc.)?</li>
</ul>
<p>Also, big thanks to Mr Marcin Gibula for the consultations.</p> bluestore - Bug #23333: bluestore: ENODATA on aiohttps://tracker.ceph.com/issues/23333?journal_id=1093332018-03-19T21:22:23ZRobert Sanderceph@gurubert.de
<ul></ul><p>Radoslaw Zarzynski wrote:</p>
<blockquote>
Instead, could you please:
<ul>
<li>provide log with increased debug log levels (<em>debug_osd = 20</em> and <em>debug_bdev = 30</em>),</li>
<li>verify the hardware for failure (<em>dmesg</em>, SMART etc.)?</li>
</ul>
</blockquote>
<p>The issue only happened between March 12 6:31 and March 13 15:55 for 36 times, but not even once after that.<br />It's a little bit strange, but it seems to have healed itself.<br />I am not sure if I want to increase debug output on this production cluster.</p> bluestore - Bug #23333: bluestore: ENODATA on aiohttps://tracker.ceph.com/issues/23333?journal_id=1093342018-03-19T21:25:02ZRobert Sanderceph@gurubert.de
<ul></ul><p>Robert Sander wrote:</p>
<blockquote>
<p>It's a little bit strange, but it seems to have healed itself.</p>
</blockquote>
<p>Looking around I just found entries in /var/log/kern.log related:</p>
<p>Mar 13 15:55:45 ceph02 kernel: [362540.919365] sd 0:0:4:0: [sde] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE<br />Mar 13 15:55:45 ceph02 kernel: [362540.919388] sd 0:0:4:0: [sde] tag#4 Sense Key : Medium Error [current] <br />Mar 13 15:55:45 ceph02 kernel: [362540.919394] sd 0:0:4:0: [sde] tag#4 Add. Sense: Read retries exhausted<br />Mar 13 15:55:45 ceph02 kernel: [362540.919401] sd 0:0:4:0: [sde] tag#4 CDB: Read(16) 88 00 00 00 00 01 38 af 66 f8 00 00 00 60 00 00<br />Mar 13 15:55:45 ceph02 kernel: [362540.919407] print_req_error: critical medium error, dev sde, sector 5245986552</p>
<p>Looking back it was always the same sector.</p> bluestore - Bug #23333: bluestore: ENODATA on aiohttps://tracker.ceph.com/issues/23333?journal_id=1093612018-03-20T12:57:53ZRadoslaw Zarzynskirzarzyns@redhat.com
<ul><li><strong>Status</strong> changed from <i>Need More Info</i> to <i>In Progress</i></li></ul><blockquote>
<p>Mar 13 15:55:45 ceph02 kernel: [362540.919407] print_req_error: critical medium error, dev sde, sector 5245986552</p>
</blockquote>
<p>This explains a lot. The <a href="https://lxr.missinglinkelectronics.com/linux/block/blk-core.c#L143" class="external">associated</a> <em>errno</em> for <em>BLK_STS_MEDIUM</em> is <em>ENODATA</em>. Thanks, Robert!</p>
<p>@Sage, do we want to alter our <em>allow_eio</em> policy? The kernel can set many other <em>E*</em> <a href="https://lxr.missinglinkelectronics.com/linux/block/blk-core.c#L151" class="external">treating</a> <em>EIO</em> rather like a fallback when nothing suits better.</p> bluestore - Bug #23333: bluestore: ENODATA on aiohttps://tracker.ceph.com/issues/23333?journal_id=1093922018-03-20T22:32:59ZYuri Weinsteinyweinste@redhat.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-8 priority-6 priority-high2 closed" href="/issues/23426">Bug #23426</a>: aio thread got No space left on device</i> added</li></ul> bluestore - Bug #23333: bluestore: ENODATA on aiohttps://tracker.ceph.com/issues/23333?journal_id=1106172018-04-09T18:54:19ZRadoslaw Zarzynskirzarzyns@redhat.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Fix Under Review</i></li></ul><p>PR: <a class="external" href="https://github.com/ceph/ceph/pull/21306">https://github.com/ceph/ceph/pull/21306</a>.</p> bluestore - Bug #23333: bluestore: ENODATA on aiohttps://tracker.ceph.com/issues/23333?journal_id=1106182018-04-09T18:54:47ZRadoslaw Zarzynskirzarzyns@redhat.com
<ul><li><strong>Status</strong> changed from <i>Fix Under Review</i> to <i>7</i></li></ul> bluestore - Bug #23333: bluestore: ENODATA on aiohttps://tracker.ceph.com/issues/23333?journal_id=1109032018-04-11T15:07:00ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Status</strong> changed from <i>7</i> to <i>Pending Backport</i></li></ul><p>i believe this change should be backported.</p> bluestore - Bug #23333: bluestore: ENODATA on aiohttps://tracker.ceph.com/issues/23333?journal_id=1109812018-04-12T01:38:42ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Backport</strong> set to <i>luminous</i></li></ul> bluestore - Bug #23333: bluestore: ENODATA on aiohttps://tracker.ceph.com/issues/23333?journal_id=1109822018-04-12T01:39:17ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/23672">Backport #23672</a>: luminous: bluestore: ENODATA on aio</i> added</li></ul> bluestore - Bug #23333: bluestore: ENODATA on aiohttps://tracker.ceph.com/issues/23333?journal_id=1139112018-05-23T18:36:43ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Status</strong> changed from <i>Pending Backport</i> to <i>Resolved</i></li></ul>