https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2015-12-01T16:01:12ZCeph Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=623222015-12-01T16:01:12ZMarkus Blank-Burianburian@muenster.de
<ul><li><strong>File</strong> <a href="/attachments/download/2099/logs.zip.001">logs.zip.001</a> added</li><li><strong>File</strong> <a href="/attachments/download/2100/logs.zip.002">logs.zip.002</a> added</li><li><strong>File</strong> <a href="/attachments/download/2101/logs.zip.003">logs.zip.003</a> added</li><li><strong>File</strong> <a href="/attachments/download/2102/logs.zip.004">logs.zip.004</a> added</li><li><strong>File</strong> <a href="/attachments/download/2103/logs.zip.005">logs.zip.005</a> added</li></ul><p>first error on one of the two OSDs was actually this one (see logs):<br /><pre>
osd/ReplicatedPG.cc: 8071: FAILED assert(repop_queue.front() == repop)
ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55e69ea65e73]
2: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0xfc4) [0x55e69e735234]
3: (ReplicatedPG::repop_all_applied(ReplicatedPG::RepGather*)+0x92) [0x55e69e7352f0]
4: (Context::complete(int)+0x9) [0x55e69e59bd6d]
5: (ECBackend::check_op(ECBackend::Op*)+0x12f) [0x55e69e8e21f9]
6: (ECBackend::handle_sub_write_reply(pg_shard_t, ECSubWriteReply&)+0xad) [0x55e69e8e272b]
7: (ECBackend::handle_message(std::shared_ptr<OpRequest>)+0x446) [0x55e69e8f9e96]
8: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x1c0) [0x55e69e70ea4e]
9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x40c) [0x55e69e56fc1e]
10: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x52) [0x55e69e56fe52]
11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x61f) [0x55e69e5872c7]
12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7ae) [0x55e69ea53b16]
13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e69ea57060]
14: (Thread::entry_wrapper()+0x64) [0x55e69ea47c2c]
15: (()+0x7176) [0x7f5a6c289176]
16: (clone()+0x6d) [0x7f5a6a55d49d]
</pre></p> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=623752015-12-02T07:31:35ZMarkus Blank-Burianburian@muenster.de
<ul></ul><p>Increasing the log level, I found that osd.27 was returning error -2 for some EC part. So after killing osd.27 and leaving osd.40 and osd.41 run for some time, it seems to be stable now. Maybe this is connected one of the 7 unfound objects the cluster had, when osd.40 and osd.41 were down.<br />At the moment I have osd.40/41 running with debug osd 0/20 and osd.27 with 20/20 in case the error reappears. I will also do some further stress tests.</p> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=645942016-01-26T15:55:43ZSamuel Justsjust@redhat.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-2 status-3 priority-4 priority-default closed" href="/issues/14513">Feature #14513</a>: Test and improve ec handling of reads on objects with shards unexpectedly missing on a replica</i> added</li></ul> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=645952016-01-26T15:55:54ZSamuel Justsjust@redhat.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Can't reproduce</i></li></ul><p>I think this is an actual shortcoming with the way we handle corrupt osds, I'll open a feature ticket to improve that.</p> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=705382016-05-11T02:45:18ZKefu Chaitchaikov@gmail.com
<ul></ul><p><a class="external" href="https://jenkins.ceph.com/job/ceph-pull-requests/5445/console">https://jenkins.ceph.com/job/ceph-pull-requests/5445/console</a></p>
<pre>
0> 2016-05-10 21:45:26.748284 2b3e08339700 -1 osd/ECBackend.cc: In function 'virtual void OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)' thread 2b3e08339700 time 2016-05-10 21:45:26.741352
osd/ECBackend.cc: 203: FAILED assert(res.errors.empty())
ceph version 10.2.0-833-g3d9f6c2 (3d9f6c27e4983a5c1a8ba3b9849e5a38f191541c)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x2b3df1a3857b]
2: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x192) [0x2b3df15eedb2]
3: (GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x9) [0x2b3df15dc769]
4: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x63) [0x2b3df15d2bd3]
5: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0xf68) [0x2b3df15d3ba8]
6: (ECBackend::handle_message(std::shared_ptr<OpRequest>)+0x186) [0x2b3df15dae16]
7: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xed) [0x2b3df151785d]
8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) [0x2b3df13d7955]
9: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d) [0x2b3df13d7b7d]
10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x657) [0x2b3df13dc367]
11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x877) [0x2b3df1a28a77]
12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x2b3df1a2a9a0]
13: (()+0x8182) [0x2b3dfbf32182]
14: (clone()+0x6d) [0x2b3dfde4147d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interp
</pre> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=705822016-05-11T17:19:28ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Status</strong> changed from <i>Can't reproduce</i> to <i>12</i></li><li><strong>Assignee</strong> set to <i>David Zafman</i></li></ul><p>Seen again recently, but the Jenkins job isn't accessible any more. Yuri saw this during a make check running test/osd/osd-scrub-repair.sh.</p>
<p>osd/ECBackend.cc: 203: FAILED assert(res.errors.empty())</p>
<pre><code>ceph version 10.2.0-757-g8aa0d57 (8aa0d571088defcdedb0844568f262cc43f5c23c)<br /> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x2ae0f6824b0b]<br /> 2: (OnRecoveryReadComplete::finish(std::pair&lt;RecoveryMessages*, ECBackend::read_result_t&&gt;&)+0x192) [0x2ae0f63d79e2]<br /> 3: (GenContext&lt;std::pair&lt;RecoveryMessages*, ECBackend::read_result_t&&gt;&>::complete(std::pair&lt;RecoveryMessages*, ECBackend::read_result_t&&gt;&)+0x9) [0x2ae0f63c5399]<br /> 4: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x63) [0x2ae0f63bb803]<br /> 5: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0xf68) [0x2ae0f63bc7d8]<br /> 6: (ECBackend::handle_message(std::shared_ptr&lt;OpRequest&gt;)+0x186) [0x2ae0f63c3a46]<br /> 7: (ReplicatedPG::do_request(std::shared_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0xed) [0x2ae0f62fcf8d]<br /> 8: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, std::shared_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x3f5) [0x2ae0f61bc5c5]<br /> 9: (PGQueueable::RunVis::operator()(std::shared_ptr&lt;OpRequest&gt;&)+0x5d) [0x2ae0f61bc7ed]<br /> 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x879) [0x2ae0f61c1219]<br /> 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x877) [0x2ae0f6815007]<br /> 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x2ae0f6816f30]<br /> 13: (()+0x8182) [0x2ae100cbb182]<br /> 14: (clone()+0x6d) [0x2ae102bca47d]<br /> NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.</code></pre> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=707352016-05-13T18:19:03ZDavid Zafmandzafman@redhat.com
<ul></ul><p>This failure has been exposed due to parallelization of part of osd-scrub-repair.sh. It deletes objects from 3 OSDs simultaneously. This involves taking the OSD down and using ceph-objectstore-tool to remove the object. As the OSDs coming back up together and recovery on the primary is trying to recovery a missing object, but it is also missing on the secondaries. This error case wasn't handled because it usually involves a corruption and would normally be very rare.</p>
<p>Either we need to revert the parallelization or try to handle this error. To handle the error we need to fix the crash and make sure recovery can complete. The test expects to end up with an unfound object after repair.</p> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=708132016-05-16T11:21:40ZAaron Bassettaaron.bassett@nantomics.com
<ul></ul><p>I've just hit this issue in a pre production cluster. Using it with radosgw/civetweb for object storage, I have one storage pool using EC 2,1 for less important data. On one pg, I had one osd go down with a bad disk, and another kept crashing with this error. After some snooping, it turned out the third osd for that pg also had a bad disk an was throwing a read error on certain pieces, which would cause the head osd to crash with `FAILED assert(res.errors.empty())<br />`. Unfortunately, at this point, these were the only two copies left. I was able to get things moving again by stopping both osds, removing the unreadable piece from both, then starting them again and waiting for the one on the bad disk to find the next bad part. Once I removed everything it couldn't read they were able to finish backfilling and I took the bad one out.</p>
<p>Obviously, this caused data loss, but I'm ok with that since its somewhat expected in a 2,1 pool, and I am still in the process of populating data anyways so I kept track of the bad (rgw) keys so I can delete and re upload them.</p> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=708962016-05-17T09:19:35ZKefu Chaitchaikov@gmail.com
<ul></ul><p><a class="external" href="https://jenkins.ceph.com/job/ceph-pull-requests/5688/consoleFull">https://jenkins.ceph.com/job/ceph-pull-requests/5688/consoleFull</a></p> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=709122016-05-17T14:01:38ZAaron Bassettaaron.bassett@nantomics.com
<ul></ul><p>I believe I've just hit this again in the same cluster. It looks like another bad disk, returning err -5 on just a small handful of reads. When I took it out, it started crashing other osds as the cluster tried to backfill from it. It seems like a pretty bad situation to me..</p> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=710292016-05-18T21:08:41ZSage Weilsage@newdream.net
<ul></ul><p>David: it seems to me like we should handle the error, especially since it seems to happen in the real world (not just the test).</p> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=712262016-05-21T09:51:47ZMustafa Muhammadmustafa1024m@gmail.com
<ul></ul><p>This affects my production cluster too, what can I do for the time being?<br />Currently I am setting osd_recovery_max_active of the crashed OSD to 0.</p> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=714022016-05-25T11:14:55ZDan van der Ster
<ul></ul><p>+1 on a pre-prod jewel cluster here.</p> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=714032016-05-25T11:21:52ZMarkus Blank-Burianburian@muenster.de
<ul></ul><p>As a workaround, you can increase osd logging to 0/20 on the crashing OSD. Then, a short time before the assert, there is a message which outputs the error codes and the shards which produced the errors. Then you can stop the corresponding <abbr title="s">OSD</abbr>, then start the crashing OSD again and let the recovery finish with the remaining OSDs. Once the recovery of the affected PG is complete, you can start the remaining <abbr title="s">OSD</abbr> again.</p> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=714092016-05-25T12:08:03ZDan van der Ster
<ul></ul><p>Cool, got that fixed. In my case this was an 8+3 pool, and 3 OSDs were crashing, while 1 of the 8 had the read error. So I had to juggle which OSDs were running and use osd_recovery_max_active=0 to let some run to recover the others. All good now.</p> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=714422016-05-25T16:21:44ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Status</strong> changed from <i>12</i> to <i>In Progress</i></li></ul> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=751322016-07-21T02:07:15ZKen Dreyerkdreyer@redhat.com
<ul></ul><p>PR for master here: <a class="external" href="https://github.com/ceph/ceph/pull/9304">https://github.com/ceph/ceph/pull/9304</a> (currently marked DNM)</p> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=779422016-09-06T15:23:24ZPavan Rallabhandi
<ul></ul><p>We are seeing this on Jewel versions 10.2.2 with RGW data pool on EC (8+3) jerasure with failure domain set to host.</p>
<p>We had a drive failure, resulting in some of the PGs ending up in inconsistent state, and an eventual osd down/out of the respective OSD resulted in couple of other OSDs going down with the reported stack trace.</p> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=784052016-09-20T04:15:01ZKefu Chaitchaikov@gmail.com
<ul></ul><p>note, should revert <a class="external" href="https://github.com/ceph/ceph/commit/5bc55338f5a1645bc651811fae2f89ad855ff86e#diff-1b6e7a54c1cf4f286f5835c21abd065dL4">https://github.com/ceph/ceph/commit/5bc55338f5a1645bc651811fae2f89ad855ff86e#diff-1b6e7a54c1cf4f286f5835c21abd065dL4</a> once this issue is resolved.</p> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=804872016-10-28T09:21:54ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=805232016-10-29T09:02:32ZDan van der Ster
<ul></ul><p>Will there be a jewel backport for this?</p> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=816212016-11-20T09:57:36ZAaron Tceph-redmine@aarontc.com
<ul></ul><p>Dan van der Ster wrote:</p>
<blockquote>
<p>Will there be a jewel backport for this?</p>
</blockquote>
<p>Also curious about a Jewel backport, as my production Jewel cluster is encountering this problem regularly now.</p> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=816262016-11-20T21:31:25ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Status</strong> changed from <i>Resolved</i> to <i>Pending Backport</i></li><li><strong>Backport</strong> set to <i>jewel</i></li></ul> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=816272016-11-20T21:32:25ZNathan Cutlerncutler@suse.cz
<ul></ul><p><strong>master PRs</strong>: <a class="external" href="https://github.com/ceph/ceph/pull/9304">https://github.com/ceph/ceph/pull/9304</a> and <a class="external" href="https://github.com/ceph/ceph/pull/11449">https://github.com/ceph/ceph/pull/11449</a></p> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=816302016-11-20T21:36:06ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/17970">Backport #17970</a>: jewel: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())</i> added</li></ul> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=886272017-04-07T04:28:59ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Status</strong> changed from <i>Pending Backport</i> to <i>Resolved</i></li></ul> Ceph - Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())https://tracker.ceph.com/issues/13937?journal_id=940712017-06-29T19:31:55ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Duplicated by</strong> <i><a class="issue tracker-1 status-10 priority-4 priority-default closed" href="/issues/13493">Bug #13493</a>: osd: for ec, cascading crash during recovery if one shard is corrupted</i> added</li></ul>