Project

General

Profile

Bug #13937

osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())

Added by Markus Blank-Burian almost 2 years ago. Updated 8 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
12/01/2015
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
jewel
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

I am getting the following error with ceph v9.2.0 on two different OSDs. The error occurs shortly after startup.

 osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())

 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55b5305f5e73]
 2: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0xe3) [0x55b5304987af]
 3: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x59) [0x55b530484ad1]
 4: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0x1008) [0x55b530485ea6]
 5: (ECBackend::handle_message(std::shared_ptr<OpRequest>)+0x216) [0x55b530489c66]
 6: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x1c0) [0x55b53029ea4e]
 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x40c) [0x55b5300ffc1e]
 8: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x52) [0x55b5300ffe52]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x61f) [0x55b5301172c7]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7ae) [0x55b5305e3b16]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b5305e7060]
 12: (Thread::entry_wrapper()+0x64) [0x55b5305d7c2c]
 13: (()+0x7176) [0x7f4cc5cdd176]
 14: (clone()+0x6d) [0x7f4cc3fb149d]

Configuration is rather simple:
3x mirror ("data")
3x mirror ("datahot", cache-tier for "dataec")
8+3 jerasure ("dataec")

pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 72574 flags hashpspool crash_replay_interval 45 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
pool 1 'metadata' replicated size 4 min_size 2 crush_ruleset 1 object_hash rjenkins pg_num 256 pgp_num 256 last_change 72577 flags hashpspool min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 31 flags hashpspool min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
pool 5 'datahot' replicated size 3 min_size 2 crush_ruleset 3 object_hash rjenkins pg_num 128 pgp_num 128 last_change 185334 flags hashpspool,incomplete_clones tier_of 6 cache_mode writeback target_bytes 1099511627776 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 3600s x1 stripe_width 0
pool 6 'dataec' erasure size 11 min_size 8 crush_ruleset 4 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 185334 lfor 185334 flags hashpspool tiers 5 read_tier 5 write_tier 5 stripe_width 131072

data and datahot are used as data-pools for cephfs. All pools share the same OSDs.

logs.zip.001 (1 MB) Markus Blank-Burian, 12/01/2015 04:00 PM

logs.zip.002 (1 MB) Markus Blank-Burian, 12/01/2015 04:00 PM

logs.zip.003 (1 MB) Markus Blank-Burian, 12/01/2015 04:00 PM

logs.zip.004 (1 MB) Markus Blank-Burian, 12/01/2015 04:00 PM

logs.zip.005 (503 KB) Markus Blank-Burian, 12/01/2015 04:00 PM


Related issues

Related to Ceph - Feature #14513: Test and improve ec handling of reads on objects with shards unexpectedly missing on a replica Resolved 01/26/2016
Duplicated by RADOS - Bug #13493: osd: for ec, cascading crash during recovery if one shard is corrupted Duplicate 10/15/2015
Copied to Ceph - Backport #17970: jewel: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty()) Resolved

History

#1 Updated by Markus Blank-Burian almost 2 years ago

first error on one of the two OSDs was actually this one (see logs):

osd/ReplicatedPG.cc: 8071: FAILED assert(repop_queue.front() == repop)

 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55e69ea65e73]
 2: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0xfc4) [0x55e69e735234]
 3: (ReplicatedPG::repop_all_applied(ReplicatedPG::RepGather*)+0x92) [0x55e69e7352f0]
 4: (Context::complete(int)+0x9) [0x55e69e59bd6d]
 5: (ECBackend::check_op(ECBackend::Op*)+0x12f) [0x55e69e8e21f9]
 6: (ECBackend::handle_sub_write_reply(pg_shard_t, ECSubWriteReply&)+0xad) [0x55e69e8e272b]
 7: (ECBackend::handle_message(std::shared_ptr<OpRequest>)+0x446) [0x55e69e8f9e96]
 8: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x1c0) [0x55e69e70ea4e]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x40c) [0x55e69e56fc1e]
 10: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x52) [0x55e69e56fe52]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x61f) [0x55e69e5872c7]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7ae) [0x55e69ea53b16]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e69ea57060]
 14: (Thread::entry_wrapper()+0x64) [0x55e69ea47c2c]
 15: (()+0x7176) [0x7f5a6c289176]
 16: (clone()+0x6d) [0x7f5a6a55d49d]

#2 Updated by Markus Blank-Burian almost 2 years ago

Increasing the log level, I found that osd.27 was returning error -2 for some EC part. So after killing osd.27 and leaving osd.40 and osd.41 run for some time, it seems to be stable now. Maybe this is connected one of the 7 unfound objects the cluster had, when osd.40 and osd.41 were down.
At the moment I have osd.40/41 running with debug osd 0/20 and osd.27 with 20/20 in case the error reappears. I will also do some further stress tests.

#3 Updated by Samuel Just almost 2 years ago

  • Related to Feature #14513: Test and improve ec handling of reads on objects with shards unexpectedly missing on a replica added

#4 Updated by Samuel Just almost 2 years ago

  • Status changed from New to Can't reproduce

I think this is an actual shortcoming with the way we handle corrupt osds, I'll open a feature ticket to improve that.

#5 Updated by Kefu Chai over 1 year ago

https://jenkins.ceph.com/job/ceph-pull-requests/5445/console

     0> 2016-05-10 21:45:26.748284 2b3e08339700 -1 osd/ECBackend.cc: In function 'virtual void OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)' thread 2b3e08339700 time 2016-05-10 21:45:26.741352
osd/ECBackend.cc: 203: FAILED assert(res.errors.empty())

 ceph version 10.2.0-833-g3d9f6c2 (3d9f6c27e4983a5c1a8ba3b9849e5a38f191541c)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x2b3df1a3857b]
 2: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x192) [0x2b3df15eedb2]
 3: (GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x9) [0x2b3df15dc769]
 4: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x63) [0x2b3df15d2bd3]
 5: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0xf68) [0x2b3df15d3ba8]
 6: (ECBackend::handle_message(std::shared_ptr<OpRequest>)+0x186) [0x2b3df15dae16]
 7: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xed) [0x2b3df151785d]
 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) [0x2b3df13d7955]
 9: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d) [0x2b3df13d7b7d]
 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x657) [0x2b3df13dc367]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x877) [0x2b3df1a28a77]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x2b3df1a2a9a0]
 13: (()+0x8182) [0x2b3dfbf32182]
 14: (clone()+0x6d) [0x2b3dfde4147d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interp

#6 Updated by David Zafman over 1 year ago

  • Status changed from Can't reproduce to Verified
  • Assignee set to David Zafman

Seen again recently, but the Jenkins job isn't accessible any more. Yuri saw this during a make check running test/osd/osd-scrub-repair.sh.

osd/ECBackend.cc: 203: FAILED assert(res.errors.empty())

ceph version 10.2.0-757-g8aa0d57 (8aa0d571088defcdedb0844568f262cc43f5c23c)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x2ae0f6824b0b]
2: (OnRecoveryReadComplete::finish(std::pair&lt;RecoveryMessages*, ECBackend::read_result_t&&gt;&)+0x192) [0x2ae0f63d79e2]
3: (GenContext&lt;std::pair&lt;RecoveryMessages*, ECBackend::read_result_t&&gt;&>::complete(std::pair&lt;RecoveryMessages*, ECBackend::read_result_t&&gt;&)+0x9) [0x2ae0f63c5399]
4: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x63) [0x2ae0f63bb803]
5: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0xf68) [0x2ae0f63bc7d8]
6: (ECBackend::handle_message(std::shared_ptr&lt;OpRequest&gt;)+0x186) [0x2ae0f63c3a46]
7: (ReplicatedPG::do_request(std::shared_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0xed) [0x2ae0f62fcf8d]
8: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, std::shared_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x3f5) [0x2ae0f61bc5c5]
9: (PGQueueable::RunVis::operator()(std::shared_ptr&lt;OpRequest&gt;&)+0x5d) [0x2ae0f61bc7ed]
10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x879) [0x2ae0f61c1219]
11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x877) [0x2ae0f6815007]
12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x2ae0f6816f30]
13: (()+0x8182) [0x2ae100cbb182]
14: (clone()+0x6d) [0x2ae102bca47d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

#7 Updated by David Zafman over 1 year ago

This failure has been exposed due to parallelization of part of osd-scrub-repair.sh. It deletes objects from 3 OSDs simultaneously. This involves taking the OSD down and using ceph-objectstore-tool to remove the object. As the OSDs coming back up together and recovery on the primary is trying to recovery a missing object, but it is also missing on the secondaries. This error case wasn't handled because it usually involves a corruption and would normally be very rare.

Either we need to revert the parallelization or try to handle this error. To handle the error we need to fix the crash and make sure recovery can complete. The test expects to end up with an unfound object after repair.

#8 Updated by Aaron Bassett over 1 year ago

I've just hit this issue in a pre production cluster. Using it with radosgw/civetweb for object storage, I have one storage pool using EC 2,1 for less important data. On one pg, I had one osd go down with a bad disk, and another kept crashing with this error. After some snooping, it turned out the third osd for that pg also had a bad disk an was throwing a read error on certain pieces, which would cause the head osd to crash with `FAILED assert(res.errors.empty())
`. Unfortunately, at this point, these were the only two copies left. I was able to get things moving again by stopping both osds, removing the unreadable piece from both, then starting them again and waiting for the one on the bad disk to find the next bad part. Once I removed everything it couldn't read they were able to finish backfilling and I took the bad one out.

Obviously, this caused data loss, but I'm ok with that since its somewhat expected in a 2,1 pool, and I am still in the process of populating data anyways so I kept track of the bad (rgw) keys so I can delete and re upload them.

#10 Updated by Aaron Bassett over 1 year ago

I believe I've just hit this again in the same cluster. It looks like another bad disk, returning err -5 on just a small handful of reads. When I took it out, it started crashing other osds as the cluster tried to backfill from it. It seems like a pretty bad situation to me..

#11 Updated by Sage Weil over 1 year ago

David: it seems to me like we should handle the error, especially since it seems to happen in the real world (not just the test).

#12 Updated by Mustafa Muhammad over 1 year ago

This affects my production cluster too, what can I do for the time being?
Currently I am setting osd_recovery_max_active of the crashed OSD to 0.

#13 Updated by Dan van der Ster over 1 year ago

+1 on a pre-prod jewel cluster here.

#14 Updated by Markus Blank-Burian over 1 year ago

As a workaround, you can increase osd logging to 0/20 on the crashing OSD. Then, a short time before the assert, there is a message which outputs the error codes and the shards which produced the errors. Then you can stop the corresponding OSD, then start the crashing OSD again and let the recovery finish with the remaining OSDs. Once the recovery of the affected PG is complete, you can start the remaining OSD again.

#15 Updated by Dan van der Ster over 1 year ago

Cool, got that fixed. In my case this was an 8+3 pool, and 3 OSDs were crashing, while 1 of the 8 had the read error. So I had to juggle which OSDs were running and use osd_recovery_max_active=0 to let some run to recover the others. All good now.

#16 Updated by David Zafman over 1 year ago

  • Status changed from Verified to In Progress

#17 Updated by Ken Dreyer over 1 year ago

PR for master here: https://github.com/ceph/ceph/pull/9304 (currently marked DNM)

#18 Updated by Pavan Rallabhandi about 1 year ago

We are seeing this on Jewel versions 10.2.2 with RGW data pool on EC (8+3) jerasure with failure domain set to host.

We had a drive failure, resulting in some of the PGs ending up in inconsistent state, and an eventual osd down/out of the respective OSD resulted in couple of other OSDs going down with the reported stack trace.

#20 Updated by Kefu Chai about 1 year ago

  • Status changed from In Progress to Resolved
  • Needs Doc set to No

#21 Updated by Dan van der Ster about 1 year ago

Will there be a jewel backport for this?

#22 Updated by Aaron Ten Clay 12 months ago

Dan van der Ster wrote:

Will there be a jewel backport for this?

Also curious about a Jewel backport, as my production Jewel cluster is encountering this problem regularly now.

#23 Updated by Nathan Cutler 12 months ago

  • Status changed from Resolved to Pending Backport
  • Backport set to jewel

#25 Updated by Nathan Cutler 12 months ago

  • Copied to Backport #17970: jewel: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty()) added

#26 Updated by Nathan Cutler 8 months ago

  • Status changed from Pending Backport to Resolved

#27 Updated by David Zafman 5 months ago

  • Duplicated by Bug #13493: osd: for ec, cascading crash during recovery if one shard is corrupted added

Also available in: Atom PDF