Bug #13937
closedosd/ECBackend.cc: 201: FAILED assert(res.errors.empty())
0%
Description
I am getting the following error with ceph v9.2.0 on two different OSDs. The error occurs shortly after startup.
osd/ECBackend.cc: 201: FAILED assert(res.errors.empty()) ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55b5305f5e73] 2: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0xe3) [0x55b5304987af] 3: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x59) [0x55b530484ad1] 4: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0x1008) [0x55b530485ea6] 5: (ECBackend::handle_message(std::shared_ptr<OpRequest>)+0x216) [0x55b530489c66] 6: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x1c0) [0x55b53029ea4e] 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x40c) [0x55b5300ffc1e] 8: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x52) [0x55b5300ffe52] 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x61f) [0x55b5301172c7] 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7ae) [0x55b5305e3b16] 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b5305e7060] 12: (Thread::entry_wrapper()+0x64) [0x55b5305d7c2c] 13: (()+0x7176) [0x7f4cc5cdd176] 14: (clone()+0x6d) [0x7f4cc3fb149d]
Configuration is rather simple:
3x mirror ("data")
3x mirror ("datahot", cache-tier for "dataec")
8+3 jerasure ("dataec")
pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 72574 flags hashpspool crash_replay_interval 45 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0 pool 1 'metadata' replicated size 4 min_size 2 crush_ruleset 1 object_hash rjenkins pg_num 256 pgp_num 256 last_change 72577 flags hashpspool min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0 pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 31 flags hashpspool min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0 pool 5 'datahot' replicated size 3 min_size 2 crush_ruleset 3 object_hash rjenkins pg_num 128 pgp_num 128 last_change 185334 flags hashpspool,incomplete_clones tier_of 6 cache_mode writeback target_bytes 1099511627776 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 3600s x1 stripe_width 0 pool 6 'dataec' erasure size 11 min_size 8 crush_ruleset 4 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 185334 lfor 185334 flags hashpspool tiers 5 read_tier 5 write_tier 5 stripe_width 131072
data and datahot are used as data-pools for cephfs. All pools share the same OSDs.
Files
Updated by Markus Blank-Burian over 8 years ago
- File logs.zip.001 logs.zip.001 added
- File logs.zip.002 logs.zip.002 added
- File logs.zip.003 logs.zip.003 added
- File logs.zip.004 logs.zip.004 added
- File logs.zip.005 logs.zip.005 added
first error on one of the two OSDs was actually this one (see logs):
osd/ReplicatedPG.cc: 8071: FAILED assert(repop_queue.front() == repop) ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55e69ea65e73] 2: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0xfc4) [0x55e69e735234] 3: (ReplicatedPG::repop_all_applied(ReplicatedPG::RepGather*)+0x92) [0x55e69e7352f0] 4: (Context::complete(int)+0x9) [0x55e69e59bd6d] 5: (ECBackend::check_op(ECBackend::Op*)+0x12f) [0x55e69e8e21f9] 6: (ECBackend::handle_sub_write_reply(pg_shard_t, ECSubWriteReply&)+0xad) [0x55e69e8e272b] 7: (ECBackend::handle_message(std::shared_ptr<OpRequest>)+0x446) [0x55e69e8f9e96] 8: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x1c0) [0x55e69e70ea4e] 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x40c) [0x55e69e56fc1e] 10: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x52) [0x55e69e56fe52] 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x61f) [0x55e69e5872c7] 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7ae) [0x55e69ea53b16] 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e69ea57060] 14: (Thread::entry_wrapper()+0x64) [0x55e69ea47c2c] 15: (()+0x7176) [0x7f5a6c289176] 16: (clone()+0x6d) [0x7f5a6a55d49d]
Updated by Markus Blank-Burian over 8 years ago
Increasing the log level, I found that osd.27 was returning error -2 for some EC part. So after killing osd.27 and leaving osd.40 and osd.41 run for some time, it seems to be stable now. Maybe this is connected one of the 7 unfound objects the cluster had, when osd.40 and osd.41 were down.
At the moment I have osd.40/41 running with debug osd 0/20 and osd.27 with 20/20 in case the error reappears. I will also do some further stress tests.
Updated by Samuel Just about 8 years ago
- Related to Feature #14513: Test and improve ec handling of reads on objects with shards unexpectedly missing on a replica added
Updated by Samuel Just about 8 years ago
- Status changed from New to Can't reproduce
I think this is an actual shortcoming with the way we handle corrupt osds, I'll open a feature ticket to improve that.
Updated by Kefu Chai almost 8 years ago
https://jenkins.ceph.com/job/ceph-pull-requests/5445/console
0> 2016-05-10 21:45:26.748284 2b3e08339700 -1 osd/ECBackend.cc: In function 'virtual void OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)' thread 2b3e08339700 time 2016-05-10 21:45:26.741352 osd/ECBackend.cc: 203: FAILED assert(res.errors.empty()) ceph version 10.2.0-833-g3d9f6c2 (3d9f6c27e4983a5c1a8ba3b9849e5a38f191541c) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x2b3df1a3857b] 2: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x192) [0x2b3df15eedb2] 3: (GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x9) [0x2b3df15dc769] 4: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x63) [0x2b3df15d2bd3] 5: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0xf68) [0x2b3df15d3ba8] 6: (ECBackend::handle_message(std::shared_ptr<OpRequest>)+0x186) [0x2b3df15dae16] 7: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xed) [0x2b3df151785d] 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) [0x2b3df13d7955] 9: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d) [0x2b3df13d7b7d] 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x657) [0x2b3df13dc367] 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x877) [0x2b3df1a28a77] 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x2b3df1a2a9a0] 13: (()+0x8182) [0x2b3dfbf32182] 14: (clone()+0x6d) [0x2b3dfde4147d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interp
Updated by David Zafman almost 8 years ago
- Status changed from Can't reproduce to 12
- Assignee set to David Zafman
Seen again recently, but the Jenkins job isn't accessible any more. Yuri saw this during a make check running test/osd/osd-scrub-repair.sh.
osd/ECBackend.cc: 203: FAILED assert(res.errors.empty())
ceph version 10.2.0-757-g8aa0d57 (8aa0d571088defcdedb0844568f262cc43f5c23c)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x2ae0f6824b0b]
2: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x192) [0x2ae0f63d79e2]
3: (GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x9) [0x2ae0f63c5399]
4: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x63) [0x2ae0f63bb803]
5: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0xf68) [0x2ae0f63bc7d8]
6: (ECBackend::handle_message(std::shared_ptr<OpRequest>)+0x186) [0x2ae0f63c3a46]
7: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xed) [0x2ae0f62fcf8d]
8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) [0x2ae0f61bc5c5]
9: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d) [0x2ae0f61bc7ed]
10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x879) [0x2ae0f61c1219]
11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x877) [0x2ae0f6815007]
12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x2ae0f6816f30]
13: (()+0x8182) [0x2ae100cbb182]
14: (clone()+0x6d) [0x2ae102bca47d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by David Zafman almost 8 years ago
This failure has been exposed due to parallelization of part of osd-scrub-repair.sh. It deletes objects from 3 OSDs simultaneously. This involves taking the OSD down and using ceph-objectstore-tool to remove the object. As the OSDs coming back up together and recovery on the primary is trying to recovery a missing object, but it is also missing on the secondaries. This error case wasn't handled because it usually involves a corruption and would normally be very rare.
Either we need to revert the parallelization or try to handle this error. To handle the error we need to fix the crash and make sure recovery can complete. The test expects to end up with an unfound object after repair.
Updated by Aaron Bassett almost 8 years ago
I've just hit this issue in a pre production cluster. Using it with radosgw/civetweb for object storage, I have one storage pool using EC 2,1 for less important data. On one pg, I had one osd go down with a bad disk, and another kept crashing with this error. After some snooping, it turned out the third osd for that pg also had a bad disk an was throwing a read error on certain pieces, which would cause the head osd to crash with `FAILED assert(res.errors.empty())
`. Unfortunately, at this point, these were the only two copies left. I was able to get things moving again by stopping both osds, removing the unreadable piece from both, then starting them again and waiting for the one on the bad disk to find the next bad part. Once I removed everything it couldn't read they were able to finish backfilling and I took the bad one out.
Obviously, this caused data loss, but I'm ok with that since its somewhat expected in a 2,1 pool, and I am still in the process of populating data anyways so I kept track of the bad (rgw) keys so I can delete and re upload them.
Updated by Kefu Chai almost 8 years ago
Updated by Aaron Bassett almost 8 years ago
I believe I've just hit this again in the same cluster. It looks like another bad disk, returning err -5 on just a small handful of reads. When I took it out, it started crashing other osds as the cluster tried to backfill from it. It seems like a pretty bad situation to me..
Updated by Sage Weil almost 8 years ago
David: it seems to me like we should handle the error, especially since it seems to happen in the real world (not just the test).
Updated by Mustafa Muhammad almost 8 years ago
This affects my production cluster too, what can I do for the time being?
Currently I am setting osd_recovery_max_active of the crashed OSD to 0.
Updated by Dan van der Ster almost 8 years ago
+1 on a pre-prod jewel cluster here.
Updated by Markus Blank-Burian almost 8 years ago
As a workaround, you can increase osd logging to 0/20 on the crashing OSD. Then, a short time before the assert, there is a message which outputs the error codes and the shards which produced the errors. Then you can stop the corresponding OSD, then start the crashing OSD again and let the recovery finish with the remaining OSDs. Once the recovery of the affected PG is complete, you can start the remaining OSD again.
Updated by Dan van der Ster almost 8 years ago
Cool, got that fixed. In my case this was an 8+3 pool, and 3 OSDs were crashing, while 1 of the 8 had the read error. So I had to juggle which OSDs were running and use osd_recovery_max_active=0 to let some run to recover the others. All good now.
Updated by David Zafman almost 8 years ago
- Status changed from 12 to In Progress
Updated by Ken Dreyer over 7 years ago
PR for master here: https://github.com/ceph/ceph/pull/9304 (currently marked DNM)
Updated by Pavan Rallabhandi over 7 years ago
We are seeing this on Jewel versions 10.2.2 with RGW data pool on EC (8+3) jerasure with failure domain set to host.
We had a drive failure, resulting in some of the PGs ending up in inconsistent state, and an eventual osd down/out of the respective OSD resulted in couple of other OSDs going down with the reported stack trace.
Updated by Kefu Chai over 7 years ago
note, should revert https://github.com/ceph/ceph/commit/5bc55338f5a1645bc651811fae2f89ad855ff86e#diff-1b6e7a54c1cf4f286f5835c21abd065dL4 once this issue is resolved.
Updated by Kefu Chai over 7 years ago
- Status changed from In Progress to Resolved
Updated by Dan van der Ster over 7 years ago
Will there be a jewel backport for this?
Updated by Aaron T over 7 years ago
Dan van der Ster wrote:
Will there be a jewel backport for this?
Also curious about a Jewel backport, as my production Jewel cluster is encountering this problem regularly now.
Updated by Nathan Cutler over 7 years ago
- Status changed from Resolved to Pending Backport
- Backport set to jewel
Updated by Nathan Cutler over 7 years ago
Updated by Nathan Cutler over 7 years ago
- Copied to Backport #17970: jewel: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty()) added
Updated by Nathan Cutler about 7 years ago
- Status changed from Pending Backport to Resolved
Updated by David Zafman almost 7 years ago
- Has duplicate Bug #13493: osd: for ec, cascading crash during recovery if one shard is corrupted added