Bug #13937: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty()) - Ceph - Ceph

Actions

Copy link

Bug #13937

closed

osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())

Added by Markus Blank-Burian over 8 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

High

Assignee:

David Zafman

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

jewel

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I am getting the following error with ceph v9.2.0 on two different OSDs. The error occurs shortly after startup.

 osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())

 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55b5305f5e73]
 2: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0xe3) [0x55b5304987af]
 3: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x59) [0x55b530484ad1]
 4: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0x1008) [0x55b530485ea6]
 5: (ECBackend::handle_message(std::shared_ptr<OpRequest>)+0x216) [0x55b530489c66]
 6: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x1c0) [0x55b53029ea4e]
 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x40c) [0x55b5300ffc1e]
 8: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x52) [0x55b5300ffe52]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x61f) [0x55b5301172c7]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7ae) [0x55b5305e3b16]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b5305e7060]
 12: (Thread::entry_wrapper()+0x64) [0x55b5305d7c2c]
 13: (()+0x7176) [0x7f4cc5cdd176]
 14: (clone()+0x6d) [0x7f4cc3fb149d]

Configuration is rather simple:
3x mirror ("data")
3x mirror ("datahot", cache-tier for "dataec")
8+3 jerasure ("dataec")

pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 72574 flags hashpspool crash_replay_interval 45 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
pool 1 'metadata' replicated size 4 min_size 2 crush_ruleset 1 object_hash rjenkins pg_num 256 pgp_num 256 last_change 72577 flags hashpspool min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 31 flags hashpspool min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
pool 5 'datahot' replicated size 3 min_size 2 crush_ruleset 3 object_hash rjenkins pg_num 128 pgp_num 128 last_change 185334 flags hashpspool,incomplete_clones tier_of 6 cache_mode writeback target_bytes 1099511627776 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 3600s x1 stripe_width 0
pool 6 'dataec' erasure size 11 min_size 8 crush_ruleset 4 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 185334 lfor 185334 flags hashpspool tiers 5 read_tier 5 write_tier 5 stripe_width 131072

data and datahot are used as data-pools for cephfs. All pools share the same OSDs.

Files

Download all files

logs.zip.001 (1 MB) logs.zip.001		Markus Blank-Burian, 12/01/2015 04:00 PM
logs.zip.002 (1 MB) logs.zip.002		Markus Blank-Burian, 12/01/2015 04:00 PM
logs.zip.003 (1 MB) logs.zip.003		Markus Blank-Burian, 12/01/2015 04:00 PM
logs.zip.004 (1 MB) logs.zip.004		Markus Blank-Burian, 12/01/2015 04:00 PM
logs.zip.005 (503 KB) logs.zip.005		Markus Blank-Burian, 12/01/2015 04:00 PM

Related issues 3 (0 open — 3 closed)

Actions

Copy link Download all files

Updated by Markus Blank-Burian over 8 years ago

File logs.zip.001 logs.zip.001 added
File logs.zip.002 logs.zip.002 added
File logs.zip.003 logs.zip.003 added
File logs.zip.004 logs.zip.004 added
File logs.zip.005 logs.zip.005 added

first error on one of the two OSDs was actually this one (see logs):

osd/ReplicatedPG.cc: 8071: FAILED assert(repop_queue.front() == repop)

 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55e69ea65e73]
 2: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0xfc4) [0x55e69e735234]
 3: (ReplicatedPG::repop_all_applied(ReplicatedPG::RepGather*)+0x92) [0x55e69e7352f0]
 4: (Context::complete(int)+0x9) [0x55e69e59bd6d]
 5: (ECBackend::check_op(ECBackend::Op*)+0x12f) [0x55e69e8e21f9]
 6: (ECBackend::handle_sub_write_reply(pg_shard_t, ECSubWriteReply&)+0xad) [0x55e69e8e272b]
 7: (ECBackend::handle_message(std::shared_ptr<OpRequest>)+0x446) [0x55e69e8f9e96]
 8: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x1c0) [0x55e69e70ea4e]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x40c) [0x55e69e56fc1e]
 10: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x52) [0x55e69e56fe52]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x61f) [0x55e69e5872c7]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7ae) [0x55e69ea53b16]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e69ea57060]
 14: (Thread::entry_wrapper()+0x64) [0x55e69ea47c2c]
 15: (()+0x7176) [0x7f5a6c289176]
 16: (clone()+0x6d) [0x7f5a6a55d49d]

Actions

Copy link

Updated by Markus Blank-Burian over 8 years ago

Increasing the log level, I found that osd.27 was returning error -2 for some EC part. So after killing osd.27 and leaving osd.40 and osd.41 run for some time, it seems to be stable now. Maybe this is connected one of the 7 unfound objects the cluster had, when osd.40 and osd.41 were down.
At the moment I have osd.40/41 running with debug osd 0/20 and osd.27 with 20/20 in case the error reappears. I will also do some further stress tests.

Actions

Copy link

Updated by Samuel Just about 8 years ago

Related to Feature #14513: Test and improve ec handling of reads on objects with shards unexpectedly missing on a replica added

Actions

Copy link

Updated by Samuel Just about 8 years ago

Status changed from New to Can't reproduce

I think this is an actual shortcoming with the way we handle corrupt osds, I'll open a feature ticket to improve that.

Actions

Copy link

Updated by Kefu Chai almost 8 years ago

https://jenkins.ceph.com/job/ceph-pull-requests/5445/console

     0> 2016-05-10 21:45:26.748284 2b3e08339700 -1 osd/ECBackend.cc: In function 'virtual void OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)' thread 2b3e08339700 time 2016-05-10 21:45:26.741352
osd/ECBackend.cc: 203: FAILED assert(res.errors.empty())

 ceph version 10.2.0-833-g3d9f6c2 (3d9f6c27e4983a5c1a8ba3b9849e5a38f191541c)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x2b3df1a3857b]
 2: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x192) [0x2b3df15eedb2]
 3: (GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x9) [0x2b3df15dc769]
 4: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x63) [0x2b3df15d2bd3]
 5: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0xf68) [0x2b3df15d3ba8]
 6: (ECBackend::handle_message(std::shared_ptr<OpRequest>)+0x186) [0x2b3df15dae16]
 7: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xed) [0x2b3df151785d]
 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) [0x2b3df13d7955]
 9: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d) [0x2b3df13d7b7d]
 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x657) [0x2b3df13dc367]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x877) [0x2b3df1a28a77]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x2b3df1a2a9a0]
 13: (()+0x8182) [0x2b3dfbf32182]
 14: (clone()+0x6d) [0x2b3dfde4147d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interp

Actions

Copy link

Updated by David Zafman almost 8 years ago

Status changed from Can't reproduce to 12
Assignee set to David Zafman

Seen again recently, but the Jenkins job isn't accessible any more. Yuri saw this during a make check running test/osd/osd-scrub-repair.sh.

osd/ECBackend.cc: 203: FAILED assert(res.errors.empty())

ceph version 10.2.0-757-g8aa0d57 (8aa0d571088defcdedb0844568f262cc43f5c23c)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x2ae0f6824b0b]
 2: (OnRecoveryReadComplete::finish(std::pair&lt;RecoveryMessages*, ECBackend::read_result_t&&gt;&)+0x192) [0x2ae0f63d79e2]
 3: (GenContext&lt;std::pair&lt;RecoveryMessages*, ECBackend::read_result_t&&gt;&>::complete(std::pair&lt;RecoveryMessages*, ECBackend::read_result_t&&gt;&)+0x9) [0x2ae0f63c5399]
 4: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x63) [0x2ae0f63bb803]
 5: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0xf68) [0x2ae0f63bc7d8]
 6: (ECBackend::handle_message(std::shared_ptr&lt;OpRequest&gt;)+0x186) [0x2ae0f63c3a46]
 7: (ReplicatedPG::do_request(std::shared_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0xed) [0x2ae0f62fcf8d]
 8: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, std::shared_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x3f5) [0x2ae0f61bc5c5]
 9: (PGQueueable::RunVis::operator()(std::shared_ptr&lt;OpRequest&gt;&)+0x5d) [0x2ae0f61bc7ed]
 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x879) [0x2ae0f61c1219]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x877) [0x2ae0f6815007]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x2ae0f6816f30]
 13: (()+0x8182) [0x2ae100cbb182]
 14: (clone()+0x6d) [0x2ae102bca47d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Actions

Copy link

Updated by David Zafman almost 8 years ago

This failure has been exposed due to parallelization of part of osd-scrub-repair.sh. It deletes objects from 3 OSDs simultaneously. This involves taking the OSD down and using ceph-objectstore-tool to remove the object. As the OSDs coming back up together and recovery on the primary is trying to recovery a missing object, but it is also missing on the secondaries. This error case wasn't handled because it usually involves a corruption and would normally be very rare.

Either we need to revert the parallelization or try to handle this error. To handle the error we need to fix the crash and make sure recovery can complete. The test expects to end up with an unfound object after repair.

Actions

Copy link

Updated by Aaron Bassett almost 8 years ago

I've just hit this issue in a pre production cluster. Using it with radosgw/civetweb for object storage, I have one storage pool using EC 2,1 for less important data. On one pg, I had one osd go down with a bad disk, and another kept crashing with this error. After some snooping, it turned out the third osd for that pg also had a bad disk an was throwing a read error on certain pieces, which would cause the head osd to crash with `FAILED assert(res.errors.empty())
`. Unfortunately, at this point, these were the only two copies left. I was able to get things moving again by stopping both osds, removing the unreadable piece from both, then starting them again and waiting for the one on the bad disk to find the next bad part. Once I removed everything it couldn't read they were able to finish backfilling and I took the bad one out.

Obviously, this caused data loss, but I'm ok with that since its somewhat expected in a 2,1 pool, and I am still in the process of populating data anyways so I kept track of the bad (rgw) keys so I can delete and re upload them.

Actions

Copy link

Updated by Kefu Chai almost 8 years ago

https://jenkins.ceph.com/job/ceph-pull-requests/5688/consoleFull

Actions

Copy link

#10

Updated by Aaron Bassett almost 8 years ago

I believe I've just hit this again in the same cluster. It looks like another bad disk, returning err -5 on just a small handful of reads. When I took it out, it started crashing other osds as the cluster tried to backfill from it. It seems like a pretty bad situation to me..

Actions

Copy link

#11

Updated by Sage Weil almost 8 years ago

David: it seems to me like we should handle the error, especially since it seems to happen in the real world (not just the test).

Actions

Copy link

#12

Updated by Mustafa Muhammad almost 8 years ago

This affects my production cluster too, what can I do for the time being?
Currently I am setting osd_recovery_max_active of the crashed OSD to 0.

Actions

Copy link

#13

Updated by Dan van der Ster almost 8 years ago

+1 on a pre-prod jewel cluster here.

Actions

Copy link

#14

Updated by Markus Blank-Burian almost 8 years ago

As a workaround, you can increase osd logging to 0/20 on the crashing OSD. Then, a short time before the assert, there is a message which outputs the error codes and the shards which produced the errors. Then you can stop the corresponding OSD, then start the crashing OSD again and let the recovery finish with the remaining OSDs. Once the recovery of the affected PG is complete, you can start the remaining OSD again.

Actions

Copy link

#15

Updated by Dan van der Ster almost 8 years ago

Cool, got that fixed. In my case this was an 8+3 pool, and 3 OSDs were crashing, while 1 of the 8 had the read error. So I had to juggle which OSDs were running and use osd_recovery_max_active=0 to let some run to recover the others. All good now.

Actions

Copy link

#16

Updated by David Zafman almost 8 years ago

Status changed from 12 to In Progress

Actions

Copy link

#17

Updated by Ken Dreyer over 7 years ago

PR for master here: https://github.com/ceph/ceph/pull/9304 (currently marked DNM)

Actions

Copy link

#18

Updated by Pavan Rallabhandi over 7 years ago

We are seeing this on Jewel versions 10.2.2 with RGW data pool on EC (8+3) jerasure with failure domain set to host.

We had a drive failure, resulting in some of the PGs ending up in inconsistent state, and an eventual osd down/out of the respective OSD resulted in couple of other OSDs going down with the reported stack trace.

Actions

Copy link

#19