Project

General

Profile

Actions

Bug #13937

closed

osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())

Added by Markus Blank-Burian over 8 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
jewel
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I am getting the following error with ceph v9.2.0 on two different OSDs. The error occurs shortly after startup.

 osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())

 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55b5305f5e73]
 2: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0xe3) [0x55b5304987af]
 3: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x59) [0x55b530484ad1]
 4: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0x1008) [0x55b530485ea6]
 5: (ECBackend::handle_message(std::shared_ptr<OpRequest>)+0x216) [0x55b530489c66]
 6: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x1c0) [0x55b53029ea4e]
 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x40c) [0x55b5300ffc1e]
 8: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x52) [0x55b5300ffe52]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x61f) [0x55b5301172c7]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7ae) [0x55b5305e3b16]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b5305e7060]
 12: (Thread::entry_wrapper()+0x64) [0x55b5305d7c2c]
 13: (()+0x7176) [0x7f4cc5cdd176]
 14: (clone()+0x6d) [0x7f4cc3fb149d]

Configuration is rather simple:
3x mirror ("data")
3x mirror ("datahot", cache-tier for "dataec")
8+3 jerasure ("dataec")

pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 72574 flags hashpspool crash_replay_interval 45 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
pool 1 'metadata' replicated size 4 min_size 2 crush_ruleset 1 object_hash rjenkins pg_num 256 pgp_num 256 last_change 72577 flags hashpspool min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 31 flags hashpspool min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
pool 5 'datahot' replicated size 3 min_size 2 crush_ruleset 3 object_hash rjenkins pg_num 128 pgp_num 128 last_change 185334 flags hashpspool,incomplete_clones tier_of 6 cache_mode writeback target_bytes 1099511627776 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 3600s x1 stripe_width 0
pool 6 'dataec' erasure size 11 min_size 8 crush_ruleset 4 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 185334 lfor 185334 flags hashpspool tiers 5 read_tier 5 write_tier 5 stripe_width 131072

data and datahot are used as data-pools for cephfs. All pools share the same OSDs.


Files

logs.zip.001 (1 MB) logs.zip.001 Markus Blank-Burian, 12/01/2015 04:00 PM
logs.zip.002 (1 MB) logs.zip.002 Markus Blank-Burian, 12/01/2015 04:00 PM
logs.zip.003 (1 MB) logs.zip.003 Markus Blank-Burian, 12/01/2015 04:00 PM
logs.zip.004 (1 MB) logs.zip.004 Markus Blank-Burian, 12/01/2015 04:00 PM
logs.zip.005 (503 KB) logs.zip.005 Markus Blank-Burian, 12/01/2015 04:00 PM

Related issues 3 (0 open3 closed)

Related to Ceph - Feature #14513: Test and improve ec handling of reads on objects with shards unexpectedly missing on a replicaResolvedDavid Zafman01/26/2016

Actions
Has duplicate RADOS - Bug #13493: osd: for ec, cascading crash during recovery if one shard is corruptedDuplicateDavid Zafman10/15/2015

Actions
Copied to Ceph - Backport #17970: jewel: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty())ResolvedDavid ZafmanActions

Updated by Markus Blank-Burian over 8 years ago

first error on one of the two OSDs was actually this one (see logs):

osd/ReplicatedPG.cc: 8071: FAILED assert(repop_queue.front() == repop)

 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55e69ea65e73]
 2: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0xfc4) [0x55e69e735234]
 3: (ReplicatedPG::repop_all_applied(ReplicatedPG::RepGather*)+0x92) [0x55e69e7352f0]
 4: (Context::complete(int)+0x9) [0x55e69e59bd6d]
 5: (ECBackend::check_op(ECBackend::Op*)+0x12f) [0x55e69e8e21f9]
 6: (ECBackend::handle_sub_write_reply(pg_shard_t, ECSubWriteReply&)+0xad) [0x55e69e8e272b]
 7: (ECBackend::handle_message(std::shared_ptr<OpRequest>)+0x446) [0x55e69e8f9e96]
 8: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x1c0) [0x55e69e70ea4e]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x40c) [0x55e69e56fc1e]
 10: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x52) [0x55e69e56fe52]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x61f) [0x55e69e5872c7]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7ae) [0x55e69ea53b16]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e69ea57060]
 14: (Thread::entry_wrapper()+0x64) [0x55e69ea47c2c]
 15: (()+0x7176) [0x7f5a6c289176]
 16: (clone()+0x6d) [0x7f5a6a55d49d]

Actions #2

Updated by Markus Blank-Burian over 8 years ago

Increasing the log level, I found that osd.27 was returning error -2 for some EC part. So after killing osd.27 and leaving osd.40 and osd.41 run for some time, it seems to be stable now. Maybe this is connected one of the 7 unfound objects the cluster had, when osd.40 and osd.41 were down.
At the moment I have osd.40/41 running with debug osd 0/20 and osd.27 with 20/20 in case the error reappears. I will also do some further stress tests.

Actions #3

Updated by Samuel Just about 8 years ago

  • Related to Feature #14513: Test and improve ec handling of reads on objects with shards unexpectedly missing on a replica added
Actions #4

Updated by Samuel Just about 8 years ago

  • Status changed from New to Can't reproduce

I think this is an actual shortcoming with the way we handle corrupt osds, I'll open a feature ticket to improve that.

Actions #5

Updated by Kefu Chai almost 8 years ago

https://jenkins.ceph.com/job/ceph-pull-requests/5445/console

     0> 2016-05-10 21:45:26.748284 2b3e08339700 -1 osd/ECBackend.cc: In function 'virtual void OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)' thread 2b3e08339700 time 2016-05-10 21:45:26.741352
osd/ECBackend.cc: 203: FAILED assert(res.errors.empty())

 ceph version 10.2.0-833-g3d9f6c2 (3d9f6c27e4983a5c1a8ba3b9849e5a38f191541c)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x2b3df1a3857b]
 2: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x192) [0x2b3df15eedb2]
 3: (GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x9) [0x2b3df15dc769]
 4: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x63) [0x2b3df15d2bd3]
 5: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0xf68) [0x2b3df15d3ba8]
 6: (ECBackend::handle_message(std::shared_ptr<OpRequest>)+0x186) [0x2b3df15dae16]
 7: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xed) [0x2b3df151785d]
 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) [0x2b3df13d7955]
 9: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d) [0x2b3df13d7b7d]
 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x657) [0x2b3df13dc367]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x877) [0x2b3df1a28a77]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x2b3df1a2a9a0]
 13: (()+0x8182) [0x2b3dfbf32182]
 14: (clone()+0x6d) [0x2b3dfde4147d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interp
Actions #6

Updated by David Zafman almost 8 years ago

  • Status changed from Can't reproduce to 12
  • Assignee set to David Zafman

Seen again recently, but the Jenkins job isn't accessible any more. Yuri saw this during a make check running test/osd/osd-scrub-repair.sh.

osd/ECBackend.cc: 203: FAILED assert(res.errors.empty())

ceph version 10.2.0-757-g8aa0d57 (8aa0d571088defcdedb0844568f262cc43f5c23c)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x2ae0f6824b0b]
2: (OnRecoveryReadComplete::finish(std::pair&lt;RecoveryMessages*, ECBackend::read_result_t&&gt;&)+0x192) [0x2ae0f63d79e2]
3: (GenContext&lt;std::pair&lt;RecoveryMessages*, ECBackend::read_result_t&&gt;&>::complete(std::pair&lt;RecoveryMessages*, ECBackend::read_result_t&&gt;&)+0x9) [0x2ae0f63c5399]
4: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x63) [0x2ae0f63bb803]
5: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0xf68) [0x2ae0f63bc7d8]
6: (ECBackend::handle_message(std::shared_ptr&lt;OpRequest&gt;)+0x186) [0x2ae0f63c3a46]
7: (ReplicatedPG::do_request(std::shared_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0xed) [0x2ae0f62fcf8d]
8: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, std::shared_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x3f5) [0x2ae0f61bc5c5]
9: (PGQueueable::RunVis::operator()(std::shared_ptr&lt;OpRequest&gt;&)+0x5d) [0x2ae0f61bc7ed]
10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x879) [0x2ae0f61c1219]
11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x877) [0x2ae0f6815007]
12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x2ae0f6816f30]
13: (()+0x8182) [0x2ae100cbb182]
14: (clone()+0x6d) [0x2ae102bca47d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.
Actions #7

Updated by David Zafman almost 8 years ago

This failure has been exposed due to parallelization of part of osd-scrub-repair.sh. It deletes objects from 3 OSDs simultaneously. This involves taking the OSD down and using ceph-objectstore-tool to remove the object. As the OSDs coming back up together and recovery on the primary is trying to recovery a missing object, but it is also missing on the secondaries. This error case wasn't handled because it usually involves a corruption and would normally be very rare.

Either we need to revert the parallelization or try to handle this error. To handle the error we need to fix the crash and make sure recovery can complete. The test expects to end up with an unfound object after repair.

Actions #8

Updated by Aaron Bassett almost 8 years ago

I've just hit this issue in a pre production cluster. Using it with radosgw/civetweb for object storage, I have one storage pool using EC 2,1 for less important data. On one pg, I had one osd go down with a bad disk, and another kept crashing with this error. After some snooping, it turned out the third osd for that pg also had a bad disk an was throwing a read error on certain pieces, which would cause the head osd to crash with `FAILED assert(res.errors.empty())
`. Unfortunately, at this point, these were the only two copies left. I was able to get things moving again by stopping both osds, removing the unreadable piece from both, then starting them again and waiting for the one on the bad disk to find the next bad part. Once I removed everything it couldn't read they were able to finish backfilling and I took the bad one out.

Obviously, this caused data loss, but I'm ok with that since its somewhat expected in a 2,1 pool, and I am still in the process of populating data anyways so I kept track of the bad (rgw) keys so I can delete and re upload them.

Actions #10

Updated by Aaron Bassett almost 8 years ago

I believe I've just hit this again in the same cluster. It looks like another bad disk, returning err -5 on just a small handful of reads. When I took it out, it started crashing other osds as the cluster tried to backfill from it. It seems like a pretty bad situation to me..

Actions #11

Updated by Sage Weil almost 8 years ago

David: it seems to me like we should handle the error, especially since it seems to happen in the real world (not just the test).

Actions #12

Updated by Mustafa Muhammad almost 8 years ago

This affects my production cluster too, what can I do for the time being?
Currently I am setting osd_recovery_max_active of the crashed OSD to 0.

Actions #13

Updated by Dan van der Ster almost 8 years ago

+1 on a pre-prod jewel cluster here.

Actions #14

Updated by Markus Blank-Burian almost 8 years ago

As a workaround, you can increase osd logging to 0/20 on the crashing OSD. Then, a short time before the assert, there is a message which outputs the error codes and the shards which produced the errors. Then you can stop the corresponding OSD, then start the crashing OSD again and let the recovery finish with the remaining OSDs. Once the recovery of the affected PG is complete, you can start the remaining OSD again.

Actions #15

Updated by Dan van der Ster almost 8 years ago

Cool, got that fixed. In my case this was an 8+3 pool, and 3 OSDs were crashing, while 1 of the 8 had the read error. So I had to juggle which OSDs were running and use osd_recovery_max_active=0 to let some run to recover the others. All good now.

Actions #16

Updated by David Zafman almost 8 years ago

  • Status changed from 12 to In Progress
Actions #17

Updated by Ken Dreyer over 7 years ago

PR for master here: https://github.com/ceph/ceph/pull/9304 (currently marked DNM)

Actions #18

Updated by Pavan Rallabhandi over 7 years ago

We are seeing this on Jewel versions 10.2.2 with RGW data pool on EC (8+3) jerasure with failure domain set to host.

We had a drive failure, resulting in some of the PGs ending up in inconsistent state, and an eventual osd down/out of the respective OSD resulted in couple of other OSDs going down with the reported stack trace.

Actions #20

Updated by Kefu Chai over 7 years ago

  • Status changed from In Progress to Resolved
Actions #21

Updated by Dan van der Ster over 7 years ago

Will there be a jewel backport for this?

Actions #22

Updated by Aaron T over 7 years ago

Dan van der Ster wrote:

Will there be a jewel backport for this?

Also curious about a Jewel backport, as my production Jewel cluster is encountering this problem regularly now.

Actions #23

Updated by Nathan Cutler over 7 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to jewel
Actions #25

Updated by Nathan Cutler over 7 years ago

  • Copied to Backport #17970: jewel: osd/ECBackend.cc: 201: FAILED assert(res.errors.empty()) added
Actions #26

Updated by Nathan Cutler about 7 years ago

  • Status changed from Pending Backport to Resolved
Actions #27

Updated by David Zafman almost 7 years ago

  • Has duplicate Bug #13493: osd: for ec, cascading crash during recovery if one shard is corrupted added
Actions

Also available in: Atom PDF