Project

General

Profile

Actions

Bug #9263

closed

erasure-code: ECBackend crashes when mapping fails

Added by Loïc Dachary over 9 years ago. Updated over 9 years ago.

Status:
Rejected
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The mapping of erasure coded PG fails (it has [4,1,2147483647,3,2147483647,9,8,11]) and although the plugin claims to be unable to read the required chunk ( ErasureCodeLRC: minimum_to_decode not enough chunks in 0,1,3,7 to read 0,1,2,3,4,5,6,7 ), the code proceeds with an attempt to read and fails. Although this happened while running rados tests for the LRC plugin, I suspect it would do something similar with the jerasure plugin. The decoding error events are

   -96> 2014-08-28 11:08:32.954255 7f23a85d9700 10 osd.8 pg_epoch: 135 pg[1.3s6( v 89'140 lc 64'137 (0'0,89'140] local-les=135 n=8 ec=8 les/c 135/96 134/134/134) [4,1,2147483647,3,2147483647,9,8,11] r=6 lpr=134 pi=8-133/17 rops=1 crt=89'140 mlcod 0'0 active+recovering+degraded m=2 u=1] handle_recovery_read_complete: returned f23e9003/vpm03010811-268/head//1 (0, 1048576, [1(1),262144, 3(3),262144, 4(0),262144, 9(5),262144, 11(7),262144])
   -95> 2014-08-28 11:08:32.954301 7f23a85d9700 20 ErasureCodeLRC: minimum_to_decode want_to_read 0,1,2,3,4,5,6,7 available_chunks 0,1,3,5,7
   -94> 2014-08-28 11:08:32.954309 7f23a85d9700 20 ErasureCodeLRC: minimum_to_decode minimum = 0,1,3,5,7
   -93> 2014-08-28 11:08:32.954312 7f23a85d9700 20 ErasureCodeLRC: minimum_to_decode want_to_read 0,1,2,3,4,5,6,7 available_chunks 0,1,3,7
   -92> 2014-08-28 11:08:32.954320 7f23a85d9700 -1 ErasureCodeLRC: minimum_to_decode not enough chunks in 0,1,3,7 to read 0,1,2,3,4,5,6,7
   -91> 2014-08-28 11:08:32.954293 7f23a85d9700 10 osd.8 pg_epoch: 135 pg[1.3s6( v 89'140 lc 64'137 (0'0,89'140] local-les=135 n=8 ec=8 les/c 135/96 134/134/134) [4,1,2147483647,3,2147483647,9,8,11] r=6 lpr=134 pi=8-133/17 rops=1 crt=89'140 mlcod 0'0 active+recovering+degraded m=2 u=1] handle_recovery_read_complete: [0,262144, 1,262144, 3,262144, 5,262144, 7,262144]
   -90> 2014-08-28 11:08:32.954354 7f23a85d9700 -1 ErasureCodeLRC: decode_chunks want to read 6 with available_chunks = 0,1,3,5,7 end up being unable to read 6

which leads to
osd/ECUtil.cc: In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&, std::map<int, ceph::buffer::list*>&)' thread 7f23a85d9700 time 2014-08-28 11:08:32.954359
osd/ECUtil.cc: 83: FAILED assert(r == 0)

 ceph version 0.84-758-g5a8de6f (5a8de6f276826af5922c4ac090af74957d6bde5b)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb7923b]
 2: (ECUtil::decode(ECUtil::stripe_info_t const&, std::tr1::shared_ptr<ceph::ErasureCodeInterface>&, std::map<int, ceph::buffer::list, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::list> > >&, std::map<int, ceph::buffer::list*, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::list*> > >&)+0xd71) [0xa86101]
 3: (ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::list> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, boost::optional<std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > > >, RecoveryMessages*)+0x7c0) [0xa04760]
 4: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x121) [0xa13771]
 5: (GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x9) [0xa050e9]
 6: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x63) [0x9f79c3]
 7: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0x96d) [0x9fbe3d]
 8: (ECBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x17e) [0xa03bae]
 9: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x23b) [0x81a24b]
 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3d5) [0x684305]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x346) [0x684866]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x821) [0xb69341]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb6b450]
 14: (()+0x8182) [0x7f23c4eca182]
 15: (clone()+0x6d) [0x7f23c343638d]


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #9253: ECBackend::continue_recovery_op assert when not enough shardsDuplicate08/27/2014

Actions
Actions #1

Updated by Samuel Just over 9 years ago

I suspect the problem is that we are feeding back to the plugin the set of shards it already told us it could use for reconstruction. It's probably an LRC mismatch between the method which gives the set of shards required for decoding and the decode method.

Actions #2

Updated by Loïc Dachary over 9 years ago

  • Status changed from New to Rejected

I misread the logs, the failure is on

   -90> 2014-08-28 11:08:32.954354 7f23a85d9700 -1 ErasureCodeLRC: decode_chunks want to read 6 with available_chunks = 0,1,3,5,7 end up being unable to read 6

which was indeed incorrectly reported as recoverable earlier
   -95> 2014-08-28 11:08:32.954301 7f23a85d9700 20 ErasureCodeLRC: minimum_to_decode want_to_read 0,1,2,3,4,5,6,7 available_chunks 0,1,3,5,7
   -94> 2014-08-28 11:08:32.954309 7f23a85d9700 20 ErasureCodeLRC: minimum_to_decode minimum = 0,1,3,5,7

Actions

Also available in: Atom PDF