Bug #9263: erasure-code: ECBackend crashes when mapping fails - Ceph - Ceph

Actions

Copy link

Bug #9263

closed

erasure-code: ECBackend crashes when mapping fails

Added by Loïc Dachary over 9 years ago. Updated over 9 years ago.

Status:

Rejected

Priority:

Urgent

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

The mapping of erasure coded PG fails (it has [4,1,2147483647,3,2147483647,9,8,11]) and although the plugin claims to be unable to read the required chunk ( ErasureCodeLRC: minimum_to_decode not enough chunks in 0,1,3,7 to read 0,1,2,3,4,5,6,7 ), the code proceeds with an attempt to read and fails. Although this happened while running rados tests for the LRC plugin, I suspect it would do something similar with the jerasure plugin. The decoding error events are

   -96> 2014-08-28 11:08:32.954255 7f23a85d9700 10 osd.8 pg_epoch: 135 pg[1.3s6( v 89'140 lc 64'137 (0'0,89'140] local-les=135 n=8 ec=8 les/c 135/96 134/134/134) [4,1,2147483647,3,2147483647,9,8,11] r=6 lpr=134 pi=8-133/17 rops=1 crt=89'140 mlcod 0'0 active+recovering+degraded m=2 u=1] handle_recovery_read_complete: returned f23e9003/vpm03010811-268/head//1 (0, 1048576, [1(1),262144, 3(3),262144, 4(0),262144, 9(5),262144, 11(7),262144])
   -95> 2014-08-28 11:08:32.954301 7f23a85d9700 20 ErasureCodeLRC: minimum_to_decode want_to_read 0,1,2,3,4,5,6,7 available_chunks 0,1,3,5,7
   -94> 2014-08-28 11:08:32.954309 7f23a85d9700 20 ErasureCodeLRC: minimum_to_decode minimum = 0,1,3,5,7
   -93> 2014-08-28 11:08:32.954312 7f23a85d9700 20 ErasureCodeLRC: minimum_to_decode want_to_read 0,1,2,3,4,5,6,7 available_chunks 0,1,3,7
   -92> 2014-08-28 11:08:32.954320 7f23a85d9700 -1 ErasureCodeLRC: minimum_to_decode not enough chunks in 0,1,3,7 to read 0,1,2,3,4,5,6,7
   -91> 2014-08-28 11:08:32.954293 7f23a85d9700 10 osd.8 pg_epoch: 135 pg[1.3s6( v 89'140 lc 64'137 (0'0,89'140] local-les=135 n=8 ec=8 les/c 135/96 134/134/134) [4,1,2147483647,3,2147483647,9,8,11] r=6 lpr=134 pi=8-133/17 rops=1 crt=89'140 mlcod 0'0 active+recovering+degraded m=2 u=1] handle_recovery_read_complete: [0,262144, 1,262144, 3,262144, 5,262144, 7,262144]
   -90> 2014-08-28 11:08:32.954354 7f23a85d9700 -1 ErasureCodeLRC: decode_chunks want to read 6 with available_chunks = 0,1,3,5,7 end up being unable to read 6

which leads to

osd/ECUtil.cc: In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&, std::map<int, ceph::buffer::list*>&)' thread 7f23a85d9700 time 2014-08-28 11:08:32.954359
osd/ECUtil.cc: 83: FAILED assert(r == 0)

 ceph version 0.84-758-g5a8de6f (5a8de6f276826af5922c4ac090af74957d6bde5b)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb7923b]
 2: (ECUtil::decode(ECUtil::stripe_info_t const&, std::tr1::shared_ptr<ceph::ErasureCodeInterface>&, std::map<int, ceph::buffer::list, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::list> > >&, std::map<int, ceph::buffer::list*, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::list*> > >&)+0xd71) [0xa86101]
 3: (ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::list> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, boost::optional<std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > > >, RecoveryMessages*)+0x7c0) [0xa04760]
 4: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x121) [0xa13771]
 5: (GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x9) [0xa050e9]
 6: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x63) [0x9f79c3]
 7: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0x96d) [0x9fbe3d]
 8: (ECBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x17e) [0xa03bae]
 9: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x23b) [0x81a24b]
 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3d5) [0x684305]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x346) [0x684866]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x821) [0xb69341]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb6b450]
 14: (()+0x8182) [0x7f23c4eca182]
 15: (clone()+0x6d) [0x7f23c343638d]

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Samuel Just over 9 years ago

I suspect the problem is that we are feeding back to the plugin the set of shards it already told us it could use for reconstruction. It's probably an LRC mismatch between the method which gives the set of shards required for decoding and the decode method.

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Status changed from New to Rejected

I misread the logs, the failure is on

   -90> 2014-08-28 11:08:32.954354 7f23a85d9700 -1 ErasureCodeLRC: decode_chunks want to read 6 with available_chunks = 0,1,3,5,7 end up being unable to read 6

which was indeed incorrectly reported as recoverable earlier

   -95> 2014-08-28 11:08:32.954301 7f23a85d9700 20 ErasureCodeLRC: minimum_to_decode want_to_read 0,1,2,3,4,5,6,7 available_chunks 0,1,3,5,7
   -94> 2014-08-28 11:08:32.954309 7f23a85d9700 20 ErasureCodeLRC: minimum_to_decode minimum = 0,1,3,5,7

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #9263

erasure-code: ECBackend crashes when mapping fails

Updated by Samuel Just over 9 years ago

Updated by Loïc Dachary over 9 years ago