Bug #14009
FAILED assert(old_size == total_chunk_size) in 0.94.5 / strange file size
0%
Description
Three of our OSD are failing with
2015-12-06 09:44:22.271873 7f47091fc700 -1 osd/ECUtil.cc: In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&, std::map<int, ceph::buffer::list*>&)' thread 7f47\
091fc700 time 2015-12-06 09:44:22.267329
osd/ECUtil.cc: 59: FAILED assert(i->second.length() == total_chunk_size)
during recovery
Related issues
Associated revisions
osd: log inconsistent shard sizes
When an OSD asserts because it cannot recover from an unexpected erasure
coded shard size, the object needs to be manually fixed (i.e. the shard
must be removed so it can be reconstructed). Unfortunately the assert
does not display the name of the object and it is not displayed unless
the OSD logs are >= 10, which is uncommon on an actual cluster because
it creates log files that are too big.
The problem has been fixed in infernalis as part of
https://github.com/ceph/ceph/pull/5173 and backporting it is non
trivial.
The error condition is modified to make it easy for the system
administrator to identify which object is causing the crash and what
steps must be taken to fix the issue.
http://tracker.ceph.com/issues/14009 Refs: #14009
Signed-off-by: Loic Dachary <loic@dachary.org>
History
#1 Updated by Laurent GUERBY over 8 years ago
2015-12-06 09:44:22.271873 7f47091fc700 -1 osd/ECUtil.cc: In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&, std::map<int, ceph::buffer::list*>&)' thread 7f47\ 091fc700 time 2015-12-06 09:44:22.267329 osd/ECUtil.cc: 59: FAILED assert(i->second.length() == total_chunk_size) ceph version 0.94.5-164-gbf9e1b6 (bf9e1b6307692fdc50465a64590d83e3d7015c9d) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc5a9b] 2: (ECUtil::decode(ECUtil::stripe_info_t const&, std::tr1::shared_ptr<ceph::ErasureCodeInterface>&, std::map<int, ceph::buffer::list, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::list> > >&, std::map<int, ceph::buff\ er::list*, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::list*> > >&)+0xcde) [0xac33de] 3: (ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::l\ ist> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, boost::optional<std::map<std::string, ceph:\ :buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > > >, RecoveryMessages*)+0x7c0) [0xa3ee40] 4: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x121) [0xa4fa01] 5: (GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x9) [0xa3f7a9] 6: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x63) [0xa31e53] 7: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0x96d) [0xa3645d] 8: (ECBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x17e) [0xa3e2ae] 9: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x167) [0x843fb7] 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3bd) [0x6a1aed] 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x331) [0x6a2011] 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x85f) [0xbb55ef] 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbb7710] 14: (()+0x8182) [0x7f474f559182] 15: (clone()+0x6d) [0x7f474dac447d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
#2 Updated by Laurent GUERBY over 8 years ago
OSD crashing the most often is 30. The other two are 9 and 33.
Various log files:
http://guerby.org/ftp/ooops-30.log.xz
debug osd 20
http://guerby.org/ftp/ceph-osd.30.log.xz
http://guerby.org/ftp/ceph-osd.9.log
http://guerby.org/ftp/ceph-osd.33.log
crush map
http://tracker.ceph.com/attachments/download/2112/cm-20151207.txt
dump
http://tracker.ceph.com/attachments/download/2113/ceph-osd-dump-20151207.txt
http://tracker.ceph.com/attachments/download/2114/ceph-pg-dump-20151207.txt.gz
#3 Updated by Laurent GUERBY over 8 years ago
There is a strange file on disk described here
http://tracker.ceph.com/issues/13972#note-32
http://tracker.ceph.com/issues/13972#note-33
rados get on this object immediately crashes the OSD.
norecover allows the cluster to function
#4 Updated by Loïc Dachary over 8 years ago
COPYING THE INITIAL REPORT¶
#5 Updated by Loïc Dachary over 8 years ago
We started seeing this tonight on our ceph 0.94.5 cluster tonight, RBD over EC pools (one 4+1, one 8+2), ubuntu trusty, 3 OSD died with:
2015-12-06 09:44:22.271873 7f47091fc700 -1 osd/ECUtil.cc: In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&, std::map<int, ceph::buffer::list*>&)' thread 7f47\
091fc700 time 2015-12-06 09:44:22.267329
osd/ECUtil.cc: 59: FAILED assert(i->second.length() == total_chunk_size)
ceph version 0.94.5-164-gbf9e1b6 (bf9e1b6307692fdc50465a64590d83e3d7015c9d)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc5a9b]
2: (ECUtil::decode(ECUtil::stripe_info_t const&, std::tr1::shared_ptr<ceph::ErasureCodeInterface>&, std::map<int, ceph::buffer::list, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::list> > >&, std::map<int, ceph::buff\
er::list*, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::list*> > >&)+0xcde) [0xac33de]
3: (ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::l\
ist> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, boost::optional<std::map<std::string, ceph:\
:buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > > >, RecoveryMessages*)+0x7c0) [0xa3ee40]
4: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x121) [0xa4fa01]
5: (GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x9) [0xa3f7a9]
6: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x63) [0xa31e53]
7: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0x96d) [0xa3645d]
8: (ECBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x17e) [0xa3e2ae]
9: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x167) [0x843fb7]
10: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3bd) [0x6a1aed]
11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x331) [0x6a2011]
12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x85f) [0xbb55ef]
13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbb7710]
14: (()+0x8182) [0x7f474f559182]
15: (clone()+0x6d) [0x7f474dac447d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
#6 Updated by Loïc Dachary over 8 years ago
The three OSD keep having the same error a short time after restart, log with ms = 1 objecter = 20 on osd.30 is attached.
attach didn't work, here is the URL
http://guerby.org/ftp/ooops-30.log.xz
Note: only the last restart has detailed debug (previous didn't have the ms):
1> 2015-12-06 10:12:42.364429 7f37c06fe700 5 - op tracker -- seq: 24552, time: 2015-12-06 10:12:42.364429, event: done, op: MOSDPGPushReply(87.35s0 226867 [PushReplyOp(2a31b0b5/rbd_data.1cecf133ca1a5bd.00000000000c74c6/head//87),PushReplyOp(fd31b0b5/rbd_data.1cecf133ca1a5bd.000000000000a5f1/head//87),PushReplyOp(a841b0b5/rbd_data.1c25f6c2eb141f2.0000000000055752/head//87)])
0> 2015-12-06 10:12:42.365322 7f37c2f03700 -1 osd/ECUtil.cc: In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&, std::map<int, ceph::buffer::list*>&)' thread 7f37c2f03700 time 2015-12-06 10:12:42.362228
osd/ECUtil.cc: 59: FAILED assert(i->second.length() == total_chunk_size)
Looks like the object causing the issue is rbd_data.1b2de3e52bdda2a.00000000000f16f4 in 8+2 ec pool
grep 7fea58c08700 /var/log/ceph/ceph-osd.30.log ... -239> 2015-12-06 12:16:39.851518 7fea58c08700 5 -- op tracker -- seq: 10397, time: 2015-12-06 12:16:39.851518, event: reached_pg, op: pg_scan(digest 87.1cs0 64fad1c/rbd_data.1b2de3e52bdda2a.00000000000f16f4/head//87-MAX e 227380/227380) -238> 2015-12-06 12:16:39.851537 7fea58c08700 5 -- op tracker -- seq: 10397, time: 2015-12-06 12:16:39.851537, event: started, op: pg_scan(digest 87.1cs0 64fad1c/rbd_data.1b2de3e52bdda2a.00000000000f16f4/head//87-MAX e 227380/227380) -237> 2015-12-06 12:16:39.851544 7fea58c08700 5 -- op tracker -- seq: 10397, time: 2015-12-06 12:16:39.851544, event: done, op: pg_scan(digest 87.1cs0 64fad1c/rbd_data.1b2de3e52bdda2a.00000000000f16f4/head//87-MAX e 227380/227380) -223> 2015-12-06 12:16:39.856672 7fea58c08700 5 -- op tracker -- seq: 10399, time: 2015-12-06 12:16:39.856672, event: reached_pg, op: pg_scan(digest 87.1cs0 64fad1c/rbd_data.1b2de3e52bdda2a.00000000000f16f4/head//87-MAX e 227380/227380) -222> 2015-12-06 12:16:39.856692 7fea58c08700 5 -- op tracker -- seq: 10399, time: 2015-12-06 12:16:39.856692, event: started, op: pg_scan(digest 87.1cs0 64fad1c/rbd_data.1b2de3e52bdda2a.00000000000f16f4/head//87-MAX e 227380/227380) -221> 2015-12-06 12:16:39.856699 7fea58c08700 5 -- op tracker -- seq: 10399, time: 2015-12-06 12:16:39.856699, event: done, op: pg_scan(digest 87.1cs0 64fad1c/rbd_data.1b2de3e52bdda2a.00000000000f16f4/head//87-MAX e 227380/227380) -202> 2015-12-06 12:16:39.859786 7fea58c08700 5 -- op tracker -- seq: 10401, time: 2015-12-06 12:16:39.859786, event: reached_pg, op: MOSDECSubOpReadReply(87.1cs0 227380 ECSubReadReply(tid=51, attrs_read=0)) -201> 2015-12-06 12:16:39.859813 7fea58c08700 5 -- op tracker -- seq: 10401, time: 2015-12-06 12:16:39.859813, event: done, op: MOSDECSubOpReadReply(87.1cs0 227380 ECSubReadReply(tid=51, attrs_read=0)) -188> 2015-12-06 12:16:39.860797 7fea58c08700 5 -- op tracker -- seq: 10403, time: 2015-12-06 12:16:39.860796, event: reached_pg, op: MOSDECSubOpReadReply(87.1cs0 227380 ECSubReadReply(tid=51, attrs_read=0)) -187> 2015-12-06 12:16:39.860823 7fea58c08700 5 -- op tracker -- seq: 10403, time: 2015-12-06 12:16:39.860823, event: done, op: MOSDECSubOpReadReply(87.1cs0 227380 ECSubReadReply(tid=51, attrs_read=0)) -4> 2015-12-06 12:16:40.062663 7fea58c08700 5 -- op tracker -- seq: 10417, time: 2015-12-06 12:16:40.062663, event: reached_pg, op: MOSDECSubOpReadReply(87.1cs0 227380 ECSubReadReply(tid=51, attrs_read=0)) 0> 2015-12-06 12:16:40.068089 7fea58c08700 -1 osd/ECUtil.cc: In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&, std::map<int, ceph::buffer::list*>&)' thread 7fea58c08700 time 2015-12-06 12:16:40.062709 root@g8:~# ceph pg dump|grep 87.1c dumped all in format plain 87.1c 41685 0 41685 208425 0 174146019328 3448 3448 active+undersized+degraded+remapped 2015-12-06 12:19:28.900553 226852'44959 227394:123178 [NONE,33,47,37,24,16,22,0,9,13] 33 [NONE,33,9,37,24,16,22,8,13,0] 33 179906'30225 2015-11-10 12:58:04.433918 179906'30225 2015-11-10 12:58:04.433918 root@g8:~# ceph osd map ec8p2 rbd_data.1b2de3e52bdda2a.00000000000f16f4 osdmap e227399 pool 'ec8p2' (87) object 'rbd_data.1b2de3e52bdda2a.00000000000f16f4' -> pg 87.64fad1c (87.1c) -> up ([NONE,33,47,37,24,16,22,0,9,13], p33) acting ([NONE,33,9,37,24,16,22,8,13,0], p33) pool 87 'ec8p2' erasure size 10 min_size 8 crush_ruleset 4 object_hash rjenkins pg_num 128 pgp_num 128 last_change 151534 lfor 151534 flags hashpspool tiers 88 read_tier 88 write_tier 88 stripe_width 4096 pool 88 'ec8p2c' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 152194 flags hashpspool,incomplete_clones tier_of 87 cache_mode writeback target_bytes 200000000000 target_objects 10000000 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 3600s x1 stripe_width 0 root@g8:~# ceph osd map ec8p2c rbd_data.1b2de3e52bdda2a.00000000000f16f4 osdmap e227399 pool 'ec8p2c' (88) object 'rbd_data.1b2de3e52bdda2a.00000000000f16f4' -> pg 88.64fad1c (88.1c) -> up ([16,26,27], p16) acting ([16,26,27], p16)
#7 Updated by Loïc Dachary over 8 years ago
With:
ceph osd set norecover
The 3 OSDs do not crash.
But of course no recovery is happening so we're kind of stuck.
#8 Updated by Loïc Dachary over 8 years ago
debug osd = 20 log on osd.30, 300 MByte :
http://guerby.org/ftp/ceph-osd.30.log.xz
I did update ceph.conf, restart osd.30, ceph osd unset norecover, waited a while (may be one or two restart by upstart), ceph osd set norecover, edit again ceph.conf and restart.
#9 Updated by Loïc Dachary over 8 years ago
For the record osd 30 is the one crashing the most frequently, OSD 33 and 9 are also crashing quickly, other OSD are not crashing.
Per Loic request OSD 9 and 33 (the two other crashing) non debug logs:
http://guerby.org/ftp/ceph-osd.9.log
http://guerby.org/ftp/ceph-osd.33.log
Each has a few crashes in it.
Extract of osd.30 debug log near the first crash by grepping 7fbb209ab700
2015-12-07 10:53:15.289820 7fbb209ab700 10 osd.30 228065 dequeue_op 0x16d88900 prio 127 cost 0 latency 0.185466 MOSDECSubOpReadReply(87.1cs0 228065 ECSubReadReply(tid=512, attrs_read=0)) v1 pg pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228065 n=41685 ec=88152 les/c 228065/181936 228060/228060/228054) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228060 pi=181856-228059/1322 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] 2015-12-07 10:53:15.289834 7fbb209ab700 10 osd.30 pg_epoch: 228065 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228065 n=41685 ec=88152 les/c 228065/181936 228060/228060/228054) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228060 pi=181856-228059/1322 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_message: MOSDECSubOpReadReply(87.1cs0 228065 ECSubReadReply(tid=512, attrs_read=0)) v1 2015-12-07 10:53:15.289842 7fbb209ab700 10 osd.30 pg_epoch: 228065 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228065 n=41685 ec=88152 les/c 228065/181936 228060/228060/228054) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228060 pi=181856-228059/1322 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_sub_read_reply: reply ECSubReadReply(tid=512, attrs_read=0) 2015-12-07 10:53:15.289853 7fbb209ab700 10 osd.30 pg_epoch: 228065 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228065 n=41685 ec=88152 les/c 228065/181936 228060/228060/228054) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228060 pi=181856-228059/1322 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_sub_read_reply readop complete: ReadOp(tid=512, to_read={ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87=read_request_t(to_read=[0,8388608,0], need=0(9),8(7),9(2),13(8), want_attrs=1)}, complete={ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87=read_result_t(r=0, errors={}, attrs={_=buffer::list(len=276, 2015-12-07 10:53:15.289875 7fbb209ab700 10 osd.30 pg_epoch: 228065 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228065 n=41685 ec=88152 les/c 228065/181936 228060/228060/228054) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228060 pi=181856-228059/1322 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_recovery_read_complete: returned ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87 (0, 8388608, [0(9),524288, 8(7),524288, 9(2),1048576, 13(8),524288]) 2015-12-07 10:53:15.289889 7fbb209ab700 10 osd.30 pg_epoch: 228065 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228065 n=41685 ec=88152 les/c 228065/181936 228060/228060/228054) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228060 pi=181856-228059/1322 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_recovery_read_complete: [2,1048576, 7,524288, 8,524288, 9,524288] 2015-12-07 10:53:15.293099 7fbb209ab700 -1 osd/ECUtil.cc: In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&, std::map<int, ceph::buffer::list*>&)' thread 7fbb209ab700 time 2015-12-07 10:53:15.289898
Second crash:
2015-12-07 10:54:50.671509 7f32ea436700 10 osd.30 228074 dequeue_op 0x13ae24000 prio 127 cost 0 latency 0.002933 MOSDECSubOpReadReply(87.1cs0 228074 ECSubReadReply(tid=354, attrs_read=1)) v1 pg pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228074 n=41685 ec=88152 les/c 228074/181936 228072/228072/228068) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228072 pi=181856-228071/1326 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] 2015-12-07 10:54:50.671620 7f32ea436700 10 osd.30 pg_epoch: 228074 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228074 n=41685 ec=88152 les/c 228074/181936 228072/228072/228068) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228072 pi=181856-228071/1326 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_message: MOSDECSubOpReadReply(87.1cs0 228074 ECSubReadReply(tid=354, attrs_read=1)) v1 2015-12-07 10:54:50.671649 7f32ea436700 10 osd.30 pg_epoch: 228074 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228074 n=41685 ec=88152 les/c 228074/181936 228072/228072/228068) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228072 pi=181856-228071/1326 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_sub_read_reply: reply ECSubReadReply(tid=354, attrs_read=1) 2015-12-07 10:54:50.671709 7f32ea436700 10 osd.30 pg_epoch: 228074 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228074 n=41685 ec=88152 les/c 228074/181936 228072/228072/228068) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228072 pi=181856-228071/1326 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_sub_read_reply readop complete: ReadOp(tid=354, to_read={ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87=read_request_t(to_read=[0,8388608,0], need=0(9),8(7),9(2),13(8), want_attrs=1)}, complete={ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87=read_result_t(r=0, errors={}, attrs={_=buffer::list(len=276, 2015-12-07 10:54:50.671851 7f32ea436700 10 osd.30 pg_epoch: 228074 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228074 n=41685 ec=88152 les/c 228074/181936 228072/228072/228068) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228072 pi=181856-228071/1326 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_recovery_read_complete: returned ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87 (0, 8388608, [0(9),524288, 8(7),524288, 9(2),1048576, 13(8),524288]) 2015-12-07 10:54:50.671966 7f32ea436700 10 osd.30 pg_epoch: 228074 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228074 n=41685 ec=88152 les/c 228074/181936 228072/228072/228068) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228072 pi=181856-228071/1326 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_recovery_read_complete: [2,1048576, 7,524288, 8,524288, 9,524288] 2015-12-07 10:54:50.675197 7f32ea436700 -1 osd/ECUtil.cc: In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&, std::map<int, ceph::buffer::list*>&)' thread 7f32ea436700 time 2015-12-07 10:54:50.672079
#10 Updated by Loïc Dachary over 8 years ago
Missing part after len=276, (it's on two lines) :
buffer::ptr(524538~276 0x1434e80fa in raw 0x143468000 len 524939 nref 4) ),hinfo_key=buffer::list(len=58, buffer::ptr(524831~58 0x1434e821f in raw 0x143468000 len 524939 nref 4) ),snapset=buffer::list(len=31, buffer::ptr(524904~31 0x1434e8268 in raw 0x143468000 len 524939 nref 4) )}, returned=(0, 8388608, [0(9),524288, 8(7),524288, 9(2),1048576, 13(8),524288])}, priority=1, obj_to_source={ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87=0(9),8(7),9(2),13(8)}, source_to_obj={0(9)=ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87,8(7)=ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87,9(2)=ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87,13(8)=ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87}, in_progress=)
Second crash:
buffer::ptr(524538~276 0x1412280fa in raw 0x1411a8000 len 524939 nref 5) ),hinfo_key=buffer::list(len=58, buffer::ptr(524831~58 0x14122821f in raw 0x1411a8000 len 524939 nref 5) ),snapset=buffer::list(len=31, buffer::ptr(524904~31 0x141228268 in raw 0x1411a8000 len 524939 nref 5) )}, returned=(0, 8388608, [0(9),524288, 8(7),524288, 9(2),1048576, 13(8),524288])}, priority=1, obj_to_source={ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87=0(9),8(7),9(2),13(8)}, source_to_obj={0(9)=ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87,8(7)=ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87,9(2)=ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87,13(8)=ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87}, in_progress=)
#11 Updated by Loïc Dachary over 8 years ago
[0(9),524288, 8(7),524288, 9(2),1048576, 13(8),524288])
[2,1048576, 7,524288, 8,524288, 9,524288]
=> one piece is bigger than the others with 1048576 vs 524288
and ECUtil::decode expect all parts to be equal via assert(i->second.length() == total_data_size) and
total_data_size = to_decode.begin()->second.length() so same as the first length if I'm reading correctly
Now I don't know how one piece can be bigger than the others, if it's ok or not.
#12 Updated by Loïc Dachary over 8 years ago
Per Loic request:
root@g9:~# ceph osd erasure-code-profile ls default ec4p1profile ec8p2profile root@g9:~# ceph osd erasure-code-profile get ec8p2profile directory=/usr/lib/ceph/erasure-code k=8 m=2 plugin=jerasure ruleset-failure-domain=host technique=reed_sol_van
#13 Updated by Loïc Dachary over 8 years ago
root@g3:~# ceph osd map ec8p2 rbd_data.1c25f6c2eb141f2.00000000000f90bf osdmap e228088 pool 'ec8p2' (87) object 'rbd_data.1c25f6c2eb141f2.00000000000f90bf' -> pg 87.ab4fad1c (87.1c) -> up ([30,33,47,37,24,16,22,0,9,13], p30) acting ([30,33,9,37,24,16,22,8,13,0], p30)
Found something strange about 1c25f6c2eb141f2.00000000000f90bf object on osd.9 :
root@g6:/var/lib/ceph/osd/ceph-9/current/87.1cs2_head# ls -l './DIR_C/DIR_1/DIR_D/DIR_A/rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2' -rw-r--r-- 1 root root 274878431232 Nov 16 23:01 ./DIR_C/DIR_1/DIR_D/DIR_A/rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2 root@g6:/var/lib/ceph/osd/ceph-9/current/87.1cs2_head# du -hs './DIR_C/DIR_1/DIR_D/DIR_A/rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2' 516K ./DIR_C/DIR_1/DIR_D/DIR_A/rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2
Note the "274878431232" size from ls, not matching the 516K from du.
On osd.30 :
root@g8:/var/lib/ceph/osd/ceph-30/current/87.1cs0_head/DIR_C# ls -l './DIR_1/DIR_D/DIR_A/rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_0' -rw-r--r-- 1 root root 524288 Oct 11 05:34 ./DIR_1/DIR_D/DIR_A/rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_0 root@g8:/var/lib/ceph/osd/ceph-30/current/87.1cs0_head/DIR_C# du -hs './DIR_1/DIR_D/DIR_A/rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_0' 516K ./DIR_1/DIR_D/DIR_A/rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_0
#14 Updated by Loïc Dachary over 8 years ago
It looks like the abnormal sized file is containing data for the first 524288 bytes then only zeroes.
dd if='rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2' of=/tmp/a2 bs=524288 count=1 dd if='rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2' of=/tmp/a2 bs=524288 count=2 root@g6:/tmp# od -a a1|tail -10 1777560 ; 7 etx ` T E 3 ff sp dc2 esc @ b R T 8 1777600 p 9 - x D R nak 4 K 8 ! del g e I ~ 1777620 rs t dle E us k C b O R z sub W w e dc4 1777640 i V eot l \ 3 ` so cr i @ b C " ] @ 1777660 Y c so - c c G sub j m 1 5 soh Y : ( 1777700 nul f p dc2 x y us P ^ T ^ t Y I b ] 1777720 stx h F bel q U A dc3 U L C 8 1 N , nul 1777740 k 1 W ? sp etx ~ U stx # N o etx dc1 X 1 1777760 bs ^ = n x & gs i n < enq 2 u sp h x 2000000 root@g6:/tmp# od -a a2|tail -10 1777620 rs t dle E us k C b O R z sub W w e dc4 1777640 i V eot l \ 3 ` so cr i @ b C " ] @ 1777660 Y c so - c c G sub j m 1 5 soh Y : ( 1777700 nul f p dc2 x y us P ^ T ^ t Y I b ] 1777720 stx h F bel q U A dc3 U L C 8 1 N , nul 1777740 k 1 W ? sp etx ~ U stx # N o etx dc1 X 1 1777760 bs ^ = n x & gs i n < enq 2 u sp h x 2000000 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul * 4000000
#15 Updated by Loïc Dachary over 8 years ago
As requested by Loic:
- ceph osd pool get ec8p2 erasure_code_profile
erasure_code_profile: ec8p2profile
- ceph osd erasure-code-profile get ec8p2profile
directory=/usr/lib/ceph/erasure-code
k=8
m=2
plugin=jerasure
ruleset-failure-domain=host
technique=reed_sol_van
#16 Updated by Loïc Dachary over 8 years ago
xattr just in case.
root@g6:/var/lib/ceph/osd/ceph-9/current/87.1cs2_head/DIR_C/DIR_1/DIR_D/DIR_A# xattr -l 'rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2' user.cephos.spill_out: 0000 31 00 1. user.ceph._: 0000 0F 08 0E 01 00 00 04 03 4A 00 00 00 00 00 00 00 ........J....... 0010 29 00 00 00 72 62 64 5F 64 61 74 61 2E 31 63 32 )...rbd_data.1c2 0020 35 66 36 63 32 65 62 31 34 31 66 32 2E 30 30 30 5f6c2eb141f2.000 0030 30 30 30 30 30 30 30 30 66 39 30 62 66 FE FF FF 00000000f90bf... 0040 FF FF FF FF FF 1C AD 4F AB 00 00 00 00 00 57 00 .......O......W. 0050 00 00 00 00 00 00 06 03 1C 00 00 00 57 00 00 00 ............W... 0060 00 00 00 00 FF FF FF FF 00 00 00 00 00 00 00 00 ................ 0070 FF FF FF FF FF FF FF FF 00 00 00 00 C4 41 00 00 .............A.. 0080 00 00 00 00 65 58 01 00 00 00 00 00 00 00 00 00 ....eX.......... 0090 00 00 00 00 02 02 15 00 00 00 04 22 00 00 00 00 ...........".... 00A0 00 00 00 C5 F0 15 00 00 00 00 00 FA 39 01 00 00 ............9... 00B0 00 40 00 00 00 00 00 FF D1 19 56 9E 02 1A 32 02 .@........V...2. 00C0 02 15 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00D0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00E0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 91 ................ 00F0 33 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3............... 0100 34 00 00 00 BB D8 19 56 02 15 17 05 47 70 F8 94 4......V....Gp.. 0110 FF FF FF FF .... user.ceph.hinfo_key: 0000 01 01 34 00 00 00 00 00 08 00 00 00 00 00 0A 00 ..4............. 0010 00 00 9B 71 16 B1 C3 ED D6 70 BB 6F 77 83 D5 C1 ...q.....p.ow... 0020 55 86 CC 85 32 1B F1 31 E3 4B 7E 70 0E 28 AA 10 U...2..1.K~p.(.. 0030 E8 8C 40 F0 79 0D 32 BA FF 52 ..@.y.2..R
#17 Updated by Loïc Dachary over 8 years ago
I think it is safe for you to remove the corrupted shard (thanks to Kefu for clarifying this ;-). It will be rebuilt by decode, using the other shards. The bug is that a corrupted shard should not reach this point in the code path, it should be treated as invalid earlier on. This is a rare occurrence because such corruptions are supposed to be caught and fixed by deep scrubbing (it is my understanding that you disabled it some time ago). I also think David Zafman made it so a shard with an invalid size cannot reach this point in the code path but I don't remember exactly.
#18 Updated by Loïc Dachary over 8 years ago
- File ceph-osd-dump-20151207.txt View added
- File ceph-pg-dump-20151207.txt.gz added
#19 Updated by Loïc Dachary over 8 years ago
- Related to Bug #12000: OSD: EC reads do not correctly validate checksums and data contents added
#20 Updated by Loïc Dachary over 8 years ago
- File cm-20151207.txt View added
#21 Updated by Mehdi Abaakouk over 8 years ago
I have workaround our issue by stopping the three OSDs that crash, and truncate this object to the right size:
cd /var/lib/ceph/osd/ceph-9/current/87.1cs2_head/DIR_C/DIR_1/DIR_D/DIR_A/
cp 'rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2' '/root/sileht-save-rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2'
dd if='/root/sileht-save-rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2' of='rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2' count=1 bs=524288
Then I have restart the three OSDs, the cluster now recovers without crashing.
I have successfully retrieved the object that was impacted by this through rados.
#22 Updated by Samuel Just over 8 years ago
- Status changed from New to Can't reproduce
Reopen if we get more information about how these objects came to be the wrong size.
#23 Updated by Laurent GUERBY over 8 years ago
The bug is about the OSDs that shouldn't crash in this case where only one shard in a EC 8+2 setting is corrupted in a specific way, here file with correct data but with sparse zone of zero at the end (probably via a erroneous sparse write). EC should detect this condition and repair the file all by itself.
#24 Updated by Loïc Dachary over 8 years ago
I think the problem is fixed in infernalis via https://github.com/ceph/ceph/pull/5173 . It also is my understanding that backporting the necessary fix to hammer would be difficult and that the risk of regressions associated with it is high. Should someone run into the same problem, the suggested workaround (either recover the object in the same way you did or just remove the file so it is reconstructed) is less disruptive.
Sam or David may comment more if I misunderstood the tradeoff.
#25 Updated by Loïc Dachary over 8 years ago
- Copied to Backport #14087: FAILED assert(old_size == total_chunk_size) in 0.94.5 / strange file size added