Project

General

Profile

Bug #14009

FAILED assert(old_size == total_chunk_size) in 0.94.5 / strange file size

Added by Laurent GUERBY over 5 years ago. Updated over 5 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Three of our OSD are failing with

2015-12-06 09:44:22.271873 7f47091fc700 -1 osd/ECUtil.cc: In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&, std::map<int, ceph::buffer::list*>&)' thread 7f47\
091fc700 time 2015-12-06 09:44:22.267329
osd/ECUtil.cc: 59: FAILED assert(i->second.length() == total_chunk_size)

during recovery

ceph-osd-dump-20151207.txt View (14.3 KB) Loïc Dachary, 12/07/2015 05:26 PM

ceph-pg-dump-20151207.txt.gz (643 KB) Loïc Dachary, 12/07/2015 05:27 PM

cm-20151207.txt View (6.62 KB) Loïc Dachary, 12/07/2015 05:29 PM


Related issues

Related to Ceph - Bug #12000: OSD: EC reads do not correctly validate checksums and data contents Resolved 06/12/2015
Copied to Ceph - Backport #14087: FAILED assert(old_size == total_chunk_size) in 0.94.5 / strange file size Resolved

Associated revisions

Revision f04e0075 (diff)
Added by Loic Dachary over 5 years ago

osd: log inconsistent shard sizes

When an OSD asserts because it cannot recover from an unexpected erasure
coded shard size, the object needs to be manually fixed (i.e. the shard
must be removed so it can be reconstructed). Unfortunately the assert
does not display the name of the object and it is not displayed unless
the OSD logs are >= 10, which is uncommon on an actual cluster because
it creates log files that are too big.

The problem has been fixed in infernalis as part of
https://github.com/ceph/ceph/pull/5173 and backporting it is non
trivial.

The error condition is modified to make it easy for the system
administrator to identify which object is causing the crash and what
steps must be taken to fix the issue.

http://tracker.ceph.com/issues/14009 Refs: #14009

Signed-off-by: Loic Dachary <>

History

#1 Updated by Laurent GUERBY over 5 years ago

2015-12-06 09:44:22.271873 7f47091fc700 -1 osd/ECUtil.cc: In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&, std::map<int, ceph::buffer::list*>&)' thread 7f47\
091fc700 time 2015-12-06 09:44:22.267329
osd/ECUtil.cc: 59: FAILED assert(i->second.length() == total_chunk_size)

ceph version 0.94.5-164-gbf9e1b6 (bf9e1b6307692fdc50465a64590d83e3d7015c9d)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc5a9b]
 2: (ECUtil::decode(ECUtil::stripe_info_t const&, std::tr1::shared_ptr&lt;ceph::ErasureCodeInterface&gt;&, std::map&lt;int, ceph::buffer::list, std::less&lt;int&gt;, std::allocator&lt;std::pair&lt;int const, ceph::buffer::list&gt; > >&, std::map&lt;int, ceph::buff\
er::list*, std::less&lt;int&gt;, std::allocator&lt;std::pair&lt;int const, ceph::buffer::list*&gt; > >&)+0xcde) [0xac33de]
 3: (ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple&lt;unsigned long, unsigned long, std::map&lt;pg_shard_t, ceph::buffer::list, std::less&lt;pg_shard_t&gt;, std::allocator&lt;std::pair&lt;pg_shard_t const, ceph::buffer::l\
ist> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, boost::optional&lt;std::map&lt;std::string, ceph:\
:buffer::list, std::less&lt;std::string&gt;, std::allocator&lt;std::pair&lt;std::string const, ceph::buffer::list&gt; > > >, RecoveryMessages*)+0x7c0) [0xa3ee40]
 4: (OnRecoveryReadComplete::finish(std::pair&lt;RecoveryMessages*, ECBackend::read_result_t&&gt;&)+0x121) [0xa4fa01]
 5: (GenContext&lt;std::pair&lt;RecoveryMessages*, ECBackend::read_result_t&&gt;&>::complete(std::pair&lt;RecoveryMessages*, ECBackend::read_result_t&&gt;&)+0x9) [0xa3f7a9]
 6: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x63) [0xa31e53]
 7: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0x96d) [0xa3645d]
 8: (ECBackend::handle_message(std::tr1::shared_ptr&lt;OpRequest&gt;)+0x17e) [0xa3e2ae]
 9: (ReplicatedPG::do_request(std::tr1::shared_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0x167) [0x843fb7]
 10: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, std::tr1::shared_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x3bd) [0x6a1aed]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x331) [0x6a2011]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x85f) [0xbb55ef]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbb7710]
 14: (()+0x8182) [0x7f474f559182]
 15: (clone()+0x6d) [0x7f474dac447d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

#3 Updated by Laurent GUERBY over 5 years ago

There is a strange file on disk described here

http://tracker.ceph.com/issues/13972#note-32
http://tracker.ceph.com/issues/13972#note-33

rados get on this object immediately crashes the OSD.

norecover allows the cluster to function

#4 Updated by Loïc Dachary over 5 years ago

COPYING THE INITIAL REPORT

#5 Updated by Loïc Dachary over 5 years ago

We started seeing this tonight on our ceph 0.94.5 cluster tonight, RBD over EC pools (one 4+1, one 8+2), ubuntu trusty, 3 OSD died with:

2015-12-06 09:44:22.271873 7f47091fc700 -1 osd/ECUtil.cc: In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&, std::map<int, ceph::buffer::list*>&)' thread 7f47\
091fc700 time 2015-12-06 09:44:22.267329
osd/ECUtil.cc: 59: FAILED assert(i->second.length() == total_chunk_size)

ceph version 0.94.5-164-gbf9e1b6 (bf9e1b6307692fdc50465a64590d83e3d7015c9d)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc5a9b]
2: (ECUtil::decode(ECUtil::stripe_info_t const&, std::tr1::shared_ptr&lt;ceph::ErasureCodeInterface&gt;&, std::map&lt;int, ceph::buffer::list, std::less&lt;int&gt;, std::allocator&lt;std::pair&lt;int const, ceph::buffer::list&gt; > >&, std::map&lt;int, ceph::buff\
er::list*, std::less&lt;int&gt;, std::allocator&lt;std::pair&lt;int const, ceph::buffer::list*&gt; > >&)+0xcde) [0xac33de]
3: (ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple&lt;unsigned long, unsigned long, std::map&lt;pg_shard_t, ceph::buffer::list, std::less&lt;pg_shard_t&gt;, std::allocator&lt;std::pair&lt;pg_shard_t const, ceph::buffer::l\
ist> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, boost::optional&lt;std::map&lt;std::string, ceph:\
:buffer::list, std::less&lt;std::string&gt;, std::allocator&lt;std::pair&lt;std::string const, ceph::buffer::list&gt; > > >, RecoveryMessages*)+0x7c0) [0xa3ee40]
4: (OnRecoveryReadComplete::finish(std::pair&lt;RecoveryMessages*, ECBackend::read_result_t&&gt;&)+0x121) [0xa4fa01]
5: (GenContext&lt;std::pair&lt;RecoveryMessages*, ECBackend::read_result_t&&gt;&>::complete(std::pair&lt;RecoveryMessages*, ECBackend::read_result_t&&gt;&)+0x9) [0xa3f7a9]
6: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x63) [0xa31e53]
7: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0x96d) [0xa3645d]
8: (ECBackend::handle_message(std::tr1::shared_ptr&lt;OpRequest&gt;)+0x17e) [0xa3e2ae]
9: (ReplicatedPG::do_request(std::tr1::shared_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0x167) [0x843fb7]
10: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, std::tr1::shared_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x3bd) [0x6a1aed]
11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x331) [0x6a2011]
12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x85f) [0xbb55ef]
13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbb7710]
14: (()+0x8182) [0x7f474f559182]
15: (clone()+0x6d) [0x7f474dac447d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

#6 Updated by Loïc Dachary over 5 years ago

The three OSD keep having the same error a short time after restart, log with ms = 1 objecter = 20 on osd.30 is attached.

attach didn't work, here is the URL

http://guerby.org/ftp/ooops-30.log.xz

Note: only the last restart has detailed debug (previous didn't have the ms):

1> 2015-12-06 10:12:42.364429 7f37c06fe700  5 - op tracker -- seq: 24552, time: 2015-12-06 10:12:42.364429, event: done, op: MOSDPGPushReply(87.35s0 226867 [PushReplyOp(2a31b0b5/rbd_data.1cecf133ca1a5bd.00000000000c74c6/head//87),PushReplyOp(fd31b0b5/rbd_data.1cecf133ca1a5bd.000000000000a5f1/head//87),PushReplyOp(a841b0b5/rbd_data.1c25f6c2eb141f2.0000000000055752/head//87)])
0> 2015-12-06 10:12:42.365322 7f37c2f03700 -1 osd/ECUtil.cc: In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map&lt;int, ceph::buffer::list&gt;&, std::map&lt;int, ceph::buffer::list*&gt;&)' thread 7f37c2f03700 time 2015-12-06 10:12:42.362228
osd/ECUtil.cc: 59: FAILED assert(i->second.length() == total_chunk_size)

Looks like the object causing the issue is rbd_data.1b2de3e52bdda2a.00000000000f16f4 in 8+2 ec pool

grep 7fea58c08700 /var/log/ceph/ceph-osd.30.log
...
  -239> 2015-12-06 12:16:39.851518 7fea58c08700  5 -- op tracker -- seq: 10397, time: 2015-12-06 12:16:39.851518, event: reached_pg, op: pg_scan(digest 87.1cs0 64fad1c/rbd_data.1b2de3e52bdda2a.00000000000f16f4/head//87-MAX e 227380/227380)
  -238> 2015-12-06 12:16:39.851537 7fea58c08700  5 -- op tracker -- seq: 10397, time: 2015-12-06 12:16:39.851537, event: started, op: pg_scan(digest 87.1cs0 64fad1c/rbd_data.1b2de3e52bdda2a.00000000000f16f4/head//87-MAX e 227380/227380)
  -237> 2015-12-06 12:16:39.851544 7fea58c08700  5 -- op tracker -- seq: 10397, time: 2015-12-06 12:16:39.851544, event: done, op: pg_scan(digest 87.1cs0 64fad1c/rbd_data.1b2de3e52bdda2a.00000000000f16f4/head//87-MAX e 227380/227380)
  -223> 2015-12-06 12:16:39.856672 7fea58c08700  5 -- op tracker -- seq: 10399, time: 2015-12-06 12:16:39.856672, event: reached_pg, op: pg_scan(digest 87.1cs0 64fad1c/rbd_data.1b2de3e52bdda2a.00000000000f16f4/head//87-MAX e 227380/227380)
  -222> 2015-12-06 12:16:39.856692 7fea58c08700  5 -- op tracker -- seq: 10399, time: 2015-12-06 12:16:39.856692, event: started, op: pg_scan(digest 87.1cs0 64fad1c/rbd_data.1b2de3e52bdda2a.00000000000f16f4/head//87-MAX e 227380/227380)
  -221> 2015-12-06 12:16:39.856699 7fea58c08700  5 -- op tracker -- seq: 10399, time: 2015-12-06 12:16:39.856699, event: done, op: pg_scan(digest 87.1cs0 64fad1c/rbd_data.1b2de3e52bdda2a.00000000000f16f4/head//87-MAX e 227380/227380)
  -202> 2015-12-06 12:16:39.859786 7fea58c08700  5 -- op tracker -- seq: 10401, time: 2015-12-06 12:16:39.859786, event: reached_pg, op: MOSDECSubOpReadReply(87.1cs0 227380 ECSubReadReply(tid=51, attrs_read=0))
  -201> 2015-12-06 12:16:39.859813 7fea58c08700  5 -- op tracker -- seq: 10401, time: 2015-12-06 12:16:39.859813, event: done, op: MOSDECSubOpReadReply(87.1cs0 227380 ECSubReadReply(tid=51, attrs_read=0))
  -188> 2015-12-06 12:16:39.860797 7fea58c08700  5 -- op tracker -- seq: 10403, time: 2015-12-06 12:16:39.860796, event: reached_pg, op: MOSDECSubOpReadReply(87.1cs0 227380 ECSubReadReply(tid=51, attrs_read=0))
  -187> 2015-12-06 12:16:39.860823 7fea58c08700  5 -- op tracker -- seq: 10403, time: 2015-12-06 12:16:39.860823, event: done, op: MOSDECSubOpReadReply(87.1cs0 227380 ECSubReadReply(tid=51, attrs_read=0))
    -4> 2015-12-06 12:16:40.062663 7fea58c08700  5 -- op tracker -- seq: 10417, time: 2015-12-06 12:16:40.062663, event: reached_pg, op: MOSDECSubOpReadReply(87.1cs0 227380 ECSubReadReply(tid=51, attrs_read=0))
     0> 2015-12-06 12:16:40.068089 7fea58c08700 -1 osd/ECUtil.cc: In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&, std::map<int, ceph::buffer::list*>&)' thread 7fea58c08700 time 2015-12-06 12:16:40.062709

root@g8:~# ceph pg dump|grep 87.1c
dumped all in format plain
87.1c    41685    0    41685    208425    0    174146019328    3448    3448    active+undersized+degraded+remapped    2015-12-06 12:19:28.900553    226852'44959    227394:123178    [NONE,33,47,37,24,16,22,0,9,13]    33    [NONE,33,9,37,24,16,22,8,13,0]    33    179906'30225    2015-11-10 12:58:04.433918    179906'30225    2015-11-10 12:58:04.433918
root@g8:~# ceph osd map ec8p2 rbd_data.1b2de3e52bdda2a.00000000000f16f4
osdmap e227399 pool 'ec8p2' (87) object 'rbd_data.1b2de3e52bdda2a.00000000000f16f4' -> pg 87.64fad1c (87.1c) -> up ([NONE,33,47,37,24,16,22,0,9,13], p33) acting ([NONE,33,9,37,24,16,22,8,13,0], p33)
pool 87 'ec8p2' erasure size 10 min_size 8 crush_ruleset 4 object_hash rjenkins pg_num 128 pgp_num 128 last_change 151534 lfor 151534 flags hashpspool tiers 88 read_tier 88 write_tier 88 stripe_width 4096
pool 88 'ec8p2c' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 152194 flags hashpspool,incomplete_clones tier_of 87 cache_mode writeback target_bytes 200000000000 target_objects 10000000 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 3600s x1 stripe_width 0
root@g8:~# ceph osd map ec8p2c rbd_data.1b2de3e52bdda2a.00000000000f16f4
osdmap e227399 pool 'ec8p2c' (88) object 'rbd_data.1b2de3e52bdda2a.00000000000f16f4' -> pg 88.64fad1c (88.1c) -> up ([16,26,27], p16) acting ([16,26,27], p16)

#7 Updated by Loïc Dachary over 5 years ago

With:

ceph osd set norecover

The 3 OSDs do not crash.

But of course no recovery is happening so we're kind of stuck.

#8 Updated by Loïc Dachary over 5 years ago

debug osd = 20 log on osd.30, 300 MByte :

http://guerby.org/ftp/ceph-osd.30.log.xz

I did update ceph.conf, restart osd.30, ceph osd unset norecover, waited a while (may be one or two restart by upstart), ceph osd set norecover, edit again ceph.conf and restart.

#9 Updated by Loïc Dachary over 5 years ago

For the record osd 30 is the one crashing the most frequently, OSD 33 and 9 are also crashing quickly, other OSD are not crashing.

Per Loic request OSD 9 and 33 (the two other crashing) non debug logs:

http://guerby.org/ftp/ceph-osd.9.log
http://guerby.org/ftp/ceph-osd.33.log

Each has a few crashes in it.

Extract of osd.30 debug log near the first crash by grepping 7fbb209ab700

2015-12-07 10:53:15.289820 7fbb209ab700 10 osd.30 228065 dequeue_op 0x16d88900 prio 127 cost 0 latency 0.185466 MOSDECSubOpReadReply(87.1cs0 228065 ECSubReadReply(tid=512, attrs_read=0)) v1 pg pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228065 n=41685 ec=88152 les/c 228065/181936 228060/228060/228054) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228060 pi=181856-228059/1322 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling]
2015-12-07 10:53:15.289834 7fbb209ab700 10 osd.30 pg_epoch: 228065 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228065 n=41685 ec=88152 les/c 228065/181936 228060/228060/228054) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228060 pi=181856-228059/1322 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_message: MOSDECSubOpReadReply(87.1cs0 228065 ECSubReadReply(tid=512, attrs_read=0)) v1
2015-12-07 10:53:15.289842 7fbb209ab700 10 osd.30 pg_epoch: 228065 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228065 n=41685 ec=88152 les/c 228065/181936 228060/228060/228054) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228060 pi=181856-228059/1322 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_sub_read_reply: reply ECSubReadReply(tid=512, attrs_read=0)
2015-12-07 10:53:15.289853 7fbb209ab700 10 osd.30 pg_epoch: 228065 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228065 n=41685 ec=88152 les/c 228065/181936 228060/228060/228054) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228060 pi=181856-228059/1322 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_sub_read_reply readop complete: ReadOp(tid=512, to_read={ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87=read_request_t(to_read=[0,8388608,0], need=0(9),8(7),9(2),13(8), want_attrs=1)}, complete={ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87=read_result_t(r=0, errors={}, attrs={_=buffer::list(len=276,
2015-12-07 10:53:15.289875 7fbb209ab700 10 osd.30 pg_epoch: 228065 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228065 n=41685 ec=88152 les/c 228065/181936 228060/228060/228054) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228060 pi=181856-228059/1322 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_recovery_read_complete: returned ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87 (0, 8388608, [0(9),524288, 8(7),524288, 9(2),1048576, 13(8),524288])
2015-12-07 10:53:15.289889 7fbb209ab700 10 osd.30 pg_epoch: 228065 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228065 n=41685 ec=88152 les/c 228065/181936 228060/228060/228054) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228060 pi=181856-228059/1322 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_recovery_read_complete: [2,1048576, 7,524288, 8,524288, 9,524288]
2015-12-07 10:53:15.293099 7fbb209ab700 -1 osd/ECUtil.cc: In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&, std::map<int, ceph::buffer::list*>&)' thread 7fbb209ab700 time 2015-12-07 10:53:15.289898

Second crash:

2015-12-07 10:54:50.671509 7f32ea436700 10 osd.30 228074 dequeue_op 0x13ae24000 prio 127 cost 0 latency 0.002933 MOSDECSubOpReadReply(87.1cs0 228074 ECSubReadReply(tid=354, attrs_read=1)) v1 pg pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228074 n=41685 ec=88152 les/c 228074/181936 228072/228072/228068) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228072 pi=181856-228071/1326 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling]
2015-12-07 10:54:50.671620 7f32ea436700 10 osd.30 pg_epoch: 228074 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228074 n=41685 ec=88152 les/c 228074/181936 228072/228072/228068) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228072 pi=181856-228071/1326 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_message: MOSDECSubOpReadReply(87.1cs0 228074 ECSubReadReply(tid=354, attrs_read=1)) v1
2015-12-07 10:54:50.671649 7f32ea436700 10 osd.30 pg_epoch: 228074 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228074 n=41685 ec=88152 les/c 228074/181936 228072/228072/228068) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228072 pi=181856-228071/1326 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_sub_read_reply: reply ECSubReadReply(tid=354, attrs_read=1)
2015-12-07 10:54:50.671709 7f32ea436700 10 osd.30 pg_epoch: 228074 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228074 n=41685 ec=88152 les/c 228074/181936 228072/228072/228068) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228072 pi=181856-228071/1326 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_sub_read_reply readop complete: ReadOp(tid=354, to_read={ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87=read_request_t(to_read=[0,8388608,0], need=0(9),8(7),9(2),13(8), want_attrs=1)}, complete={ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87=read_result_t(r=0, errors={}, attrs={_=buffer::list(len=276,
2015-12-07 10:54:50.671851 7f32ea436700 10 osd.30 pg_epoch: 228074 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228074 n=41685 ec=88152 les/c 228074/181936 228072/228072/228068) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228072 pi=181856-228071/1326 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_recovery_read_complete: returned ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87 (0, 8388608, [0(9),524288, 8(7),524288, 9(2),1048576, 13(8),524288])
2015-12-07 10:54:50.671966 7f32ea436700 10 osd.30 pg_epoch: 228074 pg[87.1cs0( v 227628'44961 (180913'41511,227628'44961] local-les=228074 n=41685 ec=88152 les/c 228074/181936 228072/228072/228068) [30,33,47,37,24,16,22,0,9,13]/[30,33,9,37,24,16,22,8,13,0] r=0 lpr=228072 pi=181856-228071/1326 rops=1 bft=0(7),9(8),13(9),47(2) crt=226852'44959 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_recovery_read_complete: [2,1048576, 7,524288, 8,524288, 9,524288]
2015-12-07 10:54:50.675197 7f32ea436700 -1 osd/ECUtil.cc: In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&, std::map<int, ceph::buffer::list*>&)' thread 7f32ea436700 time 2015-12-07 10:54:50.672079

#10 Updated by Loïc Dachary over 5 years ago

Missing part after len=276, (it's on two lines) :

    buffer::ptr(524538~276 0x1434e80fa in raw 0x143468000 len 524939 nref 4)
),hinfo_key=buffer::list(len=58,
    buffer::ptr(524831~58 0x1434e821f in raw 0x143468000 len 524939 nref 4)
),snapset=buffer::list(len=31,
    buffer::ptr(524904~31 0x1434e8268 in raw 0x143468000 len 524939 nref 4)
)}, returned=(0, 8388608, [0(9),524288, 8(7),524288, 9(2),1048576, 13(8),524288])}, priority=1, obj_to_source={ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87=0(9),8(7),9(2),13(8)}, source_to_obj={0(9)=ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87,8(7)=ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87,9(2)=ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87,13(8)=ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87}, in_progress=)

Second crash:

    buffer::ptr(524538~276 0x1412280fa in raw 0x1411a8000 len 524939 nref 5)
),hinfo_key=buffer::list(len=58,
    buffer::ptr(524831~58 0x14122821f in raw 0x1411a8000 len 524939 nref 5)
),snapset=buffer::list(len=31,
    buffer::ptr(524904~31 0x141228268 in raw 0x1411a8000 len 524939 nref 5)
)}, returned=(0, 8388608, [0(9),524288, 8(7),524288, 9(2),1048576, 13(8),524288])}, priority=1, obj_to_source={ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87=0(9),8(7),9(2),13(8)}, source_to_obj={0(9)=ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87,8(7)=ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87,9(2)=ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87,13(8)=ab4fad1c/rbd_data.1c25f6c2eb141f2.00000000000f90bf/head//87}, in_progress=)

#11 Updated by Loïc Dachary over 5 years ago

[0(9),524288, 8(7),524288, 9(2),1048576, 13(8),524288])
[2,1048576, 7,524288, 8,524288, 9,524288]

=> one piece is bigger than the others with 1048576 vs 524288
and ECUtil::decode expect all parts to be equal via assert(i->second.length() == total_data_size) and
total_data_size = to_decode.begin()->second.length() so same as the first length if I'm reading correctly

Now I don't know how one piece can be bigger than the others, if it's ok or not.

#12 Updated by Loïc Dachary over 5 years ago

Per Loic request:

root@g9:~# ceph osd erasure-code-profile ls
default
ec4p1profile
ec8p2profile

root@g9:~# ceph osd erasure-code-profile get ec8p2profile
directory=/usr/lib/ceph/erasure-code
k=8
m=2
plugin=jerasure
ruleset-failure-domain=host
technique=reed_sol_van

#13 Updated by Loïc Dachary over 5 years ago

root@g3:~# ceph osd map ec8p2 rbd_data.1c25f6c2eb141f2.00000000000f90bf
osdmap e228088 pool 'ec8p2' (87) object 'rbd_data.1c25f6c2eb141f2.00000000000f90bf' -> pg 87.ab4fad1c (87.1c) -> up ([30,33,47,37,24,16,22,0,9,13], p30) acting ([30,33,9,37,24,16,22,8,13,0], p30)

Found something strange about 1c25f6c2eb141f2.00000000000f90bf object on osd.9 :

root@g6:/var/lib/ceph/osd/ceph-9/current/87.1cs2_head# ls -l './DIR_C/DIR_1/DIR_D/DIR_A/rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2'
-rw-r--r-- 1 root root 274878431232 Nov 16 23:01 ./DIR_C/DIR_1/DIR_D/DIR_A/rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2
root@g6:/var/lib/ceph/osd/ceph-9/current/87.1cs2_head# du -hs './DIR_C/DIR_1/DIR_D/DIR_A/rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2'
516K    ./DIR_C/DIR_1/DIR_D/DIR_A/rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2

Note the "274878431232" size from ls, not matching the 516K from du.

On osd.30 :

root@g8:/var/lib/ceph/osd/ceph-30/current/87.1cs0_head/DIR_C# ls -l './DIR_1/DIR_D/DIR_A/rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_0'
-rw-r--r-- 1 root root 524288 Oct 11 05:34 ./DIR_1/DIR_D/DIR_A/rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_0
root@g8:/var/lib/ceph/osd/ceph-30/current/87.1cs0_head/DIR_C# du -hs './DIR_1/DIR_D/DIR_A/rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_0'
516K    ./DIR_1/DIR_D/DIR_A/rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_0

#14 Updated by Loïc Dachary over 5 years ago

It looks like the abnormal sized file is containing data for the first 524288 bytes then only zeroes.

dd if='rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2' of=/tmp/a2 bs=524288 count=1
dd if='rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2' of=/tmp/a2 bs=524288 count=2

root@g6:/tmp# od -a a1|tail -10
1777560   ;   7 etx   `   T   E   3  ff  sp dc2 esc   @   b   R   T   8
1777600   p   9   -   x   D   R nak   4   K   8   ! del   g   e   I   ~
1777620  rs   t dle   E  us   k   C   b   O   R   z sub   W   w   e dc4
1777640   i   V eot   l   \   3   `  so  cr   i   @   b   C   "   ]   @
1777660   Y   c  so   -   c   c   G sub   j   m   1   5 soh   Y   :   (
1777700 nul   f   p dc2   x   y  us   P   ^   T   ^   t   Y   I   b   ]
1777720 stx   h   F bel   q   U   A dc3   U   L   C   8   1   N   , nul
1777740   k   1   W   ?  sp etx   ~   U stx   #   N   o etx dc1   X   1
1777760  bs   ^   =   n   x   &  gs   i   n   < enq   2   u  sp   h   x
2000000
root@g6:/tmp# od -a a2|tail -10
1777620  rs   t dle   E  us   k   C   b   O   R   z sub   W   w   e dc4
1777640   i   V eot   l   \   3   `  so  cr   i   @   b   C   "   ]   @
1777660   Y   c  so   -   c   c   G sub   j   m   1   5 soh   Y   :   (
1777700 nul   f   p dc2   x   y  us   P   ^   T   ^   t   Y   I   b   ]
1777720 stx   h   F bel   q   U   A dc3   U   L   C   8   1   N   , nul
1777740   k   1   W   ?  sp etx   ~   U stx   #   N   o etx dc1   X   1
1777760  bs   ^   =   n   x   &  gs   i   n   < enq   2   u  sp   h   x
2000000 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
4000000

#15 Updated by Loïc Dachary over 5 years ago

As requested by Loic:

  1. ceph osd pool get ec8p2 erasure_code_profile
    erasure_code_profile: ec8p2profile
  1. ceph osd erasure-code-profile get ec8p2profile
    directory=/usr/lib/ceph/erasure-code
    k=8
    m=2
    plugin=jerasure
    ruleset-failure-domain=host
    technique=reed_sol_van

#16 Updated by Loïc Dachary over 5 years ago

xattr just in case.

root@g6:/var/lib/ceph/osd/ceph-9/current/87.1cs2_head/DIR_C/DIR_1/DIR_D/DIR_A# xattr -l 'rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2'
user.cephos.spill_out:
0000   31 00                                              1.

user.ceph._:
0000   0F 08 0E 01 00 00 04 03 4A 00 00 00 00 00 00 00    ........J.......
0010   29 00 00 00 72 62 64 5F 64 61 74 61 2E 31 63 32    )...rbd_data.1c2
0020   35 66 36 63 32 65 62 31 34 31 66 32 2E 30 30 30    5f6c2eb141f2.000
0030   30 30 30 30 30 30 30 30 66 39 30 62 66 FE FF FF    00000000f90bf...
0040   FF FF FF FF FF 1C AD 4F AB 00 00 00 00 00 57 00    .......O......W.
0050   00 00 00 00 00 00 06 03 1C 00 00 00 57 00 00 00    ............W...
0060   00 00 00 00 FF FF FF FF 00 00 00 00 00 00 00 00    ................
0070   FF FF FF FF FF FF FF FF 00 00 00 00 C4 41 00 00    .............A..
0080   00 00 00 00 65 58 01 00 00 00 00 00 00 00 00 00    ....eX..........
0090   00 00 00 00 02 02 15 00 00 00 04 22 00 00 00 00    ..........."....
00A0   00 00 00 C5 F0 15 00 00 00 00 00 FA 39 01 00 00    ............9...
00B0   00 40 00 00 00 00 00 FF D1 19 56 9E 02 1A 32 02    .@........V...2.
00C0   02 15 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
00D0   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
00E0   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 91    ................
00F0   33 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00    3...............
0100   34 00 00 00 BB D8 19 56 02 15 17 05 47 70 F8 94    4......V....Gp..
0110   FF FF FF FF                                        ....

user.ceph.hinfo_key:
0000   01 01 34 00 00 00 00 00 08 00 00 00 00 00 0A 00    ..4.............
0010   00 00 9B 71 16 B1 C3 ED D6 70 BB 6F 77 83 D5 C1    ...q.....p.ow...
0020   55 86 CC 85 32 1B F1 31 E3 4B 7E 70 0E 28 AA 10    U...2..1.K~p.(..
0030   E8 8C 40 F0 79 0D 32 BA FF 52                      ..@.y.2..R

#17 Updated by Loïc Dachary over 5 years ago

I think it is safe for you to remove the corrupted shard (thanks to Kefu for clarifying this ;-). It will be rebuilt by decode, using the other shards. The bug is that a corrupted shard should not reach this point in the code path, it should be treated as invalid earlier on. This is a rare occurrence because such corruptions are supposed to be caught and fixed by deep scrubbing (it is my understanding that you disabled it some time ago). I also think David Zafman made it so a shard with an invalid size cannot reach this point in the code path but I don't remember exactly.

#19 Updated by Loïc Dachary over 5 years ago

  • Related to Bug #12000: OSD: EC reads do not correctly validate checksums and data contents added

#20 Updated by Loïc Dachary over 5 years ago

#21 Updated by Mehdi Abaakouk over 5 years ago

I have workaround our issue by stopping the three OSDs that crash, and truncate this object to the right size:

cd /var/lib/ceph/osd/ceph-9/current/87.1cs2_head/DIR_C/DIR_1/DIR_D/DIR_A/
cp 'rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2' '/root/sileht-save-rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2'

dd if='/root/sileht-save-rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2' of='rbd\udata.1c25f6c2eb141f2.00000000000f90bf__head_AB4FAD1C__57_ffffffffffffffff_2' count=1 bs=524288

Then I have restart the three OSDs, the cluster now recovers without crashing.

I have successfully retrieved the object that was impacted by this through rados.

#22 Updated by Samuel Just over 5 years ago

  • Status changed from New to Can't reproduce

Reopen if we get more information about how these objects came to be the wrong size.

#23 Updated by Laurent GUERBY over 5 years ago

The bug is about the OSDs that shouldn't crash in this case where only one shard in a EC 8+2 setting is corrupted in a specific way, here file with correct data but with sparse zone of zero at the end (probably via a erroneous sparse write). EC should detect this condition and repair the file all by itself.

#24 Updated by Loïc Dachary over 5 years ago

I think the problem is fixed in infernalis via https://github.com/ceph/ceph/pull/5173 . It also is my understanding that backporting the necessary fix to hammer would be difficult and that the risk of regressions associated with it is high. Should someone run into the same problem, the suggested workaround (either recover the object in the same way you did or just remove the file so it is reconstructed) is less disruptive.

Sam or David may comment more if I misunderstood the tradeoff.

#25 Updated by Loïc Dachary over 5 years ago

  • Copied to Backport #14087: FAILED assert(old_size == total_chunk_size) in 0.94.5 / strange file size added

Also available in: Atom PDF