Project

General

Profile

Bug #12200

Updated by David Zafman almost 9 years ago

 
 Corruption of an EC pool shard crashes osd during a deep-scrub.    Specifically, the file size of one of the shards is larger than expected: 

 <pre> 
 $ find dev -name '*foo*' -ls 
 279010     56 -rw-r--r--     1 dzafman    dzafman       51201 Jul    1 16:22 dev/osd4/current/1.6s1_head/foo__head_7FC1F406__1_ffffffffffffffff_1 
 279011     56 -rw-r--r--     1 dzafman    dzafman       51200 Jul    1 16:12 dev/osd2/current/1.6s2_head/foo__head_7FC1F406__1_ffffffffffffffff_2 
 279009     56 -rw-r--r--     1 dzafman    dzafman       51200 Jul    1 16:12 dev/osd3/current/1.6s0_head/foo__head_7FC1F406__1_ffffffffffffffff_0 
 </pre> 

 <pre> 
 2015-07-01 16:22:40.244766 7f3fa17ea700 10 osd.4 26 dequeue_op 0x7f3fb402cfe0 prio 127 cost 0 latency 0.000140 replica scrub(pg: 1.6s1,from:0'0,to:20'1,epoch:26,start:0//0//-1,end:MAX,chunky:1,deep:1,seed:4294967295,version:6) v6 pg pg[1.6s1( v 20'1 (0'0,20'1] local-les=26 n=1 ec=19 les/c 26/26 25/25/25) [3,4,2] r=1 lpr=25 pi=19-24/4 luod=0'0 crt=0'0 lcod 0'0 active] 
 2015-07-01 16:22:40.244799 7f3fa17ea700 10 osd.4 pg_epoch: 26 pg[1.6s1( v 20'1 (0'0,20'1] local-les=26 n=1 ec=19 les/c 26/26 25/25/25) [3,4,2] r=1 lpr=25 pi=19-24/4 luod=0'0 crt=0'0 lcod 0'0 active] handle_message: replica scrub(pg: 1.6s1,from:0'0,to:20'1,epoch:26,start:0//0//-1,end:MAX,chunky:1,deep:1,seed:4294967295,version:6) v6 
 2015-07-01 16:22:40.244814 7f3fa17ea700    7 osd.4 pg_epoch: 26 pg[1.6s1( v 20'1 (0'0,20'1] local-les=26 n=1 ec=19 les/c 26/26 25/25/25) [3,4,2] r=1 lpr=25 pi=19-24/4 luod=0'0 crt=0'0 lcod 0'0 active] replica_scrub 
 2015-07-01 16:22:40.244824 7f3fa17ea700 10 osd.4 pg_epoch: 26 pg[1.6s1( v 20'1 (0'0,20'1] local-les=26 n=1 ec=19 les/c 26/26 25/25/25) [3,4,2] r=1 lpr=25 pi=19-24/4 luod=0'0 crt=0'0 lcod 0'0 active] build_scrub_map_chunk [0//0//-1,MAX)    seed 4294967295 
 2015-07-01 16:22:40.244835 7f3fa17ea700 10 filestore(/home/dzafman/ceph/src/dev/osd4) collection_list_partial: 1.6s1_head 
 2015-07-01 16:22:40.244843 7f3fa17ea700 20 _collection_list_partial 0//0//-1 32-64 ls.size 0 
 2015-07-01 16:22:40.244937 7f3fa17ea700 20    prefixes 60000000,604F1CF7 
 2015-07-01 16:22:40.244947 7f3fa17ea700 20 filestore(/home/dzafman/ceph/src/dev/osd4) objects: [6//head//1/ffffffffffffffff/1,7fc1f406/foo/head//1/ffffffffffffffff/1] 
 2015-07-01 16:22:40.244957 7f3fa17ea700 10 osd.4 pg_epoch: 26 pg[1.6s1( v 20'1 (0'0,20'1] local-les=26 n=1 ec=19 les/c 26/26 25/25/25) [3,4,2] r=1 lpr=25 pi=19-24/4 luod=0'0 crt=0'0 lcod 0'0 active] be_scan_list scanning 1 objects deeply 
 2015-07-01 16:22:40.245000 7f3fa17ea700 10 filestore(/home/dzafman/ceph/src/dev/osd4) stat 1.6s1_head/7fc1f406/foo/head//1/ffffffffffffffff/1 = 0 (size 51201) 
 2015-07-01 16:22:40.245014 7f3fa17ea700 15 filestore(/home/dzafman/ceph/src/dev/osd4) getattrs 1.6s1_head/7fc1f406/foo/head//1/ffffffffffffffff/1 
 2015-07-01 16:22:40.245075 7f3fa17ea700 20 filestore(/home/dzafman/ceph/src/dev/osd4) fgetattrs 36 getting '_' 
 2015-07-01 16:22:40.245085 7f3fa17ea700 20 filestore(/home/dzafman/ceph/src/dev/osd4) fgetattrs 36 getting 'hinfo_key' 
 2015-07-01 16:22:40.245205 7f3fa17ea700 10 filestore(/home/dzafman/ceph/src/dev/osd4) getattrs 1.6s1_head/7fc1f406/foo/head//1/ffffffffffffffff/1 = 0 
 2015-07-01 16:22:40.245214 7f3fa17ea700 15 filestore(/home/dzafman/ceph/src/dev/osd4) read 1.6s1_head/7fc1f406/foo/head//1/ffffffffffffffff/1 0~524288 
 2015-07-01 16:22:40.245280 7f3fa17ea700 10 filestore(/home/dzafman/ceph/src/dev/osd4) FileStore::read 1.6s1_head/7fc1f406/foo/head//1/ffffffffffffffff/1 0~51201/524288 
 2015-07-01 16:22:40.245301 7f3fa17ea700    0 osd.4 pg_epoch: 26 pg[1.6s1( v 20'1 (0'0,20'1] local-les=26 n=1 ec=19 les/c 26/26 25/25/25) [3,4,2] r=1 lpr=25 pi=19-24/4 luod=0'0 crt=0'0 lcod 0'0 active] _scan_list    7fc1f406/foo/head//1 got -5 on read, read_error 
 2015-07-01 16:22:40.245323 7f3fa17ea700 10 osd.4 pg_epoch: 26 pg[1.6s1( v 20'1 (0'0,20'1] local-les=26 n=1 ec=19 les/c 26/26 25/25/25) [3,4,2] r=1 lpr=25 pi=19-24/4 luod=0'0 crt=0'0 lcod 0'0 active] get_hash_info: Getting attr on 7fc1f406/foo/head//1 
 2015-07-01 16:22:40.245337 7f3fa17ea700 10 osd.4 pg_epoch: 26 pg[1.6s1( v 20'1 (0'0,20'1] local-les=26 n=1 ec=19 les/c 26/26 25/25/25) [3,4,2] r=1 lpr=25 pi=19-24/4 luod=0'0 crt=0'0 lcod 0'0 active] get_hash_info: not in cache 7fc1f406/foo/head//1 
 2015-07-01 16:22:40.245379 7f3fa17ea700 10 filestore(/home/dzafman/ceph/src/dev/osd4) stat 1.6s1_head/7fc1f406/foo/head//1/ffffffffffffffff/1 = 0 (size 51201) 
 2015-07-01 16:22:40.245386 7f3fa17ea700 10 osd.4 pg_epoch: 26 pg[1.6s1( v 20'1 (0'0,20'1] local-les=26 n=1 ec=19 les/c 26/26 25/25/25) [3,4,2] r=1 lpr=25 pi=19-24/4 luod=0'0 crt=0'0 lcod 0'0 active] get_hash_info: found on disk, size 51201 
 2015-07-01 16:22:40.245397 7f3fa17ea700 15 filestore(/home/dzafman/ceph/src/dev/osd4) getattr 1.6s1_head/7fc1f406/foo/head//1/ffffffffffffffff/1 'hinfo_key' 
 2015-07-01 16:22:40.245413 7f3fa17ea700 10 filestore(/home/dzafman/ceph/src/dev/osd4) getattr 1.6s1_head/7fc1f406/foo/head//1/ffffffffffffffff/1 'hinfo_key' = 30 
 2015-07-01 16:22:40.261902 7f3fa17ea700 -1 osd/ECBackend.cc: In function 'ECUtil::HashInfoRef ECBackend::get_hash_info(const hobject_t&)' thread 7f3fa17ea700 time 2015-07-01 16:22:40.245421 
 osd/ECBackend.cc: 1482: FAILED assert(hinfo.get_total_chunk_size() == (uint64_t)st.st_size) 

  ceph version 9.0.1-1111-g075fb9f (075fb9f9e07f5a97bda4f8a4a23cba4df5bc826d) 
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x1a0342b] 
  2: (ECBackend::get_hash_info(hobject_t const&)+0x65c) [0x180040a] 
  3: (ECBackend::be_deep_scrub(hobject_t const&, unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x43c) [0x180255e] 
  4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, unsigned int, ThreadPool::TPHandle&)+0x444) [0x16e599a] 
  5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x3a9) [0x1590fb5] 
  6: (PG::replica_scrub(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x63d) [0x1591e63] 
  7: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xa32) [0x1624320] 
  8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x47f) [0x1391075] 
 </pre> 

Back