Project

General

Profile

Bug #23701

oi(object_info_t) size mismatch with real object size during old write arrives after truncate

Added by Peng Xie almost 6 years ago. Updated almost 6 years ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
Tiering
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
jewel
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rest
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

currently, in our test environment (jewel : cephfs + cache tier + ec pool), we found several osd coredump
in the following stack:

#0 0x00007f1de1a3123b in raise () from /lib64/libpthread.so.0
#1 0x0000000001e1b823 in reraise_fatal (signum=6) at /var/ws/yunfeithrottle-mh82-ceph/src/global/signal_handler.cc:71
#2 0x0000000001e1bb36 in handle_fatal_signal (signum=6) at /var/ws/yunfeithrottle-mh82-ceph/src/global/signal_handler.cc:133
#3 <signal handler called>
#4 0x00007f1ddf6071d7 in raise () from /lib64/libc.so.6
#5 0x00007f1ddf6088c8 in abort () from /lib64/libc.so.6
#6 0x0000000001f85141 in ceph::__ceph_assert_fail (assertion=0x22df1c1 "cursor.data_complete", file=0x22dc510 "/var/ws/yunfeithrottle-mh82-ceph/src/osd/ReplicatedPG.cc", line=7053, 
func=0x22e8500 <ReplicatedPG::fill_in_copy_get(ReplicatedPG::OpContext*, ceph::buffer::list::iterator&, OSDOp&, std::shared_ptr<ObjectContext>&, bool)::__PRETTY_FUNCTION__> "int ReplicatedPG::fill_in_copy_get(ReplicatedPG::OpContext*, ceph::buffer::list::iterator&, OSDOp&, ObjectContextRef&, bool)") at /var/ws/yunfeithrottle-mh82-ceph/src/common/assert.cc:78
#7 0x00000000019e8126 in ReplicatedPG::fill_in_copy_get (this=0xec23000, ctx=0x17a2c700, bp=..., osd_op=..., obc=warning: RTTI symbol not found for class 'std::_Sp_counted_deleter<ObjectContext*, SharedLRU<hobject_t, ObjectContext, hobject_t::ComparatorWithDefault, std::hash<hobject_t> >::Cleanup, std::allocator<int>, (__gnu_cxx::_Lock_policy)2>'
warning: RTTI symbol not found for class 'std::_Sp_counted_deleter<ObjectContext*, SharedLRU<hobject_t, ObjectContext, hobject_t::ComparatorWithDefault, std::hash<hobject_t> >::Cleanup, std::allocator<int>, (__gnu_cxx::_Lock_policy)2>'
std::shared_ptr (count 5, weak 1) 0x197a4000, classic=false)
at /var/ws/yunfeithrottle-mh82-ceph/src/osd/ReplicatedPG.cc:7053 <<<<--- which assert fail at assert(cursor.data_complete);

#8 0x00000000019dbb2d in ReplicatedPG::do_osd_ops (this=0xec23000, ctx=0x17a2c700, ops=std::vector of length 2, capacity 2 = {...}) at /var/ws/yunfeithrottle-mh82-ceph/src/osd/ReplicatedPG.cc:5886
#9 0x00000000019e48a9 in ReplicatedPG::prepare_transaction (this=0xec23000, ctx=0x17a2c700) at /var/ws/yunfeithrottle-mh82-ceph/src/osd/ReplicatedPG.cc:6596
#10 0x00000000019c0f41 in ReplicatedPG::execute_ctx (this=0xec23000, ctx=0x17a2c700) at /var/ws/yunfeithrottle-mh82-ceph/src/osd/ReplicatedPG.cc:2938
p#11 0x00000000019b84c3 in ReplicatedPG::do_op (this=0xec23000, op=std::shared_ptr (count 5, weak 0) 0x192c7100) at /var/ws/yunfeithrottle-mh82-ceph/src/osd/ReplicatedPG.cc:2138
#12 0x00000000019b23d1 in ReplicatedPG::do_request (this=0xec23000, op=std::shared_ptr (count 5, weak 0) 0x192c7100, handle=...) at /var/ws/yunfeithrottle-mh82-ceph/src/osd/ReplicatedPG.cc:1497
#13 0x00000000017db9fd in OSD::dequeue_op (this=0xddee000, pg=..., op=std::shared_ptr (count 5, weak 0) 0x192c7100, handle=...) at /var/ws/yunfeithrottle-mh82-ceph/src/osd/OSD.cc:8825
....

(gdb) p oi
$1 = (object_info_t &) @0x197a4000: {soid = {oid = {name = "10009991c9a.00000001"}, snap = {val = 18446744073709551614}, hash = 414438447, max = false, nibblewise_key_cache = 4065147777, 
hash_reverse_bits = 4096511256, static POOL_META = -1, static POOL_TEMP_START = -2, pool = 16, nspace = "", key = ""}, version = {version = 4526697, epoch = 17470, __pad = 0}, prior_version = {
version = 4526696, epoch = 17470, __pad = 0}, user_version = 4526697, last_reqid = {name = {_type = 8 '\b', _num = 9563028, static TYPE_MON = 1, static TYPE_MDS = 2, static TYPE_OSD = 4, 
static TYPE_CLIENT = 8, static NEW = -1}, tid = 696357, inc = 0}, size = 4063232, mtime = {tv = {tv_sec = 1523554374, tv_nsec = 504726366}}, local_mtime = {tv = {tv_sec = 1523554374, 
tv_nsec = 504743968}}, flags = object_info_t::FLAG_DIRTY, snaps = std::vector of length 0, capacity 0, truncate_seq = 2, truncate_size = 0, watchers = std::map with 0 elements, 
data_digest = 4294967295, omap_digest = 4294967295}
(gdb) p oi.size
$2 = 4063232
(gdb) p oi.truncate_size
$3 = 0
. 

the core dump was due to the object oi.size (
4063232) and its real size (0) on the backend disk .
after analysis the detail, the mismatch was introduced in the following event sequence in the osd code path:
step-1. truncate op comes first, the object was truncated to length 0
step-2. old write before the truncate (truncate_seq smaller the step-1) arrives. it found that
write offset > oi.size , it will write nothing, however, in the following function
           ReplicatedPG::write_update_size_and_usage() {
                ..... 
                if (force_changesize || offset+ length > oi.size) {
                        // in our case offset > oi.size, length is 0, the oi.size will be falsefully updated
                       uint64_t new_size = offset + length;
                      delta_stats.num_bytes -= oi.size;
                      delta_stats.num_bytes += new_size;
                }
           }

i propose the fix in the link:
my detailed analysis and debug logs was pasted below for your reference:

[root@xt3 ceph]# ceph osd map cache_data 10009991c9a.00000001 --cluster xtao
osdmap e17719 pool 'cache_data' (16) object '10009991c9a.00000001' -> pg 16.18b3d42f (16.2f) -> up ([73,2], p73) acting ([73,2], p73)

go to osd.73 pg 16.2f 's backend

[root@xt3 09c6a5ac-3abb-4cee-af53-695345752fc3]# cd current/16.2f_head/
[root@xt3 16.2f_head]# find . |grep 10009991c9a
./DIR_F/DIR_2/DIR_4/DIR_D/10009991c9a.00000001__head_18B3D42F__10
[root@xt3 16.2f_head]# 
[root@xt3 16.2f_head]# 
[root@xt3 16.2f_head]# stat ./DIR_F/DIR_2/DIR_4/DIR_D/10009991c9a.00000001__head_18B3D42F__10
File: ‘./DIR_F/DIR_2/DIR_4/DIR_D/10009991c9a.00000001__head_18B3D42F__10’
Size: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: 832h/2098d Inode: 136179398 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2018-04-13 01:48:59.722509861 +0800
Modify: 2018-04-13 01:32:55.514851226 +0800
Change: 2018-04-13 01:32:57.314873008 +0800
Birth: -

[root@xt3 16.2f_head]# attr -q -g "ceph._" ./DIR_F/DIR_2/DIR_4/DIR_D/10009991c9a.00000001__head_18B3D42F__10 > /tmp/oi
[root@xt3 16.2f_head]# attr -q -g "ceph._@1" ./DIR_F/DIR_2/DIR_4/DIR_D/10009991c9a.00000001__head_18B3D42F__10 >> /tmp/oi
[root@xt3 16.2f_head]# ceph-dencoder import /tmp/oi type object_info_t decode dump_json
{
"oid": {
"oid": "10009991c9a.00000001",
"key": "",
"snapid": -2,
"hash": 414438447,
"max": 0,
"pool": 16,
"namespace": "" 
},
"version": "17470'4526697",
"prior_version": "17470'4526696",
"last_reqid": "client.9563028.0:696357",
"user_version": 4526697,
"size": 4063232,             <<<--- oi.size doesn't match truncate_size
"mtime": "2018-04-13 01:32:54.504726",
"local_mtime": "2018-04-13 01:32:54.504743",
"lost": 0,
"flags": 4,
"snaps": [],
"truncate_seq": 2,
"truncate_size": 0,
"data_digest": 4294967295,
"omap_digest": 4294967295,
"watchers": {}
}

...
the fundamental reason for this assert fail is that : when the cache tier osd try to flush the object, it first read the object from
the cache tier pool backend, during the read, it found the persistent backend object size is smaller than the oi.size suggested.

i added more debug print in the write and truncate code path:

ReplicatedPG::do_osd_ops() {
  .........
case CEPH_OSD_OP_WRITE:
    if (seq && (seq > op.extent.truncate_seq) &&
       (op.extent.offset + op.extent.length > oi.size)) {
     // old write, arrived after trimtrunc
    op.extent.length = (op.extent.offset > oi.size ? 0 : oi.size - op.extent.offset);
    dout(0) << " old truncate_seq " << op.extent.truncate_seq << "  current " << seq
       << " for oi " << oi
       << " op.extent.offset " << op.extent.offset
       << " , adjusting write length to " << op.extent.length << dendl;
       .......
     }

  ......
  case CEPH_OSD_OP_TRUNCATE:
     .......
     // at the end of this TRUNCATE case branch,  add the log
     dout(0) << " truncate seq " << op.extent.truncate_seq << "  current " << oi.truncate_seq
        << " op.extent.truncate_size " << op.extent.truncate_size
        << ", truncating oi is " << oi
        << " oi size " << oi.size << dendl;  
}

the output log shown as follows:

2018-04-13 01:32:55.513241 7f1dbb731700 0 osd.73 pg_epoch: 17470 pg[16.2f( v 17470'4526677 (17470'4523614,17470'4526677] local-les=17464 n=1931 ec=430 les/c/f 17464/17464/1875 17463/17463/17463) [73,2] r=0 lpr=17463 crt=17470'4526675 lcod 17470'4526676 mlcod 17470'4526676 active+clean] truncate seq 2 , current 2 op.extent.truncate_size 0, truncating oi is 16:f42bcd18:::10009991c9a.00000001:head(17470'4526665 client.9563028.0:695195 dirty s 0 uv 4526665) oi size 0
<<<--- truncate comes first, obj will be truncated to 0 length

2018-04-13 01:32:57.307058 7f1db6727700 0 osd.73 pg_epoch: 17470 pg[16.2f( v 17470'4526696 (17470'4523614,17470'4526696] local-les=17464 n=1929 ec=430 les/c/f 17464/17464/1875 17463/17463/17463) [73,2] r=0 lpr=17463 luod=17470'4526695 crt=17470'4526689 lcod 17470'4526694 mlcod 17470'4526694 active+clean] old truncate_seq 0  current 2 for oi 16:f42bcd18:::10009991c9a.00000001:head(17470'4526696 client.9563028.0:696204 dirty s 3801088 uv 4526696) op.extent.offset 4063232 , adjusting write length to 0
<<<<--- old write (offset 4063232, which beyond the oi.size (0) ) before the truncate, falsefully update the oi size to  
<<<<<  
4526696 in the function ReplicatedPG::write_update_size_and_usage() .

2018-04-13 01:32:57.307166 7f1db6727700 0 osd.73 pg_epoch: 17470 pg[16.2f( v 17470'4526696 (17470'4523614,17470'4526696] local-les=17464 n=1929 ec=430 les/c/f 17464/17464/1875 17463/17463/17463) [73,2] r=0 lpr=17463 luod=17470'4526695 crt=17470'4526689 lcod 17470'4526694 mlcod 17470'4526694 active+clean] oi 16:f42bcd18:::10009991c9a.00000001:head(17470'4526696 client.9563028.0:696204 dirty s 3801088 uv 4526696) oi size will be 4063232  

<<<<< this output was due to the output i added in the following:

ReplicatedPG::write_update_size_and_usage() {
.....
if (force_changesize || offset+ length > oi.size) {
if (0 == length) {
dout(0) << "oi " << oi << " oi size will be " << new_size << dendl;
}
}
}
.....

History

#1 Updated by Peng Xie almost 6 years ago

here is my pull request to fix this problem
https://github.com/ceph/ceph/pull/21408

#2 Updated by Kefu Chai almost 6 years ago

  • Description updated (diff)

#3 Updated by Kefu Chai almost 6 years ago

  • Description updated (diff)

#4 Updated by Kefu Chai almost 6 years ago

  • Target version deleted (v14.0.0)
  • Affected Versions v10.2.9 added
  • Affected Versions deleted (v10.2.11)

i think this issue only exists in jewel.

#5 Updated by Kefu Chai almost 6 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to Peng Xie

#6 Updated by Kefu Chai almost 6 years ago

  • Backport set to jewel

Also available in: Atom PDF