Bug #6004
closedosdc/ObjectCacher.cc: 738: FAILED assert(bh->length() <= start+(loff_t)length-opos)
0%
Description
-2> 2013-08-15 13:39:58.197531 7fafc20f6780 1 -- 10.214.132.23:0/1893 --> 10.214.133.21:6800/17860 -- osd_op(client.4111.0:13306 10000000004.00000000 [read 272515~63272 [4390@272515]] 0.17c5e6c5 e4) v4 -- ?+0 0x2659480 con 0x22feb00 -1> 2013-08-15 13:39:58.198790 7fafbc323700 1 -- 10.214.132.23:0/1893 <== osd.3 10.214.133.21:6800/17860 13304 ==== osd_op_reply(13305 10000000004.00000000 [read 269475~13149 [4389@480722]] ondisk = 0) v4 ==== 119+0+13149 (2376228507 0 0) 0x2631800 con 0x22feb00 0> 2013-08-15 13:39:58.200283 7fafbc323700 -1 osdc/ObjectCacher.cc: In function 'void ObjectCacher::bh_read_finish(int64_t, sobject_t, loff_t, uint64_t, ceph::bufferlist&, int, bool)' thread 7fafbc323700 time 2013-08-15 13:39:58.198853 osdc/ObjectCacher.cc: 738: FAILED assert(bh->length() <= start+(loff_t)length-opos) ceph version 0.67-125-g4422f21 (4422f21a6586467a63ce6841552d0f60aa849cf1) 1: (ObjectCacher::bh_read_finish(long, sobject_t, long, unsigned long, ceph::buffer::list&, int, bool)+0x138a) [0x70fe5a] 2: (ObjectCacher::C_ReadFinish::finish(int)+0x6f) [0x71be9f] 3: (Context::complete(int)+0x9) [0x577549] 4: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe4b) [0x72b15b] 5: (Client::ms_dispatch(Message*)+0x8b) [0x576f6b] 6: (DispatchQueue::entry()+0x549) [0x765569] 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x645d4d] 8: (()+0x7e9a) [0x7fafc1a9ee9a] 9: (clone()+0x6d) [0x7fafc003bccd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
job was
ubuntu@teuthology:/a/teuthology-2013-08-15_01:01:02-fs-next-testing-basic-plana/107782$ cat orig.config.yaml kernel: kdb: true sha1: 68d04c9a036bea12d76fdd51e926aef771e8e041 machine_type: plana nuke-on-error: true os_type: ubuntu overrides: admin_socket: branch: next ceph: conf: client: debug client: 10 mds: debug mds: 20 debug ms: 1 mon: debug mon: 20 debug ms: 1 debug paxos: 20 osd: osd op thread timeout: 60 fs: btrfs log-whitelist: - slow request - wrongly marked me down sha1: 4422f21a6586467a63ce6841552d0f60aa849cf1 ceph-deploy: branch: dev: next conf: client: log file: /var/log/ceph/ceph-$name.$pid.log mon: debug mon: 1 debug ms: 20 debug paxos: 20 install: ceph: sha1: 4422f21a6586467a63ce6841552d0f60aa849cf1 s3tests: branch: next workunit: sha1: 4422f21a6586467a63ce6841552d0f60aa849cf1 roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - - client.0 tasks: - chef: null - clock.check: null - install: null - ceph: null - ceph-fuse: null - workunit: clients: all: - suites/fsx.sh teuthology_branch: next
Updated by Sage Weil over 10 years ago
- Status changed from New to 12
looks like a read vs truncate race...
-133> 2013-08-15 13:39:58.052646 7fafc20f6780 1 -- 10.214.132.23:0/1893 --> 10.214.133.21:6800/17860 -- osd_op(client.4111.0:13305 10000000004.00000000 [read 269475~13149 [4389@480722]] 0.17c5e6c5 e4) v4 -- ?+0 0x2659900 con 0x22feb00 -117> 2013-08-15 13:39:58.060214 7fafc20f6780 1 -- 10.214.132.23:0/1893 --> 10.214.133.21:6808/17863 -- client_request(client.4111:8502 setattr size=272515 #10000000004) v1 -- ?+0 0x2720500 con 0x22e6dc0 -85> 2013-08-15 13:39:58.090006 7fafbc323700 10 client.4111 _invalidate_inode_cache 10000000004.head(ref=5 cap_refs={1024=1,2048=0,4096=0,8192=1} open={3=1} mode=100664 size=272515 mtime=2013-08-15 13:39:58.051708 caps=pAsxLsXsxFcb(0=pAsxLsXsxFcb) objectset[10000000004 ts 4390/480722 objects 1 dirty_or_tx 274561] parents=0x2376000 0x2336900) 272515~208207 -84> 2013-08-15 13:39:58.090058 7fafbc323700 10 client.4111 truncate_size 480722 -> 272515 -2> 2013-08-15 13:39:58.197531 7fafc20f6780 1 -- 10.214.132.23:0/1893 --> 10.214.133.21:6800/17860 -- osd_op(client.4111.0:13306 10000000004.00000000 [read 272515~63272 [4390@272515]] 0.17c5e6c5 e4) v4 -- ?+0 0x2659480 con 0x22feb00 -1> 2013-08-15 13:39:58.198790 7fafbc323700 1 -- 10.214.132.23:0/1893 <== osd.3 10.214.133.21:6800/17860 13304 ==== osd_op_reply(13305 10000000004.00000000 [read 269475~13149 [4389@480722]] ondisk = 0) v4 ==== 119+0+13149 (2376228507 0 0) 0x2631800 con 0x22feb00 0> 2013-08-15 13:39:58.200283 7fafbc323700 -1 osdc/ObjectCacher.cc: In function 'void ObjectCacher::bh_read_finish(int64_t, sobject_t, loff_t, uint64_t, ceph::bufferlist&, int, bool)' thread 7fafbc323700 time 2013-08-15 13:39:58.198853 osdc/ObjectCacher.cc: 738: FAILED assert(bh->length() <= start+(loff_t)length-opos)
Updated by Sage Weil over 10 years ago
- Status changed from 12 to Fix Under Review
Updated by Zheng Yan over 10 years ago
Sage Weil wrote:
looks like a read vs truncate race...
[...]
looks like client releases Fr cap too early or client does not get Fr cap for readahead
Updated by Sage Weil over 10 years ago
Zheng Yan wrote:
Sage Weil wrote:
looks like a read vs truncate race...
[...]looks like client releases Fr cap too early or client does not get Fr cap for readahead
Crap, I didn't save the logs. but looking at the code we do take an RD cap ref in _read(), so my guess is that this particular read was triggered by readahead (which means we don't wait for it to complete).
I'm not sure if librbd would handle this case either (where a truncate happens mid-read), although presumaly the consuming VM would never issue such an IO. Either way, I think this fix is correct from the perspective of the ObjectCacher...
Updated by Sage Weil over 10 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Sage Weil over 10 years ago
- Status changed from Pending Backport to Resolved