Project

General

Profile

Actions

Bug #6004

closed

osdc/ObjectCacher.cc: 738: FAILED assert(bh->length() <= start+(loff_t)length-opos)

Added by Sage Weil over 10 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

    -2> 2013-08-15 13:39:58.197531 7fafc20f6780  1 -- 10.214.132.23:0/1893 --> 10.214.133.21:6800/17860 -- osd_op(client.4111.0:13306 10000000004.00000000 [read 272515~63272 [4390@272515]] 0.17c5e6c5 e4) v4 -- ?+0 0x2659480 con 0x22feb00
    -1> 2013-08-15 13:39:58.198790 7fafbc323700  1 -- 10.214.132.23:0/1893 <== osd.3 10.214.133.21:6800/17860 13304 ==== osd_op_reply(13305 10000000004.00000000 [read 269475~13149 [4389@480722]] ondisk = 0) v4 ==== 119+0+13149 (2376228507 0 0) 0x2631800 con 0x22feb00
     0> 2013-08-15 13:39:58.200283 7fafbc323700 -1 osdc/ObjectCacher.cc: In function 'void ObjectCacher::bh_read_finish(int64_t, sobject_t, loff_t, uint64_t, ceph::bufferlist&, int, bool)' thread 7fafbc323700 time 2013-08-15 13:39:58.198853
osdc/ObjectCacher.cc: 738: FAILED assert(bh->length() <= start+(loff_t)length-opos)

 ceph version 0.67-125-g4422f21 (4422f21a6586467a63ce6841552d0f60aa849cf1)
 1: (ObjectCacher::bh_read_finish(long, sobject_t, long, unsigned long, ceph::buffer::list&, int, bool)+0x138a) [0x70fe5a]
 2: (ObjectCacher::C_ReadFinish::finish(int)+0x6f) [0x71be9f]
 3: (Context::complete(int)+0x9) [0x577549]
 4: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe4b) [0x72b15b]
 5: (Client::ms_dispatch(Message*)+0x8b) [0x576f6b]
 6: (DispatchQueue::entry()+0x549) [0x765569]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x645d4d]
 8: (()+0x7e9a) [0x7fafc1a9ee9a]
 9: (clone()+0x6d) [0x7fafc003bccd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

job was
ubuntu@teuthology:/a/teuthology-2013-08-15_01:01:02-fs-next-testing-basic-plana/107782$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 68d04c9a036bea12d76fdd51e926aef771e8e041
machine_type: plana
nuke-on-error: true
os_type: ubuntu
overrides:
  admin_socket:
    branch: next
  ceph:
    conf:
      client:
        debug client: 10
      mds:
        debug mds: 20
        debug ms: 1
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
      osd:
        osd op thread timeout: 60
    fs: btrfs
    log-whitelist:
    - slow request
    - wrongly marked me down
    sha1: 4422f21a6586467a63ce6841552d0f60aa849cf1
  ceph-deploy:
    branch:
      dev: next
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
  install:
    ceph:
      sha1: 4422f21a6586467a63ce6841552d0f60aa849cf1
  s3tests:
    branch: next
  workunit:
    sha1: 4422f21a6586467a63ce6841552d0f60aa849cf1
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph: null
- ceph-fuse: null
- workunit:
    clients:
      all:
      - suites/fsx.sh
teuthology_branch: next
Actions #1

Updated by Sage Weil over 10 years ago

  • Status changed from New to 12

looks like a read vs truncate race...

  -133> 2013-08-15 13:39:58.052646 7fafc20f6780  1 -- 10.214.132.23:0/1893 --> 10.214.133.21:6800/17860 -- osd_op(client.4111.0:13305 10000000004.00000000 [read 269475~13149 [4389@480722]] 0.17c5e6c5 e4) v4 -- ?+0 0x2659900 con 0x22feb00

  -117> 2013-08-15 13:39:58.060214 7fafc20f6780  1 -- 10.214.132.23:0/1893 --> 10.214.133.21:6808/17863 -- client_request(client.4111:8502 setattr size=272515 #10000000004) v1 -- ?+0 0x2720500 con 0x22e6dc0

   -85> 2013-08-15 13:39:58.090006 7fafbc323700 10 client.4111 _invalidate_inode_cache 10000000004.head(ref=5 cap_refs={1024=1,2048=0,4096=0,8192=1} open={3=1} mode=100664 size=272515 mtime=2013-08-15 13:39:58.051708 caps=pAsxLsXsxFcb(0=pAsxLsXsxFcb) objectset[10000000004 ts 4390/480722 objects 1 dirty_or_tx 274561] parents=0x2376000 0x2336900) 272515~208207
   -84> 2013-08-15 13:39:58.090058 7fafbc323700 10 client.4111 truncate_size 480722 -> 272515

    -2> 2013-08-15 13:39:58.197531 7fafc20f6780  1 -- 10.214.132.23:0/1893 --> 10.214.133.21:6800/17860 -- osd_op(client.4111.0:13306 10000000004.00000000 [read 272515~63272 [4390@272515]] 0.17c5e6c5 e4) v4 -- ?+0 0x2659480 con 0x22feb00

    -1> 2013-08-15 13:39:58.198790 7fafbc323700  1 -- 10.214.132.23:0/1893 <== osd.3 10.214.133.21:6800/17860 13304 ==== osd_op_reply(13305 10000000004.00000000 [read 269475~13149 [4389@480722]] ondisk = 0) v4 ==== 119+0+13149 (2376228507 0 0) 0x2631800 con 0x22feb00
     0> 2013-08-15 13:39:58.200283 7fafbc323700 -1 osdc/ObjectCacher.cc: In function 'void ObjectCacher::bh_read_finish(int64_t, sobject_t, loff_t, uint64_t, ceph::bufferlist&, int, bool)' thread 7fafbc323700 time 2013-08-15 13:39:58.198853
osdc/ObjectCacher.cc: 738: FAILED assert(bh->length() <= start+(loff_t)length-opos)

Actions #2

Updated by Sage Weil over 10 years ago

  • Status changed from 12 to Fix Under Review
Actions #3

Updated by Zheng Yan over 10 years ago

Sage Weil wrote:

looks like a read vs truncate race...
[...]

looks like client releases Fr cap too early or client does not get Fr cap for readahead

Actions #4

Updated by Sage Weil over 10 years ago

Zheng Yan wrote:

Sage Weil wrote:

looks like a read vs truncate race...
[...]

looks like client releases Fr cap too early or client does not get Fr cap for readahead

Crap, I didn't save the logs. but looking at the code we do take an RD cap ref in _read(), so my guess is that this particular read was triggered by readahead (which means we don't wait for it to complete).

I'm not sure if librbd would handle this case either (where a truncate happens mid-read), although presumaly the consuming VM would never issue such an IO. Either way, I think this fix is correct from the perspective of the ObjectCacher...

Actions #5

Updated by Zheng Yan over 10 years ago

the fix looks good

Actions #6

Updated by Sage Weil over 10 years ago

  • Assignee set to Josh Durgin
Actions #7

Updated by Sage Weil over 10 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #8

Updated by Sage Weil over 10 years ago

  • Status changed from Pending Backport to Resolved
Actions #9

Updated by Greg Farnum almost 8 years ago

  • Component(FS) Client added
Actions

Also available in: Atom PDF