Bug #6207: Found incorrect object contents - Ceph - Ceph

Actions

Copy link

Bug #6207

closed

Found incorrect object contents

Added by Sage Weil over 10 years ago. Updated over 10 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Samuel Just

Category:

OSD

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2013-09-01T21:45:47.962 INFO:teuthology.task.rados.rados.0.err:[10.214.131.3]: Found incorrect object contents ObjNum: 148 snap: 45 seqnum: 148 prefix: plana3710235-OID: 45 snap 45
2013-09-01T21:45:47.962 INFO:teuthology.task.rados.rados.0.err:[10.214.131.3]: , expected ObjNum: 326 snap: 107 seqnum: 326 prefix: plana3710235-OID: 45 snap 107
2013-09-01T21:45:47.962 INFO:teuthology.task.rados.rados.0.err:[10.214.131.3]:  oid 45
2013-09-01T21:45:47.962 INFO:teuthology.task.rados.rados.0.err:[10.214.131.3]: Object 45 contents ObjNum: 148 snap: 45 seqnum: 148 prefix: plana3710235-OID: 45 snap 45
2013-09-01T21:45:47.962 INFO:teuthology.task.rados.rados.0.err:[10.214.131.3]:  corrupt
2013-09-01T21:45:47.962 INFO:teuthology.task.rados.rados.0.out:[10.214.131.3]: incorrect buffer at pos 4
2013-09-01T21:45:47.963 INFO:teuthology.task.rados.rados.0.err:[10.214.131.3]: ./test/osd/RadosModel.h: In function 'virtual void ReadOp::_finish(TestOp::CallbackInfo*)' thread 7f0facd77700 time 2013-09-01 21:46:00.402859
2013-09-01T21:45:47.963 INFO:teuthology.task.rados.rados.0.err:[10.214.131.3]: ./test/osd/RadosModel.h: 1053: FAILED assert(0)
2013-09-01T21:45:47.963 INFO:teuthology.task.rados.rados.0.err:[10.214.131.3]:  ceph version 0.67.2-22-ga708c8a (a708c8ab52e5b1476405a1f817c23b8845fbaab3)
2013-09-01T21:45:47.963 INFO:teuthology.task.rados.rados.0.err:[10.214.131.3]:  1: (ReadOp::_finish(TestOp::CallbackInfo*)+0x1421) [0x414ef1]
2013-09-01T21:45:47.963 INFO:teuthology.task.rados.rados.0.err:[10.214.131.3]:  2: (librados::C_AioComplete::finish(int)+0x1d) [0x7f0fb2b08a4d]
2013-09-01T21:45:47.963 INFO:teuthology.task.rados.rados.0.err:[10.214.131.3]:  3: (Context::complete(int)+0xa) [0x7f0fb2ae94fa]
2013-09-01T21:45:47.963 INFO:teuthology.task.rados.rados.0.err:[10.214.131.3]:  4: (Finisher::finisher_thread_entry()+0x1c0) [0x7f0fb2b8cd10]
2013-09-01T21:45:47.963 INFO:teuthology.task.rados.rados.0.err:[10.214.131.3]:  5: (()+0x7e9a) [0x7f0fb276ae9a]
2013-09-01T21:45:47.963 INFO:teuthology.task.rados.rados.0.err:[10.214.131.3]:  6: (clone()+0x6d) [0x7f0fb1f81ccd]
2013-09-01T21:45:47.964 INFO:teuthology.task.rados.rados.0.err:[10.214.131.3]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2013-09-01T21:45:47.964 INFO:teuthology.task.rados.rados.0.err:[10.214.131.3]: terminate called after throwing an instance of 'ceph::FailedAssertion'

ubuntu@teuthology:/a/teuthology-2013-09-01_20:00:14-rados-dumpling-testing-basic-plana/16572$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 263cbbcaf605e359a46e30889595d82629f82080
machine_type: plana
nuke-on-error: true
os_type: ubuntu
overrides:
  admin_socket:
    branch: dumpling
  ceph:
    conf:
      global:
        ms inject delay max: 1
        ms inject delay probability: 0.005
        ms inject delay type: osd
        ms inject internal delays: 0.002
        ms inject socket failures: 2500
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon min osdmap epochs: 2
      osd:
        osd map cache size: 1
    fs: ext4
    log-whitelist:
    - slow request
    sha1: a708c8ab52e5b1476405a1f817c23b8845fbaab3
  ceph-deploy:
    branch:
      dev: dumpling
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
  install:
    ceph:
      sha1: a708c8ab52e5b1476405a1f817c23b8845fbaab3
  s3tests:
    branch: master
  workunit:
    sha1: a708c8ab52e5b1476405a1f817c23b8845fbaab3
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    chance_test_map_discontinuity: 0.5
    timeout: 1200
- rados:
    clients:
    - client.0
    objects: 50
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000
teuthology_branch: dumpling

Actions

Copy link

Updated by Ian Colle over 10 years ago

Assignee set to Samuel Just

Actions

Copy link

Updated by Sage Weil over 10 years ago

full logs here: ubuntu@teuthology:/var/lib/teuthworker/archive/sage-bug-6222-b/20456

Actions

Copy link

Updated by Samuel Just over 10 years ago

The problem offset appears not to have changed since the prior successful read, and kern.log had:

2013-09-04T08:57:47.637749-07:00 plana40 kernel: [16102.849264] device fsid 09558d82-c22f-4b2d-94c2-03dd7b0fc4c2 devid 1 transid 7 /dev/sdd
2013-09-04T08:57:47.657682-07:00 plana40 kernel: [16102.871685] btrfs flagging fs with big metadata feature
2013-09-04T08:57:47.887731-07:00 plana40 kernel: [16103.098433] device fsid e568d361-7658-4543-8122-f60136cd4955 devid 1 transid 7 /dev/sdb
2013-09-04T08:57:47.897670-07:00 plana40 kernel: [16103.112006] btrfs flagging fs with big metadata feature
2013-09-04T08:57:48.127733-07:00 plana40 kernel: [16103.334890] device fsid 6336ab6e-c306-44a5-9379-7ee01160d270 devid 1 transid 7 /dev/sdc
2013-09-04T08:57:48.137675-07:00 plana40 kernel: [16103.348147] btrfs flagging fs with big metadata feature
2013-09-04T08:57:48.922264-07:00 plana40 kernel: [16103.934210]
2013-09-04T08:57:48.922280-07:00 plana40 kernel: [16103.935720] =====================================
2013-09-04T08:57:48.922331-07:00 plana40 kernel: [16103.940439] [ BUG: bad unlock balance detected! ]
2013-09-04T08:57:48.922333-07:00 plana40 kernel: [16103.945159] 3.11.0-rc5-ceph-00074-g263cbbc #1 Not tainted
2013-09-04T08:57:48.922335-07:00 plana40 kernel: [16103.950574] -------------------------------------
2013-09-04T08:57:48.922337-07:00 plana40 kernel: [16103.955294] ceph-osd/29077 is trying to release lock (sb_internal) at:
2013-09-04T08:57:48.922340-07:00 plana40 kernel: [16103.961885] [<ffffffffa012efd8>] btrfs_commit_transaction_async+0x1c8/0x2c0 [btrfs]
2013-09-04T08:57:48.922342-07:00 plana40 kernel: [16103.969574] but there are no more locks to release!
2013-09-04T08:57:48.922343-07:00 plana40 kernel: [16103.974469]
2013-09-04T08:57:48.922347-07:00 plana40 kernel: [16103.974469] other info that might help us debug this:
2013-09-04T08:57:48.922350-07:00 plana40 kernel: [16103.981030] no locks held by ceph-osd/29077.
2013-09-04T08:57:48.922367-07:00 plana40 kernel: [16103.985316]
2013-09-04T08:57:48.922369-07:00 plana40 kernel: [16103.985316] stack backtrace:
2013-09-04T08:57:48.922371-07:00 plana40 kernel: [16103.989704] CPU: 1 PID: 29077 Comm: ceph-osd Not tainted 3.11.0-rc5-ceph-00074-g263cbbc #1
2013-09-04T08:57:48.922373-07:00 plana40 kernel: [16103.998002] Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011
2013-09-04T08:57:48.922375-07:00 plana40 kernel: [16104.005519] ffffffffa012efd8 ffff8802247e1c68 ffffffff81642d85 0000000000000007
2013-09-04T08:57:48.922377-07:00 plana40 kernel: [16104.012989] ffff88000732bf20 ffff8802247e1c98 ffffffff810ab89e 0000000000000050
2013-09-04T08:57:48.922381-07:00 plana40 kernel: [16104.021586] 0000000000000246 ffff88017485edd0 ffffffffa012efd8 ffff8802247e1ce8
2013-09-04T08:57:48.922383-07:00 plana40 kernel: [16104.029059] Call Trace:
2013-09-04T08:57:48.922386-07:00 plana40 kernel: [16104.031535] [<ffffffffa012efd8>] ? btrfs_commit_transaction_async+0x1c8/0x2c0 [btrfs]
2013-09-04T08:57:48.922388-07:00 plana40 kernel: [16104.039491] [<ffffffff81642d85>] dump_stack+0x46/0x58
2013-09-04T08:57:48.922390-07:00 plana40 kernel: [16104.044651] [<ffffffff810ab89e>] print_unlock_imbalance_bug+0xfe/0x110
2013-09-04T08:57:48.922393-07:00 plana40 kernel: [16104.051300] [<ffffffffa012efd8>] ? btrfs_commit_transaction_async+0x1c8/0x2c0 [btrfs]
2013-09-04T08:57:48.922395-07:00 plana40 kernel: [16104.059253] [<ffffffff810aeafe>] lock_release+0x15e/0x220
2013-09-04T08:57:48.922397-07:00 plana40 kernel: [16104.064771] [<ffffffffa012efff>] btrfs_commit_transaction_async+0x1ef/0x2c0 [btrfs]
2013-09-04T08:57:48.922402-07:00 plana40 kernel: [16104.072552] [<ffffffff810af6f5>] ? trace_hardirqs_on_caller+0x105/0x1d0
2013-09-04T08:57:48.922404-07:00 plana40 kernel: [16104.079289] [<ffffffffa015d907>] btrfs_ioctl_start_sync+0x47/0xc0 [btrfs]
2013-09-04T08:57:48.922406-07:00 plana40 kernel: [16104.086201] [<ffffffffa0162575>] btrfs_ioctl+0xe55/0x1af0 [btrfs]
2013-09-04T08:57:48.922409-07:00 plana40 kernel: [16104.092404] [<ffffffff8164ab2b>] ? _raw_spin_unlock+0x2b/0x40
2013-09-04T08:57:48.922411-07:00 plana40 kernel: [16104.098256] [<ffffffff810b40b5>] ? do_futex+0xa45/0xbb0
2013-09-04T08:57:48.922413-07:00 plana40 kernel: [16104.103590] [<ffffffff8119cf0c>] ? fget_light+0x3c/0x130
2013-09-04T08:57:48.922415-07:00 plana40 kernel: [16104.109008] [<ffffffff811922a6>] do_vfs_ioctl+0x96/0x560
2013-09-04T08:57:48.922418-07:00 plana40 kernel: [16104.114425] [<ffffffff8119cf6e>] ? fget_light+0x9e/0x130
2013-09-04T08:57:48.922420-07:00 plana40 kernel: [16104.119842] [<ffffffff8119cf0c>] ? fget_light+0x3c/0x130
2013-09-04T08:57:48.922422-07:00 plana40 kernel: [16104.125260] [<ffffffff81192801>] SyS_ioctl+0x91/0xb0
2013-09-04T08:57:48.922425-07:00 plana40 kernel: [16104.130331] [<ffffffff813338fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
2013-09-04T08:57:48.922427-07:00 plana40 kernel: [16104.136794] [<ffffffff81653782>] system_call_fastpath+0x16/0x1b
2013-09-04T09:04:22.067668-07:00 plana40 kernel: [16497.521078] BTRFS debug (device sdc): unlinked 1 orphans
2013-09-04T09:08:06.457718-07:00 plana40 kernel: [16722.042874] BTRFS debug (device sdc): unlinked 1 orphans
2013-09-04T09:08:11.702927-07:00 plana40 kernel: [16727.295732] BTRFS debug (device sdc): unlinked 1 orphans
2013-09-04T09:08:17.032895-07:00 plana40 kernel: [16732.628876] BTRFS debug (device sdc): unlinked 1 orphans
2013-09-04T09:08:22.167754-07:00 plana40 kernel: [16737.762392] BTRFS debug (device sdc): unlinked 1 orphans

Actions

http://fpaste.org/39993/79377326/ contains the output on osd.0 during backfill that concerns me if you don't want to go dig through the full logs to start with. I'm quite certain that osd.5, while it's gotten log events, has never seen the full plana046077-199 object.

Actions

Copy link

i think we should implement opportunistic crcs in FileStore.

any time we do a write, we record in an xattr the crc for any complete blocks (4k, 64k, whatever you configure). on any read, we verify crcs that we happen to have.

this would be pretty easy, and very useful for debugging (and/or the paranoid) to identify fs corruption. would need to disable the read-side checks during journal replay since the xattr could be out of sync with file data.

Actions

Copy link

#14