Project

General

Profile

Actions

Bug #2508

closed

osdc/ObjectCacher.cc:761: void ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t, tid_t, int): Assertion `ob->last_commit_tid < tid' failed.

Added by Simon Frerichs almost 12 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

we've random KVM VPS crashes with the following error:

kvm: osdc/ObjectCacher.cc:761: void ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t, tid_t, int): Assertion `ob->last_commit_tid < tid' failed.

Seems to be similar to #1490 which is closed.

Ceph Version: ceph version 0.47.2-4-ge868b44 (e868b44b3959a71c731f4ec9ff9773dead6dfcb5)
KVM/QEMU Version: QEMU emulator version 1.0 (qemu-kvm-1.0+dfsg-11, Debian), Copyright (c) 2003-2008 Fabrice Bellard
Kernel: 3.2.0-2-amd64 (Debian Wheezy Kernel)

Actions #1

Updated by Josh Durgin almost 12 years ago

  • Priority changed from Normal to High

Hi Simon,

If this is at all reproducible, could you try setting osd_min_pg_log_entries higher on all your osds, say 5000, and see if it still happens? We might want to raise the default from 1000.

If that doesn't fix it, we'd probably need logs of it happening on both the client and osd. On the client side, 'debug ms = 1' and 'debug objecter = 10'. On the osd side (unfortunately there's no way to know which osd is required, so it'd be needed on all of them) 'debug ms = 1' and 'debug osd = 20'.

Thanks!
Josh

Actions #2

Updated by Simon Frerichs almost 12 years ago

Hi Josh,

i've increased osd_min_pg_log_entries to 5000. Let's see if it fixes the problem.

Simon

Actions #3

Updated by Sage Weil almost 12 years ago

  • Subject changed from Random KVM Crash - kvm: osdc/ObjectCacher.cc:761: void ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t, tid_t, int): Assertion `ob->last_commit_tid < tid' failed. to osdc/ObjectCacher.cc:761: void ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t, tid_t, int): Assertion `ob->last_commit_tid < tid' failed.
  • Category set to msgr
  • Status changed from New to 12
  • Priority changed from High to Urgent

I just reproduced this several times on master using the socket failure injection. This is probably socket errors breaking the messengers ordering constraints.

Good news is it appears to be easy to reproduce, and probably not related to the OSD's dup op detection window!

Actions #4

Updated by Sage Weil almost 12 years ago

  • Target version set to v0.50
Actions #5

Updated by Sage Weil almost 12 years ago

  • Category changed from msgr to OSD
  • Assignee set to Samuel Just

easy to reproduce with


interactive-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      client:
        debug ms: 20
        debug objecter: 20
        debug objectcacher: 20
      osd:
        debug ms: 1
        debug osd: 20
        debug filestore: 20
    fs: btrfs
    log-whitelist:
    - slow request
    branch: master
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- ceph: null
- ceph-fuse: null
- workunit:
    clients:
      all:
      - suites/blogbench.sh

Actions #6

Updated by Samuel Just almost 12 years ago

  • Status changed from 12 to Resolved

645ac87b84ae348483c7d52b1968af5f44378dea

Actions #7

Updated by Tamilarasi muthamizhan over 11 years ago

  • Status changed from Resolved to Closed
  • Source changed from Community (user) to Q/A

Recent logs: ubuntu@teuthology: /a/teuthology-2012-09-24_19:00:29-regression-master-testing-gcov/29465

2012-09-25T19:46:18.284 INFO:teuthology.orchestra.run.err:osdc/ObjectCacher.cc: In function 'void ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t, tid_t, int)' thread 7fd64d7fa700 time 2012-09-25 19:46:18.262015
2012-09-25T19:46:18.284 INFO:teuthology.orchestra.run.err:osdc/ObjectCacher.cc: 764: FAILED assert(ob->last_commit_tid < tid)
2012-09-25T19:46:18.298 INFO:teuthology.orchestra.run.err: ceph version 0.51-702-g3ca3cd4 (commit:3ca3cd4c6cd6b96ed4b207196ea3a3f3b95bc1fe)
2012-09-25T19:46:18.299 INFO:teuthology.orchestra.run.err: 1: (ObjectCacher::bh_write_commit(long, sobject_t, long, unsigned long, unsigned long, int)+0xfeb) [0x7fd658c21cbb]
2012-09-25T19:46:18.299 INFO:teuthology.orchestra.run.err: 2: (ObjectCacher::C_WriteCommit::finish(int)+0x9b) [0x7fd658c2c92b]
2012-09-25T19:46:18.299 INFO:teuthology.orchestra.run.err: 3: (Context::complete(int)+0x12) [0x7fd658bc9842]
2012-09-25T19:46:18.299 INFO:teuthology.orchestra.run.err: 4: (librbd::C_Request::finish(int)+0xa1) [0x7fd658c08981]
2012-09-25T19:46:18.299 INFO:teuthology.orchestra.run.err: 5: (Context::complete(int)+0x12) [0x7fd658bc9842]
2012-09-25T19:46:18.299 INFO:teuthology.orchestra.run.err: 6: (librbd::rados_req_cb(void*, void*)+0x5c) [0x7fd658be243c]
2012-09-25T19:46:18.299 INFO:teuthology.orchestra.run.err: 7: (librados::C_AioSafe::finish(int)+0x20) [0x7fd657d73970]
2012-09-25T19:46:18.300 INFO:teuthology.orchestra.run.err: 8: (Finisher::finisher_thread_entry()+0x340) [0x7fd657df1ae0]
2012-09-25T19:46:18.300 INFO:teuthology.orchestra.run.err: 9: (Finisher::FinisherThread::entry()+0x15) [0x7fd657d5cba5]
2012-09-25T19:46:18.300 INFO:teuthology.orchestra.run.err: 10: (Thread::_entry_func(void*)+0x12) [0x7fd657f70b62]
2012-09-25T19:46:18.300 INFO:teuthology.orchestra.run.err: 11: (()+0x7e9a) [0x7fd657360e9a]
2012-09-25T19:46:18.300 INFO:teuthology.orchestra.run.err: 12: (clone()+0x6d) [0x7fd6576684bd]
2012-09-25T19:46:18.300 INFO:teuthology.orchestra.run.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2012-09-25T19:46:18.301 INFO:teuthology.orchestra.run.err:terminate called after throwing an instance of 'ceph::FailedAssertion'
ubuntu@teuthology:/a/teuthology-2012-09-24_19:00:29-regression-master-testing-gcov/29465$ cat config.yaml 
kernel: &id001
  kdb: true
  sha1: 1c17c1aacab817f4621baeee15707ee992c6410a
nuke-on-error: true
overrides:
  ceph:
    conf:
      client:
        rbd cache: true
      global:
        ms inject socket failures: 5000
    coverage: true
    fs: xfs
    log-whitelist:
    - slow request
    sha1: 3ca3cd4c6cd6b96ed4b207196ea3a3f3b95bc1fe
  s3tests:
    branch: master
  workunit:
    sha1: 3ca3cd4c6cd6b96ed4b207196ea3a3f3b95bc1fe
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana19.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC0IExVUuQEnwBzUbN6+jgEb/qLLsSDFLAe0OPv+R3Q3uUZn+4QjB+FQ3sMMEwaGpEhUfbpWn3xZbCmi4NngTsgPDjImYJPMeMecaxvVXqAJt4IM6gdBN0415lrKXtbXaGBJCmeFFDB+xN+JN6Qhk8DWN62DnJB8MS5vKn9u0S3HvtGzY/QrnutT3AX9I4097isbTepLFRC4n2CoAC9srXaxAprFgLgOIYHm386B3W7yK3yfIImWofvZxYpPCsr/7ws8DSgdRX9eDucGQfY2WcYKCEIgZclGPBhExm7q/ahHUqYKrzWY3RD93AuXJVgOJk5Yp8C1Ryx8vyJkAxBW4nb
  ubuntu@plana35.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCtCEUU7ZjJxMBOYq9dlPZppwlahU15OIq1a/xqI0YIpAF7ZLf1eis04s6Kzvl45ooKP2J6DW598Lr9zTtFb6j9RoAkgIbIOkWZmLQjeZesi//7B8uXBw7niNzU1iBI83K8dPL2cwTsgwXXtoaD03pW9HCv4ADhH12/CsfoqNPfql7ze4JlI07seHmUqkBuBFxLVgdZDXb+Cjdg2sQnMS5qZo0OGBp2yfBla01Pk8V1sHGg39miF0EDJZyIOL2ziZOcm1f02WVkUAD8MMOtfQbf33DT/ya4Th20YtRMQxNLkUYG6rxcjutzILFKDr7ugbvOwtBhVS7qplqLWAwHdi0p
  ubuntu@plana36.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCe7CpJbnd7W2/n42TTTjDArnVkyZfbRANfmkdgfDM+6AYg6qd9wUhes6LP++eMvhuM96Sz5W4380o8OME0cguG1LkkADbm8pQbPAPZwF1Fj28YxgZKpc2PTPsF+sjOujC+AaXaQ82ffSkLL0oElKZgAiFEGCytSdUNFHZxjztDIOoWlt7kylQCy4sJCEbND8JFwFfeGyyePvMl3CNdbnR7H5GuyIx70iglLBO/XFwArjeOUZ/FboRZWOBivpZQf9IMy8k2rQetzxTyugd7cTVdq1G5N5NeHpbQfv286G2oDaZj1HT252jDF04UP083zMxH1W9gmOoUKzIhl+iXaNLZ
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    timeout: 1200
- rbd_fsx:
    clients:
    - client.0
    ops: 20000
ubuntu@teuthology:/a/teuthology-2012-09-24_19:00:29-regression-master-testing-gcov/29465$ cat summary.yaml 
ceph-sha1: 3ca3cd4c6cd6b96ed4b207196ea3a3f3b95bc1fe
client.0-kernel-sha1: 1c17c1aacab817f4621baeee15707ee992c6410a
description: collection:rbd-thrash clusters:6-osd-3-machine.yaml fs:xfs.yaml msgr-failures:few.yaml
  thrashers:default.yaml workloads:rbd_fsx_cache_writeback.yaml
duration: 3020.0398390293121
failure_reason: 'Command crashed: ''CEPH_CONF=/tmp/cephtest/ceph.conf LD_LIBRARY_PATH=/tmp/cephtest/binary/usr/local/lib
  /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage
  /tmp/cephtest/binary/usr/local/bin/test_librbd_fsx -d -W -R -p 100 -P /tmp/cephtest/archive
  -t 1 -l 1073741824 -S 0 -N 20000 pool_client.0 image_client.0'''
flavor: gcov
mds.a-kernel-sha1: 1c17c1aacab817f4621baeee15707ee992c6410a
mon.a-kernel-sha1: 1c17c1aacab817f4621baeee15707ee992c6410a
owner: scheduled_teuthology@teuthology
success: false
Actions #8

Updated by Sage Weil over 11 years ago

  • Status changed from Closed to In Progress
  • Target version changed from v0.50 to v0.54a
Actions #9

Updated by Sage Weil over 11 years ago

  • Status changed from In Progress to Resolved

false alarm; this is librbd with caching enabled, so the tid's are artificially generated by LibrbdWriteback, not by objecter. this is a different bug.

Actions #10

Updated by Sage Weil over 11 years ago

Sage Weil wrote:

false alarm; this is librbd with caching enabled, so the tid's are artificially generated by LibrbdWriteback, not by objecter. this is a different bug.

fwiw the other bug was in ObjectCacher, due to all waiters on rx not be woken up when we learn an object doesn't exist. this could lead to a op reordering when a later read succeeds immediately but the old one is still waiting on actual reads.

Actions

Also available in: Atom PDF