Project

General

Profile

Actions

Bug #807

closed

unconnected_watchers assert fail

Added by Josh Durgin about 13 years ago. Updated about 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

cosd crashed when running qemu off an rbd image calles kvm213_root:

2011-02-15 14:38:11.760298 7fbd401de710 osd0 10 dequeue_op osd_op(client4398.0:9 kvm213_root.rbd [watch 1~0] 3.6045) v1 pg pg[3.45( v 10'18 (0'0,10'18] n=8 ec=2 les=6 5/5/5) [0] r=0 mlcod 10'18 active+clean+degraded], 0 more pending
2011-02-15 14:38:11.760344 7fbd401de710 osd0 10 pg[3.45( v 10'18 (0'0,10'18] n=8 ec=2 les=6 5/5/5) [0] r=0 mlcod 10'18 active+clean+degraded] do_op osd_op(client4398.0:9 kvm213_root.rbd [watch 1~0] 3.6045) v1
2011-02-15 14:38:11.760426 7fbd401de710 filestore(/more/cephdata/dev/osd0) getattr /more/cephdata/dev/osd0/current/3.45_head/kvm213_root.rbd_head '_' = 222
2011-02-15 14:38:11.760453 7fbd401de710 filestore(/more/cephdata/dev/osd0) getattr /more/cephdata/dev/osd0/current/3.45_head/kvm213_root.rbd_head 'snapset' = 26
2011-02-15 14:38:11.760475 7fbd401de710 osd0 10 pg[3.45( v 10'18 (0'0,10'18] n=8 ec=2 les=6 5/5/5) [0] r=0 mlcod 10'18 active+clean+degraded] get_snapset_context kvm213_root.rbd 0 -> 1
2011-02-15 14:38:11.760507 7fbd401de710 osd0 10 pg[3.45( v 10'18 (0'0,10'18] n=8 ec=2 les=6 5/5/5) [0] r=0 mlcod 10'18 active+clean+degraded] get_object_context kvm213_root.rbd/head read kvm213_root.rbd/head(10'18 client4398.0:3 wrlock_by=unknown0.0:0)
2011-02-15 14:38:11.760522 7fbd401de710 osd0 10 pg[3.45( v 10'18 (0'0,10'18] n=8 ec=2 les=6 5/5/5) [0] r=0 mlcod 10'18 active+clean+degraded] find_object_context kvm213_root.rbd @head
2011-02-15 14:38:11.760537 7fbd401de710 osd0 10 pg[3.45( v 10'18 (0'0,10'18] n=8 ec=2 les=6 5/5/5) [0] r=0 mlcod 10'18 active+clean+degraded] do_op: object kvm213_root.rbd/head has oi of kvm213_root.rbd/head(10'18 client4398.0:3 wrlock_by=unknown0.0:0)
2011-02-15 14:38:11.760550 7fbd401de710 osd0 10 pg[3.45( v 10'18 (0'0,10'18] n=8 ec=2 les=6 5/5/5) [0] r=0 mlcod 10'18 active+clean+degraded] do_op mode is idle(wr=0)
2011-02-15 14:38:11.760570 7fbd401de710 osd0 10 pg[3.45( v 10'18 (0'0,10'18] n=8 ec=2 les=6 5/5/5) [0] r=0 mlcod 10'18 active+clean+degraded] do_op mode now rmw(wr=0)
2011-02-15 14:38:11.760591 7fbd401de710 osd0 10 pg[3.45( v 10'18 (0'0,10'18] n=8 ec=2 les=6 5/5/5) [0] r=0 mlcod 10'18 active+clean+degraded] do_op kvm213_root.rbd/head [watch 1~0] ov 10'18 av 10'19 snapc 0=[] snapset 0=[]:[]+head
2011-02-15 14:38:11.760604 7fbd401de710 osd0 10 pg[3.45( v 10'18 (0'0,10'18] n=8 ec=2 les=6 5/5/5) [0] r=0 mlcod 10'18 active+clean+degraded]  taking ondisk_read_lock
2011-02-15 14:38:11.760621 7fbd401de710 osd0 10 pg[3.45( v 10'18 (0'0,10'18] n=8 ec=2 les=6 5/5/5) [0] r=0 mlcod 10'18 active+clean+degraded] do_osd_op kvm213_root.rbd/head [watch 1~0]
2011-02-15 14:38:11.760634 7fbd401de710 osd0 10 pg[3.45( v 10'18 (0'0,10'18] n=8 ec=2 les=6 5/5/5) [0] r=0 mlcod 10'18 active+clean+degraded] do_osd_op  watch 1~0
2011-02-15 14:38:11.760648 7fbd401de710 osd0 10 pg[3.45( v 10'18 (0'0,10'18] n=8 ec=2 les=6 5/5/5) [0] r=0 mlcod 10'18 active+clean+degraded] make_writeable kvm213_root.rbd/head snapset=0=[]:[]+head  snapc=0=[]
2011-02-15 14:38:11.760661 7fbd401de710 osd0 10 pg[3.45( v 10'18 (0'0,10'18] n=8 ec=2 les=6 5/5/5) [0] r=0 mlcod 10'18 active+clean+degraded] make_writeable kvm213_root.rbd/head done, snapset=0=[]:[]+head
2011-02-15 14:38:11.760675 7fbd401de710 osd0 10 pg[3.45( v 10'18 (0'0,10'18] n=8 ec=2 les=6 5/5/5) [0] r=0 mlcod 10'18 active+clean+degraded] oi.user_version=8'3 is_modify=1
2011-02-15 14:38:11.760688 7fbd401de710 osd0 10 pg[3.45( v 10'18 (0'0,10'18] n=8 ec=2 les=6 5/5/5) [0] r=0 mlcod 10'18 active+clean+degraded] watch: ctx->obc=0x3a886c0 cookie=1 oi.version=18 ctx->at_version=10'19
2011-02-15 14:38:11.760700 7fbd401de710 osd0 10 pg[3.45( v 10'18 (0'0,10'18] n=8 ec=2 les=6 5/5/5) [0] r=0 mlcod 10'18 active+clean+degraded] watch: oi.user_version=3
2011-02-15 14:38:11.760717 7fbd401de710 osd0 10 pg[3.45( v 10'18 (0'0,10'18] n=8 ec=2 les=6 5/5/5) [0] r=0 mlcod 10'18 active+clean+degraded]  removed watch watch(cookie 1 30s) by client4398
2011-02-15 14:38:11.761217 sd/ReplicatedPG.cc: In function 'int ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*, std::vector<OSDOp, std::allocator<OSDOp> >&, ceph::bufferlist&)', In thread 7fbd401de710
2011-02-15 14:38:11.761227 sd/ReplicatedPG.cc:1619: FAILED assert(obc->unconnected_watchers.count(entity))
2011-02-15 14:38:11.761239 ceph version 0.25~rc (commit:0c97d79056ba982f571ef8e720c9d488e3982f81)
2011-02-15 14:38:11.781235 1: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*, std::vector<OSDOp, std::allocator<OSDOp> >&, ceph::buffer::list&)+0xa67f) [0x4b672f]
2011-02-15 14:38:11.781266 2: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x6f) [0x4b6bdf]
2011-02-15 14:38:11.781277 3: (ReplicatedPG::do_op(MOSDOp*)+0xa55) [0x4b8385]
2011-02-15 14:38:11.781286 4: (OSD::dequeue_op(PG*)+0x34d) [0x4ff51d]
2011-02-15 14:38:11.781295 5: (ThreadPool::worker()+0x5e4) [0x609264]
2011-02-15 14:38:11.781304 6: (ThreadPool::WorkThread::entry()+0xd) [0x5229cd]
2011-02-15 14:38:11.781313 7: (Thread::_entry_func(void*)+0xa) [0x4823ea]
2011-02-15 14:38:11.781328 8: (()+0x68ba) [0x7fbd4d1538ba]
2011-02-15 14:38:11.781337 9: (clone()+0x6d) [0x7fbd4bdde02d]
2011-02-15 14:38:11.781344 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2011-02-15 14:38:11.781350 2011-02-15 14:38:11.781490 ** Caught signal (Aborted) ***
2011-02-15 14:38:11.781500 n thread 7fbd401de710
2011-02-15 14:38:11.782226 ceph version 0.25~rc (commit:0c97d79056ba982f571ef8e720c9d488e3982f81)
2011-02-15 14:38:11.782237 1: /usr/bin/cosd() [0x61efcc]
2011-02-15 14:38:11.782247 2: (()+0xef60) [0x7fbd4d15bf60]
2011-02-15 14:38:11.782256 3: (gsignal()+0x35) [0x7fbd4bd41165]
2011-02-15 14:38:11.782264 4: (abort()+0x180) [0x7fbd4bd43f70]
2011-02-15 14:38:11.782274 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fbd4c5d4dc5]
2011-02-15 14:38:11.782283 6: (()+0xcb166) [0x7fbd4c5d3166]
2011-02-15 14:38:11.782297 7: (()+0xcb193) [0x7fbd4c5d3193]
2011-02-15 14:38:11.782306 8: (()+0xcb28e) [0x7fbd4c5d328e]
2011-02-15 14:38:11.782317 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x2f2) [0x607bf2]
2011-02-15 14:38:11.782331 a: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*, std::vector<OSDOp, std::allocator<OSDOp> >&, ceph::buffer::list&)+0xa67f) [0x4b672f]
2011-02-15 14:38:11.782340 b: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x6f) [0x4b6bdf]
2011-02-15 14:38:11.782349 c: (ReplicatedPG::do_op(MOSDOp*)+0xa55) [0x4b8385]
2011-02-15 14:38:11.782359 d: (OSD::dequeue_op(PG*)+0x34d) [0x4ff51d]
2011-02-15 14:38:11.782368 e: (ThreadPool::worker()+0x5e4) [0x609264]
2011-02-15 14:38:11.782376 f: (ThreadPool::WorkThread::entry()+0xd) [0x5229cd]
2011-02-15 14:38:11.782385 10: (Thread::_entry_func(void*)+0xa) [0x4823ea]
2011-02-15 14:38:11.782394 11: (()+0x68ba) [0x7fbd4d1538ba]
2011-02-15 14:38:11.782403 12: (clone()+0x6d) [0x7fbd4bdde02d]

The full log is at pudgy:/more/cephdata/out/osd.0_qemu_watch

Actions #1

Updated by Josh Durgin about 13 years ago

Here's roughly what I did that lead to this (I'm not sure exactly when cosd crashed):
- imported two images using rbd tool
- tried to boot vm with one image as the root disk and one as /home
- killed qemu, since the VM wasn't booting

Actions #2

Updated by Sage Weil about 13 years ago

  • Category set to OSD
  • Status changed from New to Resolved
  • Assignee set to Sage Weil
  • Target version set to v0.25

The (violated) invariant is that any item in oi.watchers must be either in obc->watchers or obc->unconnected_watchers.

The get_object_context loader wasn't populating the unconnected_watchers list properly! Fixed by 3b03c5a

Actions

Also available in: Atom PDF