Bug #1974
osd: radosmodel crash on thrashing
0%
Description
2012-01-23T15:04:11.650 INFO:teuthology.task.rados.rados.0.out:waiting_on = 2 2012-01-23T15:04:11.650 INFO:teuthology.task.rados.rados.0.out:Writing sepia548599-225 from 0 to 56 tid 1 ranges are [0~56,755983~740579] 2012-01-23T15:04:12.176 INFO:teuthology.task.rados.rados.0.out:Writing sepia548599-225 from 755983 to 1496562 tid 2 ranges are [0~56,755983~740579] 2012-01-23T15:04:13.689 INFO:teuthology.task.rados.rados.0.out:finishing write tid 1 to sepia548599-417 2012-01-23T15:04:13.838 INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: In function 'virtual void ReadOp::_finish(TestOp::CallbackInfo*)', in thread '7fe0007f6700' 2012-01-23T15:04:13.838 INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: 524: FAILED assert(!old_value.deleted()) 2012-01-23T15:04:13.850 INFO:teuthology.task.thrashosds.thrasher:in_osds: [4, 3, 1, 0, 2] out_osds: [5] dead_osds: [] live_osds: [1, 0, 3, 2, 5, 4] 2012-01-23T15:04:13.851 INFO:teuthology.task.thrashosds.ceph_manager:waiting till clean 2012-01-23T15:04:13.851 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /t mp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -- pg dump --format=json' 2012-01-23T15:04:13.854 INFO:teuthology.task.rados.rados.0.err: ceph version 0.40-213-g7ce544e (commit:7ce544e640d45e901ef67e8268c963c958a66eff) 2012-01-23T15:04:13.854 INFO:teuthology.task.rados.rados.0.err: 1: (ReadOp::_finish(TestOp::CallbackInfo*)+0x5a6) [0x410ab6] 2012-01-23T15:04:13.854 INFO:teuthology.task.rados.rados.0.err: 2: (TestOp::finish(TestOp::CallbackInfo*)+0x12) [0x41c5f2] 2012-01-23T15:04:13.854 INFO:teuthology.task.rados.rados.0.err: 3: (read_callback(void*, void*)+0x16) [0x41c616] 2012-01-23T15:04:13.854 INFO:teuthology.task.rados.rados.0.err: 4: (librados::RadosClient::C_aio_Ack::finish(int)+0xf4) [0x7fe0030a9e94] 2012-01-23T15:04:13.855 INFO:teuthology.task.rados.rados.0.err: 5: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x1415) [0x7fe0030c3f65] 2012-01-23T15:04:13.855 INFO:teuthology.task.rados.rados.0.err: 6: (librados::RadosClient::_dispatch(Message*)+0x124) [0x7fe0030936a4] 2012-01-23T15:04:13.855 INFO:teuthology.task.rados.rados.0.err: 7: (librados::RadosClient::ms_dispatch(Message*)+0x43) [0x7fe003093733] 2012-01-23T15:04:13.855 INFO:teuthology.task.rados.rados.0.err: 8: (SimpleMessenger::dispatch_entry()+0xa1a) [0x7fe003192a9a] 2012-01-23T15:04:13.855 INFO:teuthology.task.rados.rados.0.err: 9: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x7fe0030a1d9c] 2012-01-23T15:04:13.856 INFO:teuthology.task.rados.rados.0.err: 10: (Thread::_entry_func(void*)+0x12) [0x7fe003209e02] 2012-01-23T15:04:13.856 INFO:teuthology.task.rados.rados.0.err: 11: (()+0x7971) [0x7fe002deb971] 2012-01-23T15:04:13.856 INFO:teuthology.task.rados.rados.0.err: 12: (clone()+0x6d) [0x7fe00262b92d] 2012-01-23T15:04:13.856 INFO:teuthology.task.rados.rados.0.err: ceph version 0.40-213-g7ce544e (commit:7ce544e640d45e901ef67e8268c963c958a66eff) 2012-01-23T15:04:13.856 INFO:teuthology.task.rados.rados.0.err: 1: (ReadOp::_finish(TestOp::CallbackInfo*)+0x5a6) [0x410ab6] 2012-01-23T15:04:13.857 INFO:teuthology.task.rados.rados.0.err: 2: (TestOp::finish(TestOp::CallbackInfo*)+0x12) [0x41c5f2] 2012-01-23T15:04:13.857 INFO:teuthology.task.rados.rados.0.err: 3: (read_callback(void*, void*)+0x16) [0x41c616] 2012-01-23T15:04:13.857 INFO:teuthology.task.rados.rados.0.err: 4: (librados::RadosClient::C_aio_Ack::finish(int)+0xf4) [0x7fe0030a9e94] 2012-01-23T15:04:13.857 INFO:teuthology.task.rados.rados.0.err: 5: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x1415) [0x7fe0030c3f65] 2012-01-23T15:04:13.857 INFO:teuthology.task.rados.rados.0.err: 6: (librados::RadosClient::_dispatch(Message*)+0x124) [0x7fe0030936a4] 2012-01-23T15:04:13.857 INFO:teuthology.task.rados.rados.0.err: 7: (librados::RadosClient::ms_dispatch(Message*)+0x43) [0x7fe003093733] 2012-01-23T15:04:13.858 INFO:teuthology.task.rados.rados.0.err: 8: (SimpleMessenger::dispatch_entry()+0xa1a) [0x7fe003192a9a] 2012-01-23T15:04:13.858 INFO:teuthology.task.rados.rados.0.err: 9: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x7fe0030a1d9c] 2012-01-23T15:04:13.858 INFO:teuthology.task.rados.rados.0.err: 10: (Thread::_entry_func(void*)+0x12) [0x7fe003209e02] 2012-01-23T15:04:13.858 INFO:teuthology.task.rados.rados.0.err: 11: (()+0x7971) [0x7fe002deb971] 2012-01-23T15:04:13.858 INFO:teuthology.task.rados.rados.0.err: 12: (clone()+0x6d) [0x7fe00262b92d] 2012-01-23T15:04:13.859 INFO:teuthology.task.rados.rados.0.err:terminate called after throwing an instance of 'ceph::FailedAssertion'
/var/lib/teuthworker/archive/nightly_coverage_2012-01-23-b/8774
Associated revisions
osd: flush on activate
PG::activate() can make lots of changes, most notably clean_up_local()
which deletes lots of local objects. Those changes need to be flushed
to the fs before we start servicing requests or else we risk processing a
client read on those objects.
Fixes: #1974
Signed-off-by: Sage Weil <sage@newdream.net>
History
#1 Updated by Sage Weil almost 12 years ago
(gdb) bt #0 0x00007fe91b12aba5 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007fe91b12e6b0 in abort () at abort.c:92 #2 0x00007fe91b74b6bd in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6 #3 0x00007fe91b749906 in ?? () from /usr/lib/libstdc++.so.6 #4 0x00007fe91b749933 in std::terminate() () from /usr/lib/libstdc++.so.6 #5 0x00007fe91b749a3e in __cxa_throw () from /usr/lib/libstdc++.so.6 #6 0x00007fe91bf5ff7d in ceph::__ceph_assert_fail (assertion=0x446dad "0", file=0x446d1b "./test/osd/RadosModel.h", line=368, func=0x4474c0 "virtual void WriteOp::_finish(TestOp::CallbackInfo*)") at common/assert.cc:70 #7 0x000000000042de6e in WriteOp::_finish (this=0x7fe910088a00, info=0x7fe9101f8fa0) at ./test/osd/RadosModel.h:368 #8 0x0000000000446a7c in TestOp::finish (this=0x7fe910088a00, info=0x7fe9101f8fa0) at test/osd/RadosModel.cc:20 #9 0x0000000000446aeb in write_callback (comp=0x7fe9100884d0, arg=0x7fe9100035f0) at test/osd/RadosModel.cc:34 #10 0x00007fe91be5c034 in librados::RadosClient::C_aio_Ack::finish (this=0x7fe9100062c0, r=0) at librados.cc:657 #11 0x00007fe91be89859 in Objecter::handle_osd_op_reply (this=0x1581060, m=0x16e5c00) at osdc/Objecter.cc:1241 #12 0x00007fe91be42d82 in librados::RadosClient::_dispatch (this=0x157cbf0, m=0x16e5c00) at librados.cc:1069 #13 0x00007fe91be42b82 in librados::RadosClient::ms_dispatch (this=0x157cbf0, m=0x16e5c00) at librados.cc:1039 #14 0x00007fe91bf83287 in Messenger::ms_deliver_dispatch (this=0x15801e0, m=0x16e5c00) at msg/Messenger.h:102 #15 0x00007fe91bf6cb64 in SimpleMessenger::dispatch_entry (this=0x15801e0) at msg/SimpleMessenger.cc:364 #16 0x00007fe91be559b6 in SimpleMessenger::DispatchThread::entry (this=0x1580668) at msg/SimpleMessenger.h:545 #17 0x00007fe91bfd5899 in Thread::_entry_func (arg=0x1580668) at common/Thread.cc:41 #18 0x00007fe91b99d971 in start_thread (arg=<value optimized out>) at pthread_create.c:304 #19 0x00007fe91b1dd92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #20 0x0000000000000000 in ?? () (gdb) f 7 #7 0x000000000042de6e in WriteOp::_finish (this=0x7fe910088a00, info=0x7fe9101f8fa0) at ./test/osd/RadosModel.h:368 368 ./test/osd/RadosModel.h: No such file or directory. in ./test/osd/RadosModel.h
this core is still sitting in /tmp/cephtest on sepia63 for now
#2 Updated by Sage Weil almost 12 years ago
another one:
INFO:teuthology.task.thrashosds.thrasher:in_osds: [1, 2, 4, 0] out_osds: [5, 3] dead_osds: [] live_osds: [1, 0, 3, 2, 5, 4] INFO:teuthology.task.thrashosds.thrasher:choose_action: min_in 2 min_out 0 min_live 2 min_dead 0 INFO:teuthology.task.thrashosds.thrasher:Adding osd 5 INFO:teuthology.task.rados.rados.0.out:946: Reading 292 INFO:teuthology.task.rados.rados.0.out:947: Reading 311 INFO:teuthology.task.rados.rados.0.out:948: Reading 393 INFO:teuthology.task.rados.rados.0.out:949: Writing 262 current snap is 0 INFO:teuthology.task.rados.rados.0.out:waiting_on = 3 INFO:teuthology.task.rados.rados.0.out:Writing sepia4713453-262 from 0 to 57 tid 1 ranges are [0~57,444394~685464,1586930~57464] INFO:teuthology.task.rados.rados.0.out:Writing sepia4713453-262 from 444394 to 1129858 tid 2 ranges are [0~57,444394~685464,1586930~57464] INFO:teuthology.task.rados.rados.0.out:Writing sepia4713453-262 from 1586930 to 1644394 tid 3 ranges are [0~57,444394~685464,1586930~57464] INFO:teuthology.task.rados.rados.0.out:950: Writing 269 current snap is 0 INFO:teuthology.task.rados.rados.0.out:waiting_on = 2 INFO:teuthology.task.rados.rados.0.out:Writing sepia4713453-269 from 0 to 57 tid 1 ranges are [0~57,540647~657268] INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: In function 'virtual void ReadOp::_finish(TestOp::CallbackInfo*)' thread 7fe917400700 time 2012-01-28 20:07:51.141077 INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: 524: FAILED assert(!old_value.deleted()) INFO:teuthology.task.rados.rados.0.out:Writing sepia4713453-269 from 540647 to 1197915 tid 2 ranges are [0~57,540647~657268] INFO:teuthology.task.rados.rados.0.err: ceph version 0.41-14-g5e16974 (commit:5e16974c291e78f5a7cdc3c4f680e7817cff3cca) INFO:teuthology.task.rados.rados.0.err: 1: (ReadOp::_finish(TestOp::CallbackInfo*)+0x387) [0x40ad77] INFO:teuthology.task.rados.rados.0.err: 2: (librados::RadosClient::C_aio_Ack::finish(int)+0xa4) [0x7fe919ca2d64] INFO:teuthology.task.rados.rados.0.err: 3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf69) [0x7fe919cb27e9] INFO:teuthology.task.rados.rados.0.err: 4: (librados::RadosClient::_dispatch(Message*)+0x6c) [0x7fe919c923dc] INFO:teuthology.task.rados.rados.0.err: 5: (librados::RadosClient::ms_dispatch(Message*)+0x33) [0x7fe919c92463] INFO:teuthology.task.rados.rados.0.err: 6: (SimpleMessenger::dispatch_entry()+0x883) [0x7fe919d418e3] INFO:teuthology.task.rados.rados.0.err: 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x7fe919c9c1ac] INFO:teuthology.task.rados.rados.0.err: 8: (()+0x7971) [0x7fe9199f5971] INFO:teuthology.task.rados.rados.0.err: 9: (clone()+0x6d) [0x7fe91923592d] INFO:teuthology.task.rados.rados.0.err: ceph version 0.41-14-g5e16974 (commit:5e16974c291e78f5a7cdc3c4f680e7817cff3cca) INFO:teuthology.task.rados.rados.0.err: 1: (ReadOp::_finish(TestOp::CallbackInfo*)+0x387) [0x40ad77] INFO:teuthology.task.rados.rados.0.err: 2: (librados::RadosClient::C_aio_Ack::finish(int)+0xa4) [0x7fe919ca2d64] INFO:teuthology.task.rados.rados.0.err: 3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf69) [0x7fe919cb27e9] INFO:teuthology.task.rados.rados.0.err: 4: (librados::RadosClient::_dispatch(Message*)+0x6c) [0x7fe919c923dc] INFO:teuthology.task.rados.rados.0.err: 5: (librados::RadosClient::ms_dispatch(Message*)+0x33) [0x7fe919c92463] INFO:teuthology.task.rados.rados.0.err: 6: (SimpleMessenger::dispatch_entry()+0x883) [0x7fe919d418e3] INFO:teuthology.task.rados.rados.0.err: 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x7fe919c9c1ac] INFO:teuthology.task.rados.rados.0.err: 8: (()+0x7971) [0x7fe9199f5971] INFO:teuthology.task.rados.rados.0.err: 9: (clone()+0x6d) [0x7fe91923592d]
#3 Updated by Sage Weil almost 12 years ago
- Assignee set to Samuel Just
#4 Updated by Samuel Just almost 12 years ago
Looks like in some cases we are finding objects which should have been deleted.
#5 Updated by Anonymous almost 12 years ago
Summary: An object was deleted, but after a recovery was found to be back ... which is almost surely indicative of a bug in the recovery process.
#6 Updated by Sage Weil almost 12 years ago
Just hit this:
- clean_up_local removed an object (due it a 'delete' log entry)
- a read came in and read it before the filestore deleted it.
Well, it was a bit more complex than that; there was some recovery going on, but that's the gist of the bug.
Adding an osr.flush() near the end of PG::activate() to ensure all this crap is on disk before we handling client ops or do any recovery.
#7 Updated by Sage Weil almost 12 years ago
- Status changed from New to Resolved