Project

General

Profile

Bug #1974

osd: radosmodel crash on thrashing

Added by Sage Weil over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description


2012-01-23T15:04:11.650 INFO:teuthology.task.rados.rados.0.out:waiting_on = 2
2012-01-23T15:04:11.650 INFO:teuthology.task.rados.rados.0.out:Writing sepia548599-225 from 0 to 56 tid 1 ranges are [0~56,755983~740579]
2012-01-23T15:04:12.176 INFO:teuthology.task.rados.rados.0.out:Writing sepia548599-225 from 755983 to 1496562 tid 2 ranges are [0~56,755983~740579]
2012-01-23T15:04:13.689 INFO:teuthology.task.rados.rados.0.out:finishing write tid 1 to sepia548599-417
2012-01-23T15:04:13.838 INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: In function 'virtual void ReadOp::_finish(TestOp::CallbackInfo*)', in thread '7fe0007f6700'
2012-01-23T15:04:13.838 INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: 524: FAILED assert(!old_value.deleted())
2012-01-23T15:04:13.850 INFO:teuthology.task.thrashosds.thrasher:in_osds:  [4, 3, 1, 0, 2]  out_osds:  [5] dead_osds:  [] live_osds:  [1, 0, 3, 2, 5, 4]
2012-01-23T15:04:13.851 INFO:teuthology.task.thrashosds.ceph_manager:waiting till clean
2012-01-23T15:04:13.851 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /t
mp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -- pg dump --format=json'
2012-01-23T15:04:13.854 INFO:teuthology.task.rados.rados.0.err: ceph version 0.40-213-g7ce544e (commit:7ce544e640d45e901ef67e8268c963c958a66eff)
2012-01-23T15:04:13.854 INFO:teuthology.task.rados.rados.0.err: 1: (ReadOp::_finish(TestOp::CallbackInfo*)+0x5a6) [0x410ab6]
2012-01-23T15:04:13.854 INFO:teuthology.task.rados.rados.0.err: 2: (TestOp::finish(TestOp::CallbackInfo*)+0x12) [0x41c5f2]
2012-01-23T15:04:13.854 INFO:teuthology.task.rados.rados.0.err: 3: (read_callback(void*, void*)+0x16) [0x41c616]
2012-01-23T15:04:13.854 INFO:teuthology.task.rados.rados.0.err: 4: (librados::RadosClient::C_aio_Ack::finish(int)+0xf4) [0x7fe0030a9e94]
2012-01-23T15:04:13.855 INFO:teuthology.task.rados.rados.0.err: 5: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x1415) [0x7fe0030c3f65]
2012-01-23T15:04:13.855 INFO:teuthology.task.rados.rados.0.err: 6: (librados::RadosClient::_dispatch(Message*)+0x124) [0x7fe0030936a4]
2012-01-23T15:04:13.855 INFO:teuthology.task.rados.rados.0.err: 7: (librados::RadosClient::ms_dispatch(Message*)+0x43) [0x7fe003093733]
2012-01-23T15:04:13.855 INFO:teuthology.task.rados.rados.0.err: 8: (SimpleMessenger::dispatch_entry()+0xa1a) [0x7fe003192a9a]
2012-01-23T15:04:13.855 INFO:teuthology.task.rados.rados.0.err: 9: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x7fe0030a1d9c]
2012-01-23T15:04:13.856 INFO:teuthology.task.rados.rados.0.err: 10: (Thread::_entry_func(void*)+0x12) [0x7fe003209e02]
2012-01-23T15:04:13.856 INFO:teuthology.task.rados.rados.0.err: 11: (()+0x7971) [0x7fe002deb971]
2012-01-23T15:04:13.856 INFO:teuthology.task.rados.rados.0.err: 12: (clone()+0x6d) [0x7fe00262b92d]
2012-01-23T15:04:13.856 INFO:teuthology.task.rados.rados.0.err: ceph version 0.40-213-g7ce544e (commit:7ce544e640d45e901ef67e8268c963c958a66eff)
2012-01-23T15:04:13.856 INFO:teuthology.task.rados.rados.0.err: 1: (ReadOp::_finish(TestOp::CallbackInfo*)+0x5a6) [0x410ab6]
2012-01-23T15:04:13.857 INFO:teuthology.task.rados.rados.0.err: 2: (TestOp::finish(TestOp::CallbackInfo*)+0x12) [0x41c5f2]
2012-01-23T15:04:13.857 INFO:teuthology.task.rados.rados.0.err: 3: (read_callback(void*, void*)+0x16) [0x41c616]
2012-01-23T15:04:13.857 INFO:teuthology.task.rados.rados.0.err: 4: (librados::RadosClient::C_aio_Ack::finish(int)+0xf4) [0x7fe0030a9e94]
2012-01-23T15:04:13.857 INFO:teuthology.task.rados.rados.0.err: 5: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x1415) [0x7fe0030c3f65]
2012-01-23T15:04:13.857 INFO:teuthology.task.rados.rados.0.err: 6: (librados::RadosClient::_dispatch(Message*)+0x124) [0x7fe0030936a4]
2012-01-23T15:04:13.857 INFO:teuthology.task.rados.rados.0.err: 7: (librados::RadosClient::ms_dispatch(Message*)+0x43) [0x7fe003093733]
2012-01-23T15:04:13.858 INFO:teuthology.task.rados.rados.0.err: 8: (SimpleMessenger::dispatch_entry()+0xa1a) [0x7fe003192a9a]
2012-01-23T15:04:13.858 INFO:teuthology.task.rados.rados.0.err: 9: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x7fe0030a1d9c]
2012-01-23T15:04:13.858 INFO:teuthology.task.rados.rados.0.err: 10: (Thread::_entry_func(void*)+0x12) [0x7fe003209e02]
2012-01-23T15:04:13.858 INFO:teuthology.task.rados.rados.0.err: 11: (()+0x7971) [0x7fe002deb971]
2012-01-23T15:04:13.858 INFO:teuthology.task.rados.rados.0.err: 12: (clone()+0x6d) [0x7fe00262b92d]
2012-01-23T15:04:13.859 INFO:teuthology.task.rados.rados.0.err:terminate called after throwing an instance of 'ceph::FailedAssertion'

/var/lib/teuthworker/archive/nightly_coverage_2012-01-23-b/8774

Associated revisions

Revision 359dfb99 (diff)
Added by Sage Weil over 9 years ago

osd: flush on activate

PG::activate() can make lots of changes, most notably clean_up_local()
which deletes lots of local objects. Those changes need to be flushed
to the fs before we start servicing requests or else we risk processing a
client read on those objects.

Fixes: #1974
Signed-off-by: Sage Weil <>

History

#1 Updated by Sage Weil over 9 years ago


(gdb) bt
#0  0x00007fe91b12aba5 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007fe91b12e6b0 in abort () at abort.c:92
#2  0x00007fe91b74b6bd in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3  0x00007fe91b749906 in ?? () from /usr/lib/libstdc++.so.6
#4  0x00007fe91b749933 in std::terminate() () from /usr/lib/libstdc++.so.6
#5  0x00007fe91b749a3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x00007fe91bf5ff7d in ceph::__ceph_assert_fail (assertion=0x446dad "0", file=0x446d1b "./test/osd/RadosModel.h", line=368, 
    func=0x4474c0 "virtual void WriteOp::_finish(TestOp::CallbackInfo*)") at common/assert.cc:70
#7  0x000000000042de6e in WriteOp::_finish (this=0x7fe910088a00, info=0x7fe9101f8fa0) at ./test/osd/RadosModel.h:368
#8  0x0000000000446a7c in TestOp::finish (this=0x7fe910088a00, info=0x7fe9101f8fa0) at test/osd/RadosModel.cc:20
#9  0x0000000000446aeb in write_callback (comp=0x7fe9100884d0, arg=0x7fe9100035f0) at test/osd/RadosModel.cc:34
#10 0x00007fe91be5c034 in librados::RadosClient::C_aio_Ack::finish (this=0x7fe9100062c0, r=0) at librados.cc:657
#11 0x00007fe91be89859 in Objecter::handle_osd_op_reply (this=0x1581060, m=0x16e5c00) at osdc/Objecter.cc:1241
#12 0x00007fe91be42d82 in librados::RadosClient::_dispatch (this=0x157cbf0, m=0x16e5c00) at librados.cc:1069
#13 0x00007fe91be42b82 in librados::RadosClient::ms_dispatch (this=0x157cbf0, m=0x16e5c00) at librados.cc:1039
#14 0x00007fe91bf83287 in Messenger::ms_deliver_dispatch (this=0x15801e0, m=0x16e5c00) at msg/Messenger.h:102
#15 0x00007fe91bf6cb64 in SimpleMessenger::dispatch_entry (this=0x15801e0) at msg/SimpleMessenger.cc:364
#16 0x00007fe91be559b6 in SimpleMessenger::DispatchThread::entry (this=0x1580668) at msg/SimpleMessenger.h:545
#17 0x00007fe91bfd5899 in Thread::_entry_func (arg=0x1580668) at common/Thread.cc:41
#18 0x00007fe91b99d971 in start_thread (arg=<value optimized out>) at pthread_create.c:304
#19 0x00007fe91b1dd92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#20 0x0000000000000000 in ?? ()
(gdb) f 7
#7  0x000000000042de6e in WriteOp::_finish (this=0x7fe910088a00, info=0x7fe9101f8fa0) at ./test/osd/RadosModel.h:368
368     ./test/osd/RadosModel.h: No such file or directory.
        in ./test/osd/RadosModel.h

this core is still sitting in /tmp/cephtest on sepia63 for now

#2 Updated by Sage Weil over 9 years ago

another one:

INFO:teuthology.task.thrashosds.thrasher:in_osds:  [1, 2, 4, 0]  out_osds:  [5, 3] dead_osds:  [] live_osds:  [1, 0, 3, 2, 5, 4]
INFO:teuthology.task.thrashosds.thrasher:choose_action: min_in 2 min_out 0 min_live 2 min_dead 0
INFO:teuthology.task.thrashosds.thrasher:Adding osd 5
INFO:teuthology.task.rados.rados.0.out:946: Reading 292
INFO:teuthology.task.rados.rados.0.out:947: Reading 311
INFO:teuthology.task.rados.rados.0.out:948: Reading 393
INFO:teuthology.task.rados.rados.0.out:949: Writing 262 current snap is 0
INFO:teuthology.task.rados.rados.0.out:waiting_on = 3
INFO:teuthology.task.rados.rados.0.out:Writing sepia4713453-262 from 0 to 57 tid 1 ranges are [0~57,444394~685464,1586930~57464]
INFO:teuthology.task.rados.rados.0.out:Writing sepia4713453-262 from 444394 to 1129858 tid 2 ranges are [0~57,444394~685464,1586930~57464]
INFO:teuthology.task.rados.rados.0.out:Writing sepia4713453-262 from 1586930 to 1644394 tid 3 ranges are [0~57,444394~685464,1586930~57464]
INFO:teuthology.task.rados.rados.0.out:950: Writing 269 current snap is 0
INFO:teuthology.task.rados.rados.0.out:waiting_on = 2
INFO:teuthology.task.rados.rados.0.out:Writing sepia4713453-269 from 0 to 57 tid 1 ranges are [0~57,540647~657268]
INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: In function 'virtual void ReadOp::_finish(TestOp::CallbackInfo*)' thread 7fe917400700 time 2012-01-28 20:07:51.141077
INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: 524: FAILED assert(!old_value.deleted())
INFO:teuthology.task.rados.rados.0.out:Writing sepia4713453-269 from 540647 to 1197915 tid 2 ranges are [0~57,540647~657268]
INFO:teuthology.task.rados.rados.0.err: ceph version 0.41-14-g5e16974 (commit:5e16974c291e78f5a7cdc3c4f680e7817cff3cca)
INFO:teuthology.task.rados.rados.0.err: 1: (ReadOp::_finish(TestOp::CallbackInfo*)+0x387) [0x40ad77]
INFO:teuthology.task.rados.rados.0.err: 2: (librados::RadosClient::C_aio_Ack::finish(int)+0xa4) [0x7fe919ca2d64]
INFO:teuthology.task.rados.rados.0.err: 3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf69) [0x7fe919cb27e9]
INFO:teuthology.task.rados.rados.0.err: 4: (librados::RadosClient::_dispatch(Message*)+0x6c) [0x7fe919c923dc]
INFO:teuthology.task.rados.rados.0.err: 5: (librados::RadosClient::ms_dispatch(Message*)+0x33) [0x7fe919c92463]
INFO:teuthology.task.rados.rados.0.err: 6: (SimpleMessenger::dispatch_entry()+0x883) [0x7fe919d418e3]
INFO:teuthology.task.rados.rados.0.err: 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x7fe919c9c1ac]
INFO:teuthology.task.rados.rados.0.err: 8: (()+0x7971) [0x7fe9199f5971]
INFO:teuthology.task.rados.rados.0.err: 9: (clone()+0x6d) [0x7fe91923592d]
INFO:teuthology.task.rados.rados.0.err: ceph version 0.41-14-g5e16974 (commit:5e16974c291e78f5a7cdc3c4f680e7817cff3cca)
INFO:teuthology.task.rados.rados.0.err: 1: (ReadOp::_finish(TestOp::CallbackInfo*)+0x387) [0x40ad77]
INFO:teuthology.task.rados.rados.0.err: 2: (librados::RadosClient::C_aio_Ack::finish(int)+0xa4) [0x7fe919ca2d64]
INFO:teuthology.task.rados.rados.0.err: 3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf69) [0x7fe919cb27e9]
INFO:teuthology.task.rados.rados.0.err: 4: (librados::RadosClient::_dispatch(Message*)+0x6c) [0x7fe919c923dc]
INFO:teuthology.task.rados.rados.0.err: 5: (librados::RadosClient::ms_dispatch(Message*)+0x33) [0x7fe919c92463]
INFO:teuthology.task.rados.rados.0.err: 6: (SimpleMessenger::dispatch_entry()+0x883) [0x7fe919d418e3]
INFO:teuthology.task.rados.rados.0.err: 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x7fe919c9c1ac]
INFO:teuthology.task.rados.rados.0.err: 8: (()+0x7971) [0x7fe9199f5971]
INFO:teuthology.task.rados.rados.0.err: 9: (clone()+0x6d) [0x7fe91923592d]

#3 Updated by Sage Weil over 9 years ago

  • Assignee set to Samuel Just

#4 Updated by Samuel Just over 9 years ago

Looks like in some cases we are finding objects which should have been deleted.

#5 Updated by Anonymous over 9 years ago

Summary: An object was deleted, but after a recovery was found to be back ... which is almost surely indicative of a bug in the recovery process.

#6 Updated by Sage Weil over 9 years ago

Just hit this:

- clean_up_local removed an object (due it a 'delete' log entry)
- a read came in and read it before the filestore deleted it.

Well, it was a bit more complex than that; there was some recovery going on, but that's the gist of the bug.

Adding an osr.flush() near the end of PG::activate() to ensure all this crap is on disk before we handling client ops or do any recovery.

#7 Updated by Sage Weil over 9 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF