Project

General

Profile

Actions

Bug #13258

closed

osd Take turns to the down ("FAILED assert(0 == "unexpected error")")

Added by wj rong over 8 years ago. Updated over 8 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/hammer-x
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

-2> 2015-09-28 12:28:28.010597 7f577fbae700 -1 osd.4 pg_epoch: 389 pg[3.f( v 239'35446 (200'32446,239'35446] local-les=385 n=604 ec=26 les/c 385/385 388/388/386) [5] r=-1 lpr=388 pi=381-387/3 crt=239'35446 inactive NOTIFY] on_flushed: found objects in the temp collection: [], crashing now
-1> 2015-09-28 12:28:28.010594 7f57805af700 -1 osd.4 pg_epoch: 389 pg[3.17( v 239'32172 (200'29172,239'32172] local-les=385 n=661 ec=26 les/c 385/385 388/388/388) [5] r=-1 lpr=388 pi=381-387/3 crt=239'32172 inactive NOTIFY] on_flushed: found objects in the temp collection: [], crashing now
0> 2015-09-28 12:28:28.101488 7f5790dc3700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f5790dc3700 time 2015-09-28 12:28:27.936933
os/FileStore.cc: 2715: FAILED assert(0 == "unexpected error")

ceph version 0.87.2 (87a7cec9ab11c677de2ab23a7668a77d2f5b955e)
1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x1960) [0x7af7b0]
2: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x7b87e4]
3: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x29f) [0x7b8a9f]
4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xb64526]
5: (ThreadPool::WorkThread::entry()+0x10) [0xb66170]
6: /lib64/libpthread.so.0() [0x3c4be07a51]
7: (clone()+0x6d) [0x3c4bae89ad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ceph version 0.87.2 (87a7cec9ab11c677de2ab23a7668a77d2f5b955e)
1: (ReplicatedBackend::on_flushed()+0x121) [0x9c7261]
2: (ReplicatedPG::on_flushed()+0x1d4) [0x8ad794]
3: (PG::RecoveryState::Started::react(PG::FlushedEvt const&)+0x47) [0x8184b7]
4: (boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x28b) [0x88dbdb]
5: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x199) [0x88a639]
6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x4b) [0x87fe0b]
7: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x32f) [0x84328f]
8: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x37c) [0x66dd5c]
9: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x16) [0x6be3e6]
10: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xb64526]
11: (ThreadPool::WorkThread::entry()+0x10) [0xb66170]
12: /lib64/libpthread.so.0() [0x3c4be07a51]
13: (clone()+0x6d) [0x3c4bae89ad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #1

Updated by Samuel Just over 8 years ago

Can you reproduce that crash with
debug osd = 20
debug filestore = 20
debug ms = 1?

Actions #2

Updated by Samuel Just over 8 years ago

  • Priority changed from Normal to High
Actions #3

Updated by Sage Weil over 8 years ago

  • Status changed from New to Need More Info
Actions #4

Updated by Yuri Weinstein over 8 years ago

  • Status changed from Need More Info to New

Run: http://pulpito.ceph.com/teuthology-2015-10-16_17:10:01-upgrade:hammer-x-infernalis-distro-basic-vps/
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-10-16_17:10:01-upgrade:hammer-x-infernalis-distro-basic-vps/1110589/teuthology.log

2015-10-16T22:48:50.403 INFO:tasks.rados.rados.0.vpm119.stdout:update_object_version oid 34 v 316 (ObjNum 436 snap 155 seq_num 436) dirty exists
2015-10-16T22:48:50.403 INFO:tasks.rados.rados.0.vpm119.stdout:update_object_version oid 21 v 794 (ObjNum 298 snap 109 seq_num 298) dirty exists
2015-10-16T22:48:50.403 INFO:tasks.rados.rados.0.vpm119.stdout:update_object_version oid 38 v 318 (ObjNum 175 snap 60 seq_num 175) dirty exists
2015-10-16T22:48:50.403 INFO:tasks.rados.rados.0.vpm119.stdout:1280:  finishing write tid 6 to cloud10578-16
2015-10-16T22:48:50.403 INFO:tasks.rados.rados.0.vpm119.stdout:update_object_version oid 16 v 423 (ObjNum 493 snap 178 seq_num 493) dirty exists
2015-10-16T22:48:50.404 INFO:tasks.rados.rados.0.vpm119.stdout:1280:  left oid 16 (ObjNum 493 snap 178 seq_num 493)
2015-10-16T22:48:50.692 INFO:teuthology.orchestra.run.vpm139.stderr:os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7febbe36d880 time 2015-10-16 22:48:49.615779
2015-10-16T22:48:50.692 INFO:teuthology.orchestra.run.vpm139.stderr:os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error")
2015-10-16T22:48:50.692 INFO:teuthology.orchestra.run.vpm139.stderr: ceph version 0.94.3-293-g5764e23 (5764e233e56be08a59ffe6292f6fba9a76288aee)
2015-10-16T22:48:50.692 INFO:teuthology.orchestra.run.vpm139.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbd56f5]
2015-10-16T22:48:50.692 INFO:teuthology.orchestra.run.vpm139.stderr: 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xa6e) [0x98cd6e]
2015-10-16T22:48:50.693 INFO:teuthology.orchestra.run.vpm139.stderr: 3: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x991d64]
2015-10-16T22:48:50.693 INFO:teuthology.orchestra.run.vpm139.stderr: 4: (JournalingObjectStore::journal_replay(unsigned long)+0x5db) [0x9aa4fb]
2015-10-16T22:48:50.693 INFO:teuthology.orchestra.run.vpm139.stderr: 5: (FileStore::mount()+0x3720) [0x97c510]
2015-10-16T22:48:50.693 INFO:teuthology.orchestra.run.vpm139.stderr: 6: (main()+0x158f) [0x64aa3f]
2015-10-16T22:48:50.693 INFO:teuthology.orchestra.run.vpm139.stderr: 7: (__libc_start_main()+0xf5) [0x7febb8ef6af5]
2015-10-16T22:48:50.693 INFO:teuthology.orchestra.run.vpm139.stderr: 8: ceph-objectstore-tool() [0x667379]
2015-10-16T22:48:50.694 INFO:teuthology.orchestra.run.vpm139.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #5

Updated by Yuri Weinstein over 8 years ago

  • Subject changed from osd Take turns to the down to osd Take turns to the down ("FAILED assert(0 == "unexpected error")")
  • Release set to infernalis
  • ceph-qa-suite upgrade/hammer-x added
Actions #6

Updated by Yuri Weinstein over 8 years ago

  • Priority changed from High to Urgent
  • Source changed from other to Q/A
Actions #7

Updated by Samuel Just over 8 years ago

  • Status changed from New to Can't reproduce

I don't have logs from that osd, so I'm marking this one can't reproduce since I need logging (and the teuthology failure is probably unrelated to the original report). That assert could be a disk error.

Actions

Also available in: Atom PDF