Project

General

Profile

Actions

Bug #23120

closed

OSDs continously crash during recovery

Added by Oliver Freyermuth about 6 years ago. Updated over 5 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have several OSDs continuously crashing during recovery. This is Luminous 12.2.3.

 ceph version 12.2.3 (2dab17a455c09584f2a85e6b10888337d1ec8949) luminous (stable)
 1: (()+0xa3c591) [0x55b3e5a85591]
 2: (()+0xf5e0) [0x7f8c237ca5e0]
 3: (gsignal()+0x37) [0x7f8c227f31f7]
 4: (abort()+0x148) [0x7f8c227f48e8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x55b3e5ac4664]
 6: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x1487) [0x55b3e5997a27]
 7: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x3a0) [0x55b3e5998a70]
 8: (PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x65) [0x55b3e5708a85]
 9: (ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&, Context*)+0x631) [0x55b3e5828191]
 10: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x327) [0x55b3e5838b27]
 11: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x55b3e573d680]
 12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x59c) [0x55b3e56a900c]
 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9) [0x55b3e552ef29]
 14: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x55b3e57abad7]
 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x55b3e555d99e]
 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x55b3e5aca009]
 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b3e5acbfa0]
 18: (()+0x7e25) [0x7f8c237c2e25]
 19: (clone()+0x6d) [0x7f8c228b634d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

This is using the officially released RPMs.

I've uploaded the logfile of one such OSD as:
ca0a29ae-0993-4faa-be4d-9ba2f7d6f905

The cluster will likely be recreated soon, since the system is now borked anyway, so please let me know quickly if more info is needed.

Actions

Also available in: Atom PDF