Project

General

Profile

Actions

Bug #21470

closed

Ceph OSDs crashing in BlueStore::queue_transactions() using EC after applying fix

Added by Bob Bobington over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
BlueStore
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This is a copy of http://tracker.ceph.com/issues/21314, which was marked as resolved. It's not resolved after applying the linked fixes and I tried to say so on that issue but received no response.

I've set up a cluster with the following configuration:
  • Single node, Arch Linux, using build from the official v12.2.0 source release, with the fixes from http://tracker.ceph.com/issues/21171 applied (specifically, https://github.com/ceph/ceph/pull/17352)
  • Loopback networking
  • 4 dm-crypted Bluestore OSDs
  • 1 mon, 1 mgr, 1 mds
  • 1 CephFS, mounted with the kernel driver, written to with rsync
  • CephFS has a 256 PG k=2 m=1 erasure coded data pool and a 64 PG size=2 replicated metadata pool

Given a few hours (~6ish), my OSDs consistently crash like so:

2017-09-17 00:23:28.154916 7f3b2e1fe700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3b181b4700' had timed out after 15
2017-09-17 00:23:28.154920 7f3b2e1fe700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3b181b4700' had suicide timed out after 150
2017-09-17 00:23:28.161089 7f3b181b4700 -1 *** Caught signal (Aborted) **
 in thread 7f3b181b4700 thread_name:tp_osd_tp

 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
 1: (()+0x9a6f48) [0x55688345ef48]
 2: (()+0x117e0) [0x7f3b338527e0]
 3: (pthread_cond_wait()+0x1fd) [0x7f3b3384e1ad]
 4: (Throttle::_wait(long)+0x33c) [0x55688349a26c]
 5: (Throttle::get(long, long)+0x2a2) [0x55688349b032]
 6: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x105b) [0x55688336293b]
 7: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x68) [0x55688309dd08]
 8: (ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&, Context*)+0x963) [0x5568831dc3b3]
 9: (ECBackend::try_reads_to_commit()+0xa7b) [0x5568831e9c5b]
 10: (ECBackend::check_ops()+0x1c) [0x5568831ec52c]
 11: (ECBackend::start_rmw(ECBackend::Op*, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&)+0x1e0a) [0x5568831f685a]
 12: (ECBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x4a1) [0x5568831f7751]
 13: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x810) [0x55688303f800]
 14: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x1238) [0x556883084538]
 15: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x389e) [0x55688308864e]
 16: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xe17) [0x556883046e27]
 17: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x387) [0x556882ed5d07]
 18: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x5a) [0x556883158ffa]
 19: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1ae7) [0x556882ef9b97]
 20: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b2) [0x5568834a9612]
 21: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5568834acb80]
 22: (()+0x7049) [0x7f3b33848049]
 23: (clone()+0x3f) [0x7f3b32cd7f0f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

After a few crashes they instead refuse to start entirely, like so:

2017-09-17 00:04:39.184048 7fe65bd00700 -1 src/ceph/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::Tra
nsContext*, ObjectStore::Transaction*)' thread 7fe65bd00700 time 2017-09-17 00:04:39.179157
src/ceph/src/os/bluestore/BlueStore.cc: 9290: FAILED assert(0 == "unexpected error")

 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xf5) [0x560dd45a6ba5]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0xf1c) [0x560dd4461aec]
 3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x4f8) [0x560dd4463dd8]
 4: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x68) [0x560dd419fd08]
 5: (ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&, Context*)+0x963) [0x560dd42de3b3]
 6: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x321) [0x560dd42f5a51]
 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x560dd41dcc47]
 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x63e) [0x560dd414864e]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x387) [0x560dd3fd7d07]
 10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x5a) [0x560dd425affa]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1ae7) [0x560dd3ffbb97]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b2) [0x560dd45ab612]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x560dd45aeb80]
 14: (()+0x7049) [0x7fe675b91049]
 15: (clone()+0x3f) [0x7fe675020f0f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this

Related issues 1 (0 open1 closed)

Related to bluestore - Bug #22957: [bluestore]bstore_kv_final thread seems deadlock Duplicate02/08/2018

Actions
Actions

Also available in: Atom PDF