Actions
Bug #21470
closedCeph OSDs crashing in BlueStore::queue_transactions() using EC after applying fix
% Done:
0%
Source:
Community (user)
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
BlueStore
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
This is a copy of http://tracker.ceph.com/issues/21314, which was marked as resolved. It's not resolved after applying the linked fixes and I tried to say so on that issue but received no response.
I've set up a cluster with the following configuration:- Single node, Arch Linux, using build from the official v12.2.0 source release, with the fixes from http://tracker.ceph.com/issues/21171 applied (specifically, https://github.com/ceph/ceph/pull/17352)
- Loopback networking
- 4 dm-crypted Bluestore OSDs
- 1 mon, 1 mgr, 1 mds
- 1 CephFS, mounted with the kernel driver, written to with rsync
- CephFS has a 256 PG k=2 m=1 erasure coded data pool and a 64 PG size=2 replicated metadata pool
Given a few hours (~6ish), my OSDs consistently crash like so:
2017-09-17 00:23:28.154916 7f3b2e1fe700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3b181b4700' had timed out after 15 2017-09-17 00:23:28.154920 7f3b2e1fe700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3b181b4700' had suicide timed out after 150 2017-09-17 00:23:28.161089 7f3b181b4700 -1 *** Caught signal (Aborted) ** in thread 7f3b181b4700 thread_name:tp_osd_tp ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc) 1: (()+0x9a6f48) [0x55688345ef48] 2: (()+0x117e0) [0x7f3b338527e0] 3: (pthread_cond_wait()+0x1fd) [0x7f3b3384e1ad] 4: (Throttle::_wait(long)+0x33c) [0x55688349a26c] 5: (Throttle::get(long, long)+0x2a2) [0x55688349b032] 6: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x105b) [0x55688336293b] 7: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x68) [0x55688309dd08] 8: (ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&, Context*)+0x963) [0x5568831dc3b3] 9: (ECBackend::try_reads_to_commit()+0xa7b) [0x5568831e9c5b] 10: (ECBackend::check_ops()+0x1c) [0x5568831ec52c] 11: (ECBackend::start_rmw(ECBackend::Op*, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&)+0x1e0a) [0x5568831f685a] 12: (ECBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x4a1) [0x5568831f7751] 13: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x810) [0x55688303f800] 14: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x1238) [0x556883084538] 15: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x389e) [0x55688308864e] 16: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xe17) [0x556883046e27] 17: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x387) [0x556882ed5d07] 18: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x5a) [0x556883158ffa] 19: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1ae7) [0x556882ef9b97] 20: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b2) [0x5568834a9612] 21: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5568834acb80] 22: (()+0x7049) [0x7f3b33848049] 23: (clone()+0x3f) [0x7f3b32cd7f0f] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
After a few crashes they instead refuse to start entirely, like so:
2017-09-17 00:04:39.184048 7fe65bd00700 -1 src/ceph/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::Tra nsContext*, ObjectStore::Transaction*)' thread 7fe65bd00700 time 2017-09-17 00:04:39.179157 src/ceph/src/os/bluestore/BlueStore.cc: 9290: FAILED assert(0 == "unexpected error") ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xf5) [0x560dd45a6ba5] 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0xf1c) [0x560dd4461aec] 3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x4f8) [0x560dd4463dd8] 4: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x68) [0x560dd419fd08] 5: (ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&, Context*)+0x963) [0x560dd42de3b3] 6: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x321) [0x560dd42f5a51] 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x560dd41dcc47] 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x63e) [0x560dd414864e] 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x387) [0x560dd3fd7d07] 10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x5a) [0x560dd425affa] 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1ae7) [0x560dd3ffbb97] 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b2) [0x560dd45ab612] 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x560dd45aeb80] 14: (()+0x7049) [0x7fe675b91049] 15: (clone()+0x3f) [0x7fe675020f0f] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this
Actions