Project

General

Profile

Actions

Bug #21470

closed

Ceph OSDs crashing in BlueStore::queue_transactions() using EC after applying fix

Added by Bob Bobington over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
BlueStore
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This is a copy of http://tracker.ceph.com/issues/21314, which was marked as resolved. It's not resolved after applying the linked fixes and I tried to say so on that issue but received no response.

I've set up a cluster with the following configuration:
  • Single node, Arch Linux, using build from the official v12.2.0 source release, with the fixes from http://tracker.ceph.com/issues/21171 applied (specifically, https://github.com/ceph/ceph/pull/17352)
  • Loopback networking
  • 4 dm-crypted Bluestore OSDs
  • 1 mon, 1 mgr, 1 mds
  • 1 CephFS, mounted with the kernel driver, written to with rsync
  • CephFS has a 256 PG k=2 m=1 erasure coded data pool and a 64 PG size=2 replicated metadata pool

Given a few hours (~6ish), my OSDs consistently crash like so:

2017-09-17 00:23:28.154916 7f3b2e1fe700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3b181b4700' had timed out after 15
2017-09-17 00:23:28.154920 7f3b2e1fe700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3b181b4700' had suicide timed out after 150
2017-09-17 00:23:28.161089 7f3b181b4700 -1 *** Caught signal (Aborted) **
 in thread 7f3b181b4700 thread_name:tp_osd_tp

 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
 1: (()+0x9a6f48) [0x55688345ef48]
 2: (()+0x117e0) [0x7f3b338527e0]
 3: (pthread_cond_wait()+0x1fd) [0x7f3b3384e1ad]
 4: (Throttle::_wait(long)+0x33c) [0x55688349a26c]
 5: (Throttle::get(long, long)+0x2a2) [0x55688349b032]
 6: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x105b) [0x55688336293b]
 7: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x68) [0x55688309dd08]
 8: (ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&, Context*)+0x963) [0x5568831dc3b3]
 9: (ECBackend::try_reads_to_commit()+0xa7b) [0x5568831e9c5b]
 10: (ECBackend::check_ops()+0x1c) [0x5568831ec52c]
 11: (ECBackend::start_rmw(ECBackend::Op*, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&)+0x1e0a) [0x5568831f685a]
 12: (ECBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x4a1) [0x5568831f7751]
 13: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x810) [0x55688303f800]
 14: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x1238) [0x556883084538]
 15: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x389e) [0x55688308864e]
 16: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xe17) [0x556883046e27]
 17: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x387) [0x556882ed5d07]
 18: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x5a) [0x556883158ffa]
 19: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1ae7) [0x556882ef9b97]
 20: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b2) [0x5568834a9612]
 21: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5568834acb80]
 22: (()+0x7049) [0x7f3b33848049]
 23: (clone()+0x3f) [0x7f3b32cd7f0f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

After a few crashes they instead refuse to start entirely, like so:

2017-09-17 00:04:39.184048 7fe65bd00700 -1 src/ceph/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::Tra
nsContext*, ObjectStore::Transaction*)' thread 7fe65bd00700 time 2017-09-17 00:04:39.179157
src/ceph/src/os/bluestore/BlueStore.cc: 9290: FAILED assert(0 == "unexpected error")

 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xf5) [0x560dd45a6ba5]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0xf1c) [0x560dd4461aec]
 3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x4f8) [0x560dd4463dd8]
 4: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x68) [0x560dd419fd08]
 5: (ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&, Context*)+0x963) [0x560dd42de3b3]
 6: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x321) [0x560dd42f5a51]
 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x560dd41dcc47]
 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x63e) [0x560dd414864e]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x387) [0x560dd3fd7d07]
 10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x5a) [0x560dd425affa]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1ae7) [0x560dd3ffbb97]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b2) [0x560dd45ab612]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x560dd45aeb80]
 14: (()+0x7049) [0x7fe675b91049]
 15: (clone()+0x3f) [0x7fe675020f0f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this

Related issues 1 (0 open1 closed)

Related to bluestore - Bug #22957: [bluestore]bstore_kv_final thread seems deadlock Duplicate02/08/2018

Actions
Actions #1

Updated by Bob Bobington over 6 years ago

Oh, forgot to add that I've tried the workarounds on the related issues. Adding this to my ceph.conf makes no difference:

bluestore throttle bytes = 0
bluestore deferred throttle bytes = 0
Actions #2

Updated by Sage Weil over 6 years ago

  • Status changed from New to Need More Info
  • Assignee set to Sage Weil
  • Priority changed from Normal to Urgent

Okay, thanks for confirmation that the #21171 fix is applied. can you reproduce with debug bluestore = 20, and then ceph-post-file the resulting log file? Thanks!

Actions #3

Updated by Bob Bobington over 6 years ago

Done. Took less than an hour and happened on two OSDs. Uploaded one of them:

ceph-post-file: 6e0ed6ab-1528-428d-aae9-b62697a9a90b

Actions #4

Updated by Roy Hooper over 6 years ago

With a similar but slightly different setup, this same crash happened to me.

Installed via ceph-deploy install --release luminous

3 servers comprising 3 mon, 2 mgr, 2 mds, 4 osd
1 CephFS (kernel driver), written to with rsync

CephFS had a 40 PG k=2 m=1 erasure coded data pool and a 16 PG size=2 replicated metadata pool

Setup for the pool was as follows:

ceph osd erasure-code-profile set myprofile plugin=jerasure k=2 m=1 crush-failure-domain=osd
ceph osd pool create cephfs_ecdata 40 40 erasure myprofile
ceph osd pool create cephfs_metadata 16 16
ceph osd pool set cephfs_metadata size 2
ceph osd pool set cephfs_ecdata allow_ec_overwrites true

All other settings were defaults from

ceph-deploy osd create --zap-disk --bluestore --fs-type xfs osd01:/dev/sdb osd01:/dev/sdc osd02:/dev/sdb osd02:/dev/sdc

Actions #5

Updated by Sage Weil over 6 years ago

  • Status changed from Need More Info to 7

Ok, that's kind of embarrassing, I thinkt eh fix is pretty simple. Can you please test out this branch?

wip-21470-test
https://shaman.ceph.com/builds/ceph/wip-21470-test/

Actions #6

Updated by Bob Bobington over 6 years ago

I'm not on a Debian or Redhat derivative, is there a Git repository I can get the source from or a tarball you can link to? I couldn't find a branch with that name on Github or git.ceph.com.

Actions #8

Updated by Bob Bobington over 6 years ago

No luck. I applied 1918c57c7c6304875501f4f4b04b9c82834395a3 from the aforementioned repo to my copy of the official Luminous release and I'm still getting crashes.

ceph-post-file: 2cf20528-2be3-49c8-a645-4666e06c2538

I'll also note that prior to it happening, all OSDs had pegged a CPU core each (and the survivors continued to).

If it's necessary to clone the whole repo though, let me know.

Actions #9

Updated by Bob Bobington over 6 years ago

While I'm not intimately familiar with threaded programming, I'm okay with general C++. Could you possibly explain what exactly is happening?

My basic understanding right now is that:
There's a throttle for the number of deferred bytes that can be queued at any given time.
queue_transactions is basically the top entry point for BlueStore.
queue_transactions first checks to see if there are enough free bytes right now, if there are, hurrah, it can continue and do nothing else.
If there are not enough bytes available from the throttle, queue_transactions increments deferred_aggressive (its purpose is currently unclear), calls deferred_try_submit (also unclear), notifies kv_cond (ditto), then blocks until it can get bytes from the throttle.

This last case is the problem, as the throttle never gives the thread any bytes and it just waits until Ceph gives up and kills the OSD, which suggests that something is reserving bytes then not putting them back for some reason.

It appears that the only place bytes are put back in the throttle is _deferred_aio_finish, so something must either be blocking it or it's not being called.

So my guess from your most recent fix is that you thought something that held a throttle reservation was waiting for kv_cond and by notifying it, the throttle reservation would be released and Throttle.get() wouldn't block?

Is there a particular reason you thought this? Is there a way to read which threads are blocking on what from the logs or something?

I'm going to try enabling TSAN in cmake and see if it outputs anything helpful.

Also I got one of the previous workarounds working. Setting "bluestore throttle deferred bytes = 0" and "bluestore throttle bytes = 0" seems to prevent the crashes. Previously I was using "bluestore deferred throttle bytes" from http://tracker.ceph.com/issues/21180#note-2.

Actions #10

Updated by Bob Bobington over 6 years ago

TSAN unfortunately just caused the OSDs to core dump instantly. I'll see if I can find another way to find threading issues.

Actions #11

Updated by Sage Weil over 6 years ago

From that log I've narrowed the problem down to this line

2017-09-29 19:52:00.874146 7fea08fa9700 20 bluestore(/var/lib/ceph/osd/ceph-3) _deferred_aio_finish queuing async deferred_try_submit

which tries to schedule async work but it never happens. Not sure why the finisher isn't working.. as far as I can tell none of the other calls to queue the async call work either; we just got lucky bc other threads did the queuing.

Actions #12

Updated by Sage Weil over 6 years ago

I've pushed another patch to the same branch.. can you give it a try?

Actions #13

Updated by Bob Bobington over 6 years ago

This time I couldn't apply your changes to the original Luminous source release so I pulled the entire Git branch and built from that.

I've been running for ~6h now and haven't seen a crash yet though so I think there's a decent chance it's fixed. I'll leave it running overnight and update this issue again in the morning.

Actions #14

Updated by Bob Bobington over 6 years ago

Yep, left it running an entire night and wrote 1.5TB without crashing. Seems to be fixed. Thanks!

Actions #15

Updated by Sage Weil over 6 years ago

  • Status changed from 7 to Fix Under Review
  • Backport set to luminous

https://github.com/ceph/ceph/pull/18118

Thanks, Bob! Please let me know if you see it fail. This should be included in 12.2.2.

Actions #16

Updated by Sage Weil over 6 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #17

Updated by Sage Weil over 6 years ago

  • Status changed from Pending Backport to Resolved
Actions #18

Updated by Sage Weil about 6 years ago

  • Related to Bug #22957: [bluestore]bstore_kv_final thread seems deadlock added
Actions

Also available in: Atom PDF