Bug #21470: Ceph OSDs crashing in BlueStore::queue_transactions() using EC after applying fix - RADOS - Ceph

Actions

Copy link

Bug #21470

closed

Ceph OSDs crashing in BlueStore::queue_transactions() using EC after applying fix

Added by Bob Bobington over 6 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Sage Weil

Category:

Correctness/Safety

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

luminous

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.0

ceph-qa-suite:

Component(RADOS):

BlueStore

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

This is a copy of http://tracker.ceph.com/issues/21314, which was marked as resolved. It's not resolved after applying the linked fixes and I tried to say so on that issue but received no response.

I've set up a cluster with the following configuration:

Single node, Arch Linux, using build from the official v12.2.0 source release, with the fixes from http://tracker.ceph.com/issues/21171 applied (specifically, https://github.com/ceph/ceph/pull/17352)
Loopback networking
4 dm-crypted Bluestore OSDs
1 mon, 1 mgr, 1 mds
1 CephFS, mounted with the kernel driver, written to with rsync
CephFS has a 256 PG k=2 m=1 erasure coded data pool and a 64 PG size=2 replicated metadata pool

Given a few hours (~6ish), my OSDs consistently crash like so:

2017-09-17 00:23:28.154916 7f3b2e1fe700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3b181b4700' had timed out after 15
2017-09-17 00:23:28.154920 7f3b2e1fe700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3b181b4700' had suicide timed out after 150
2017-09-17 00:23:28.161089 7f3b181b4700 -1 *** Caught signal (Aborted) **
 in thread 7f3b181b4700 thread_name:tp_osd_tp

 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
 1: (()+0x9a6f48) [0x55688345ef48]
 2: (()+0x117e0) [0x7f3b338527e0]
 3: (pthread_cond_wait()+0x1fd) [0x7f3b3384e1ad]
 4: (Throttle::_wait(long)+0x33c) [0x55688349a26c]
 5: (Throttle::get(long, long)+0x2a2) [0x55688349b032]
 6: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x105b) [0x55688336293b]
 7: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x68) [0x55688309dd08]
 8: (ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&, Context*)+0x963) [0x5568831dc3b3]
 9: (ECBackend::try_reads_to_commit()+0xa7b) [0x5568831e9c5b]
 10: (ECBackend::check_ops()+0x1c) [0x5568831ec52c]
 11: (ECBackend::start_rmw(ECBackend::Op*, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&)+0x1e0a) [0x5568831f685a]
 12: (ECBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x4a1) [0x5568831f7751]
 13: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x810) [0x55688303f800]
 14: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x1238) [0x556883084538]
 15: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x389e) [0x55688308864e]
 16: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xe17) [0x556883046e27]
 17: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x387) [0x556882ed5d07]
 18: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x5a) [0x556883158ffa]
 19: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1ae7) [0x556882ef9b97]
 20: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b2) [0x5568834a9612]
 21: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5568834acb80]
 22: (()+0x7049) [0x7f3b33848049]
 23: (clone()+0x3f) [0x7f3b32cd7f0f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

After a few crashes they instead refuse to start entirely, like so:

2017-09-17 00:04:39.184048 7fe65bd00700 -1 src/ceph/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::Tra
nsContext*, ObjectStore::Transaction*)' thread 7fe65bd00700 time 2017-09-17 00:04:39.179157
src/ceph/src/os/bluestore/BlueStore.cc: 9290: FAILED assert(0 == "unexpected error")

 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xf5) [0x560dd45a6ba5]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0xf1c) [0x560dd4461aec]
 3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x4f8) [0x560dd4463dd8]
 4: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x68) [0x560dd419fd08]
 5: (ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&, Context*)+0x963) [0x560dd42de3b3]
 6: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x321) [0x560dd42f5a51]
 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x560dd41dcc47]
 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x63e) [0x560dd414864e]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x387) [0x560dd3fd7d07]
 10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x5a) [0x560dd425affa]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1ae7) [0x560dd3ffbb97]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b2) [0x560dd45ab612]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x560dd45aeb80]
 14: (()+0x7049) [0x7fe675b91049]
 15: (clone()+0x3f) [0x7fe675020f0f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Bob Bobington over 6 years ago

Oh, forgot to add that I've tried the workarounds on the related issues. Adding this to my ceph.conf makes no difference:

bluestore throttle bytes = 0
bluestore deferred throttle bytes = 0

Actions

Copy link

Updated by Sage Weil over 6 years ago

Status changed from New to Need More Info
Assignee set to Sage Weil
Priority changed from Normal to Urgent

Okay, thanks for confirmation that the #21171 fix is applied. can you reproduce with debug bluestore = 20, and then ceph-post-file the resulting log file? Thanks!

Actions

Copy link

Updated by Bob Bobington over 6 years ago

Done. Took less than an hour and happened on two OSDs. Uploaded one of them:

ceph-post-file: 6e0ed6ab-1528-428d-aae9-b62697a9a90b

Actions

Copy link

Updated by Roy Hooper over 6 years ago

With a similar but slightly different setup, this same crash happened to me.

Installed via ceph-deploy install --release luminous

3 servers comprising 3 mon, 2 mgr, 2 mds, 4 osd
1 CephFS (kernel driver), written to with rsync

CephFS had a 40 PG k=2 m=1 erasure coded data pool and a 16 PG size=2 replicated metadata pool

Setup for the pool was as follows:

ceph osd erasure-code-profile set myprofile plugin=jerasure k=2 m=1 crush-failure-domain=osd
ceph osd pool create cephfs_ecdata 40 40 erasure myprofile
ceph osd pool create cephfs_metadata 16 16
ceph osd pool set cephfs_metadata size 2
ceph osd pool set cephfs_ecdata allow_ec_overwrites true

All other settings were defaults from

ceph-deploy osd create --zap-disk --bluestore --fs-type xfs osd01:/dev/sdb osd01:/dev/sdc osd02:/dev/sdb osd02:/dev/sdc

Actions

Copy link

Updated by Sage Weil over 6 years ago

Status changed from Need More Info to 7

Ok, that's kind of embarrassing, I thinkt eh fix is pretty simple. Can you please test out this branch?

wip-21470-test
https://shaman.ceph.com/builds/ceph/wip-21470-test/

Actions

Copy link

Updated by Bob Bobington over 6 years ago

I'm not on a Debian or Redhat derivative, is there a Git repository I can get the source from or a tarball you can link to? I couldn't find a branch with that name on Github or git.ceph.com.

Actions

Copy link

Updated by Bob Bobington over 6 years ago

Ah, found it: https://github.com/ceph/ceph-ci/tree/wip-21470-test

Actions

Copy link

Updated by Bob Bobington over 6 years ago

No luck. I applied 1918c57c7c6304875501f4f4b04b9c82834395a3 from the aforementioned repo to my copy of the official Luminous release and I'm still getting crashes.

ceph-post-file: 2cf20528-2be3-49c8-a645-4666e06c2538

I'll also note that prior to it happening, all OSDs had pegged a CPU core each (and the survivors continued to).

If it's necessary to clone the whole repo though, let me know.

Actions

Copy link

Updated by Bob Bobington over 6 years ago

While I'm not intimately familiar with threaded programming, I'm okay with general C++. Could you possibly explain what exactly is happening?

My basic understanding right now is that:
There's a throttle for the number of deferred bytes that can be queued at any given time.
queue_transactions is basically the top entry point for BlueStore.
queue_transactions first checks to see if there are enough free bytes right now, if there are, hurrah, it can continue and do nothing else.
If there are not enough bytes available from the throttle, queue_transactions increments deferred_aggressive (its purpose is currently unclear), calls deferred_try_submit (also unclear), notifies kv_cond (ditto), then blocks until it can get bytes from the throttle.

This last case is the problem, as the throttle never gives the thread any bytes and it just waits until Ceph gives up and kills the OSD, which suggests that something is reserving bytes then not putting them back for some reason.

It appears that the only place bytes are put back in the throttle is _deferred_aio_finish, so something must either be blocking it or it's not being called.

So my guess from your most recent fix is that you thought something that held a throttle reservation was waiting for kv_cond and by notifying it, the throttle reservation would be released and Throttle.get() wouldn't block?

Is there a particular reason you thought this? Is there a way to read which threads are blocking on what from the logs or something?

I'm going to try enabling TSAN in cmake and see if it outputs anything helpful.

Also I got one of the previous workarounds working. Setting "bluestore throttle deferred bytes = 0" and "bluestore throttle bytes = 0" seems to prevent the crashes. Previously I was using "bluestore deferred throttle bytes" from http://tracker.ceph.com/issues/21180#note-2.

Actions

Copy link

#10

Updated by Bob Bobington over 6 years ago

TSAN unfortunately just caused the OSDs to core dump instantly. I'll see if I can find another way to find threading issues.

Actions

Copy link

#11

Updated by Sage Weil over 6 years ago

From that log I've narrowed the problem down to this line

2017-09-29 19:52:00.874146 7fea08fa9700 20 bluestore(/var/lib/ceph/osd/ceph-3) _deferred_aio_finish queuing async deferred_try_submit

which tries to schedule async work but it never happens. Not sure why the finisher isn't working.. as far as I can tell none of the other calls to queue the async call work either; we just got lucky bc other threads did the queuing.

Actions

Copy link

#12

Updated by Sage Weil over 6 years ago

I've pushed another patch to the same branch.. can you give it a try?

Actions

Copy link

#13

Updated by Bob Bobington over 6 years ago

This time I couldn't apply your changes to the original Luminous source release so I pulled the entire Git branch and built from that.

I've been running for ~6h now and haven't seen a crash yet though so I think there's a decent chance it's fixed. I'll leave it running overnight and update this issue again in the morning.

Actions

Copy link

#14