Project

General

Profile

Bug #16662

bluestore asserts if I/O size > 4MB when using SSDs

Added by Kevan Rehm about 1 year ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
07/12/2016
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

bluestore immediately asserts for any I/O size larger than 4 MB when using SSDs, it is a solid failure. I didn't find any traceback matches by googling. Here is a debug_bluestore=20 trace plus the traceback:

18> 2016-07-12 15:37:03.372859 7f476f9a0700 10 osd.3 pg_epoch: 314 pg[0.2f1( v 314'3678 (279'600,314'3678] local-les=262 n=1062 ec=122 les/c/f 262/262/0 261/261/261) [3] r=0 lpr=261 luod=279'3677 lua=279'3677 crt=279'3675 lcod 279'3676 mlcod 279'3676 active+clean] add_log_entry 314'3678 (0'0) modify   0:8f62f59d:::benchmark_data_alpha1-p200_32531_object7:head by client.8893.0:8 2016-07-12 15:37:03.356212
-17> 2016-07-12 15:37:03.372876 7f476f9a0700 10 osd.3 pg_epoch: 314 pg[0.2f1( v 314'3678 (279'600,314'3678] local-les=262 n=1062 ec=122 les/c/f 262/262/0 261/261/261) [3] r=0 lpr=261 luod=279'3677 lua=279'3677 crt=279'3676 lcod 279'3676 mlcod 279'3676 active+clean] append_log: trimming to 279'3676 entries 279'3676 (279'2605) delete 0:8f7f3b75:::benchmark_data_alpha1-d103_4600_object1121735:head by client.8615.0:2246250 2016-07-12 05:04:12.100640
-16> 2016-07-12 15:37:03.372903 7f476f9a0700 5 write_log with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: false, divergent_priors: 0, writeout_from: 314'3678, trimmed:
-15> 2016-07-12 15:37:03.372922 7f476f9a0700 10 bluestore(/var/lib/ceph/osd/ceph-3) queue_transactions existing 0x7f47a1474df0 osr(0.2f1 0x7f479cb8e960)
-14> 2016-07-12 15:37:03.372929 7f476f9a0700 20 bluestore(/var/lib/ceph/osd/ceph-3) _txc_create osr 0x7f47a1474df0 = 0x7f47b0eaec80 seq 3308
-13> 2016-07-12 15:37:03.372945 7f476f9a0700 20 bluestore(/var/lib/ceph/osd/ceph-3).collection(0.2f1_head) get_onode oid #0:8f62f59d:::benchmark_data_alpha1-p200_32531_object7:head# key '-
'0x80000000000000008f62f59d'.!=benchmark_data_alpha1-p200_32531_object7!'0xfffffffffffffffeffffffffffffffff
-12> 2016-07-12 15:37:03.372982 7f476f9a0700 20 bluestore(/var/lib/ceph/osd/ceph-3).collection(0.2f1_head) r -2 v.len 0
-11> 2016-07-12 15:37:03.372990 7f476f9a0700 15 bluestore(/var/lib/ceph/osd/ceph-3) _write 0.2f1_head #0:8f62f59d:::benchmark_data_alpha1-p200_32531_object7:head# 0x0~800000
-10> 2016-07-12 15:37:03.372995 7f476f9a0700 20 bluestore(/var/lib/ceph/osd/ceph-3) _assign_nid 63444
-9> 2016-07-12 15:37:03.372998 7f476f9a0700 20 bluestore(/var/lib/ceph/osd/ceph-3) _do_write #0:8f62f59d:::benchmark_data_alpha1-p200_32531_object7:head# 0x0~800000 - have 0x0 (0) bytes in 0 lextents
-8> 2016-07-12 15:37:03.373002 7f476f9a0700 20 bluestore(/var/lib/ceph/osd/ceph-3) _do_write prefer csum_order 12 comp_blob_size 0x40000
-7> 2016-07-12 15:37:03.373004 7f476f9a0700 10 bluestore(/var/lib/ceph/osd/ceph-3) _do_write_big 0x0~800000 max_blob_len 0x800000 compress 0
-6> 2016-07-12 15:37:03.373007 7f476f9a0700 20 bluestore.BufferSpace(0x7f47a21757e8 in 0x7f479cbbe540) _discard 0x0~800000
-5> 2016-07-12 15:37:03.373011 7f476f9a0700 20 bluestore(/var/lib/ceph/osd/ceph-3) _do_write_big lex 0x0: 0x0~800000->1
-4> 2016-07-12 15:37:03.373015 7f476f9a0700 20 bluestore(/var/lib/ceph/osd/ceph-3) _do_write_big blob 1:blob([] clen 0x0 ref_map(0x0~800000=1))
-3> 2016-07-12 15:37:03.373018 7f476f9a0700 20 bluestore(/var/lib/ceph/osd/ceph-3) _do_alloc_write txc 0x7f47b0eaec80 1 blobs
-2> 2016-07-12 15:37:03.373020 7f476f9a0700 10 bitmapalloc:reserve instance 139945553645952 num_used 10848020 total 195328000
-1> 2016-07-12 15:37:03.373023 7f476f9a0700 10 bitmapalloc:allocate instance 139945553645952 want_size 0x800000 alloc_unit 0x1000 hint 0x0
0> 2016-07-12 15:37:03.377685 7f47731a7700 -1 os/bluestore/BlueStore.cc: In function 'int BlueStore::_do_alloc_write(BlueStore::TransContext*, BlueStore::WriteContext*)' thread 7f47731a7700 time 2016-07-12 15:37:03.368821
os/bluestore/BlueStore.cc: 5988: FAILED assert(r == 0)
ceph version 11.0.0-196-g85bb43e (85bb43e111692989d2296a389ce45377d2297d6f)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f479212de35]
2: (BlueStore::_do_alloc_write(BlueStore::TransContext*, BlueStore::WriteContext*)+0x10af) [0x7f4791d48c4f]
3: (BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)+0x3e9) [0x7f4791d69479]
4: (BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)+0x10f) [0x7f4791d6a0bf]
5: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x1127) [0x7f4791d6e297]
6: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x41a) [0x7f4791d7017a]
7: (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<OpRequest>)+0x8c) [0x7f4791bffc4c]
8: (ReplicatedBackend::submit_transaction(hobject_t const&, eversion_t const&, std::unique_ptr<PGBackend::PGTransaction, std::default_delete<PGBackend::PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*, unsigned long, osd_reqid_t, std::shared_ptr<OpRequest>)+0x94d) [0x7f4791c3b67d]
9: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, ReplicatedPG::OpContext*)+0x6af) [0x7f4791b8ba5f]
10: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0xeea) [0x7f4791bdfeda]
11: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x2843) [0x7f4791be3853]
12: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x777) [0x7f4791b9e957]
13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x41d) [0x7f4791a4d0ad]
14: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> const&)+0x6d) [0x7f4791a4d2fd]
15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x864) [0x7f4791a6e884]
16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x947) [0x7f4792112147]
17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f47921142a0]
18: (()+0x7dc5) [0x7f478ffd7dc5]
19: (clone()+0x6d) [0x7f478e1efced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I think I have it figured out. The bug is in
src/os/bluestore/BlueStore.cc in BlueStore::_do_alloc_write(). At around
line 5982 there is a loop that allocates blocks to satisfy the write
request:

while (final_length > 0) {
bluestore_pextent_t e;
uint32_t l;
uint64_t want = max_alloc_size ? MIN(final_length, max_alloc_size)
: final_length;
int r = alloc->allocate(want, min_alloc_size, hint,
e.offset, &l);
assert(r 0);
...

We don't have 'bluestore_allocator' set to anything in ceph.conf, so it
defaults to 'bitmap'. We don't have 'bluestore_max_alloc_size' set to
anything either, and the default behavior is to set 'max_alloc_size' to 0.
And we don't have 'bluestore_min_alloc_size' set to anything, so the
default is to set it to 0. When the OSD starts up in routine
_set_alloc_sizes(), it sees that bluestore_min_alloc is 0, and so it sets
min_alloc_size to the value of 'bluestore_min_alloc_size_ssd' since we are
using SSDs. That variable is also not set in our ceph.conf, and so it
defaults to 4096.

In BlueFS::_open_alloc() the BitMapAllocator() gets instantiated with
block_size set to min_alloc_size, which is 4096. That value gets stored
in the member variable m_block_size.

As a result, allocate() is called with want_size 8388608 and alloc_unit == 4096, and the member variable m_block_size is also 4096.

allocate() computes nblks to be 2048, then it calls routine
m_bit_alloc->alloc_blocks_res(). That routine calls
check_input(num_blocks), which returns 0 because 2048 > BITMAP_SPAN_SIZE,
which is 1024. alloc_blocks_res() then returns 0, causing allocate() to
return -ENOSPC, which causes the assert above to fire.

If you set bluestore_max_alloc_size to 4MB in ceph.conf, then the problem goes away. The code in the _do_alloc_write() loop above should ensure that 'want' cannot get set to a value larger than what the bitmap allocator can support.

History

#1 Updated by Sage Weil 4 months ago

  • Status changed from New to Resolved

https://github.com/ceph/ceph/pull/14844

This appears to ahve been fixed sinc ethen. added a test case above to verify!

Also available in: Atom PDF