Actions
Bug #16981
closedOSD crashes under moderate write load when using bluestore
Status:
Won't Fix
Priority:
Low
Assignee:
-
Category:
OSD
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
jewel, osd, crash
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Description
Hi.
While testing out bluestore, one of our OSDs crashed suddenly outputting the following logs (see below):
After OSD restart it was able to get up again, so far we couldn't reproduce the bug reliably but I decided to file it as a bug anyway.
I will update the bug report if we happen to reproduce this again.
OSD output:
авг 10 16:16:51 stor01 ceph-osd[863226]: terminate called after throwing an instance of 'ceph::buffer::bad_alloc' авг 10 16:16:51 stor01 ceph-osd[863226]: what(): buffer::bad_alloc авг 10 16:16:51 stor01 ceph-osd[863226]: *** Caught signal (Aborted) ** авг 10 16:16:51 stor01 ceph-osd[863226]: in thread 7f62feb80700 thread_name:ms_pipe_read авг 10 16:16:51 stor01 ceph-osd[863226]: terminate called recursively
Our osd config is now like this:
[osd] osd mkfs type = xfs osd mkfs options xfs = -f -i size=2048 osd mount options xfs = noatime,largeio,inode64,swalloc osd journal size = 2600 osd op threads = 4 osd objectstore = bluestore bluestore block path = /dev/disk/by-partlabel/osd-device-$id-block bluestore bluefs = false bluestore fsck on mount = true bluestore block db path = /var/lib/ceph/osd/$cluster-$id/block.db bluestore block db create = true bluestore block wal path = /var/lib/ceph/osd/$cluster-$id/block.wal bluestore block wal create = true bluestore rocksdb options = compression=kNoCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3 osd recovery delay start = 10 osd recovery threads = 2 #osd recovery max active = 2
Updated by Yuri Gorshkov over 7 years ago
We started seeing other OSDs crash with the same symptoms. Seems like it's related to OSD memory usage.
Our environment: CentOS 7, jewel binaries taken from official Ceph repos.
Logs for the failed OSD are below:
авг 10 16:53:39 mstor01 ceph-osd[2352550]: terminate called after throwing an instance of 'ceph::buffer::bad_alloc' авг 10 16:53:39 mstor01 ceph-osd[2352550]: what(): buffer::bad_alloc авг 10 16:53:39 mstor01 ceph-osd[2352550]: *** Caught signal (Aborted) ** авг 10 16:53:39 mstor01 ceph-osd[2352550]: in thread 7f3f3e6f9700 thread_name:tp_osd_tp авг 10 16:53:39 mstor01 ceph-osd[2352550]: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) авг 10 16:53:39 mstor01 ceph-osd[2352550]: 1: (()+0x91341a) [0x7f3f5bc7d41a] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 2: (()+0xf100) [0x7f3f59cb3100] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 3: (gsignal()+0x37) [0x7f3f582755f7] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 4: (abort()+0x148) [0x7f3f58276ce8] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f3f58b7a9d5] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 6: (()+0x5e946) [0x7f3f58b78946] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 7: (()+0x5e973) [0x7f3f58b78973] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 8: (()+0x5eb93) [0x7f3f58b78b93] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 9: (ceph::buffer::create_aligned(unsigned int, unsigned int)+0x26d) [0x7f3f5bd8740d] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 10: (ceph::buffer::list::rebuild_aligned_size_and_memory(unsigned int, unsigned int)+0x1f3) [0x7f3f5bd88043] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 11: (KernelDevice::aio_write(unsigned long, ceph::buffer::list&, IOContext*, bool)+0x29f) [0x7f3f5baf2c9f] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 12: (BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::bu авг 10 16:53:39 mstor01 ceph-osd[2352550]: 13: (BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buff авг 10 16:53:39 mstor01 ceph-osd[2352550]: 14: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x1127) [0x7f3f5ba05317] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 15: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, Threa авг 10 16:53:39 mstor01 ceph-osd[2352550]: 16: (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<OpRequest>)+0x8c) [0x7f3f5b89c4fc] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 17: (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xc2a) [0x7f3f5b8e3d5a] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 18: (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x3e3) [0x7f3f5b8e4703] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 19: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x100) [0x7f3f5b83d810] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 20: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x41d) [0x7f3f5b6f2a8d] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 21: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x6d) [0x7f3f5b6f2cdd] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 22: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x869) [0x7f3f5b6f7809] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 23: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x887) [0x7f3f5bd6d557] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 24: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f3f5bd6f4c0] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 25: (()+0x7dc5) [0x7f3f59cabdc5] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 26: (clone()+0x6d) [0x7f3f58336ced] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 2016-08-10 16:53:39.112928 7f3f3e6f9700 -1 *** Caught signal (Aborted) ** авг 10 16:53:39 mstor01 ceph-osd[2352550]: in thread 7f3f3e6f9700 thread_name:tp_osd_tp авг 10 16:53:39 mstor01 ceph-osd[2352550]: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) авг 10 16:53:39 mstor01 ceph-osd[2352550]: 1: (()+0x91341a) [0x7f3f5bc7d41a] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 2: (()+0xf100) [0x7f3f59cb3100] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 3: (gsignal()+0x37) [0x7f3f582755f7] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 4: (abort()+0x148) [0x7f3f58276ce8] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f3f58b7a9d5] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 6: (()+0x5e946) [0x7f3f58b78946] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 7: (()+0x5e973) [0x7f3f58b78973] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 8: (()+0x5eb93) [0x7f3f58b78b93] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 9: (ceph::buffer::create_aligned(unsigned int, unsigned int)+0x26d) [0x7f3f5bd8740d] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 10: (ceph::buffer::list::rebuild_aligned_size_and_memory(unsigned int, unsigned int)+0x1f3) [0x7f3f5bd88043] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 11: (KernelDevice::aio_write(unsigned long, ceph::buffer::list&, IOContext*, bool)+0x29f) [0x7f3f5baf2c9f] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 12: (BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::bu авг 10 16:53:39 mstor01 ceph-osd[2352550]: 13: (BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buff авг 10 16:53:39 mstor01 ceph-osd[2352550]: 14: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x1127) [0x7f3f5ba05317] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 15: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, Threa авг 10 16:53:39 mstor01 ceph-osd[2352550]: 16: (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<OpRequest>)+0x8c) [0x7f3f5b89c4fc] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 17: (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xc2a) [0x7f3f5b8e3d5a] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 18: (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x3e3) [0x7f3f5b8e4703] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 19: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x100) [0x7f3f5b83d810] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 20: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x41d) [0x7f3f5b6f2a8d] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 21: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x6d) [0x7f3f5b6f2cdd] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 22: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x869) [0x7f3f5b6f7809] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 23: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x887) [0x7f3f5bd6d557] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 24: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f3f5bd6f4c0] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 25: (()+0x7dc5) [0x7f3f59cabdc5] авг 10 16:53:39 mstor01 ceph-osd[2352550]: 26: (clone()+0x6d) [0x7f3f58336ced] авг 10 16:53:39 mstor01 ceph-osd[2352550]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Sage Weil about 7 years ago
- Status changed from New to Won't Fix
if you see anything similar on kraken or master, pelase reopen!
Actions