Project

General

Profile

Actions

Bug #16981

closed

OSD crashes under moderate write load when using bluestore

Added by Yuri Gorshkov over 7 years ago. Updated about 7 years ago.

Status:
Won't Fix
Priority:
Low
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
jewel, osd, crash
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi.

While testing out bluestore, one of our OSDs crashed suddenly outputting the following logs (see below):
After OSD restart it was able to get up again, so far we couldn't reproduce the bug reliably but I decided to file it as a bug anyway.

I will update the bug report if we happen to reproduce this again.

OSD output:

авг 10 16:16:51 stor01 ceph-osd[863226]: terminate called after throwing an instance of 'ceph::buffer::bad_alloc'
авг 10 16:16:51 stor01 ceph-osd[863226]: what():  buffer::bad_alloc
авг 10 16:16:51 stor01 ceph-osd[863226]: *** Caught signal (Aborted) **
авг 10 16:16:51 stor01 ceph-osd[863226]: in thread 7f62feb80700 thread_name:ms_pipe_read
авг 10 16:16:51 stor01 ceph-osd[863226]: terminate called recursively

Our osd config is now like this:

[osd]
osd mkfs type = xfs
osd mkfs options xfs = -f -i size=2048
osd mount options xfs = noatime,largeio,inode64,swalloc
osd journal size = 2600
osd op threads    = 4

osd objectstore = bluestore
bluestore block path = /dev/disk/by-partlabel/osd-device-$id-block
bluestore bluefs = false
bluestore fsck on mount = true
bluestore block db path = /var/lib/ceph/osd/$cluster-$id/block.db
bluestore block db create = true
bluestore block wal path = /var/lib/ceph/osd/$cluster-$id/block.wal
bluestore block wal create = true
bluestore rocksdb options = compression=kNoCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3

osd recovery delay start = 10
osd recovery threads = 2
#osd recovery max active = 2

Actions #1

Updated by Yuri Gorshkov over 7 years ago

We started seeing other OSDs crash with the same symptoms. Seems like it's related to OSD memory usage.

Our environment: CentOS 7, jewel binaries taken from official Ceph repos.
Logs for the failed OSD are below:

авг 10 16:53:39 mstor01 ceph-osd[2352550]: terminate called after throwing an instance of 'ceph::buffer::bad_alloc'
авг 10 16:53:39 mstor01 ceph-osd[2352550]: what():  buffer::bad_alloc
авг 10 16:53:39 mstor01 ceph-osd[2352550]: *** Caught signal (Aborted) **
авг 10 16:53:39 mstor01 ceph-osd[2352550]: in thread 7f3f3e6f9700 thread_name:tp_osd_tp
авг 10 16:53:39 mstor01 ceph-osd[2352550]: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 1: (()+0x91341a) [0x7f3f5bc7d41a]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 2: (()+0xf100) [0x7f3f59cb3100]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 3: (gsignal()+0x37) [0x7f3f582755f7]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 4: (abort()+0x148) [0x7f3f58276ce8]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f3f58b7a9d5]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 6: (()+0x5e946) [0x7f3f58b78946]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 7: (()+0x5e973) [0x7f3f58b78973]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 8: (()+0x5eb93) [0x7f3f58b78b93]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 9: (ceph::buffer::create_aligned(unsigned int, unsigned int)+0x26d) [0x7f3f5bd8740d]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 10: (ceph::buffer::list::rebuild_aligned_size_and_memory(unsigned int, unsigned int)+0x1f3) [0x7f3f5bd88043]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 11: (KernelDevice::aio_write(unsigned long, ceph::buffer::list&, IOContext*, bool)+0x29f) [0x7f3f5baf2c9f]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 12: (BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::bu
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 13: (BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buff
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 14: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x1127) [0x7f3f5ba05317]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 15: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, Threa
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 16: (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<OpRequest>)+0x8c) [0x7f3f5b89c4fc]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 17: (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xc2a) [0x7f3f5b8e3d5a]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 18: (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x3e3) [0x7f3f5b8e4703]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 19: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x100) [0x7f3f5b83d810]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 20: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x41d) [0x7f3f5b6f2a8d]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 21: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x6d) [0x7f3f5b6f2cdd]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 22: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x869) [0x7f3f5b6f7809]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 23: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x887) [0x7f3f5bd6d557]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 24: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f3f5bd6f4c0]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 25: (()+0x7dc5) [0x7f3f59cabdc5]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 26: (clone()+0x6d) [0x7f3f58336ced]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 2016-08-10 16:53:39.112928 7f3f3e6f9700 -1 *** Caught signal (Aborted) **
авг 10 16:53:39 mstor01 ceph-osd[2352550]: in thread 7f3f3e6f9700 thread_name:tp_osd_tp
авг 10 16:53:39 mstor01 ceph-osd[2352550]: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 1: (()+0x91341a) [0x7f3f5bc7d41a]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 2: (()+0xf100) [0x7f3f59cb3100]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 3: (gsignal()+0x37) [0x7f3f582755f7]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 4: (abort()+0x148) [0x7f3f58276ce8]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f3f58b7a9d5]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 6: (()+0x5e946) [0x7f3f58b78946]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 7: (()+0x5e973) [0x7f3f58b78973]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 8: (()+0x5eb93) [0x7f3f58b78b93]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 9: (ceph::buffer::create_aligned(unsigned int, unsigned int)+0x26d) [0x7f3f5bd8740d]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 10: (ceph::buffer::list::rebuild_aligned_size_and_memory(unsigned int, unsigned int)+0x1f3) [0x7f3f5bd88043]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 11: (KernelDevice::aio_write(unsigned long, ceph::buffer::list&, IOContext*, bool)+0x29f) [0x7f3f5baf2c9f]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 12: (BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::bu
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 13: (BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buff
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 14: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x1127) [0x7f3f5ba05317]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 15: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, Threa
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 16: (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<OpRequest>)+0x8c) [0x7f3f5b89c4fc]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 17: (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xc2a) [0x7f3f5b8e3d5a]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 18: (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x3e3) [0x7f3f5b8e4703]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 19: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x100) [0x7f3f5b83d810]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 20: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x41d) [0x7f3f5b6f2a8d]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 21: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x6d) [0x7f3f5b6f2cdd]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 22: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x869) [0x7f3f5b6f7809]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 23: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x887) [0x7f3f5bd6d557]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 24: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f3f5bd6f4c0]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 25: (()+0x7dc5) [0x7f3f59cabdc5]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: 26: (clone()+0x6d) [0x7f3f58336ced]
авг 10 16:53:39 mstor01 ceph-osd[2352550]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #2

Updated by Sage Weil about 7 years ago

  • Status changed from New to Won't Fix

if you see anything similar on kraken or master, pelase reopen!

Actions

Also available in: Atom PDF