Project

General

Profile

Actions

Bug #18648

closed

Bluestore: OSD crash during soak test

Added by Muthusamy Muthiah over 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Installed version : v11.2.0 with bluestore
Platform : 3 * SGI nodes with 40 disk (6TB) each and EC: 2+1

Issue:

Installed OSD with 4 partitions having db and wal separate ( all default values).
Mixed load( different object sizes using our client) was applied to the cluster. When the storage crosses 5% , the OSDs are getting crashed (maximum 5 OSD) and it was reproduced 2 times.

Seen two types of crashes , attached log has both the crash information.

1. 2017-01-22 10:27:11.481303 7f972f978700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/ceph-11.2.0/src/msg/async/AsyncConnection.cc: In function 'void AsyncConnection::tick(uint64_t)' thread 7f972f978700 time 2017-01-22 10:27:11.478891
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/ceph-11.2.0/src/msg/async/AsyncConnection.cc: 2483: FAILED assert(last_tick_id id)

ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f9736256b35]
2: (AsyncConnection::tick(unsigned long)+0x516) [0x7f9736439e26]
3: (EventCenter::process_time_events()+0x1ed) [0x7f97362e0c0d]
4: (EventCenter::process_events(int)+0x521) [0x7f97362e24d1]
5: (()+0xb53cca) [0x7f97362e4cca]
6: (()+0xb5220) [0x7f9732968220]
7: (()+0x7dc5) [0x7f97331e9dc5]
8: (clone()+0x6d) [0x7f97320d021d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

2. /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/ceph-11.2.0/src/os/bluestore/BlueFS.cc: 1779: FAILED assert(0 "allocate failed... wtf")

ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7fae561d4b35]
2: (BlueFS::_allocate(unsigned char, unsigned long, std::vector&lt;bluefs_extent_t, mempool::pool_allocator<(mempool::pool_index_t)8, bluefs_extent_t> >)+0x68e) [0x7fae55fe9b9e]
3: (BlueFS::_flush_and_sync_log(std::unique_lock&lt;std::mutex&gt;&, unsigned long, unsigned long)+0xda9) [0x7fae55ff0a49]
4: (BlueFS::_fsync(BlueFS::FileWriter
, std::unique_lock&lt;std::mutex&gt;&)+0x168) [0x7fae55ff1d18]
5: (BlueRocksWritableFile::Sync()+0x62) [0x7fae5600afc2]
6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x149) [0x7fae561126a9]
7: (rocksdb::WritableFileWriter::Sync(bool)+0xb0) [0x7fae56113320]
8: (rocksdb::BuildTable(std::string const&, rocksdb::Env*, rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, rocksdb::EnvOptions const&, rocksdb::TableCache*, rocksdb::InternalIterator*, std::unique_ptr&lt;rocksdb::InternalIterator, std::default_delete&lt;rocksdb::InternalIterator&gt; >, rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector&lt;std::unique_ptr&lt;rocksdb::IntTblPropCollectorFactory, std::default_delete&lt;rocksdb::IntTblPropCollectorFactory&gt; >, std::allocator&lt;std::unique_ptr&lt;rocksdb::IntTblPropCollectorFactory, std::default_delete&lt;rocksdb::IntTblPropCollectorFactory&gt; > > > const*, unsigned int, std::string const&, std::vector&lt;unsigned long, std::allocator&lt;unsigned long&gt; >, unsigned long, rocksdb::CompressionType, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int)+0xec2) [0x7fae5614ac32]
9: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0x8e2) [0x7fae560543e2]
10: (rocksdb::DBImpl::RecoverLogFiles(std::vector&lt;unsigned long, std::allocator&lt;unsigned long&gt; > const&, unsigned long*, bool)+0x193b) [0x7fae5605658b]
11: (rocksdb::DBImpl::Recover(std::vector&lt;rocksdb::ColumnFamilyDescriptor, std::allocator&lt;rocksdb::ColumnFamilyDescriptor&gt; > const&, bool, bool, bool)+0x866) [0x7fae56057386]
12: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::string const&, std::vector&lt;rocksdb::ColumnFamilyDescriptor, std::allocator&lt;rocksdb::ColumnFamilyDescriptor&gt; > const&, std::vector&lt;rocksdb::ColumnFamilyHandle*, std::allocator&lt;rocksdb::ColumnFamilyHandle*&gt; >, rocksdb::DB*)+0xd4a) [0x7fae5605842a]
13: (rocksdb::DB::Open(rocksdb::Options const&, std::string const&, rocksdb::DB**)+0x186) [0x7fae560596b6]
14: (RocksDBStore::do_open(std::ostream&, bool)+0x412) [0x7fae55f8d6e2]
15: (BlueStore::_open_db(bool)+0x933) [0x7fae55f408c3]
16: (BlueStore::mount()+0x416) [0x7fae55f4bd96]
17: (OSD::init()+0x283) [0x7fae55b73703]
18: (main()+0x2cda) [0x7fae55aa6f4a]
19: (__libc_start_main()+0xf5) [0x7fae51f79b15]
20: (()+0x413da9) [0x7fae55b22da9]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

We also validating the cluster with default 2 partitions having db and wal on same data block device . seems to be stable than having separate partitions.


Files

ceph-osd.15.zip (374 KB) ceph-osd.15.zip OSD log with crash logs Muthusamy Muthiah, 01/24/2017 07:05 AM
Actions #1

Updated by Sage Weil almost 7 years ago

  • Status changed from New to Resolved

this was fixed shortly after the kraken release

Actions

Also available in: Atom PDF