Project

General

Profile

Bug #21932

OSD crash on boot with assert caused by Bluefs on flush write

Added by Pawel Stefanski about 1 year ago. Updated 12 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
10/26/2017
Due date:
% Done:

0%

Source:
Community (user)
Tags:
bluestore, crash, osd
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

After network outage some nodes in cluster were hard-restarted and came up with one or more crashing OSD instances. All affected OSD's have the same issue.

Version is Luminous - Ubuntu packages,

Bluestore is configured with separate WAL.db device (nvme).

Can't start OSD - it crash immedetiary after startup.

root@ceph10:/var/log/ceph# /usr/bin/ceph-osd -f --cluster ceph --id 106 --setuser ceph --setgroup ceph
starting osd.106 at - osd_data /var/lib/ceph/osd/ceph-106 /var/lib/ceph/osd/ceph-106/journal
tcmalloc: large alloc 1497374720 bytes == 0x55f2b67ee000 @  0x7f237e9ce1e1 0x7f237d751499 0x7f237d752833 0x55f24de2f359 0x55f24de20a48 0x55f24de22394 0x55f24de2375c 0x55f24de24ff1 0x55f24da064df 0x55f24d992623 0x55f24d9c2de7 0x55f24d9c240e 0x55f24d53acff 0x55f24d44d1f8 0x7f237cd69830 0x55f24d4d8a59 (nil)
tcmalloc: large alloc 1313308672 bytes == 0x55f25d6a6000 @  0x7f237e9ce1e1 0x7f237d751499 0x7f237d75200b 0x55f24de66e59 0x55f24de20c4a 0x55f24de22394 0x55f24de2375c 0x55f24de24ff1 0x55f24da064df 0x55f24d992623 0x55f24d9c2de7 0x55f24d9c240e 0x55f24d53acff 0x55f24d44d1f8 0x7f237cd69830 0x55f24d4d8a59 (nil)
/build/ceph-12.2.1/src/include/buffer.h: In function 'void ceph::buffer::list::prepare_iov(VectorT*) const [with VectorT = boost::container::small_vector<iovec, 4ul>]' thread 7f237f8fce00 time 2017-10-26 00:53:40.482785
/build/ceph-12.2.1/src/include/buffer.h: 882: FAILED assert(_buffers.size() <= 1024)
 ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55f24db033f2]
 2: (KernelDevice::aio_write(unsigned long, ceph::buffer::list&, IOContext*, bool)+0x1598) [0x55f24daa8a78]
 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x9c0) [0x55f24da832c0]
 4: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x124) [0x55f24da84cf4]
 5: (BlueRocksWritableFile::Flush()+0x3d) [0x55f24da9ae6d]
 6: (rocksdb::WritableFileWriter::Flush()+0x24c) [0x55f24ded33ac]
 7: (rocksdb::WritableFileWriter::Sync(bool)+0x3e) [0x55f24ded45ce]
 8: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, rocksdb::EnvOptions const&, rocksdb::TableCache*, rocksdb::InternalIterator*, std::unique_ptr<rocksdb::InternalIterator, std::default_delete<rocksdb::InternalIterator> >, rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> >, std::allocator<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> > > > const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, rocksdb::CompressionType, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int)+0x190f) [0x55f24def304f]
 9: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0xad9) [0x55f24de1edd9]
 10: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool)+0x17ec) [0x55f24de20e2c]
 11: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x8c4) [0x55f24de22394]
 12: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xedc) [0x55f24de2375c]
 13: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x6b1) [0x55f24de24ff1]
 14: (RocksDBStore::do_open(std::ostream&, bool)+0x8ff) [0x55f24da064df]
 15: (BlueStore::_open_db(bool)+0xf73) [0x55f24d992623]
 16: (BlueStore::fsck(bool)+0x3e7) [0x55f24d9c2de7]
 17: (BlueStore::_mount(bool)+0x1ee) [0x55f24d9c240e]
 18: (OSD::init()+0x3df) [0x55f24d53acff]
 19: (main()+0x2eb8) [0x55f24d44d1f8]
 20: (__libc_start_main()+0xf0) [0x7f237cd69830]
 21: (_start()+0x29) [0x55f24d4d8a59]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2017-10-26 00:53:40.490798 7f237f8fce00 -1 /build/ceph-12.2.1/src/include/buffer.h: In function 'void ceph::buffer::list::prepare_iov(VectorT*) const [with VectorT = boost::container::small_vector<iovec, 4ul>]' thread 7f237f8fce00 time 2017-10-26 00:53:40.482785
/build/ceph-12.2.1/src/include/buffer.h: 882: FAILED assert(_buffers.size() <= 1024)


Related issues

Related to bluestore - Bug #22066: bluestore osd asserts repeatedly with ceph-12.2.1/src/include/buffer.h: 882: FAILED assert(_buffers.size() <= 1024) Duplicate 11/07/2017
Related to bluestore - Bug #22115: OSD SIGABRT on bluestore_prefer_deferred_size = 104857600: assert(_buffers.size() <= 1024) Duplicate 11/13/2017
Copied to Ceph - Backport #22193: luminous: OSD crash on boot with assert caused by Bluefs on flush write Resolved

History

#1 Updated by jianpeng ma about 1 year ago

Could you add debug bluestore = 20 in /etc/ceph.conf? And paste the message. Thanks!

#2 Updated by Kefu Chai about 1 year ago

  • Backport set to luminous

#3 Updated by Kefu Chai about 1 year ago

  • Status changed from New to Pending Backport

#4 Updated by Pawel Stefanski about 1 year ago

Kefu Chai wrote:

https://github.com/ceph/ceph/pull/18828

Thank you so much!

Unfortunately can't check this patch, all nodes with those failed OSDs were redeployed. We will do another hard reset tests in the future.

#5 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #22193: luminous: OSD crash on boot with assert caused by Bluefs on flush write added

#6 Updated by Sage Weil 12 months ago

  • Related to Bug #22066: bluestore osd asserts repeatedly with ceph-12.2.1/src/include/buffer.h: 882: FAILED assert(_buffers.size() <= 1024) added

#7 Updated by Sage Weil 12 months ago

  • Related to Bug #22115: OSD SIGABRT on bluestore_prefer_deferred_size = 104857600: assert(_buffers.size() <= 1024) added

#8 Updated by Nathan Cutler 12 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF