Project

General

Profile

Actions

Bug #51684

closed

OSD crashes after update to 16.2.4

Added by Jérôme Poulin almost 3 years ago. Updated over 2 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After we've updated to 16.2.4 from version 14.2.0, all OSD have crashed with this backtrace after about an hour and we had to rebuild them. Now, one week later, the rolling crash happens again.

Luckily, there's a workaround, I need is to start the OSD using:
ceph-osd --setuser ceph --setgroup ceph -i 7 -d --bluefs_allocator=bitmap --bluestore_allocator=bitmap

ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890) [0x7f8186143890]
2: gsignal()
3: abort()
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19c) [0x56552b5efb6e]
5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x56552b5efcf8]
6: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x1bac) [0x56552bc6e19c]
7: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x9e) [0x56552bc6e89e]
8: (BlueRocksWritableFile::Sync()+0x6c) [0x56552bc9926c]
9: (rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x4e) [0x56552c150cd8]
10: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x212) [0x56552c33aa6c]
11: (rocksdb::WritableFileWriter::Sync(bool)+0x177) [0x56552c33a47d]
12: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, rocksdb::FileSystem*, rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> >, std::allocator<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> > > > const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::CompressionType, unsigned long, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, unsigned long, rocksdb::Env::WriteLifeTimeHint, unsigned long)+0xe9d) [0x56552c496893]
13: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0x613) [0x56552c1f18d5]
14: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool, bool*)+0x1b1e) [0x56552c1f0476]
15: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*)+0x15f7) [0x56552c1ed879]
16: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >, rocksdb::DB*, bool, bool)+0x709) [0x56552c1f2ee9]
17: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >, rocksdb::DB*)+0x61) [0x56552c1f21c1]
18: (RocksDBStore::do_open(std::ostream&, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xd81) [0x56552c0f8fd1]
19: (BlueStore::_open_db(bool, bool, bool)+0x406) [0x56552bb6cd96]
20: (BlueStore::_open_db_and_around(bool, bool)+0x56b) [0x56552bbd636b]
21: (BlueStore::_mount()+0x161) [0x56552bbd8cc1]
22: (OSD::init()+0x4d1) [0x56552b6a3191]
23: main()
24: __libc_start_main()
25: _start()

Files

ceph-osd.0.log.7z (161 KB) ceph-osd.0.log.7z Jérôme Poulin, 07/15/2021 04:17 PM

Related issues 1 (0 open1 closed)

Related to bluestore - Bug #50656: bluefs _allocate unable to allocate, though enough freeResolvedIgor Fedotov

Actions
Actions #1

Updated by Jérôme Poulin almost 3 years ago

Reference to the moment we've upgraded: https://tracker.ceph.com/issues/47883#note-5

Actions #2

Updated by Igor Fedotov almost 3 years ago

Would you please share the OSD log with the crash?

Am I getting that correct that you managed to work around the issue by switching to bitmap allocator?

Actions #3

Updated by Jérôme Poulin almost 3 years ago

Yes, it is correct, switching to bitmap allocator allows restarting the OSD and recovery.

Here's the full log for the day including before, during and after the crash. Search for

0> 2021-07-15T08:46:58.535-0400 7f8187fddf00 -1 ** Caught signal (Aborted) *

Actions #4

Updated by Igor Fedotov almost 3 years ago

Jérôme Poulin wrote:

Yes, it is correct, switching to bitmap allocator allows restarting the OSD and recovery.

Here's the full log for the day including before, during and after the crash. Search for

0> 2021-07-15T08:46:58.535-0400 7f8187fddf00 -1 ** Caught signal (Aborted) *

OK, thanks!
Could you also please make a free block dump for this OSD.0 via:
ceph-bluestore-tool --path <path-to-osd> --allocator block free-dump

Actions #5

Updated by Jérôme Poulin almost 3 years ago

I did remove the first block: line and ran it through jq -c to compact it prior compression but it's still too big for 1MB, so here's a link to my Google Drive.

https://drive.google.com/file/d/1yrCxJMdG2Zqq-e88X5Uh66NT6a5gqR5h/view?usp=sharing

Actions #6

Updated by Igor Fedotov almost 3 years ago

Jerome,

I think you've missed one more fix for unexpected ENOSPC in Hybrid allocator (https://tracker.ceph.com/issues/50656) which is available since 16.2.5

I'll double check that with your free block dump a bit later but I'm pretty sure that's a culprit.

Actions #7

Updated by Jérôme Poulin almost 3 years ago

Then you can go-ahead and close this issue, we'll proceed to the upgrade in the week to come. Thanks.

Actions #8

Updated by Igor Fedotov over 2 years ago

  • Status changed from New to Duplicate
Actions #9

Updated by Igor Fedotov over 2 years ago

  • Related to Bug #50656: bluefs _allocate unable to allocate, though enough free added
Actions

Also available in: Atom PDF