Project

General

Profile

Actions

Bug #62125

closed

bluestore/bluefs: bluefs enospc while osd start

Added by yite gu 9 months ago. Updated 8 months ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

    -4> 2023-07-24T11:26:16.935+0000 7fa045f42200  1 bluefs _allocate unable to allocate 0x90000 on bdev 1, allocator name block, allocator type hybrid, capacity 0xb488400000, block size 0x1000, free 0xad7156000, fragmentation 0.402739, allocated 0x0
    -3> 2023-07-24T11:26:16.935+0000 7fa045f42200 -1 bluefs _allocate allocation failed, needed 0x80d5b
    -2> 2023-07-24T11:26:16.935+0000 7fa045f42200 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0x80d5b
    -1> 2023-07-24T11:26:16.947+0000 7fa045f42200 -1 /root/rpmbuild/BUILD/ceph-16.2.13-1/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7fa045f42200 time 2023-07-24T11:26:16.935860+0000
/root/rpmbuild/BUILD/ceph-16.2.13-1/src/os/bluestore/BlueFS.cc: 2810: ceph_abort_msg("bluefs enospc")

 ceph version 16.2.13-1 (4165348e9832868203044cb5561f34995fe29e82) pacific (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x55a18ce8cf9a]
 2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1131) [0x55a18d596921]
 3: (BlueFS::_flush(BlueFS::FileWriter*, bool, bool*)+0x90) [0x55a18d596c00]
 4: (BlueFS::_flush(BlueFS::FileWriter*, bool, std::unique_lock<std::mutex>&)+0x32) [0x55a18d5a83c2]
 5: (BlueRocksWritableFile::Append(rocksdb::Slice const&)+0x11b) [0x55a18d5c216b]
 6: (rocksdb::LegacyWritableFileWrapper::Append(rocksdb::Slice const&, rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x1f) [0x55a18da6294f]
 7: (rocksdb::WritableFileWriter::WriteBuffered(char const*, unsigned long)+0x58a) [0x55a18db7469a]
 8: (rocksdb::WritableFileWriter::Append(rocksdb::Slice const&)+0x2d0) [0x55a18db75af0]
 9: (rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const&, rocksdb::CompressionType, rocksdb::BlockHandle*, bool)+0xb6) [0x55a18dc915c6]
 10: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::Slice const&, rocksdb::BlockHandle*, bool)+0x26c) [0x55a18dc91f0c]
 11: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::BlockBuilder*, rocksdb::BlockHandle*, bool)+0x3c) [0x55a18dc9260c]
 12: (rocksdb::BlockBasedTableBuilder::Flush()+0x6d) [0x55a18dc9269d]
 13: (rocksdb::BlockBasedTableBuilder::Add(rocksdb::Slice const&, rocksdb::Slice const&)+0x2b8) [0x55a18dc95b08]
 14: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, rocksdb::FileSystem*, rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> >, std::allocator<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> > > > const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::CompressionType, unsigned long, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, unsigned long, rocksdb::Env::WriteLifeTimeHint, unsigned long)+0xa45) [0x55a18dc405e5]
 15: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0xcf5) [0x55a18daa5295]
 16: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool, bool*)+0x1c2e) [0x55a18daa79ce]
 17: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*)+0xae8) [0x55a18daa8d28]
 18: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool)+0x59d) [0x55a18daa2a4d]
 19: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x15) [0x55a18daa3de5]
 20: (RocksDBStore::do_open(std::ostream&, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x10c1) [0x55a18da1b211]
 21: (BlueStore::_open_db(bool, bool, bool)+0x8c7) [0x55a18d48b977]
 22: (BlueStore::_open_db_and_around(bool, bool)+0x2f7) [0x55a18d4fa9c7]
 23: (BlueStore::_mount()+0x204) [0x55a18d4fd854]
 24: (OSD::init()+0x380) [0x55a18cfc7bc0]
 25: main()

Related issues 2 (0 open2 closed)

Related to bluestore - Backport #58589: pacific: OSD is unable to allocate free space for BlueFSResolvedIgor FedotovActions
Is duplicate of bluestore - Bug #53466: OSD is unable to allocate free space for BlueFSResolvedIgor Fedotov

Actions
Actions #1

Updated by Adam Kupczyk 9 months ago

  • Status changed from New to Need More Info

capacity 0xb488400000, block size 0x1000, free 0xad7156000, fragmentation 0.402739
free / capacity = 6%
It is known deficiency in Pacific.

Please consider upgrading to Quincy,
or wait for
https://github.com/ceph/ceph/pull/52212 to review/test/merge.

Actions #2

Updated by yite gu 9 months ago

Adam Kupczyk wrote:

capacity 0xb488400000, block size 0x1000, free 0xad7156000, fragmentation 0.402739
free / capacity = 6%
It is known deficiency in Pacific.

Please consider upgrading to Quincy,
or wait for
https://github.com/ceph/ceph/pull/52212 to review/test/merge.

The second case:

2023-08-03T08:04:12.104+0000 7f66a189f700  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.693740845s, txc = 0x558c2dc4e700
2023-08-03T08:04:12.104+0000 7f66a189f700  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.692392349s, txc = 0x558c65341880
2023-08-03T08:04:12.104+0000 7f66a189f700  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.693435669s, txc = 0x558bc2c1ca80
2023-08-03T08:04:12.104+0000 7f66a189f700  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.692977428s, txc = 0x558c84d7ce00
2023-08-03T08:04:12.104+0000 7f66a189f700  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.692403316s, txc = 0x558c5d41a000
2023-08-03T08:04:12.104+0000 7f66a189f700  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.690338135s, txc = 0x558c99798e00
2023-08-03T08:04:12.104+0000 7f66a189f700  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.690275192s, txc = 0x558c75713180
2023-08-03T08:04:12.104+0000 7f66a189f700  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.677726746s, txc = 0x558c2ad62700
2023-08-03T08:04:12.104+0000 7f66a189f700  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.656172752s, txc = 0x558c47f40a80
2023-08-03T08:04:12.104+0000 7f66a189f700  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.690862179s, txc = 0x558c5c3e1180
2023-08-03T08:04:12.104+0000 7f66a189f700  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.549236774s, txc = 0x558c5ab32e00
2023-08-03T08:04:16.201+0000 7f669c895700  1 bluefs _allocate unable to allocate 0x400000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x6fc7d000000, block size 0x1000, free 0x16fb6e25000, fragmentation 0.99712, allocated 0x330000
2023-08-03T08:04:16.201+0000 7f669c895700 -1 bluefs _allocate allocation failed, needed 0x400000
2023-08-03T08:04:16.207+0000 7f669c895700 -1 /root/rpmbuild/BUILD/ceph-16.2.13-2/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, uint64_t, uint64_t)' thread 7f669c895700 time 2023-08-03T08:04:16.201734+0000
/root/rpmbuild/BUILD/ceph-16.2.13-2/src/os/bluestore/BlueFS.cc: 2591: FAILED ceph_assert(r == 0)

 ceph version 16.2.13-2 (b869fb212a4b3a722a7f6aec9ff42b56aa06ba96) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x558b8869e88e]
 2: ceph-osd(+0x583aa8) [0x558b8869eaa8]
 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x19ab) [0x558b88dab0bb]
 4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x322) [0x558b88dabb42]
 5: (BlueRocksWritableFile::Sync()+0x6c) [0x558b88dd516c]
 6: (rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x1f) [0x558b89274e4f]
 7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x402) [0x558b893865c2]
 8: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x558b89387c08]
 9: (rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned long)+0x309) [0x558b892885a9]
 10: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x2629) [0x558b89291149]
 11: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x21) [0x558b89291341]
 12: (RocksDBStore::submit_common(rocksdb::WriteOptions&, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x84) [0x558b89230354]
 13: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x9a) [0x558b89230d5a]
 14: (BlueStore::_kv_sync_thread()+0x30d8) [0x558b88d17e28]
 15: (BlueStore::KVSyncThread::entry()+0x11) [0x558b88d3f721]
 16: /lib64/libpthread.so.0(+0x81ca) [0x7f66b15491ca]
 17: clone()

then, osd restart, it assert at src/os/bluestore/BlueFS.cc: 2810: ceph_abort_msg("bluefs enospc")

Actions #3

Updated by yite gu 9 months ago

yite gu wrote:

Adam Kupczyk wrote:

capacity 0xb488400000, block size 0x1000, free 0xad7156000, fragmentation 0.402739
free / capacity = 6%
It is known deficiency in Pacific.

Please consider upgrading to Quincy,
or wait for
https://github.com/ceph/ceph/pull/52212 to review/test/merge.

The second case:
[...]
then, osd restart, it assert at src/os/bluestore/BlueFS.cc: 2810: ceph_abort_msg("bluefs enospc")

capacity 0x6fc7d000000, block size 0x1000, free 0x16fb6e25000, fragmentation 0.99712
free / capacity = 20%

Actions #4

Updated by Igor Fedotov 9 months ago

Most likely you're facing https://tracker.ceph.com/issues/53466
This has been resolved (to some degree) in Quincy and Pacific backport is pending review: https://github.com/ceph/ceph/pull/52212

So you have the following options to workaround the issue:
1) Upgrade to Quincy
2) Make a custom Pacific build with https://github.com/ceph/ceph/pull/52212 on top
3) set bluefs_shared_alloc_size to 32768 which will hopefully provide some temporary relief as BlueFS will need 32K continuous chunks rather than 64K ones. No guarantee that eventually OSD will get to the state when this isn't enough though. And unlimited alloc size downsizing might finally cause more severe damage. That's what PR#52212 actually fixes.

Actions #5

Updated by Igor Fedotov 9 months ago

  • Related to Bug #53466: OSD is unable to allocate free space for BlueFS added
Actions #6

Updated by yite gu 9 months ago

Igor Fedotov wrote:

Most likely you're facing https://tracker.ceph.com/issues/53466
This has been resolved (to some degree) in Quincy and Pacific backport is pending review: https://github.com/ceph/ceph/pull/52212

So you have the following options to workaround the issue:
1) Upgrade to Quincy
2) Make a custom Pacific build with https://github.com/ceph/ceph/pull/52212 on top
3) set bluefs_shared_alloc_size to 32768 which will hopefully provide some temporary relief as BlueFS will need 32K continuous chunks rather than 64K ones. No guarantee that eventually OSD will get to the state when this isn't enough though. And unlimited alloc size downsizing might finally cause more severe damage. That's what PR#52212 actually fixes.

this problem seem occur at upgrade to pacific from octopus, because I do the upgrade. I think this is very very dangerous.

Actions #7

Updated by Igor Fedotov 9 months ago

True, Pacific has got 4K allocation unit for main device which causes high disk fragmentation in some scenarios. Bringing BlueFS 4K allocation unit support (PR #52212) is the first step to fix the problem. New allocation strategy implementation (https://github.com/ceph/ceph/pull/52489) is another one.

Actions #8

Updated by yite gu 9 months ago

1. Could you tell me the problem root cause? I am not familiar with this part of the code. I am reading it.
2. Do I only need to upgrade to quincy from pacific solve this problem?
3. https://github.com/ceph/ceph/pull/52212, this PR have too many commit, contain 38 commits. Could you pick important commits to me? I only need solve this problem commit

Actions #9

Updated by yite gu 9 months ago

I need make a custom pacific as soon as possible, because I upgrade 5 clusters, have 2 cluster already occur this problem.
Igor,I am very anxious and afraid of losing data now.

Actions #10

Updated by Igor Fedotov 9 months ago

yite gu wrote:

1. Could you tell me the problem root cause? I am not familiar with this part of the code. I am reading it.

BlueFS uses 64K continuous extents to keep DB/WAL data. And high disk fragmentation might prevent it to allocate more such chanks when DB/WAL is collocating main device. Even though reported free space is high enough. Given the fragmentation ratio you've shared I believe that's what you've got.

2. Do I only need to upgrade to quincy from pacific solve this problem?

In fact upgrading to Quincy (with 4K BlueFS support enabled) primarily works around the problem - it just removes the 64K long extents requirement from BlueFS and allows it to operate on highly fragmented disks. The fragmentation itself is not improved this way though. And it might cause some negative performance impact as well. That's what new allocation strategy designed for - but it's still under reviewing/testing at the moment so not ready for production yet.

3. https://github.com/ceph/ceph/pull/52212, this PR have too many commit, contain 38 commits. Could you pick important commits to me? I only need solve this problem commit

Unfortunately everything from 52212 is mostly needed as 4K BlueFS support was implemented on top of other BlueFS changes which are currently hard to cut off.

Actions #11

Updated by Igor Fedotov 9 months ago

yite gu wrote:

I need make a custom pacific as soon as possible, because I upgrade 5 clusters, have 2 cluster already occur this problem.
Igor,I am very anxious and afraid of losing data now.

Using bluefs_shared_alloc_size = 32768 is a proven short-term(!) workaround so you might want to start with it first.
Just please do not apply it in a bulky manner for every OSD and use for "broken" ones only.

And keep in mind that you'll need better long-term solution one day. As a positive note I've never heard of 32K alloc unit equipped OSDs being suffered from the same problem - it's rather a theoretical problem so far. But not many clusters have got such tuning though....

Actions #12

Updated by yite gu 9 months ago

Igor Fedotov wrote:

yite gu wrote:

1. Could you tell me the problem root cause? I am not familiar with this part of the code. I am reading it.

BlueFS uses 64K continuous extents to keep DB/WAL data. And high disk fragmentation might prevent it to allocate more such chanks when DB/WAL is collocating main device. Even though reported free space is high enough. Given the fragmentation ratio you've shared I believe that's what you've got.

I got it.

2. Do I only need to upgrade to quincy from pacific solve this problem?

In fact upgrading to Quincy (with 4K BlueFS support enabled) primarily works around the problem - it just removes the 64K long extents requirement from BlueFS and allows it to operate on highly fragmented disks. The fragmentation itself is not improved this way though. And it might cause some negative performance impact as well. That's what new allocation strategy designed for - but it's still under reviewing/testing at the moment so not ready for production yet.

3. https://github.com/ceph/ceph/pull/52212, this PR have too many commit, contain 38 commits. Could you pick important commits to me? I only need solve this problem commit

Unfortunately everything from 52212 is mostly needed as 4K BlueFS support was implemented on top of other BlueFS changes which are currently hard to cut off.

Can we merge this PR today? I backport to my branch.

Actions #13

Updated by Igor Fedotov 9 months ago

yite gu wrote:

Igor Fedotov wrote:

yite gu wrote:

1. Could you tell me the problem root cause? I am not familiar with this part of the code. I am reading it.

BlueFS uses 64K continuous extents to keep DB/WAL data. And high disk fragmentation might prevent it to allocate more such chanks when DB/WAL is collocating main device. Even though reported free space is high enough. Given the fragmentation ratio you've shared I believe that's what you've got.

I got it.

2. Do I only need to upgrade to quincy from pacific solve this problem?

In fact upgrading to Quincy (with 4K BlueFS support enabled) primarily works around the problem - it just removes the 64K long extents requirement from BlueFS and allows it to operate on highly fragmented disks. The fragmentation itself is not improved this way though. And it might cause some negative performance impact as well. That's what new allocation strategy designed for - but it's still under reviewing/testing at the moment so not ready for production yet.

3. https://github.com/ceph/ceph/pull/52212, this PR have too many commit, contain 38 commits. Could you pick important commits to me? I only need solve this problem commit

Unfortunately everything from 52212 is mostly needed as 4K BlueFS support was implemented on top of other BlueFS changes which are currently hard to cut off.

Can we merge this PR today? I backport to my branch.

No if you're talking about merging to upstream Ceph. This has to pass review and QA.

But IMHO it's ready from developer's point of view - which might be not correct as we all know... So given that disclaimer you can make a custom build and start testing it - with some caution if this is performed in real production environment, e.g. upgrade singe node only and let it bake for a while.

Actions #14

Updated by yite gu 9 months ago

Igor Fedotov wrote:

yite gu wrote:

Igor Fedotov wrote:

yite gu wrote:

1. Could you tell me the problem root cause? I am not familiar with this part of the code. I am reading it.

BlueFS uses 64K continuous extents to keep DB/WAL data. And high disk fragmentation might prevent it to allocate more such chanks when DB/WAL is collocating main device. Even though reported free space is high enough. Given the fragmentation ratio you've shared I believe that's what you've got.

I got it.

2. Do I only need to upgrade to quincy from pacific solve this problem?

In fact upgrading to Quincy (with 4K BlueFS support enabled) primarily works around the problem - it just removes the 64K long extents requirement from BlueFS and allows it to operate on highly fragmented disks. The fragmentation itself is not improved this way though. And it might cause some negative performance impact as well. That's what new allocation strategy designed for - but it's still under reviewing/testing at the moment so not ready for production yet.

3. https://github.com/ceph/ceph/pull/52212, this PR have too many commit, contain 38 commits. Could you pick important commits to me? I only need solve this problem commit

Unfortunately everything from 52212 is mostly needed as 4K BlueFS support was implemented on top of other BlueFS changes which are currently hard to cut off.

Can we merge this PR today? I backport to my branch.

No if you're talking about merging to upstream Ceph. This has to pass review and QA.

merging to pacific, I will backport to v16.2.13 from pacific

But IMHO it's ready from developer's point of view - which might be not correct as we all know... So given that disclaimer you can make a custom build and start testing it - with some caution if this is performed in real production environment, e.g. upgrade singe node only and let it bake for a while.

Actions #15

Updated by Igor Fedotov 9 months ago

yite gu wrote:

Igor Fedotov wrote:

yite gu wrote:

Igor Fedotov wrote:

yite gu wrote:

1. Could you tell me the problem root cause? I am not familiar with this part of the code. I am reading it.

BlueFS uses 64K continuous extents to keep DB/WAL data. And high disk fragmentation might prevent it to allocate more such chanks when DB/WAL is collocating main device. Even though reported free space is high enough. Given the fragmentation ratio you've shared I believe that's what you've got.

I got it.

2. Do I only need to upgrade to quincy from pacific solve this problem?

In fact upgrading to Quincy (with 4K BlueFS support enabled) primarily works around the problem - it just removes the 64K long extents requirement from BlueFS and allows it to operate on highly fragmented disks. The fragmentation itself is not improved this way though. And it might cause some negative performance impact as well. That's what new allocation strategy designed for - but it's still under reviewing/testing at the moment so not ready for production yet.

3. https://github.com/ceph/ceph/pull/52212, this PR have too many commit, contain 38 commits. Could you pick important commits to me? I only need solve this problem commit

Unfortunately everything from 52212 is mostly needed as 4K BlueFS support was implemented on top of other BlueFS changes which are currently hard to cut off.

Can we merge this PR today? I backport to my branch.

No if you're talking about merging to upstream Ceph. This has to pass review and QA.

merging to pacific, I will backport to v16.2.13 from pacific

Adam's just approved this PR but this has to run through QA prior to merging. Which my take some time (e.g. week or two) depending on QA team schedule/load/priorities. So no this wouldn't be merged today.

You can fork a repo and do the merge in your private branch though. Then build the code from that branch. Not that I'm strongly recommending this way - IMO downsizing bluefs_shared_alloc_size should be good enough for quite a while.

Actions #16

Updated by yite gu 9 months ago

Igor Fedotov wrote:

yite gu wrote:

I need make a custom pacific as soon as possible, because I upgrade 5 clusters, have 2 cluster already occur this problem.
Igor,I am very anxious and afraid of losing data now.

Using bluefs_shared_alloc_size = 32768 is a proven short-term(!) workaround so you might want to start with it first.
Just please do not apply it in a bulky manner for every OSD and use for "broken" ones only.

bluefs_shared_alloc_size = 65536 in pacific, now, I change it to 32768, and then, start a problem osd. start is successful.

And keep in mind that you'll need better long-term solution one day. As a positive note I've never heard of 32K alloc unit equipped OSDs being suffered from the same problem - it's rather a theoretical problem so far. But not many clusters have got such tuning though....

Actions #17

Updated by yite gu 9 months ago

Igor Fedotov wrote:

yite gu wrote:

Igor Fedotov wrote:

yite gu wrote:

Igor Fedotov wrote:

yite gu wrote:

1. Could you tell me the problem root cause? I am not familiar with this part of the code. I am reading it.

BlueFS uses 64K continuous extents to keep DB/WAL data. And high disk fragmentation might prevent it to allocate more such chanks when DB/WAL is collocating main device. Even though reported free space is high enough. Given the fragmentation ratio you've shared I believe that's what you've got.

I got it.

2. Do I only need to upgrade to quincy from pacific solve this problem?

In fact upgrading to Quincy (with 4K BlueFS support enabled) primarily works around the problem - it just removes the 64K long extents requirement from BlueFS and allows it to operate on highly fragmented disks. The fragmentation itself is not improved this way though. And it might cause some negative performance impact as well. That's what new allocation strategy designed for - but it's still under reviewing/testing at the moment so not ready for production yet.

3. https://github.com/ceph/ceph/pull/52212, this PR have too many commit, contain 38 commits. Could you pick important commits to me? I only need solve this problem commit

Unfortunately everything from 52212 is mostly needed as 4K BlueFS support was implemented on top of other BlueFS changes which are currently hard to cut off.

Can we merge this PR today? I backport to my branch.

No if you're talking about merging to upstream Ceph. This has to pass review and QA.

merging to pacific, I will backport to v16.2.13 from pacific

Adam's just approved this PR but this has to run through QA prior to merging. Which my take some time (e.g. week or two) depending on QA team schedule/load/priorities. So no this wouldn't be merged today.

You can fork a repo and do the merge in your private branch though. Then build the code from that branch. Not that I'm strongly recommending this way - IMO downsizing bluefs_shared_alloc_size should be good enough for quite a while.

ok, I downsizing bluefs_shared_alloc_size. I waiting you week or two.

Actions #18

Updated by yite gu 9 months ago

Igor Fedotov wrote:

True, Pacific has got 4K allocation unit for main device which causes high disk fragmentation in some scenarios. Bringing BlueFS 4K allocation unit support (https://github.com/ceph/ceph/pull/52212) is the first step to fix the problem. New allocation strategy implementation (https://github.com/ceph/ceph/pull/52489) is another one.

which PR modify 4K allocation in Pacific?

Actions #19

Updated by Igor Fedotov 9 months ago

This one if you're asking about main device not BlueFS
https://github.com/ceph/ceph/pull/34588/

Actions #20

Updated by Igor Fedotov 9 months ago

  • Related to Backport #58589: pacific: OSD is unable to allocate free space for BlueFS added
Actions #21

Updated by yite gu 9 months ago

Igor Fedotov wrote:

This one if you're asking about main device not BlueFS
https://github.com/ceph/ceph/pull/34588/

my main device is nvme SSD, bluestore_min_alloc_size_ssd is 4k at octopus and pacific, include this PR at pacific should not occur this problem, right?

Actions #22

Updated by yite gu 9 months ago

yite gu wrote:

Igor Fedotov wrote:

This one if you're asking about main device not BlueFS
https://github.com/ceph/ceph/pull/34588/

my main device is nvme SSD, bluestore_min_alloc_size_ssd is 4k at octopus and pacific, include this PR at pacific should not occur this problem, right?

you remove bluestore_bluefs_min_ratio and bluestore_bluefs_gift_ratio after octopus, bluefs no longer has its own independent space, even though DB/WAL is collocated with the main device.

Actions #23

Updated by yite gu 9 months ago

bluefs will allocate space from bluestore allocater after upgrade, thus occur this problem, right?

Actions #24

Updated by Igor Fedotov 9 months ago

yite gu wrote:

bluefs will allocate space from bluestore allocater after upgrade, thus occur this problem, right?

IMO that's correct partly. Having 4K allocation unit for main device is another part of the story. It fragments the free space to drastically and this could impact that gifting mechanics as well.

Actions #25

Updated by yite gu 9 months ago

Igor Fedotov wrote:

yite gu wrote:

bluefs will allocate space from bluestore allocater after upgrade, thus occur this problem, right?

IMO that's correct partly. Having 4K allocation unit for main device is another part of the story. It fragments the free space to drastically and this could impact that gifting mechanics as well.

so, this is not only an upgrade problem, new deployment pacific ceph also occurs high disk fragmentation, and then impacts bluefs to allocate space, because bluefs still uses 64k allocate units.

Actions #26

Updated by Igor Fedotov 8 months ago

  • Related to deleted (Bug #53466: OSD is unable to allocate free space for BlueFS)
Actions #27

Updated by Igor Fedotov 8 months ago

  • Is duplicate of Bug #53466: OSD is unable to allocate free space for BlueFS added
Actions #28

Updated by Igor Fedotov 8 months ago

  • Status changed from Need More Info to Duplicate
Actions

Also available in: Atom PDF