Project

General

Profile

Bug #47883

bluefs _allocate failed to allocate bdev 1 and 2,cause ceph_assert(r == 0)

Added by chunsong feng 3 months ago. Updated about 2 months ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
BlueStore
Backport:
octopus, nautilus
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

2020-10-17T18:02:17.658+0800 ffff7eaac980 1 bluefs _allocate failed to allocate 0x4d0000 on bdev 1, free 0x19c2108000; fallback to bdev 2
2020-10-17T18:02:17.668+0800 ffff7eaac980 1 bluefs _allocate unable to allocate 0x4d0000 on bdev 2, free 0xffffffffffffffff
2020-10-17T18:02:17.678+0800 ffff7eaac980 -1 /home/rpmbuild/BUILD/ceph-16.0.0-4421-gcfeea548ad/src/os/bluestore/BlueFS.cc: In function 'void BlueFS::_compact_log_async(std::unique_lock<std::mutex>&)' thread ffff7eaac980 time 2020-10-17T18:02:17.676262+0800
/home/rpmbuild/BUILD/ceph-16.0.0-4421-gcfeea548ad/src/os/bluestore/BlueFS.cc: 2198: FAILED ceph_assert(r == 0)

ceph version 16.0.0-4421-gcfeea548ad (971ac7aa5739e2dbc0df1b7c8b10b17653df1037) pacific (dev)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x154) [0xaaaae34238e8]
2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0xaaaae3423ad0]
3: (BlueFS::_compact_log_async(std::unique_lock&lt;std::mutex&gt;&)+0x16d0) [0xaaaae3a42518]
4: (BlueRocksWritableFile::Sync()+0x80) [0xaaaae3a538f0]
5: (rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x2c) [0xaaaae3e8e734]
6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x3a0) [0xaaaae3f7c498]
7: (rocksdb::WritableFileWriter::Sync(bool)+0x90) [0xaaaae3f7d8b8]
8: (rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned long)+0x274) [0xaaaae3e9f444]
9: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x1ccc) [0xaaaae3ea611c]
10: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x30) [0xaaaae3ea63c0]
11: (RocksDBStore::submit_common(rocksdb::WriteOptions&, std::shared_ptr&lt;KeyValueDB::TransactionImpl&gt;)+0x94) [0xaaaae3e5081c]
12: (RocksDBStore::submit_transaction_sync(std::shared_ptr&lt;KeyValueDB::TransactionImpl&gt;)+0x98) [0xaaaae3e510c8]
13: (BlueStore::_kv_sync_thread()+0x22dc) [0xaaaae39a1b04]
14: (BlueStore::KVSyncThread::entry()+0x10) [0xaaaae39c69e0]
15: /lib64/libpthread.so.0(+0x77f8) [0xffffa32e77f8]
16: /lib64/libc.so.6(+0xd0edc) [0xffffa2e40edc]

2020-10-17T18:02:17.678+0800 ffff7eaac980 -1 ** Caught signal (Aborted) *
in thread ffff7eaac980 thread_name:bstore_kv_sync

ceph version 16.0.0-4421-gcfeea548ad (971ac7aa5739e2dbc0df1b7c8b10b17653df1037) pacific (dev)
1: _kernel_rt_sigreturn()
2: gsignal()
3: abort()
4: (ceph::
_ceph_assert_fail(char const*, char const*, int, char const*)+0x1a4) [0xaaaae3423938]
5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0xaaaae3423ad0]
6: (BlueFS::_compact_log_async(std::unique_lock&lt;std::mutex&gt;&)+0x16d0) [0xaaaae3a42518]
7: (BlueRocksWritableFile::Sync()+0x80) [0xaaaae3a538f0]
8: (rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x2c) [0xaaaae3e8e734]
9: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x3a0) [0xaaaae3f7c498]
10: (rocksdb::WritableFileWriter::Sync(bool)+0x90) [0xaaaae3f7d8b8]

ceph-osd.34.rar (345 KB) chunsong feng, 10/19/2020 01:38 AM


Related issues

Copied to bluestore - Backport #48478: octopus: bluefs _allocate failed to allocate bdev 1 and 2,cause ceph_assert(r == 0) In Progress
Copied to bluestore - Backport #48479: nautilus: bluefs _allocate failed to allocate bdev 1 and 2,cause ceph_assert(r == 0) Resolved

History

#1 Updated by chunsong feng 3 months ago

The problem recurs within 10 minutes after a test is performed on more than 32 KB concurrent random writes.

#2 Updated by chunsong feng 3 months ago

Test Allocator=stipid and hybrid respectively.
When creating an OSD, only data is used. Block-wal and block-db are not configured. The total data space is 200 GB. When the application fails, there is still about 100 GB free space. The application also fails to be applied for after the OSD is restarted. Is this a problem caused by fragmentation allocation?

#3 Updated by Igor Fedotov 3 months ago

chunsong feng wrote:
Hi Feng,

you're using master Ceph branch, right?
Am I getting properly OSD is unable to restart after this failure with the same "failed to allocate" log output? If so could you please set
"debug bluestore" to 10 and collect that failing startup log.

And one more experiment if possible please:
1) Start your benchmark against clean OSDs
2) Monitor fragmentation for a couple of OSDs using "watch -n 5 ceph tell osd.N bluestore allocator fragmentation block"

Test Allocator=stipid and hybrid respectively.

Do you mean you tried both stupid and hybrid with the same results?

When creating an OSD, only data is used. Block-wal and block-db are not configured. The total data space is 200 GB. When the application fails, there is still about 100 GB free space. The application also fails to be applied for after the OSD is restarted. Is this a problem caused by fragmentation allocation?

#4 Updated by Igor Fedotov 3 months ago

  • Project changed from Ceph to bluestore
  • Category deleted (Objecter)

#5 Updated by Igor Fedotov 3 months ago

Also prior to redeploying the cluster/OSDs you might want to learn current fragmentation rating using:
ceph-bluestore-tool --path <path-to-osd> --allocator block free-score
ceph-bluestore-tool --path <path-to-osd> --allocator block free-dump

#6 Updated by chunsong feng 3 months ago

Igor Fedotov wrote:

chunsong feng wrote:
Hi Feng,

you're using master Ceph branch, right?

right

Am I getting properly OSD is unable to restart after this failure with the same "failed to allocate" log output? If so could you please set
"debug bluestore" to 10 and collect that failing startup log.

OSD is unable to restart after this failure with the same "failed to allocate" log output,right

And one more experiment if possible please:
1) Start your benchmark against clean OSDs
2) Monitor fragmentation for a couple of OSDs using "watch -n 5 ceph tell osd.N bluestore allocator fragmentation block"

Test Allocator=stipid and hybrid respectively.

Do you mean you tried both stupid and hybrid with the same results?

stupid and hybrid is the same results. This issue does not occur in versions 15.2.5 and 14.2.11.

When creating an OSD, only data is used. Block-wal and block-db are not configured. The total data space is 200 GB. When the application fails, there is still about 100 GB free space. The application also fails to be applied for after the OSD is restarted. Is this a problem caused by fragmentation allocation?

#7 Updated by Neha Ojha 3 months ago

  • Status changed from New to Need More Info

#8 Updated by Martin Verges about 2 months ago

We triggered the bug on a production cluster having a lot of small files and objects as their workload.
The affected host where updated from 14.2.9 to 14.2.12 and 14.2.15 both newer versions hit this bug but just on 1 out of 13 OSDs (1 out of 3 NVMe OSDs).

Solution or Workaround to solve the issue was a OSD destroy and recreation.

Please see Output of "free-score", "free-dump", as well as the crash log.
Unfortunately the output is around >700MB and even zipped >40Mb so I can't upload at the tracker. The logs are available at https://nextcloud.croit.io/s/WZt9QPrEJTkEZMk

#9 Updated by Igor Fedotov about 2 months ago

  • Status changed from Need More Info to Fix Under Review
  • Assignee set to Igor Fedotov
  • Backport set to octopus, nautilus
  • Pull request ID set to 38438

#10 Updated by Igor Fedotov about 2 months ago

The issue is caused by a bug in avl/hybrid allocators. The workaround would be switching back to bitmap allocator.

#11 Updated by Igor Fedotov about 2 months ago

Well, my previous comment is valid for Martin's issue. It's still unclear for me whether it applies to the original report due to the lack of info.

#12 Updated by Martin Verges about 2 months ago

just as an additional Information, the allocator is already bitmap.

#13 Updated by Igor Fedotov about 2 months ago

Martin Verges wrote:

just as an additional Information, the allocator is already bitmap.

Do you mean it WAS bitmap when the issue occurred or it has been switched to bitmap since then?

#14 Updated by Kefu Chai about 2 months ago

  • Status changed from Fix Under Review to Pending Backport

#15 Updated by Martin Verges about 2 months ago

Igor Fedotov wrote:

Do you mean it WAS bitmap when the issue occurred or it has been switched to bitmap since then?

Yes it was and we/customer never changed it to something else.

#16 Updated by Igor Fedotov about 2 months ago

The default allocator was changed from bitmap to hybrid starting v14.2.11.
Hence unless you had made any custom settings the allocator type was changed for you automatically during upgrade to v14.2.11+.

Does it makes sense or I'm missing something?

#17 Updated by Martin Verges about 2 months ago

ok sorry, didn't know that. Yes then it has changed while installing the .12 release.

#18 Updated by Igor Fedotov about 2 months ago

  • Copied to Backport #48478: octopus: bluefs _allocate failed to allocate bdev 1 and 2,cause ceph_assert(r == 0) added

#19 Updated by Igor Fedotov about 2 months ago

  • Copied to Backport #48479: nautilus: bluefs _allocate failed to allocate bdev 1 and 2,cause ceph_assert(r == 0) added

Also available in: Atom PDF