Project

General

Profile

Bug #45519

OSD asserts during block allocation for BlueFS

Added by Aleksei Zakharov over 1 year ago. Updated 11 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi all.
We use ceph as the rados object storage for small (<=4K) objects. We have about 2 billions of objects in one pool. We do not use separate WAL/DB partitions, instead full NVMe drives are used by OSDs. To use bluestore space effectively we've set up:

bluestore_min_alloc_size_ssd 4096
bluestore_allocator bitmap
bluefs_allocator bitmap

A few days ago we've caught a problem - some osd's started aborting with message:

    "os_version_id": "16.04", 
    "assert_condition": "abort", 
    "utsname_release": "4.15.0-54-generic", 
    "os_name": "Ubuntu", 
    "entity_name": "osd.6", 
    "assert_file": "/build/ceph-14.2.7/src/os/bluestore/BlueFS.cc", 
    "timestamp": "2020-05-09 08:05:57.367574Z", 
    "process_name": "ceph-osd", 
    "utsname_machine": "x86_64", 
    "assert_line": 2269, 
    "utsname_sysname": "Linux", 
    "os_version": "16.04.6 LTS (Xenial Xerus)", 
    "os_id": "ubuntu", 
    "assert_thread_name": "rocksdb:low7", 
    "utsname_version": "#58~16.04.1-Ubuntu SMP Mon Jun 24 13:21:41 UTC 2019", 
    "backtrace": [
        "(()+0x11390) [0x7fc585c8b390]", 
        "(gsignal()+0x38) [0x7fc5851b6428]", 
        "(abort()+0x16a) [0x7fc5851b802a]", 
        "(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b0) [0x85394b]", 
        "(BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1e63) [0xe585d3]", 
        "(BlueFS::_flush(BlueFS::FileWriter*, bool)+0x11c) [0xe588ec]", 
        "(BlueRocksWritableFile::Flush()+0x3d) [0xe7be9d]", 
        "(rocksdb::WritableFileWriter::Flush()+0x45f) [0x14cb23f]", 
        "(rocksdb::WritableFileWriter::Sync(bool)+0x3b) [0x14cc50b]", 
        "(rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, rocksdb::CompactionJob::SubcompactionState*, rocksdb::CompactionRangeDelAggregator*, CompactionIterationStats*, rocksdb::Slice const*)+0x1b91) [0x151d7c1]", 
        "(rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0xa8c) [0x151eb2c]", 
        "(rocksdb::CompactionJob::Run()+0x2ec) [0x1520c1c]", 
        "(rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0xbbd) [0x134c64d]", 
        "(rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0x12a) [0x13544fa]", 
        "(rocksdb::DBImpl::BGWorkCompaction(void*)+0x97) [0x1354bd7]", 
        "(rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x264) [0x15652a4]", 
        "(rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x56) [0x1565436]", 
        "/usr/bin/ceph-osd() [0x1661f2f]", 
        "(()+0x76ba) [0x7fc585c816ba]", 
        "(clone()+0x6d) [0x7fc58528841d]" 
    ], 
    "utsname_hostname": "meta101", 
    "assert_msg": "/build/ceph-14.2.7/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7fc57536b700 time 2020-05-09 11:05:57.327807\n/build/ceph-14.2.7/src/os/bluestore/BlueFS.cc: 2269: ceph_abort_msg(\"bluefs enospc\")\n", 
    "crash_id": "2020-05-09_08:05:57.367574Z_bcfca643-be35-4449-9f90-46b8980b96be", 
    "assert_func": "int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)", 
    "ceph_version": "14.2.7" 

We found that `bluestore bluefs available` showed available from bluestore space as 0:

{
    "dev": {
        "device": "BDEV_DB",
        "free": 768016384
    },
    "available_from_bluestore": 0
}

OSDs were failing and restarting.

We also saw that the bluestore fragmentation was high, but we're not sure if it is related to the problem:

ceph daemon osd.46 bluestore allocator score block
{
    "fragmentation_rating": 0.92772435210430959
}

Some time after, one of the `bad` OSDs failed and couldn't start with another assert:

{
    "os_version_id": "16.04",
    "assert_condition": "h->file->fnode.ino != 1",
    "utsname_release": "4.15.0-54-generic",
    "os_name": "Ubuntu",
    "entity_name": "osd.14",
    "assert_file": "/build/ceph-14.2.7/src/os/bluestore/BlueFS.cc",
    "timestamp": "2020-05-12 12:24:20.755439Z",
    "process_name": "ceph-osd",
    "utsname_machine": "x86_64",
    "assert_line": 2261,
    "utsname_sysname": "Linux",
    "os_version": "16.04.6 LTS (Xenial Xerus)",
    "os_id": "ubuntu",
    "assert_thread_name": "ceph-osd",
    "utsname_version": "#58~16.04.1-Ubuntu SMP Mon Jun 24 13:21:41 UTC 2019",
    "backtrace": [
        "(()+0x11390) [0x7f46b4572390]",
        "(gsignal()+0x38) [0x7f46b3a9d428]",
        "(abort()+0x16a) [0x7f46b3a9f02a]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x853221]",
        "(ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x8533ab]",
        "(BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1cdf) [0xe5844f]",
        "(BlueFS::_flush(BlueFS::FileWriter*, bool)+0x11c) [0xe588ec]",
        "(BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x86e) [0xe5960e]",
        "(BlueFS::_compact_log_async(std::unique_lock<std::mutex>&)+0x2fb) [0xe6931b]",
        "(BlueFS::sync_metadata()+0x215) [0xe6ad65]",
        "(BlueFS::umount()+0xea) [0xe6b08a]",
        "(BlueStore::_close_bluefs()+0xd) [0xd25d0d]",
        "(BlueStore::_open_db_and_around(bool)+0x157) [0xd68987]",
        "(BlueStore::_mount(bool, bool)+0x584) [0xdb83b4]",
        "(OSD::init()+0x3b7) [0x901b17]",
        "(main()+0x283a) [0x85921a]",
        "(__libc_start_main()+0xf0) [0x7f46b3a88830]",
        "(_start()+0x29) [0x88d569]" 
    ],
    "utsname_hostname": "meta102",
    "assert_msg": "/build/ceph-14.2.7/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f46b68d7c00 time 2020-05-12 15:24:20.743666\n/build/ceph-14.2.7/src/os/bluestore/BlueFS.cc: 2261: FAILED ceph_assert(h->file->fnode.ino != 1)\n",
    "crash_id": "2020-05-12_12:24:20.755439Z_00b51d40-536f-4124-973f-c5bf8d874bb2",
    "assert_func": "int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)",
    "ceph_version": "14.2.7" 
}

After setting `bluefs_alloc_size` to `4096` we saw that `available_from_bluestore` became normal (much greater than 0). After that there were no new osd's failing. We also set `bluefs_shared_alloc_size` to `4096`.

No new osd's were failing after that, but OSD's that failed once were failing occasionally. And after few restarts they were unable to start.

We found that the way to start osd is to set for failed one:

bluestore_allocator stupid
bluefs_allocator stupid

After it started up, it is possible to stop daemon, set allocators back to `bitmap` and start daemon. They work fine without fails after that.

For now we have one OSD that doesn't start up with `ceph_assert(h->file->fnode.ino != 1)` message, it is out of crush map, so it is possible to collect any debug information if needed.

All this situation looks like the bitmap allocator bug.

It seems related to https://tracker.ceph.com/issues/40741.
It is not clear to me: why did `bluefs_alloc_size` affected `available_from_bluestore`, but not `bluefs_shared_alloc_size` in this setup (with one device for OSD)? In what scenario do we need to use the second one?

Do we need to check some extra allocator options when have this kind of workload - billions of very small objects?
We saw that when backfill starts (after turn on the balancer module, or change an osd weight), latency starts growing. And latency starts falling back to normal only when rebalance has already been finished some time ago. I'm not quite sure this is related to this issue, if 100% not - we'll open a new issue after collecting some more details on it.


Related issues

Related to bluestore - Bug #24639: [segfault] segfault in BlueFS::read Can't reproduce 06/23/2018

History

#1 Updated by Igor Fedotov over 1 year ago

Alexei, it looks like your OSDs have pretty fragmented free space (and presumably quite high utilization) which results in inability for bitmap allocator to allocate continuous extent(s) to gift to bluefs. Stupid allocator is probably a bit better in this respect and managed to find some.
Hence some points/recommendations here:
1) This is a "by-design" behavior for bitmap allocator - it might be too expensive to do better search for continuous extents when fragmentation is that high. So I doubt this will be fixed anytime soon.
2) Octopus has got avl allocator which finally evolved to Hybrid one in Pacific - currently a default there. Perhaps this will be backported to Octopus one day. Both avl and hybrid allocators should rather be on par with stupid one behavior in regard to high fragmentation/continuous allocation.
3) IMO given high fragmentation your cluster is under the risk of outage one day anyway. Can't suggest anything
other than extending available space or redeploying OSDs which will defragment the space.
4) As for bluefs_alloc_size/bluefs_shared_alloc_size discrepance for available_from_bluestore it looks like a bug - the command handler unconditionally use bluefs_alloc_size parameter irrespectively of disk layout. See StoreHook::call() implementation in BlueFS.cc. Worth another ticket IMO.

#2 Updated by Aleksei Zakharov over 1 year ago

Igor, thanks for your answers!

If I understand right: high bluestore fragmentation makes it impossible to allocate continuous blocks for bluefs. Options `bluefs_alloc_size` and `bluefs_shared_alloc_size` make it possible to allocate bunches of small blocks instead. If this is true, than what kind of issues could it cause? For me it looks like allocations will become slower, but they already are slow in our case as I understand. Will it lead to slower rocksdb operations?

In this particular pool there won't be any objects with size higher than 4K, so it looks like high fragmentation is not a problem here. Or am i missing something? Are there any mechanisms that need only continuous blocks to be allocated?

bluefs_[shared_]alloc_size control allocations for BlueFS and hence important for RocskDB only - they have nothing about main device and objects kept there.
IIRC reducing these parameters showed some performance degradation for RocksDB and hence BlueStore. So I wouldn't recommend to going this way if possible.

Another question about fragmentation causes. We have only 4 hosts which were deployed at one time. And fragmentation on different OSDs is different:

{
    "fragmentation_rating": 0.50179903491497091
}
{
    "fragmentation_rating": 0.57473651276276927
}
{
    "fragmentation_rating": 0.60326654132330093
}
{
    "fragmentation_rating": 0.86433947322218041
}
{
    "fragmentation_rating": 0.87496707526029704
}
{
    "fragmentation_rating": 0.58650867923660477
}
{
    "fragmentation_rating": 0.59876916782058998
}
{
    "fragmentation_rating": 0.86563561688160573
}
{
    "fragmentation_rating": 0.75144924677648284
}
{
    "fragmentation_rating": 0.55282099491022552
}
{
    "fragmentation_rating": 0.56002388459475105
}
{
    "fragmentation_rating": 0.92897663238862038
}

Does this mean, that we cannot predict the fragmentation rating at all? I don't see any external reasons why it differs between OSDs.
Also, it seems we need to take this statistics from ceph-mgr.

OSDs could experience different access pattern (e.g. some back actively used PGs while others don't) and hence rating fluctuation is OK. But I presume that higher fragmentation ratio has some correlation with amount of free space for an OSD. It's easier to fragment smaller space. Doesn't OSDs with higher fragmentation have higher space utilization?

#3 Updated by Aleksei Zakharov over 1 year ago

We also see some new OSDs asserting:

   -2> 2020-05-13 16:48:08.404 7f3951239700  5 prioritycache tune_memory target: 6442450944 mapped: 6028156928 unmapped: 1375698944 heap: 7403855872 old mem: 4443165402 new mem: 4457802392
    -1> 2020-05-13 16:48:08.684 7f395a599700 -1 /build/ceph-14.2.7/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f395a599700 time 2020-05-13 16:48:08.655619
/build/ceph-14.2.7/src/os/bluestore/BlueFS.cc: 2261: FAILED ceph_assert(h->file->fnode.ino != 1)

 ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x8531d0]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x8533ab]
 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1cdf) [0xe5844f]
 4: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x11c) [0xe588ec]
 5: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x86e) [0xe5960e]
 6: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x9c) [0xe5ba7c]
 7: (BlueRocksWritableFile::Sync()+0x63) [0xe7bf53]
 8: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x3e9) [0x14c9459]
 9: (rocksdb::WritableFileWriter::Sync(bool)+0x376) [0x14cc846]
 10: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, rocksdb::CompactionJob::SubcompactionState*, rocksdb::CompactionRangeDelAggregator*, CompactionIterationStats*, rocksdb::Slice const*)+0x1b91) [0x151d7c1]
 11: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0xa8c) [0x151eb2c]
 12: (rocksdb::CompactionJob::Run()+0x2ec) [0x1520c1c]
 13: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0xbbd) [0x134c64d]
 14: (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0x12a) [0x13544fa]
 15: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x97) [0x1354bd7]
 16: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x264) [0x15652a4]
 17: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x56) [0x1565436]
 18: /usr/bin/ceph-osd() [0x1661f2f]
 19: (()+0x76ba) [0x7f39736c06ba]
 20: (clone()+0x6d) [0x7f3972cc741d]

This message is not clear and doesn't look like an allocator bug at first glance.

This looks like another issue indeed. Which could be caused by above one(s) or may be there is some root issue common to them all.

Comment in the code prior to the assertion states that we should't get to that point for ino=1 (i.e. bluefs log):

// we should never run out of log space here; see the min runway check
// in _flush_and_sync_log.

But it looks like this assumption is broken.
Could you please set debug_bluefs log to 20 and collect the log for OSD startup?

#4 Updated by Aleksei Zakharov over 1 year ago

It look like someone has edited my previous message:)

Yes, this check in _flush_and_sync_log doesn't work.
I've tried to look at this issue with gdb:
In _flush_and_sync_log

(gdb) p runway
$94 = 5238784

In _flush_range:

(gdb) p log_writer.pos + log_writer.buffer._len
$112 = 426749952
(gdb) p ((File*)h->file)->fnode.allocated - log_writer.pos +log_writer.buffer._len
$116 = 15831040
(gdb) p offset + length
$113 = 426749952

It defenetely doesn't allocate space in _flush_and_sync_log, the result of the 'if (runway < (int64_t)cct->_conf->bluefs_min_log_runway)' is false. It looks like it tries to allocate memory in sync_metadata:

2020-05-10 14:15:14.880 7f35d2856c00 10 bluefs sync_metadata - no pending log events
2020-05-10 14:15:14.880 7f35d2856c00 10 bluefs _should_compact_log current 0x18ce1000 expected 115000 ratio 366.791
2020-05-10 14:15:14.880 7f35d2856c00 10 bluefs _compact_log_async
2020-05-10 14:15:14.880 7f35d2856c00 10 bluefs _compact_log_async old_log_jump_to 0x18de0000 need 0x191e0000
2020-05-10 14:15:14.880 7f35d2856c00 10 bluefs _allocate len 0x400000 from 0
2020-05-10 14:15:14.880 7f35d2856c00 10 bluefs _allocate len 0x400000 from 1
2020-05-10 14:15:14.880 7f35d2856c00  1 bluefs _allocate failed to allocate 0x400000 on bdev 1, free 0x1c0000; fallback to bdev 2
2020-05-10 14:15:14.880 7f35d2856c00 10 bluefs _allocate len 0x400000 from 2
2020-05-10 14:15:14.880 7f35d2856c00  1 bluefs _allocate unable to allocate 0x400000 on bdev 2, free 0xffffffffffffffff; fallback to slow device expander
2020-05-10 14:15:14.880 7f35d2856c00 10 bluefs _expand_slow_device expanding slow device by 0x15df6b000

These different bdevs also look suspicious: we have only one "bluestore+db+wal" drive.

I've cut the file for one startup (or maybe a bit more), it is above 1.5GB when untared: https://yadi.sk/d/8GYKHbcrJFu5ZQ

#5 Updated by Kefu Chai over 1 year ago

2020-05-27T11:16:47.374 INFO:teuthology.orchestra.run.smithi129.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cen
tos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-1957-gec2e515e821/rpm/el8/BUILD/ceph-16.0.0-1957-gec2e515e821/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_r
ange(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7ff8f5ce9700 time 2020-05-27T11:16:47.351261+0000
2020-05-27T11:16:47.375 INFO:teuthology.orchestra.run.smithi129.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cen
tos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-1957-gec2e515e821/rpm/el8/BUILD/ceph-16.0.0-1957-gec2e515e821/src/os/bluestore/BlueFS.cc: 2746: ceph_abort_msg("bluefs enos
pc")
2020-05-27T11:16:47.375 INFO:teuthology.orchestra.run.smithi129.stderr: ceph version 16.0.0-1957-gec2e515e821 (ec2e515e821ebea5598e58132d8fdf3883460473) pacific (dev)
2020-05-27T11:16:47.375 INFO:teuthology.orchestra.run.smithi129.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<ch
ar>, std::allocator<char> > const&)+0xe5) [0x7ff8fbe30a46]
2020-05-27T11:16:47.375 INFO:teuthology.orchestra.run.smithi129.stderr: 2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1a60) [0x55f663c92df0]
2020-05-27T11:16:47.375 INFO:teuthology.orchestra.run.smithi129.stderr: 3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xf3) [0x55f663c934c3]
2020-05-27T11:16:47.376 INFO:teuthology.orchestra.run.smithi129.stderr: 4: (BlueRocksWritableFile::Flush()+0x41) [0x55f663cba611]
2020-05-27T11:16:47.376 INFO:teuthology.orchestra.run.smithi129.stderr: 5: (rocksdb::LegacyWritableFileWrapper::Flush(rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x1f) [
0x55f663d7b59f]
2020-05-27T11:16:47.376 INFO:teuthology.orchestra.run.smithi129.stderr: 6: (rocksdb::WritableFileWriter::Flush()+0x22d) [0x55f663e8f5fd]
2020-05-27T11:16:47.376 INFO:teuthology.orchestra.run.smithi129.stderr: 7: (rocksdb::WritableFileWriter::Sync(bool)+0x1f) [0x55f663e8f77f]
2020-05-27T11:16:47.376 INFO:teuthology.orchestra.run.smithi129.stderr: 8: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > co
nst&, rocksdb::Env*, rocksdb::FileSystem*, rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::Inter
nalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::al
locator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, rocksdb::FileMetaData*, rocksdb::Internal
KeyComparator const&, std::vector<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> >, std::allocator<std::unique_ptr<
rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> > > > const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>,
 std::allocator<char> > const&, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::CompressionType, unsigned long, roc
ksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*,
 int, unsigned long, unsigned long, rocksdb::Env::WriteLifeTimeHint, unsigned long)+0x2ddd) [0x55f663f5d92d]
2020-05-27T11:16:47.377 INFO:teuthology.orchestra.run.smithi129.stderr: 9: (rocksdb::FlushJob::WriteLevel0Table()+0x1003) [0x55f663df2bb3]
2020-05-27T11:16:47.377 INFO:teuthology.orchestra.run.smithi129.stderr: 10: (rocksdb::FlushJob::Run(rocksdb::LogsWithPrepTracker*, rocksdb::FileMetaData*)+0xcd) [0x55f663df4c4d]
2020-05-27T11:16:47.377 INFO:teuthology.orchestra.run.smithi129.stderr: 11: (rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*, rocksdb::MutableCFOptions cons
t&, bool*, rocksdb::JobContext*, rocksdb::SuperVersionContext*, std::vector<unsigned long, std::allocator<unsigned long> >&, unsigned long, rocksdb::SnapshotChecker*, rocksdb::Lo
gBuffer*, rocksdb::Env::Priority)+0x3c1) [0x55f663d9f091]
2020-05-27T11:16:47.377 INFO:teuthology.orchestra.run.smithi129.stderr: 12: (rocksdb::DBImpl::FlushMemTablesToOutputFiles(rocksdb::autovector<rocksdb::DBImpl::BGFlushArg, 8ul> co
nst&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority)+0x49b) [0x55f663da1dcb]
2020-05-27T11:16:47.377 INFO:teuthology.orchestra.run.smithi129.stderr: 13: (rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::FlushReas
on*, rocksdb::Env::Priority)+0xaf5) [0x55f663da2e95]
2020-05-27T11:16:47.378 INFO:teuthology.orchestra.run.smithi129.stderr: 14: (rocksdb::DBImpl::BackgroundCallFlush(rocksdb::Env::Priority)+0xc8) [0x55f663da9548]
2020-05-27T11:16:47.378 INFO:teuthology.orchestra.run.smithi129.stderr: 15: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x24a) [0x55f663f34cba]
2020-05-27T11:16:47.378 INFO:teuthology.orchestra.run.smithi129.stderr: 16: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x5d) [0x55f663f34e3d]
2020-05-27T11:16:47.378 INFO:teuthology.orchestra.run.smithi129.stderr: 17: (()+0xc2b23) [0x7ff8fab53b23]

rados/objectstore/{backends/objectstore.yaml supported-random-distro$/{rhel_8.yaml}}

/a/kchai-2020-05-27_06:49:52-rados-wip-kefu-testing-2020-05-27-1240-distro-basic-smithi/5094314

#6 Updated by Neha Ojha over 1 year ago

  • Priority changed from Normal to Urgent
2020-05-28T13:12:08.551 INFO:teuthology.orchestra.run.smithi186.stdout:[ RUN      ] ObjectStore/StoreTestSpecificAUSize.ExcessiveFragmentation/2
2020-05-28T13:12:20.161 INFO:teuthology.orchestra.run.smithi186.stdout:written 104857600
2020-05-28T13:12:33.316 INFO:teuthology.orchestra.run.smithi186.stdout:written 209715200
2020-05-28T13:12:38.655 INFO:teuthology.orchestra.run.smithi186.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-1967-g2a01d33/rpm/el8/BUILD/ceph-16.0.0-1967-g2a01d33/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f3dca7df700 time 2020-05-28T13:12:38.650207+0000
2020-05-28T13:12:38.656 INFO:teuthology.orchestra.run.smithi186.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-1967-g2a01d33/rpm/el8/BUILD/ceph-16.0.0-1967-g2a01d33/src/os/bluestore/BlueFS.cc: 2746: ceph_abort_msg("bluefs enospc")
2020-05-28T13:12:38.656 INFO:teuthology.orchestra.run.smithi186.stderr: ceph version 16.0.0-1967-g2a01d33 (2a01d3330fdf6c2b0908352bc9de48f0d6be148c) pacific (dev)
2020-05-28T13:12:38.656 INFO:teuthology.orchestra.run.smithi186.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7f3dd3122a46]
2020-05-28T13:12:38.656 INFO:teuthology.orchestra.run.smithi186.stderr: 2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1a60) [0x5634eb10edf0]
2020-05-28T13:12:38.656 INFO:teuthology.orchestra.run.smithi186.stderr: 3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xf3) [0x5634eb10f4c3]
2020-05-28T13:12:38.657 INFO:teuthology.orchestra.run.smithi186.stderr: 4: (BlueRocksWritableFile::Flush()+0x41) [0x5634eb136611]
2020-05-28T13:12:38.657 INFO:teuthology.orchestra.run.smithi186.stderr: 5: (rocksdb::LegacyWritableFileWrapper::Flush(rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x1f) [0x5634eb1f759f]
2020-05-28T13:12:38.657 INFO:teuthology.orchestra.run.smithi186.stderr: 6: (rocksdb::WritableFileWriter::Flush()+0x22d) [0x5634eb30b5fd]
2020-05-28T13:12:38.657 INFO:teuthology.orchestra.run.smithi186.stderr: 7: (rocksdb::WritableFileWriter::Sync(bool)+0x1f) [0x5634eb30b77f]
2020-05-28T13:12:38.658 INFO:teuthology.orchestra.run.smithi186.stderr: 8: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, rocksdb::FileSystem*, rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> >, std::allocator<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> > > > const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::CompressionType, unsigned long, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, unsigned long, rocksdb::Env::WriteLifeTimeHint, unsigned long)+0x2ddd) [0x5634eb3d992d]
2020-05-28T13:12:38.658 INFO:teuthology.orchestra.run.smithi186.stderr: 9: (rocksdb::FlushJob::WriteLevel0Table()+0x1003) [0x5634eb26ebb3]
2020-05-28T13:12:38.659 INFO:teuthology.orchestra.run.smithi186.stderr: 10: (rocksdb::FlushJob::Run(rocksdb::LogsWithPrepTracker*, rocksdb::FileMetaData*)+0xcd) [0x5634eb270c4d]
2020-05-28T13:12:38.659 INFO:teuthology.orchestra.run.smithi186.stderr: 11: (rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*, rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*, rocksdb::SuperVersionContext*, std::vector<unsigned long, std::allocator<unsigned long> >&, unsigned long, rocksdb::SnapshotChecker*, rocksdb::LogBuffer*, rocksdb::Env::Priority)+0x3c1) [0x5634eb21b091]
2020-05-28T13:12:38.659 INFO:teuthology.orchestra.run.smithi186.stderr: 12: (rocksdb::DBImpl::FlushMemTablesToOutputFiles(rocksdb::autovector<rocksdb::DBImpl::BGFlushArg, 8ul> const&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority)+0x49b) [0x5634eb21ddcb]
2020-05-28T13:12:38.659 INFO:teuthology.orchestra.run.smithi186.stderr: 13: (rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::FlushReason*, rocksdb::Env::Priority)+0xaf5) [0x5634eb21ee95]
2020-05-28T13:12:38.659 INFO:teuthology.orchestra.run.smithi186.stderr: 14: (rocksdb::DBImpl::BackgroundCallFlush(rocksdb::Env::Priority)+0xc8) [0x5634eb225548]
2020-05-28T13:12:38.660 INFO:teuthology.orchestra.run.smithi186.stderr: 15: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x24a) [0x5634eb3b0cba]
2020-05-28T13:12:38.660 INFO:teuthology.orchestra.run.smithi186.stderr: 16: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x5d) [0x5634eb3b0e3d]
2020-05-28T13:12:38.660 INFO:teuthology.orchestra.run.smithi186.stderr: 17: (()+0xc2b23) [0x7f3dd1e4eb23]
2020-05-28T13:12:38.660 INFO:teuthology.orchestra.run.smithi186.stderr: 18: (()+0x82de) [0x7f3dddafe2de]
2020-05-28T13:12:38.660 INFO:teuthology.orchestra.run.smithi186.stderr: 19: (clone()+0x43) [0x7f3dd152b133]

/a/yuriw-2020-05-28_02:23:45-rados-wip-yuri-master_5.27.20-distro-basic-smithi/5098043/

#7 Updated by Simon Leinen over 1 year ago

We have similar issues on our cluster. ~1 billion objects in EC (8+3) pools, 540 OSDs, Nautilus 14.2.8. No tuning of minimum allocation. Data is on spinning disks, and our SSDs are small so we have a lot of spillover of RocksDB data to spinning disks as well. For the past couple of days, many OSDs have terminated with

/build/ceph-14.2.8/src/os/bluestore/BlueFS.cc: 2261: FAILED ceph_assert(h->file->fnode.ino != 1)

For some of them, we have been unable to bring them up. Offline compaction of the K/V store fails at the same assertion. The cluster is severely degraded with customer impact.

Any helpful tips would be appreciated. At this stage we're happy to test patches.

#8 Updated by Igor Fedotov over 1 year ago

@Neha, @Kefu - made another ticket for QA test case failure, IMO unrelated.

https://tracker.ceph.com/issues/45788

#9 Updated by Aleksei Zakharov over 1 year ago

We added some new OSD's. During backfill process we were affected by another issue(https://tracker.ceph.com/issues/45765#change-166975).

About 1/4 of PG's is moved to other hosts, but we see that fragmentation rating is not decreasing. Opposite, it is increasing a bit.

Turning on stupid allocator with "bad" OSD turned into an impossibility to start this OSD some time after. It looks like it is a bad idea to leave the OSD work with stupid allocator.

#10 Updated by Igor Fedotov over 1 year ago

Aleksei Zakharov wrote:

Turning on stupid allocator with "bad" OSD turned into an impossibility to start this OSD some time after. It looks like it is a bad idea to leave the OSD work with stupid allocator.

Are you referring to the OSD from comment #4 or it's a different one?
If not - could you please share OSD startup log....

#11 Updated by Igor Fedotov over 1 year ago

For the sake of record the following thread has some helpful info on assertion: h->file->fnode.ino != 1

I did some analysis on log from comment #4 in this this ticket - it has pretty the same issue - huge alocation from main device needs too much space in BlueFS log and overflows its runaway. High main disk fragmentation is the primarily trigger for that.

#12 Updated by Aleksei Zakharov over 1 year ago

Igor Fedotov wrote:

Aleksei Zakharov wrote:

Turning on stupid allocator with "bad" OSD turned into an impossibility to start this OSD some time after. It looks like it is a bad idea to leave the OSD work with stupid allocator.

Are you referring to the OSD from comment #4 or it's a different one?
If not - could you please share OSD startup log....

This is a different OSD, but issue looks the same. I'm trying to get startup log with debug_bluefs=20/20 now, it is going to be about 5GiB when compressed.

#13 Updated by Aleksei Zakharov over 1 year ago

Aleksei Zakharov wrote:

Igor Fedotov wrote:

Aleksei Zakharov wrote:

Turning on stupid allocator with "bad" OSD turned into an impossibility to start this OSD some time after. It looks like it is a bad idea to leave the OSD work with stupid allocator.

Are you referring to the OSD from comment #4 or it's a different one?
If not - could you please share OSD startup log....

This is a different OSD, but issue looks the same. I'm trying to get startup log with debug_bluefs=20/20 now, it is going to be about 5GiB when compressed.

Here's the startup log: https://166924.selcdn.ru/links/osd.30.log.tar.gz

#14 Updated by Igor Fedotov over 1 year ago

OK, here is a backtrace from the OSD.30 log

ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)
1: (()+0x11390) [0x7f0f4b186390]
2: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsign
ed long, ceph::buffer::v14_2_0::list*, char*)+0x4e5) [0xe4e405]
3: (BlueFS::_replay(bool, bool)+0x489) [0xe4f3f9]
4: (BlueFS::mount()+0x239) [0xe62269]
5: (BlueStore::_open_bluefs(bool)+0x41) [0xd50a91]
6: (BlueStore::_open_db(bool, bool, bool)+0xa1c) [0xd51e0c]
7: (BlueStore::_open_db_and_around(bool)+0x44) [0xd68874]

Which looks similar to https://tracker.ceph.com/issues/24639
I'll provide more throughout analysis a bit later. But it looks like another incarnation of the same bluefs log runaway overflow issue which results in "/build/ceph-14.2.8/src/os/bluestore/BlueFS.cc: 2261: FAILED ceph_assert(h->file->fnode.ino != 1" assertion.

#15 Updated by Igor Fedotov over 1 year ago

  • Related to Bug #24639: [segfault] segfault in BlueFS::read added

#16 Updated by Aleksei Zakharov over 1 year ago

We faced the "FAILED ceph_assert(h->file->fnode.ino != 1)" issue on the osds with very low fragmentation rating:
"fragmentation_rating": 0.26475483800667898

I assume then, that the fragmentation is not the cause here.
Part of the log before failing:

2020-07-25 11:08:22.297 7f6f489d5700  4 rocksdb: [db/compaction_job.cc:1332] [default] [JOB 1737] Generated table #84353: 73533 keys, 54382343 bytes
2020-07-25 11:08:22.297 7f6f489d5700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1595664502298797, "cf_name": "default", "job": 1737, "event": "table_file_creation", "file_number": 84353, "file_size": 54382343, "table_properties": {"data_size": 53688970, "index_size": 508637, "filter_size": 183877, "raw_key_size": 13696105, "raw_average_key_size": 186, "raw_value_size": 40504676, "raw_average_value_size": 550, "num_data_blocks": 12972, "num_entries": 73533, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}}
2020-07-25 11:08:22.673 7f6f489d5700  4 rocksdb: [db/compaction_job.cc:1332] [default] [JOB 1737] Generated table #84354: 73535 keys, 54384067 bytes
2020-07-25 11:08:22.673 7f6f489d5700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1595664502675661, "cf_name": "default", "job": 1737, "event": "table_file_creation", "file_number": 84354, "file_size": 54384067, "table_properties": {"data_size": 53689835, "index_size": 509496, "filter_size": 183877, "raw_key_size": 13696614, "raw_average_key_size": 186, "raw_value_size": 40506039, "raw_average_value_size": 550, "num_data_blocks": 12985, "num_entries": 73535, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}}
2020-07-25 11:08:22.961 7f6f489d5700  1 bluefs _allocate failed to allocate 0x33dd4c5 on bdev 1, free 0x2bc0000; fallback to bdev 2
2020-07-25 11:08:22.961 7f6f489d5700  1 bluefs _allocate unable to allocate 0x33dd4c5 on bdev 2, free 0xffffffffffffffff; fallback to slow device expander
2020-07-25 11:08:23.997 7f6f489d5700  1 bluefs _add_block_extent bdev 1 0x4e7c7b000~1000
2020-07-25 11:08:23.997 7f6f489d5700  1 bluefs _add_block_extent bdev 1 0x4e7c82000~1000
2020-07-25 11:08:23.997 7f6f489d5700  1 bluefs _add_block_extent bdev 1 0x4e7c86000~1000
2020-07-25 11:08:23.997 7f6f489d5700  1 bluefs _add_block_extent bdev 1 0x4e7c9a000~1000
2020-07-25 11:08:23.997 7f6f489d5700  1 bluefs _add_block_extent bdev 1 0x4e7ca4000~1000
2020-07-25 11:08:23.997 7f6f489d5700  1 bluefs _add_block_extent bdev 1 0x4e7cae000~1000
2020-07-25 11:08:23.997 7f6f489d5700  1 bluefs _add_block_extent bdev 1 0x4e7cb1000~2000
2020-07-25 11:08:23.997 7f6f489d5700  1 bluefs _add_block_extent bdev 1 0x4e7cb5000~1000
2020-07-25 11:08:23.997 7f6f489d5700  1 bluefs _add_block_extent bdev 1 0x4e7cba000~1000
2020-07-25 11:08:23.997 7f6f489d5700  1 bluefs _add_block_extent bdev 1 0x4e7cdd000~1000
....
2020-07-25 11:08:46.563 7f6f489d5700  1 bluefs _add_block_extent bdev 1 0xcc59d5000~2000
2020-07-25 11:08:46.563 7f6f489d5700  1 bluefs _add_block_extent bdev 1 0xcc59d9000~1000
2020-07-25 11:08:46.563 7f6f489d5700  1 bluefs _add_block_extent bdev 1 0xcc59e0000~1000
2020-07-25 11:08:47.079 7f6f489d5700 -1 /build/ceph-14.2.7/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f6f489d5700 time 2020-07-25 11:08:47.013006
/build/ceph-14.2.7/src/os/bluestore/BlueFS.cc: 2261: FAILED ceph_assert(h->file->fnode.ino != 1)

 ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x5584cd542e4c]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x5584cd543027]
 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1cdf) [0x5584cdb4983f]
 4: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x11c) [0x5584cdb49cdc]
 5: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x86e) [0x5584cdb4a9fe]
 6: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x9c) [0x5584cdb4ce6c]
 7: (BlueRocksWritableFile::Sync()+0x63) [0x5584cdb6d343]
 8: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x3e9) [0x5584ce1ba869]
 9: (rocksdb::WritableFileWriter::Sync(bool)+0x376) [0x5584ce1bdc56]
 10: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, rocksdb::CompactionJob::SubcompactionState*, rocksdb::CompactionRangeDelAggregator*, CompactionIterationStats*, rocksdb::Slice const*)+0x1b91) [0x5584ce20ebd1]
 11: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7e0) [0x5584ce20fc90]
 12: (rocksdb::CompactionJob::Run()+0x2ec) [0x5584ce21202c]
 13: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0xbbd) [0x5584ce03da1d]
 14: (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0x12a) [0x5584ce0458ca]
 15: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x97) [0x5584ce045fa7]
 16: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x264) [0x5584ce2566b4]
 17: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x56) [0x5584ce256846]
 18: (()+0xbd6df) [0x7f6f645e16df]
 19: (()+0x76db) [0x7f6f64efe6db]
 20: (clone()+0x3f) [0x7f6f63c9e88f]

2020-07-25 11:08:47.087 7f6f489d5700 -1 *** Caught signal (Aborted) **
 in thread 7f6f489d5700 thread_name:rocksdb:low30

 ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)
 1: (()+0x12890) [0x7f6f64f09890]
 2: (gsignal()+0xc7) [0x7f6f63bbbe97]
 3: (abort()+0x141) [0x7f6f63bbd801]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x5584cd542e9d]
 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x5584cd543027]
 6: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1cdf) [0x5584cdb4983f]
 7: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x11c) [0x5584cdb49cdc]
 8: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x86e) [0x5584cdb4a9fe]
 9: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x9c) [0x5584cdb4ce6c]
 10: (BlueRocksWritableFile::Sync()+0x63) [0x5584cdb6d343]
 11: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x3e9) [0x5584ce1ba869]
 12: (rocksdb::WritableFileWriter::Sync(bool)+0x376) [0x5584ce1bdc56]
 13: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, rocksdb::CompactionJob::SubcompactionState*, rocksdb::CompactionRangeDelAggregator*, CompactionIterationStats*, rocksdb::Slice const*)+0x1b91) [0x5584ce20ebd1]
 14: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7e0) [0x5584ce20fc90]
 15: (rocksdb::CompactionJob::Run()+0x2ec) [0x5584ce21202c]
 16: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0xbbd) [0x5584ce03da1d]
 17: (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0x12a) [0x5584ce0458ca]
 18: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x97) [0x5584ce045fa7]
 19: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x264) [0x5584ce2566b4]
 20: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x56) [0x5584ce256846]
 21: (()+0xbd6df) [0x7f6f645e16df]
 22: (()+0x76db) [0x7f6f64efe6db]
 23: (clone()+0x3f) [0x7f6f63c9e88f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

After that it cannot start with bitmap allocator showing the trace above.
We switched the osds to stupid allocator, and they work for now. OSD's are utilized for ~60%.

We have 4 OSDs failed in ten minutes with this problem. 3 of them were able to start with stupid allocator, but one of them doesn't. All of them were added to rebalance data and help to reduce fragmentation.
I've run

CEPH_ARGS="--bluefs-shared-alloc-size 4096 --bluestore_allocator stupid --bluefs_allocator stupid --debug_bluefs 20/20" ceph-kvstore-tool bluestore-kv  /var/lib/ceph/osd/ceph-668/ compact 2>&1
for the borken OSD to get log attached here: https://yadi.sk/d/RjOt0bqCy7cw6g
As you can see, even this command is unable to work here.

This issue looks more and more dangerous. What else can we provide to help to find the root cause of this?

#17 Updated by Igor Fedotov over 1 year ago

Alexei,
I presume you have plenty of "bluefs _add_block_extent bdev..." lines immediately prior to the assertion. And it was bitmap allocator which was in-effect at that moment. If so to both having that many lines denotes pretty small allocations of 4K-8K which implicitly indicates high fragmentation. Well - this might be "local" fragmentation, i.e. space is highly fragmented just locally and allocator happens to step into this area. Overall fragmentation ratio might still be fine though. Using stupid allocator (or preferably hybrid one for future releases) would help in this case.

Unfortunately sometimes "h->file->fnode.ino != 1" might result in a subsequent bluefs log reply error. That's what also happened to osd.30 in comment #13.
See https://github.com/ceph/ceph/pull/35473 for the fix to prevent data corruption
and https://github.com/ceph/ceph/pull/36121 to recover - don't know how effective it is though...

What I don't like in your log (https://yadi.sk/d/RjOt0bqCy7cw6g) are pretty long extent lists for bluefs files. E.g.

BlueFS log file:
-5> 2020-07-25 14:53:30.807 7fa473a2ed80 10 bluefs _read h 0x55b9d1b76260 0x2ecc9000~731000 from file(ino 1 size 0x2ecc8000 mtime 0.000000 bdev 0 allocated 2f059000 e
xtents [1:0x9b01ead000~59000,1:0x9b01980000~1ea000,1:0x9b01c97000~216000,1:0x3842c71000~2a000,1:0x3842e43000~7000,1:0x3885bb7000~e1000,1:0x38adb58000~68000,1:0x38add66000
~111000,1:0x5412d8f000~3c000,1:0x5413340000~105000,
.....

KV WAL file:
-10> 2020-07-25 14:53:30.807 7fa473a2ed80 20 bluefs _replay 0x2ecc6000: op_file_update file(ino 73754 size 0xde59d96 mtime 2020-07-25 14:13:34.111964 bdev 0 allocate
d de5a000 extents [1:0x9b0197f000~1000,1:0x9b01f06000~1000,1:0x38427a8000~10000,1:0x384298d000~64000,1:0x38429f3000~5000,1:0x3842a09000~59000,1:0x3842c0a000~67000,1:0x541
3511000~1af000,1:0x541942c000~118000,1:0x5419acd000~2000,1:0x543e50f000~9c000,1:0x543ec00000~1ba000,1:0x543f6cf000~12d000,1:0x543f9cf000~31000,1:0x543fb21000~4a000,1:0x543fd1c000~1a000, ....

This is likely to be caused by either allocation fragmentation or low bluefs_shared_alloc_size or both. And it results in large file entries in bluefs log and hence raises the chances to overflow this log's runaway. You might want to increase bluefs_max_log_runway value to workaround but please use it with care - it might reveal some additional hidden issues... So long-term I'd start fighting that large extent lists. Upgrading to recent versions might be a good first step on this way...

BTW what was the rationale to set bluefs_shared_alloc_size to 4K? Well - this might provide some more space when dealing with almost full device but generally I'd recommend to avoid very high device's filling anyway.

#18 Updated by Aleksei Zakharov over 1 year ago

Igor, thanks for the answer,
bluefs_shared_alloc_size was set because we saw a problem with bluefs allocation:

ceph daemon osd.46 bluestore bluefs available
{
    "dev": {
        "device": "BDEV_DB",
        "free": 1776680960
    },
    "available_from_bluestore": 0
}
ceph daemon osd.46 bluestore bluefs available 4096
{
    "dev": {
        "device": "BDEV_DB",
        "free": 1782775808
    },
    "available_from_bluestore": 269867220992
}

OSDs failed with assert "bluefs_enospc":
    "os_version_id": "16.04", 
    "assert_condition": "abort", 
    "utsname_release": "4.15.0-54-generic", 
    "os_name": "Ubuntu", 
    "entity_name": "osd.6", 
    "assert_file": "/build/ceph-14.2.7/src/os/bluestore/BlueFS.cc", 
    "timestamp": "2020-05-09 08:05:57.367574Z", 
    "process_name": "ceph-osd", 
    "utsname_machine": "x86_64", 
    "assert_line": 2269, 
    "utsname_sysname": "Linux", 
    "os_version": "16.04.6 LTS (Xenial Xerus)", 
    "os_id": "ubuntu", 
    "assert_thread_name": "rocksdb:low7", 
    "utsname_version": "#58~16.04.1-Ubuntu SMP Mon Jun 24 13:21:41 UTC 2019", 
    "backtrace": [
        "(()+0x11390) [0x7fc585c8b390]", 
        "(gsignal()+0x38) [0x7fc5851b6428]", 
        "(abort()+0x16a) [0x7fc5851b802a]", 
        "(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b0) [0x85394b]", 
        "(BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1e63) [0xe585d3]", 
        "(BlueFS::_flush(BlueFS::FileWriter*, bool)+0x11c) [0xe588ec]", 
        "(BlueRocksWritableFile::Flush()+0x3d) [0xe7be9d]", 
        "(rocksdb::WritableFileWriter::Flush()+0x45f) [0x14cb23f]", 
        "(rocksdb::WritableFileWriter::Sync(bool)+0x3b) [0x14cc50b]", 
        "(rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, rocksdb::CompactionJob::SubcompactionState*, rocksdb::CompactionRangeDelAggregator*, CompactionIterationStats*, rocksdb::Slice const*)+0x1b91) [0x151d7c1]", 
        "(rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0xa8c) [0x151eb2c]", 
        "(rocksdb::CompactionJob::Run()+0x2ec) [0x1520c1c]", 
        "(rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0xbbd) [0x134c64d]", 
        "(rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0x12a) [0x13544fa]", 
        "(rocksdb::DBImpl::BGWorkCompaction(void*)+0x97) [0x1354bd7]", 
        "(rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x264) [0x15652a4]", 
        "(rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x56) [0x1565436]", 
        "/usr/bin/ceph-osd() [0x1661f2f]", 
        "(()+0x76ba) [0x7fc585c816ba]", 
        "(clone()+0x6d) [0x7fc58528841d]" 
    ], 
    "utsname_hostname": "meta101", 
    "assert_msg": "/build/ceph-14.2.7/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7fc57536b700 time 2020-05-09 11:05:57.327807\n/build/ceph-14.2.7/src/os/bluestore/BlueFS.cc: 2269: ceph_abort_msg(\"bluefs enospc\")\n", 
    "crash_id": "2020-05-09_08:05:57.367574Z_bcfca643-be35-4449-9f90-46b8980b96be", 
    "assert_func": "int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)", 
    "ceph_version": "14.2.7" 

I probably had to write about this earlier:) because, this is related to the problems described here in the ticket.

#19 Updated by Neha Ojha over 1 year ago

  • Priority changed from Urgent to Normal

At this point, redeploying the OSD might be the best way to go.

#20 Updated by Aleksei Zakharov over 1 year ago

Redeploying the OSD is the only way if it doesn't start, of course. But the problem occurs with different OSDs from time to time, so the cluster needs a lot of human intervention now.

#21 Updated by Mohammed Naser over 1 year ago

I'm running into this heavily which makes it impossible to backfill a new cluster. I'm seeing very high fragmentation ratings on OSDs (70%+) even though they are actually being used at around 20-30% according to Ceph...

Should we perhaps look into monitoring the fragmentation scores of OSDs and.. figure out a way to defragment them? This is using NVMe OSDs as well with a lot of radosgw objects (indexes stored on them).

#22 Updated by Mohammed Naser over 1 year ago

I did a whole bunch of digging today and found out that for some reason, there is a non-trivial amount of fragmentation happening in the cluster, even in brand new OSDs.

For example, osd.398 in this cluster reports a score of 0.81 in fragmentation however...

ID  CLASS WEIGHT    REWEIGHT SIZE    RAW USE DATA    OMAP    META     AVAIL    %USE  VAR  PGS STATUS TYPE NAME
...
398   ssd   0.34999  1.00000 931 GiB 274 GiB 260 GiB  14 GiB  967 MiB  657 GiB 29.46 0.63  38     up                 osd.398              

This was surfacing a lot in the process of adding new drives to the system. The OSD is not that old...

  File: /var/lib/ceph/osd/ceph-398/ready
  Size: 6             Blocks: 8          IO Block: 4096   regular file
Device: ebh/235d    Inode: 514697      Links: 1
Access: (0600/-rw-------)  Uid: (64045/    ceph)   Gid: (64045/    ceph)
Access: 2020-10-06 21:23:06.216858059 +0000
Modify: 2020-08-21 23:25:11.691805173 +0000
Change: 2020-08-21 23:25:11.703805290 +0000
 Birth: -

So running for something like just over a month and it got that bad?

#23 Updated by Igor Fedotov over 1 year ago

Mohammed Naser wrote:

I did a whole bunch of digging today and found out that for some reason, there is a non-trivial amount of fragmentation happening in the cluster, even in brand new OSDs.

For example, osd.398 in this cluster reports a score of 0.81 in fragmentation however...

[...]

This was surfacing a lot in the process of adding new drives to the system. The OSD is not that old...

[...]

So running for something like just over a month and it got that bad?

Hi Mohammed.

You didn't mentioned which Ceph release you're using and what are bluestore_allocator setting currently is.
But I presume it's bitmap. Perhaps under some use scenarios (something like plenty of overwrites/removals evenly distrubuted over stored data) it might cause such a high fragmentation indeed. I'd recommend to migrate to Hybrid (or at least AVL) allocator. This will avoid going things even worse and has some potential to improve the situation. The latter hardly depends on the use patterns again though so please do not expect much on this way.

#24 Updated by Mohammed Naser over 1 year ago

Thanks for responding, this is actually running Nautilus with the default allocator (so `hybrid`). Perhaps we should look into AVL instead?

https://github.com/ceph/ceph/blob/v14.2.11/src/common/options.cc#L4721-L4725

#25 Updated by Igor Fedotov over 1 year ago

Mohammed Naser wrote:

Thanks for responding, this is actually running Nautilus with the default allocator (so `hybrid`). Perhaps we should look into AVL instead?

https://github.com/ceph/ceph/blob/v14.2.11/src/common/options.cc#L4721-L4725

Most liekly AVL wouldn't provide much improvement since it's Hybrid allocator is basically AVL one coupled with bitmap allocator to avoid to much RAM usage for highly fragmented cases...
Are you using v14.2.11 from the beginning or OSDs suffering from high fragmentation were deployed (and used) with the previous Nautilus sub-releases?

#26 Updated by Mohammed Naser over 1 year ago

Are you using v14.2.11 from the beginning or OSDs suffering from high fragmentation were deployed (and used) with the previous Nautilus sub-releases?

These 'new' OSDs that are suffering from high fragmentation were deployed with 14.2.5 and then upgraded to 14.2.11 afterwards (when we started digging into the issue to see if a newer release also has problems).

#27 Updated by Mohammed Naser over 1 year ago

I just redeployed an OSD that was created at 2020-10-16 16:35:49.954951487 +0000, it is currently Sun Oct 18 15:13:07 UTC 2020. The bluestore allocator score is currently at 0.6336798815977652

Will I just have to live with this because of my high churn in terms of the number of volumes in my RBD (a lot of create/write a lot of data/delete) pattern?

#28 Updated by Igor Fedotov over 1 year ago

You might want to use free-dump command to inspect actual free extents layout:

ceph-bluestore-tool --path <> free-dump

I'd like to take a look at it as well if possible...

Additionally wondering how do you learn fragmentation ratio? There are actually two commands to do that (using a bit different algorithm though) : "score" and "fragmentation".

#29 Updated by Mohammed Naser about 1 year ago

This is what it looks like:

http://sprunge.us/MypV0b

And the high fragmentation is using the score. Sorry, we didn't get back here because we just stopped making cluster changes and now as we're emptying some OSDs, seeing crashes again.

#30 Updated by Mohammed Naser about 1 year ago

Interestingly enough, I was able to get the OSD to start back up with the stupid allocator.. does that help?

#31 Updated by Seena Fallah 11 months ago

I faced the same issue in 14.2.14
My cluster was recovering the degraded pgs and one of my OSDs' db get full! After switching to stupid allocator it raise the db size about 20GB and after that I switch back to bitmap.
This was happening when buckets.index pool was recovering.
This OSD doesn't have a separate db block.
Why didn't it spillover to block data? Should I always monitor a db size and db used to see if it's getting full? If yes how can I grow it? SHouldn't it automatically grow up db size when not using separate db block?

Also available in: Atom PDF