Bug #45519: OSD asserts during block allocation for BlueFS - bluestore - Ceph

Actions

Bug #45519

open

OSD asserts during block allocation for BlueFS

Added by Aleksei Zakharov almost 4 years ago. Updated about 3 years ago.

Status:

New

Priority:

Normal

Assignee:

-

Target version:

-

% Done:

0%

Source:

Tags:

Backport:

Regression:

No

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi all.
We use ceph as the rados object storage for small (<=4K) objects. We have about 2 billions of objects in one pool. We do not use separate WAL/DB partitions, instead full NVMe drives are used by OSDs. To use bluestore space effectively we've set up:

bluestore_min_alloc_size_ssd 4096
bluestore_allocator bitmap
bluefs_allocator bitmap

A few days ago we've caught a problem - some osd's started aborting with message:

    "os_version_id": "16.04", 
    "assert_condition": "abort", 
    "utsname_release": "4.15.0-54-generic", 
    "os_name": "Ubuntu", 
    "entity_name": "osd.6", 
    "assert_file": "/build/ceph-14.2.7/src/os/bluestore/BlueFS.cc", 
    "timestamp": "2020-05-09 08:05:57.367574Z", 
    "process_name": "ceph-osd", 
    "utsname_machine": "x86_64", 
    "assert_line": 2269, 
    "utsname_sysname": "Linux", 
    "os_version": "16.04.6 LTS (Xenial Xerus)", 
    "os_id": "ubuntu", 
    "assert_thread_name": "rocksdb:low7", 
    "utsname_version": "#58~16.04.1-Ubuntu SMP Mon Jun 24 13:21:41 UTC 2019", 
    "backtrace": [
        "(()+0x11390) [0x7fc585c8b390]", 
        "(gsignal()+0x38) [0x7fc5851b6428]", 
        "(abort()+0x16a) [0x7fc5851b802a]", 
        "(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b0) [0x85394b]", 
        "(BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1e63) [0xe585d3]", 
        "(BlueFS::_flush(BlueFS::FileWriter*, bool)+0x11c) [0xe588ec]", 
        "(BlueRocksWritableFile::Flush()+0x3d) [0xe7be9d]", 
        "(rocksdb::WritableFileWriter::Flush()+0x45f) [0x14cb23f]", 
        "(rocksdb::WritableFileWriter::Sync(bool)+0x3b) [0x14cc50b]", 
        "(rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, rocksdb::CompactionJob::SubcompactionState*, rocksdb::CompactionRangeDelAggregator*, CompactionIterationStats*, rocksdb::Slice const*)+0x1b91) [0x151d7c1]", 
        "(rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0xa8c) [0x151eb2c]", 
        "(rocksdb::CompactionJob::Run()+0x2ec) [0x1520c1c]", 
        "(rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0xbbd) [0x134c64d]", 
        "(rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0x12a) [0x13544fa]", 
        "(rocksdb::DBImpl::BGWorkCompaction(void*)+0x97) [0x1354bd7]", 
        "(rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x264) [0x15652a4]", 
        "(rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x56) [0x1565436]", 
        "/usr/bin/ceph-osd() [0x1661f2f]", 
        "(()+0x76ba) [0x7fc585c816ba]", 
        "(clone()+0x6d) [0x7fc58528841d]" 
    ], 
    "utsname_hostname": "meta101", 
    "assert_msg": "/build/ceph-14.2.7/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7fc57536b700 time 2020-05-09 11:05:57.327807\n/build/ceph-14.2.7/src/os/bluestore/BlueFS.cc: 2269: ceph_abort_msg(\"bluefs enospc\")\n", 
    "crash_id": "2020-05-09_08:05:57.367574Z_bcfca643-be35-4449-9f90-46b8980b96be", 
    "assert_func": "int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)", 
    "ceph_version": "14.2.7"

We found that `bluestore bluefs available` showed available from bluestore space as 0:

{
    "dev": {
        "device": "BDEV_DB",
        "free": 768016384
    },
    "available_from_bluestore": 0
}

OSDs were failing and restarting.

We also saw that the bluestore fragmentation was high, but we're not sure if it is related to the problem:

ceph daemon osd.46 bluestore allocator score block
{
    "fragmentation_rating": 0.92772435210430959
}

Some time after, one of the `bad` OSDs failed and couldn't start with another assert:

{
    "os_version_id": "16.04",
    "assert_condition": "h->file->fnode.ino != 1",
    "utsname_release": "4.15.0-54-generic",
    "os_name": "Ubuntu",
    "entity_name": "osd.14",
    "assert_file": "/build/ceph-14.2.7/src/os/bluestore/BlueFS.cc",
    "timestamp": "2020-05-12 12:24:20.755439Z",
    "process_name": "ceph-osd",
    "utsname_machine": "x86_64",
    "assert_line": 2261,
    "utsname_sysname": "Linux",
    "os_version": "16.04.6 LTS (Xenial Xerus)",
    "os_id": "ubuntu",
    "assert_thread_name": "ceph-osd",
    "utsname_version": "#58~16.04.1-Ubuntu SMP Mon Jun 24 13:21:41 UTC 2019",
    "backtrace": [
        "(()+0x11390) [0x7f46b4572390]",
        "(gsignal()+0x38) [0x7f46b3a9d428]",
        "(abort()+0x16a) [0x7f46b3a9f02a]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x853221]",
        "(ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x8533ab]",
        "(BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1cdf) [0xe5844f]",
        "(BlueFS::_flush(BlueFS::FileWriter*, bool)+0x11c) [0xe588ec]",
        "(BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x86e) [0xe5960e]",
        "(BlueFS::_compact_log_async(std::unique_lock<std::mutex>&)+0x2fb) [0xe6931b]",
        "(BlueFS::sync_metadata()+0x215) [0xe6ad65]",
        "(BlueFS::umount()+0xea) [0xe6b08a]",
        "(BlueStore::_close_bluefs()+0xd) [0xd25d0d]",
        "(BlueStore::_open_db_and_around(bool)+0x157) [0xd68987]",
        "(BlueStore::_mount(bool, bool)+0x584) [0xdb83b4]",
        "(OSD::init()+0x3b7) [0x901b17]",
        "(main()+0x283a) [0x85921a]",
        "(__libc_start_main()+0xf0) [0x7f46b3a88830]",
        "(_start()+0x29) [0x88d569]" 
    ],
    "utsname_hostname": "meta102",
    "assert_msg": "/build/ceph-14.2.7/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f46b68d7c00 time 2020-05-12 15:24:20.743666\n/build/ceph-14.2.7/src/os/bluestore/BlueFS.cc: 2261: FAILED ceph_assert(h->file->fnode.ino != 1)\n",
    "crash_id": "2020-05-12_12:24:20.755439Z_00b51d40-536f-4124-973f-c5bf8d874bb2",
    "assert_func": "int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)",
    "ceph_version": "14.2.7" 
}

After setting `bluefs_alloc_size` to `4096` we saw that `available_from_bluestore` became normal (much greater than 0). After that there were no new osd's failing. We also set `bluefs_shared_alloc_size` to `4096`.

No new osd's were failing after that, but OSD's that failed once were failing occasionally. And after few restarts they were unable to start.

We found that the way to start osd is to set for failed one:

bluestore_allocator stupid
bluefs_allocator stupid

After it started up, it is possible to stop daemon, set allocators back to `bitmap` and start daemon. They work fine without fails after that.

For now we have one OSD that doesn't start up with `ceph_assert(h->file->fnode.ino != 1)` message, it is out of crush map, so it is possible to collect any debug information if needed.

All this situation looks like the bitmap allocator bug.

It seems related to https://tracker.ceph.com/issues/40741.
It is not clear to me: why did `bluefs_alloc_size` affected `available_from_bluestore`, but not `bluefs_shared_alloc_size` in this setup (with one device for OSD)? In what scenario do we need to use the second one?

Do we need to check some extra allocator options when have this kind of workload - billions of very small objects?
We saw that when backfill starts (after turn on the balancer module, or change an osd weight), latency starts growing. And latency starts falling back to normal only when rebalance has already been finished some time ago. I'm not quite sure this is related to this issue, if 100% not - we'll open a new issue after collecting some more details on it.

Related issues 1 (0 open — 1 closed)

Actions

Also available in: Atom PDF