Project

General

Profile

Actions

Bug #63110

closed

Crash in RocksDBBlueFSVolumeSelector::sub_usage via BlueFS::fsync via WriteToWAL in KVSyncThread

Added by Witold Baryluk 7 months ago. Updated 6 months ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I got a random crash.

10 storage nodes, 14 hdd each. 3 mon, 3 mgr, 3 mds + 30 rgw. Quite a bit of load.

We were running on octopus running on whole cluster, for at least a year.

We updated to pacific today.

Initially all was good.

Few hours after upgrade was finished, one of the osd daemons crashed. (Remaining 111 osd daemons looks good at the moment).

No issues found in dmesg, or smartctl for this node / disk. I can read this disk (i.e. using fdisk or dd).

{
    "assert_condition": "cur >= p.length",
    "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.14/rpm/el8/BUILD/ceph-16.2.14/src/os/bluestore/BlueStore.h",
    "assert_func": "virtual void RocksDBBlueFSVolumeSelector::sub_usage(void*, const bluefs_fnode_t&)",
    "assert_line": 3870,
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.14/rpm/el8/BUILD/ceph-16.2.14/src/os/bluestore/BlueStore.h: In function 'virtual void RocksDBBlueFSVolumeSelector::sub_usage(void*, const bluefs_fnode_t&)' thread 7f4b82f1f700 time 2023-10-05T13:13:43.560373+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.14/rpm/el8/BUILD/ceph-16.2.14/src/os/bluestore/BlueStore.h: 3870: FAILED ceph_assert(cur >= p.length)\n",
    "assert_thread_name": "bstore_kv_sync",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12cf0) [0x7f4b9c6b9cf0]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x55dd40c72d0b]",
        "/usr/bin/ceph-osd(+0x584ed4) [0x55dd40c72ed4]",
        "(RocksDBBlueFSVolumeSelector::sub_usage(void*, bluefs_fnode_t const&)+0x16a) [0x55dd412efaaa]",
        "(BlueFS::_flush_range_F(BlueFS::FileWriter*, unsigned long, unsigned long)+0x77d) [0x55dd413801cd]",
        "(BlueFS::_flush_F(BlueFS::FileWriter*, bool, bool*)+0x90) [0x55dd41380670]",
        "(BlueFS::fsync(BlueFS::FileWriter*)+0x18b) [0x55dd4139ca6b]",
        "(BlueRocksWritableFile::Sync()+0x18) [0x55dd413ac768]",
        "(rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x1f) [0x55dd4184f96f]",
        "(rocksdb::WritableFileWriter::SyncInternal(bool)+0x402) [0x55dd419611c2]",
        "(rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x55dd41962808]",
        "(rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned long)+0x309) [0x55dd418630c9]",
        "(rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x2629) [0x55dd4186bc69]",
        "(rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x21) [0x55dd4186be61]",
        "(RocksDBStore::submit_common(rocksdb::WriteOptions&, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x84) [0x55dd4180a644]",
        "(RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x9a) [0x55dd4180b04a]",
        "(BlueStore::_kv_sync_thread()+0x30d8) [0x55dd412edec8]",
        "(BlueStore::KVSyncThread::entry()+0x11) [0x55dd41315b61]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7f4b9c6af1ca]",
        "clone()" 
    ],
    "ceph_version": "16.2.14",
    "crash_id": "2023-10-05T13:13:43.571785Z_a25ce619-edb3-4490-bd7c-d55307cbf1f1",
    "entity_name": "osd.304",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "82c7be719cabd69c1cde16b44210ffee7d7c1530c415bf2f9faf1b5601253e00",
    "timestamp": "2023-10-05T13:13:43.571785Z",
    "utsname_hostname": "fooobar03",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-305.25.1.el8_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Wed Nov 3 10:29:07 UTC 2021" 
}

Related issues 1 (0 open1 closed)

Is duplicate of bluestore - Bug #53907: BlueStore.h: 4148: FAILED ceph_assert(cur >= p.length)ResolvedAdam Kupczyk

Actions
Actions #1

Updated by Igor Fedotov 7 months ago

Most likely this is a duplicate of https://tracker.ceph.com/issues/53907
Relevant Pacific backport is pending review/QA at the moment, see https://github.com/ceph/ceph/pull/53587

Actions #2

Updated by Igor Fedotov 6 months ago

  • Is duplicate of Bug #53907: BlueStore.h: 4148: FAILED ceph_assert(cur >= p.length) added
Actions #3

Updated by Igor Fedotov 6 months ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF