Actions
Bug #63110
closedCrash in RocksDBBlueFSVolumeSelector::sub_usage via BlueFS::fsync via WriteToWAL in KVSyncThread
Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
I got a random crash.
10 storage nodes, 14 hdd each. 3 mon, 3 mgr, 3 mds + 30 rgw. Quite a bit of load.
We were running on octopus running on whole cluster, for at least a year.
We updated to pacific today.
Initially all was good.
Few hours after upgrade was finished, one of the osd daemons crashed. (Remaining 111 osd daemons looks good at the moment).
No issues found in dmesg, or smartctl for this node / disk. I can read this disk (i.e. using fdisk or dd).
{ "assert_condition": "cur >= p.length", "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.14/rpm/el8/BUILD/ceph-16.2.14/src/os/bluestore/BlueStore.h", "assert_func": "virtual void RocksDBBlueFSVolumeSelector::sub_usage(void*, const bluefs_fnode_t&)", "assert_line": 3870, "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.14/rpm/el8/BUILD/ceph-16.2.14/src/os/bluestore/BlueStore.h: In function 'virtual void RocksDBBlueFSVolumeSelector::sub_usage(void*, const bluefs_fnode_t&)' thread 7f4b82f1f700 time 2023-10-05T13:13:43.560373+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.14/rpm/el8/BUILD/ceph-16.2.14/src/os/bluestore/BlueStore.h: 3870: FAILED ceph_assert(cur >= p.length)\n", "assert_thread_name": "bstore_kv_sync", "backtrace": [ "/lib64/libpthread.so.0(+0x12cf0) [0x7f4b9c6b9cf0]", "gsignal()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x55dd40c72d0b]", "/usr/bin/ceph-osd(+0x584ed4) [0x55dd40c72ed4]", "(RocksDBBlueFSVolumeSelector::sub_usage(void*, bluefs_fnode_t const&)+0x16a) [0x55dd412efaaa]", "(BlueFS::_flush_range_F(BlueFS::FileWriter*, unsigned long, unsigned long)+0x77d) [0x55dd413801cd]", "(BlueFS::_flush_F(BlueFS::FileWriter*, bool, bool*)+0x90) [0x55dd41380670]", "(BlueFS::fsync(BlueFS::FileWriter*)+0x18b) [0x55dd4139ca6b]", "(BlueRocksWritableFile::Sync()+0x18) [0x55dd413ac768]", "(rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x1f) [0x55dd4184f96f]", "(rocksdb::WritableFileWriter::SyncInternal(bool)+0x402) [0x55dd419611c2]", "(rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x55dd41962808]", "(rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned long)+0x309) [0x55dd418630c9]", "(rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x2629) [0x55dd4186bc69]", "(rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x21) [0x55dd4186be61]", "(RocksDBStore::submit_common(rocksdb::WriteOptions&, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x84) [0x55dd4180a644]", "(RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x9a) [0x55dd4180b04a]", "(BlueStore::_kv_sync_thread()+0x30d8) [0x55dd412edec8]", "(BlueStore::KVSyncThread::entry()+0x11) [0x55dd41315b61]", "/lib64/libpthread.so.0(+0x81ca) [0x7f4b9c6af1ca]", "clone()" ], "ceph_version": "16.2.14", "crash_id": "2023-10-05T13:13:43.571785Z_a25ce619-edb3-4490-bd7c-d55307cbf1f1", "entity_name": "osd.304", "os_id": "centos", "os_name": "CentOS Stream", "os_version": "8", "os_version_id": "8", "process_name": "ceph-osd", "stack_sig": "82c7be719cabd69c1cde16b44210ffee7d7c1530c415bf2f9faf1b5601253e00", "timestamp": "2023-10-05T13:13:43.571785Z", "utsname_hostname": "fooobar03", "utsname_machine": "x86_64", "utsname_release": "4.18.0-305.25.1.el8_4.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP Wed Nov 3 10:29:07 UTC 2021" }
Updated by Igor Fedotov 7 months ago
Most likely this is a duplicate of https://tracker.ceph.com/issues/53907
Relevant Pacific backport is pending review/QA at the moment, see https://github.com/ceph/ceph/pull/53587
Updated by Igor Fedotov 6 months ago
- Is duplicate of Bug #53907: BlueStore.h: 4148: FAILED ceph_assert(cur >= p.length) added
Actions