Bug #46804: nautilus: upgrade:nautilus-p2p/nautilus-p2p-stress-split: BlueFS::_flush_and_sync_log ceph_assert(r == 0) - bluestore - Ceph

Actions

Copy link

Bug #46804

closed

nautilus: upgrade:nautilus-p2p/nautilus-p2p-stress-split: BlueFS::_flush_and_sync_log ceph_assert(r == 0)

Added by Neha Ojha over 3 years ago. Updated over 2 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2020-07-30T21:32:34.927 INFO:tasks.ceph.osd.0.smithi058.stderr:2020-07-30 21:32:34.921 7fd444355700 -1 bluestore(/var/lib/ceph/osd/ceph-0) allocate_bluefs_freespace failed to allocate on 0x3ff80000 min_size 0x400000 > allocated total 0x0 bluefs_shared_alloc_size 0x10000 allocated 0x0 available 0x 1e5e5c000
2020-07-30T21:32:34.927 INFO:tasks.ceph.osd.0.smithi058.stderr:/build/ceph-14.2.10-254-g76851635289/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, uint64_t, uint64_t)' thread 7fd444355700 time 2020-07-30 21:32:34.925698
2020-07-30T21:32:34.927 INFO:tasks.ceph.osd.0.smithi058.stderr:/build/ceph-14.2.10-254-g76851635289/src/os/bluestore/BlueFS.cc: 2197: FAILED ceph_assert(r == 0)
2020-07-30T21:32:34.928 INFO:tasks.ceph.osd.0.smithi058.stderr:2020-07-30 21:32:34.921 7fd444355700 -1 bluefs _allocate failed to expand slow device to fit +0x400000
2020-07-30T21:32:34.939 INFO:teuthology.orchestra.run.smithi058.stderr:osd.1: osd_enable_op_tracker = 'true'
2020-07-30T21:32:34.940 INFO:tasks.ceph.osd.0.smithi058.stderr: ceph version 14.2.10-254-g76851635289 (7685163528909e74749288840a480c00c24d4152) nautilus (stable)
2020-07-30T21:32:34.940 INFO:tasks.ceph.osd.0.smithi058.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x562c8121bb14]
2020-07-30T21:32:34.941 INFO:tasks.ceph.osd.0.smithi058.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x562c8121bcef]
2020-07-30T21:32:34.941 INFO:tasks.ceph.osd.0.smithi058.stderr: 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x1a95) [0x562c8182b3c5]
2020-07-30T21:32:34.941 INFO:tasks.ceph.osd.0.smithi058.stderr: 4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x9c) [0x562c8182c6fc]
2020-07-30T21:32:34.941 INFO:tasks.ceph.osd.0.smithi058.stderr: 5: (BlueRocksWritableFile::Sync()+0x63) [0x562c8184cd23]
2020-07-30T21:32:34.941 INFO:tasks.ceph.osd.0.smithi058.stderr: 6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x3e9) [0x562c81ebd759]
2020-07-30T21:32:34.942 INFO:tasks.ceph.osd.0.smithi058.stderr: 7: (rocksdb::WritableFileWriter::Sync(bool)+0x3a6) [0x562c81ec0bf6]
2020-07-30T21:32:34.942 INFO:tasks.ceph.osd.0.smithi058.stderr: 8: (rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned long)+0x32c) [0x562c81d2b93c]
2020-07-30T21:32:34.942 INFO:tasks.ceph.osd.0.smithi058.stderr: 9: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x245e) [0x562c81d349be]
2020-07-30T21:32:34.942 INFO:tasks.ceph.osd.0.smithi058.stderr: 10: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x30) [0x562c81d35030]
2020-07-30T21:32:34.942 INFO:tasks.ceph.osd.0.smithi058.stderr: 11: (RocksDBStore::submit_common(rocksdb::WriteOptions&, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x81) [0x562c81ce27b1]
2020-07-30T21:32:34.943 INFO:tasks.ceph.osd.0.smithi058.stderr: 12: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x97) [0x562c81ce3197]
2020-07-30T21:32:34.943 INFO:tasks.ceph.osd.0.smithi058.stderr: 13: (BlueStore::_kv_sync_thread()+0x205a) [0x562c8178da6a]
2020-07-30T21:32:34.943 INFO:tasks.ceph.osd.0.smithi058.stderr: 14: (BlueStore::KVSyncThread::entry()+0xd) [0x562c817af9ed]
2020-07-30T21:32:34.943 INFO:tasks.ceph.osd.0.smithi058.stderr: 15: (()+0x76db) [0x7fd4530686db]
2020-07-30T21:32:34.943 INFO:tasks.ceph.osd.0.smithi058.stderr: 16: (clone()+0x3f) [0x7fd451e08a3f]

Note that this is an upgrade test, where half of the cluster has been upgraded to latest nautilus. Also, the test is using objectstore/default, which means different allocators for upgraded vs not upgraded osds.

upgrade:nautilus-p2p/nautilus-p2p-stress-split/{0-cluster/{openstack start} 1-ceph-install/nautilus 1.1.short_pg_log 2-partial-upgrade/firsthalf 3-thrash/default 4-workload/{fsx radosbench rbd-cls rbd-import-export rbd_api readwrite snaps-few-objects} 5-finish-upgrade 7-final-workload/{rbd-python rgw-swift snaps-many-objects} objectstore/default supported-all-distro/ubuntu_latest thrashosds-health}

/a/yuriw-2020-07-30_20:39:04-upgrade:nautilus-p2p-wip-yuri-testing-2020-07-30-1921-nautilus-distro-basic-smithi/5270997

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Josh Durgin about 3 years ago

A similar ENOSPC issue appeared in failed and dead jobs in 14.2.17 nautilus-p2p testing: https://pulpito.ceph.com/teuthology-2021-03-06_03:49:44-upgrade:nautilus-p2p-nautilus-distro-basic-smithi/ and https://pulpito.ceph.com/yuriw-2021-03-08_19:14:19-upgrade:nautilus-p2p-nautilus-distro-basic-smithi

The only pattern was it was only with hybrid allocatore. Some of them crashed in earlier versions (e.g. 14.2.2) before reaching a point where they'd be using the hybrid allocator though. Others got all the way to 14.2.16.

2021-03-08 20:31:18.245 7ff70053f700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _do_alloc_write failed to allocate 0x4000 allocated 0x 0 min_alloc_size 0x4000 available 0x 0
2021-03-08 20:31:18.245 7ff70053f700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _do_write _do_alloc_write failed with (28) No space left on device
2021-03-08 20:31:18.245 7ff70053f700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _txc_add_transaction error (28) No space left on device not handled on operation 10 (op 0, counting from 0)
2021-03-08 20:31:18.245 7ff70053f700 -1 bluestore(/var/lib/ceph/osd/ceph-0) ENOSPC from bluestore, misconfigured cluster
2021-03-08 20:31:18.245 7ff70053f700  0 _dump_transaction transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "write",
            "collection": "meta",
            "oid": "#-1:4b4b4e63:::inc_osdmap.880:0#",
            "length": 1284,
            "offset": 0,
            "bufferlist length": 1284
        },
        {
            "op_num": 1,
            "op_name": "write",
            "collection": "meta",
            "oid": "#-1:a6032935:::osdmap.880:0#",
            "length": 6040,
            "offset": 0,
            "bufferlist length": 6040
        },
        {
            "op_num": 2,
            "op_name": "write",
            "collection": "meta",
            "oid": "#-1:46e858f0:::pg_num_history:head#",
            "length": 406,
            "offset": 0,
            "bufferlist length": 406
        },
        {
            "op_num": 3,
            "op_name": "write",
            "collection": "meta",
            "oid": "#-1:7b3f43c4:::osd_superblock:0#",
            "length": 538,
            "offset": 0,
            "bufferlist length": 538
        }
    ]
}
2021-03-08 20:31:18.324 7ff6ec990700 -1 /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16-432-g9b829698/rpm/el7/BUILD/ceph-14.2.16-432-g9b829698/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7ff6ec990700 time 2021-03-08 20:31:18.274039
/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16-432-g9b829698/rpm/el7/BUILD/ceph-14.2.16-432-g9b829698/src/os/bluestore/BlueStore.cc: 12384: ceph_abort_msg("unexpected error")

 ceph version 14.2.16-432-g9b82969 (9b829698f3d6269668c9402131bbfba81df6df0f) nautilus (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xdd) [0x55e75423bb22]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0xc97) [0x55e754794d77]
 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x370) [0x55e7547ac250]
 4: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x55e754516394]
 5: (ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xc4b) [0x55e75460cc5b]
 6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x257) [0x55e75461b267]
 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55e75452cc8a]
 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5b3) [0x55e7544dcdb3]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x362) [0x55e75431c462]
 10: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55e7545abc02]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x55e75433796f]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55e7548f3956]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e7548f6470]
 14: (()+0x7ea5) [0x7ff713182ea5]
 15: (clone()+0x6d) [0x7ff71204596d]

Actions

Copy link