Bug #46804
closednautilus: upgrade:nautilus-p2p/nautilus-p2p-stress-split: BlueFS::_flush_and_sync_log ceph_assert(r == 0)
0%
Description
2020-07-30T21:32:34.927 INFO:tasks.ceph.osd.0.smithi058.stderr:2020-07-30 21:32:34.921 7fd444355700 -1 bluestore(/var/lib/ceph/osd/ceph-0) allocate_bluefs_freespace failed to allocate on 0x3ff80000 min_size 0x400000 > allocated total 0x0 bluefs_shared_alloc_size 0x10000 allocated 0x0 available 0x 1e5e5c000 2020-07-30T21:32:34.927 INFO:tasks.ceph.osd.0.smithi058.stderr:/build/ceph-14.2.10-254-g76851635289/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, uint64_t, uint64_t)' thread 7fd444355700 time 2020-07-30 21:32:34.925698 2020-07-30T21:32:34.927 INFO:tasks.ceph.osd.0.smithi058.stderr:/build/ceph-14.2.10-254-g76851635289/src/os/bluestore/BlueFS.cc: 2197: FAILED ceph_assert(r == 0) 2020-07-30T21:32:34.928 INFO:tasks.ceph.osd.0.smithi058.stderr:2020-07-30 21:32:34.921 7fd444355700 -1 bluefs _allocate failed to expand slow device to fit +0x400000 2020-07-30T21:32:34.939 INFO:teuthology.orchestra.run.smithi058.stderr:osd.1: osd_enable_op_tracker = 'true' 2020-07-30T21:32:34.940 INFO:tasks.ceph.osd.0.smithi058.stderr: ceph version 14.2.10-254-g76851635289 (7685163528909e74749288840a480c00c24d4152) nautilus (stable) 2020-07-30T21:32:34.940 INFO:tasks.ceph.osd.0.smithi058.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x562c8121bb14] 2020-07-30T21:32:34.941 INFO:tasks.ceph.osd.0.smithi058.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x562c8121bcef] 2020-07-30T21:32:34.941 INFO:tasks.ceph.osd.0.smithi058.stderr: 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x1a95) [0x562c8182b3c5] 2020-07-30T21:32:34.941 INFO:tasks.ceph.osd.0.smithi058.stderr: 4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x9c) [0x562c8182c6fc] 2020-07-30T21:32:34.941 INFO:tasks.ceph.osd.0.smithi058.stderr: 5: (BlueRocksWritableFile::Sync()+0x63) [0x562c8184cd23] 2020-07-30T21:32:34.941 INFO:tasks.ceph.osd.0.smithi058.stderr: 6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x3e9) [0x562c81ebd759] 2020-07-30T21:32:34.942 INFO:tasks.ceph.osd.0.smithi058.stderr: 7: (rocksdb::WritableFileWriter::Sync(bool)+0x3a6) [0x562c81ec0bf6] 2020-07-30T21:32:34.942 INFO:tasks.ceph.osd.0.smithi058.stderr: 8: (rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned long)+0x32c) [0x562c81d2b93c] 2020-07-30T21:32:34.942 INFO:tasks.ceph.osd.0.smithi058.stderr: 9: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x245e) [0x562c81d349be] 2020-07-30T21:32:34.942 INFO:tasks.ceph.osd.0.smithi058.stderr: 10: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x30) [0x562c81d35030] 2020-07-30T21:32:34.942 INFO:tasks.ceph.osd.0.smithi058.stderr: 11: (RocksDBStore::submit_common(rocksdb::WriteOptions&, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x81) [0x562c81ce27b1] 2020-07-30T21:32:34.943 INFO:tasks.ceph.osd.0.smithi058.stderr: 12: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x97) [0x562c81ce3197] 2020-07-30T21:32:34.943 INFO:tasks.ceph.osd.0.smithi058.stderr: 13: (BlueStore::_kv_sync_thread()+0x205a) [0x562c8178da6a] 2020-07-30T21:32:34.943 INFO:tasks.ceph.osd.0.smithi058.stderr: 14: (BlueStore::KVSyncThread::entry()+0xd) [0x562c817af9ed] 2020-07-30T21:32:34.943 INFO:tasks.ceph.osd.0.smithi058.stderr: 15: (()+0x76db) [0x7fd4530686db] 2020-07-30T21:32:34.943 INFO:tasks.ceph.osd.0.smithi058.stderr: 16: (clone()+0x3f) [0x7fd451e08a3f]
Note that this is an upgrade test, where half of the cluster has been upgraded to latest nautilus. Also, the test is using objectstore/default, which means different allocators for upgraded vs not upgraded osds.
upgrade:nautilus-p2p/nautilus-p2p-stress-split/{0-cluster/{openstack start} 1-ceph-install/nautilus 1.1.short_pg_log 2-partial-upgrade/firsthalf 3-thrash/default 4-workload/{fsx radosbench rbd-cls rbd-import-export rbd_api readwrite snaps-few-objects} 5-finish-upgrade 7-final-workload/{rbd-python rgw-swift snaps-many-objects} objectstore/default supported-all-distro/ubuntu_latest thrashosds-health}
/a/yuriw-2020-07-30_20:39:04-upgrade:nautilus-p2p-wip-yuri-testing-2020-07-30-1921-nautilus-distro-basic-smithi/5270997
Updated by Josh Durgin about 3 years ago
A similar ENOSPC issue appeared in failed and dead jobs in 14.2.17 nautilus-p2p testing: https://pulpito.ceph.com/teuthology-2021-03-06_03:49:44-upgrade:nautilus-p2p-nautilus-distro-basic-smithi/ and https://pulpito.ceph.com/yuriw-2021-03-08_19:14:19-upgrade:nautilus-p2p-nautilus-distro-basic-smithi
The only pattern was it was only with hybrid allocatore. Some of them crashed in earlier versions (e.g. 14.2.2) before reaching a point where they'd be using the hybrid allocator though. Others got all the way to 14.2.16.
2021-03-08 20:31:18.245 7ff70053f700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _do_alloc_write failed to allocate 0x4000 allocated 0x 0 min_alloc_size 0x4000 available 0x 0 2021-03-08 20:31:18.245 7ff70053f700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _do_write _do_alloc_write failed with (28) No space left on device 2021-03-08 20:31:18.245 7ff70053f700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _txc_add_transaction error (28) No space left on device not handled on operation 10 (op 0, counting from 0) 2021-03-08 20:31:18.245 7ff70053f700 -1 bluestore(/var/lib/ceph/osd/ceph-0) ENOSPC from bluestore, misconfigured cluster 2021-03-08 20:31:18.245 7ff70053f700 0 _dump_transaction transaction dump: { "ops": [ { "op_num": 0, "op_name": "write", "collection": "meta", "oid": "#-1:4b4b4e63:::inc_osdmap.880:0#", "length": 1284, "offset": 0, "bufferlist length": 1284 }, { "op_num": 1, "op_name": "write", "collection": "meta", "oid": "#-1:a6032935:::osdmap.880:0#", "length": 6040, "offset": 0, "bufferlist length": 6040 }, { "op_num": 2, "op_name": "write", "collection": "meta", "oid": "#-1:46e858f0:::pg_num_history:head#", "length": 406, "offset": 0, "bufferlist length": 406 }, { "op_num": 3, "op_name": "write", "collection": "meta", "oid": "#-1:7b3f43c4:::osd_superblock:0#", "length": 538, "offset": 0, "bufferlist length": 538 } ] } 2021-03-08 20:31:18.324 7ff6ec990700 -1 /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16-432-g9b829698/rpm/el7/BUILD/ceph-14.2.16-432-g9b829698/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7ff6ec990700 time 2021-03-08 20:31:18.274039 /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16-432-g9b829698/rpm/el7/BUILD/ceph-14.2.16-432-g9b829698/src/os/bluestore/BlueStore.cc: 12384: ceph_abort_msg("unexpected error") ceph version 14.2.16-432-g9b82969 (9b829698f3d6269668c9402131bbfba81df6df0f) nautilus (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xdd) [0x55e75423bb22] 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0xc97) [0x55e754794d77] 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x370) [0x55e7547ac250] 4: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x55e754516394] 5: (ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xc4b) [0x55e75460cc5b] 6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x257) [0x55e75461b267] 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55e75452cc8a] 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5b3) [0x55e7544dcdb3] 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x362) [0x55e75431c462] 10: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55e7545abc02] 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x55e75433796f] 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55e7548f3956] 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e7548f6470] 14: (()+0x7ea5) [0x7ff713182ea5] 15: (clone()+0x6d) [0x7ff71204596d]
Updated by Igor Fedotov over 2 years ago
- Status changed from New to Duplicate
Highly likely this is a duplicate of https://tracker.ceph.com/issues/50656 which was fixed in Nautilus v14.2.22.
Closing for now, please feel free to reopen if later releases show the same.
Updated by Igor Fedotov over 2 years ago
- Related to Bug #50656: bluefs _allocate unable to allocate, though enough free added