Bug #52081
closedrbd persistent SSD cache crash at retire_entries
0%
Description
/root/rpmbuild/BUILD/ceph-16.2.5-8-gb1f52008f42/src/librbd/cache/pwl/ssd/WriteLog.cc: In function 'bool librbd::cache::pwl::ssd::WriteLog<ImageCtxT>::retire_entries(long unsigned int) [with ImageCtxT = librbd::ImageCtx]' thread 7f3655ffb700 time 2021-08-06T15:15:00.113374+0800
/root/rpmbuild/BUILD/ceph-16.2.5-8-gb1f52008f42/src/librbd/cache/pwl/ssd/WriteLog.cc: 611: FAILED ceph_assert((it)->log_entry_index == (control_block_pos + data_length + MIN_WRITE_ALLOC_SSD_SIZE) % this->m_log_pool_size + DATA_RING_BUFFER_OFFSET)
ceph version 16.2.5-8-gb1f52008f42 (b1f52008f422bdda8ea80cf01d9ebdb659eee803) pacific (stable)
1: (ceph::__ceph_assert_fail(char const, char const*, int, char const*)+0x158) [0x7f36cd38b49c]
2: /usr/lib64/ceph/libceph-common.so.2(+0x2776b6) [0x7f36cd38b6b6]
3: (librbd::cache::pwl::ssd::WriteLog<librbd::ImageCtx>::retire_entries(unsigned long)+0x140f) [0x7f366c47838f]
4: (librbd::cache::pwl::ssd::WriteLog<librbd::ImageCtx>::process_work()+0x298) [0x7f366c471a98]
5: (LambdaContext<librbd::cache::pwl::AbstractWriteLog<librbd::ImageCtx>::wake_up()::{lambda(int)#3}>::finish(int)+0x12) [0x7f366c4298d2]
6: (ThreadPool::PointerWQ<Context>::_void_process(void*, ThreadPool::TPHandle&)+0x148) [0x7f366c42a398]
7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xe9a) [0x7f36cd48a0ca]
8: (ThreadPool::WorkThread::entry()+0x15) [0x7f36cd48a935]
9: /lib64/libpthread.so.0(+0x82de) [0x7f36d78f52de]
10: clone()
Files
Updated by Ilya Dryomov almost 3 years ago
This seems very similar to https://tracker.ceph.com/issues/49819. What is your workload? Is the crash reproducible?
The ssd mode is being actively worked on in master. Since you seem to have compiled pacific branch on your own, could you please try master?
Updated by Ilya Dryomov almost 3 years ago
- Related to Bug #49819: [pwl ssd] assert in retire_entries() during QEMU xfstest workload added
Updated by chunsong feng almost 3 years ago
The problem is reproducible in a matter of minutes.
The FIO configuration file is as follows:
[global]
ioengine=rbd
clientname=admin
pool=rbdtest
size=20G
direct=1
bs=4K
ba=4K
numjobs=1
runtime=180
ramp_time=10
log_avg_msec=500
rbdname=image21
thread
time_based
[4K-randwrite]
rw=randwrite
iodepth=1
write_bw_log=4K-randwrite
stonewall
group_reporting
Updated by chunsong feng almost 3 years ago
The tested version is based on ceph 16.2.5 and replaces src/librbd/cache of the master branch.
OK, I'll upgrade the master branch version and test it.
Updated by chunsong feng almost 3 years ago
I test SSD mode use master branch version,it crash at
/root/rpmbuild/BUILD/ceph-17.0.0-6764-g567a4e6b961/src/librbd/cache/pwl/ssd/WriteLog.cc: In function 'void librbd::cache::pwl::ssd::WriteLog<ImageCtxT>::schedule_update_root(std::shared_ptr<librbd::cache::pwl::WriteLogPoolRoot>, Context*) [with ImageCtxT = librbd::ImageCtx]' thread 7fead17fa700 time 2021-08-10T09:45:31.791743+0800
/root/rpmbuild/BUILD/ceph-17.0.0-6764-g567a4e6b961/src/librbd/cache/pwl/ssd/WriteLog.cc: 857: FAILED ceph_assert(is_valid_pool_root(root))
ceph version 17.0.0-6764-g567a4e6b961 (567a4e6b961d581837a2f022d5e8f9ada72f4842) quincy (dev)
1: (ceph::__ceph_assert_fail(char const, char const*, int, char const*)+0x139) [0x7feb3dfb36b2]
2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7feb3dfb3902]
3: (librbd::cache::pwl::ssd::WriteLog<librbd::ImageCtx>::schedule_update_root(std::shared_ptr<librbd::cache::pwl::WriteLogPoolRoot>, Context*)+0x2c7) [0x7fead81e25a5]
4: (librbd::cache::pwl::ssd::WriteLog<librbd::ImageCtx>::append_op_log_entries(std::__cxx11::list<std::shared_ptr<librbd::cache::pwl::GenericLogOperation>, std::allocator<std::shared_ptr<librbd::cache::pwl::GenericLogOperation> > >&)::{lambda(int)#3}::operator()(int) const+0x3ad) [0x7fead81e3fab]
5: (LambdaContext<librbd::cache::pwl::ssd::WriteLog<librbd::ImageCtx>::append_op_log_entries(std::__cxx11::list<std::shared_ptr<librbd::cache::pwl::GenericLogOperation>, std::allocator<std::shared_ptr<librbd::cache::pwl::GenericLogOperation> > >&)::{lambda(int)#3}>::finish(int)+0x11) [0x7fead81e4071]
6: (Context::complete(int)+0xd) [0x7feb3df8d5a9]
7: (librbd::cache::pwl::ssd::WriteLog<librbd::ImageCtx>::aio_cache_cb(void*, void*)+0x1c) [0x7fead81da4a6]
8: (KernelDevice::_aio_thread()+0x845) [0x7fead81fdf6b]
9: (KernelDevice::AioCompletionThread::entry()+0x11) [0x7fead820e79d]
10: (Thread::entry_wrapper()+0x43) [0x7feb3df8561f]
11: (Thread::_entry_func(void*)+0xd) [0x7feb3df8563b]
12: /lib64/libpthread.so.0(+0x814a) [0x7feb4021714a]
Updated by Ilya Dryomov almost 3 years ago
This assert that was added in https://github.com/ceph/ceph/pull/41490 to catch obvious corruption before pool root is written out to disk. So the bug still exists in master, the assert is different because it gets detected earlier.
How big is the cache (rbd_persistent_cache_size) in your setup?
Updated by Ilya Dryomov over 2 years ago
Ah, you hit https://tracker.ceph.com/issues/50675. Unfortunately anything larger than 4GB is currently broken...
Updated by Ilya Dryomov over 2 years ago
- Related to deleted (Bug #49819: [pwl ssd] assert in retire_entries() during QEMU xfstest workload)
Updated by Ilya Dryomov over 2 years ago
- Is duplicate of Bug #50675: [pwl ssd] cache larger than 4G will corrupt itself added
Updated by Ilya Dryomov over 2 years ago
- Status changed from New to Duplicate
- Assignee set to Ilya Dryomov