Bug #52081: rbd persistent SSD cache crash at retire_entries - rbd - Ceph

Actions

Copy link

Bug #52081

closed

rbd persistent SSD cache crash at retire_entries

Added by chunsong feng over 2 years ago. Updated over 2 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Ilya Dryomov

Target version:

% Done:

Source:

Community (dev)

Tags:

persistent cache

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

/root/rpmbuild/BUILD/ceph-16.2.5-8-gb1f52008f42/src/librbd/cache/pwl/ssd/WriteLog.cc: In function 'bool librbd::cache::pwl::ssd::WriteLog<ImageCtxT>::retire_entries(long unsigned int) [with ImageCtxT = librbd::ImageCtx]' thread 7f3655ffb700 time 2021-08-06T15:15:00.113374+0800
/root/rpmbuild/BUILD/ceph-16.2.5-8-gb1f52008f42/src/librbd/cache/pwl/ssd/WriteLog.cc: 611: FAILED ceph_assert((it)->log_entry_index == (control_block_pos + data_length + MIN_WRITE_ALLOC_SSD_SIZE) % this->m_log_pool_size + DATA_RING_BUFFER_OFFSET)
ceph version 16.2.5-8-gb1f52008f42 (b1f52008f422bdda8ea80cf01d9ebdb659eee803) pacific (stable)
1: (ceph::__ceph_assert_fail(char const, char const*, int, char const*)+0x158) [0x7f36cd38b49c]
2: /usr/lib64/ceph/libceph-common.so.2(+0x2776b6) [0x7f36cd38b6b6]
3: (librbd::cache::pwl::ssd::WriteLog<librbd::ImageCtx>::retire_entries(unsigned long)+0x140f) [0x7f366c47838f]
4: (librbd::cache::pwl::ssd::WriteLog<librbd::ImageCtx>::process_work()+0x298) [0x7f366c471a98]
5: (LambdaContext<librbd::cache::pwl::AbstractWriteLog<librbd::ImageCtx>::wake_up()::{lambda(int)#3}>::finish(int)+0x12) [0x7f366c4298d2]
6: (ThreadPool::PointerWQ<Context>::_void_process(void*, ThreadPool::TPHandle&)+0x148) [0x7f366c42a398]
7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xe9a) [0x7f36cd48a0ca]
8: (ThreadPool::WorkThread::entry()+0x15) [0x7f36cd48a935]
9: /lib64/libpthread.so.0(+0x82de) [0x7f36d78f52de]
10: clone()

Files

Download all files

ssd_crash.txt (18.7 KB) ssd_crash.txt		chunsong feng, 08/06/2021 07:25 AM
ceph-client.admin.3014381.log (9.22 KB) ceph-client.admin.3014381.log		chunsong feng, 08/10/2021 02:08 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Ilya Dryomov over 2 years ago

This seems very similar to https://tracker.ceph.com/issues/49819. What is your workload? Is the crash reproducible?

The ssd mode is being actively worked on in master. Since you seem to have compiled pacific branch on your own, could you please try master?

Actions

Copy link

Updated by Ilya Dryomov over 2 years ago

Related to Bug #49819: [pwl ssd] assert in retire_entries() during QEMU xfstest workload added

Actions

Copy link

Updated by chunsong feng over 2 years ago

The problem is reproducible in a matter of minutes.
The FIO configuration file is as follows:
[global]
ioengine=rbd
clientname=admin
pool=rbdtest
size=20G
direct=1
bs=4K
ba=4K
numjobs=1
runtime=180
ramp_time=10
log_avg_msec=500
rbdname=image21
thread
time_based

[4K-randwrite]
rw=randwrite
iodepth=1
write_bw_log=4K-randwrite
stonewall
group_reporting

Actions

Copy link

Updated by chunsong feng over 2 years ago

The tested version is based on ceph 16.2.5 and replaces src/librbd/cache of the master branch.
OK, I'll upgrade the master branch version and test it.

Actions

Copy link

Updated by Loïc Dachary over 2 years ago

Target version deleted (~~v16.2.6~~)

Actions

Copy link

Updated by chunsong feng over 2 years ago

File ceph-client.admin.3014381.log ceph-client.admin.3014381.log added

I test SSD mode use master branch version,it crash at
/root/rpmbuild/BUILD/ceph-17.0.0-6764-g567a4e6b961/src/librbd/cache/pwl/ssd/WriteLog.cc: In function 'void librbd::cache::pwl::ssd::WriteLog<ImageCtxT>::schedule_update_root(std::shared_ptr<librbd::cache::pwl::WriteLogPoolRoot>, Context*) [with ImageCtxT = librbd::ImageCtx]' thread 7fead17fa700 time 2021-08-10T09:45:31.791743+0800
/root/rpmbuild/BUILD/ceph-17.0.0-6764-g567a4e6b961/src/librbd/cache/pwl/ssd/WriteLog.cc: 857: FAILED ceph_assert(is_valid_pool_root(root))
ceph version 17.0.0-6764-g567a4e6b961 (567a4e6b961d581837a2f022d5e8f9ada72f4842) quincy (dev)
1: (ceph::__ceph_assert_fail(char const, char const*, int, char const*)+0x139) [0x7feb3dfb36b2]
2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7feb3dfb3902]
3: (librbd::cache::pwl::ssd::WriteLog<librbd::ImageCtx>::schedule_update_root(std::shared_ptr<librbd::cache::pwl::WriteLogPoolRoot>, Context*)+0x2c7) [0x7fead81e25a5]
4: (librbd::cache::pwl::ssd::WriteLog<librbd::ImageCtx>::append_op_log_entries(std::__cxx11::list<std::shared_ptr<librbd::cache::pwl::GenericLogOperation>, std::allocator<std::shared_ptr<librbd::cache::pwl::GenericLogOperation> > >&)::{lambda(int)#3}::operator()(int) const+0x3ad) [0x7fead81e3fab]
5: (LambdaContext<librbd::cache::pwl::ssd::WriteLog<librbd::ImageCtx>::append_op_log_entries(std::__cxx11::list<std::shared_ptr<librbd::cache::pwl::GenericLogOperation>, std::allocator<std::shared_ptr<librbd::cache::pwl::GenericLogOperation> > >&)::{lambda(int)#3}>::finish(int)+0x11) [0x7fead81e4071]
6: (Context::complete(int)+0xd) [0x7feb3df8d5a9]
7: (librbd::cache::pwl::ssd::WriteLog<librbd::ImageCtx>::aio_cache_cb(void*, void*)+0x1c) [0x7fead81da4a6]
8: (KernelDevice::_aio_thread()+0x845) [0x7fead81fdf6b]
9: (KernelDevice::AioCompletionThread::entry()+0x11) [0x7fead820e79d]
10: (Thread::entry_wrapper()+0x43) [0x7feb3df8561f]
11: (Thread::_entry_func(void*)+0xd) [0x7feb3df8563b]
12: /lib64/libpthread.so.0(+0x814a) [0x7feb4021714a]

Actions

Copy link

Updated by Ilya Dryomov over 2 years ago

This assert that was added in https://github.com/ceph/ceph/pull/41490 to catch obvious corruption before pool root is written out to disk. So the bug still exists in master, the assert is different because it gets detected earlier.

How big is the cache (rbd_persistent_cache_size) in your setup?

Actions

Copy link