Project

General

Profile

Actions

Bug #52081

closed

rbd persistent SSD cache crash at retire_entries

Added by chunsong feng over 2 years ago. Updated over 2 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
persistent cache
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/root/rpmbuild/BUILD/ceph-16.2.5-8-gb1f52008f42/src/librbd/cache/pwl/ssd/WriteLog.cc: In function 'bool librbd::cache::pwl::ssd::WriteLog<ImageCtxT>::retire_entries(long unsigned int) [with ImageCtxT = librbd::ImageCtx]' thread 7f3655ffb700 time 2021-08-06T15:15:00.113374+0800
/root/rpmbuild/BUILD/ceph-16.2.5-8-gb1f52008f42/src/librbd/cache/pwl/ssd/WriteLog.cc: 611: FAILED ceph_assert((it)->log_entry_index == (control_block_pos + data_length + MIN_WRITE_ALLOC_SSD_SIZE) % this->m_log_pool_size + DATA_RING_BUFFER_OFFSET)
ceph version 16.2.5-8-gb1f52008f42 (b1f52008f422bdda8ea80cf01d9ebdb659eee803) pacific (stable)
1: (ceph::__ceph_assert_fail(char const
, char const*, int, char const*)+0x158) [0x7f36cd38b49c]
2: /usr/lib64/ceph/libceph-common.so.2(+0x2776b6) [0x7f36cd38b6b6]
3: (librbd::cache::pwl::ssd::WriteLog<librbd::ImageCtx>::retire_entries(unsigned long)+0x140f) [0x7f366c47838f]
4: (librbd::cache::pwl::ssd::WriteLog<librbd::ImageCtx>::process_work()+0x298) [0x7f366c471a98]
5: (LambdaContext<librbd::cache::pwl::AbstractWriteLog<librbd::ImageCtx>::wake_up()::{lambda(int)#3}>::finish(int)+0x12) [0x7f366c4298d2]
6: (ThreadPool::PointerWQ<Context>::_void_process(void*, ThreadPool::TPHandle&)+0x148) [0x7f366c42a398]
7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xe9a) [0x7f36cd48a0ca]
8: (ThreadPool::WorkThread::entry()+0x15) [0x7f36cd48a935]
9: /lib64/libpthread.so.0(+0x82de) [0x7f36d78f52de]
10: clone()


Files

ssd_crash.txt (18.7 KB) ssd_crash.txt chunsong feng, 08/06/2021 07:25 AM
ceph-client.admin.3014381.log (9.22 KB) ceph-client.admin.3014381.log chunsong feng, 08/10/2021 02:08 AM

Related issues 1 (0 open1 closed)

Is duplicate of rbd - Bug #50675: [pwl ssd] cache larger than 4G will corrupt itselfResolvedCONGMIN YIN

Actions
Actions #1

Updated by Ilya Dryomov over 2 years ago

This seems very similar to https://tracker.ceph.com/issues/49819. What is your workload? Is the crash reproducible?

The ssd mode is being actively worked on in master. Since you seem to have compiled pacific branch on your own, could you please try master?

Actions #2

Updated by Ilya Dryomov over 2 years ago

  • Related to Bug #49819: [pwl ssd] assert in retire_entries() during QEMU xfstest workload added
Actions #3

Updated by chunsong feng over 2 years ago

The problem is reproducible in a matter of minutes.
The FIO configuration file is as follows:
[global]
ioengine=rbd
clientname=admin
pool=rbdtest
size=20G
direct=1
bs=4K
ba=4K
numjobs=1
runtime=180
ramp_time=10
log_avg_msec=500
rbdname=image21
thread
time_based

[4K-randwrite]
rw=randwrite
iodepth=1
write_bw_log=4K-randwrite
stonewall
group_reporting

Actions #4

Updated by chunsong feng over 2 years ago

The tested version is based on ceph 16.2.5 and replaces src/librbd/cache of the master branch.
OK, I'll upgrade the master branch version and test it.

Actions #5

Updated by Loïc Dachary over 2 years ago

  • Target version deleted (v16.2.6)
Actions #6

Updated by chunsong feng over 2 years ago

I test SSD mode use master branch version,it crash at
/root/rpmbuild/BUILD/ceph-17.0.0-6764-g567a4e6b961/src/librbd/cache/pwl/ssd/WriteLog.cc: In function 'void librbd::cache::pwl::ssd::WriteLog<ImageCtxT>::schedule_update_root(std::shared_ptr<librbd::cache::pwl::WriteLogPoolRoot>, Context*) [with ImageCtxT = librbd::ImageCtx]' thread 7fead17fa700 time 2021-08-10T09:45:31.791743+0800
/root/rpmbuild/BUILD/ceph-17.0.0-6764-g567a4e6b961/src/librbd/cache/pwl/ssd/WriteLog.cc: 857: FAILED ceph_assert(is_valid_pool_root(root))
ceph version 17.0.0-6764-g567a4e6b961 (567a4e6b961d581837a2f022d5e8f9ada72f4842) quincy (dev)
1: (ceph::__ceph_assert_fail(char const
, char const*, int, char const*)+0x139) [0x7feb3dfb36b2]
2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7feb3dfb3902]
3: (librbd::cache::pwl::ssd::WriteLog<librbd::ImageCtx>::schedule_update_root(std::shared_ptr<librbd::cache::pwl::WriteLogPoolRoot>, Context*)+0x2c7) [0x7fead81e25a5]
4: (librbd::cache::pwl::ssd::WriteLog<librbd::ImageCtx>::append_op_log_entries(std::__cxx11::list<std::shared_ptr<librbd::cache::pwl::GenericLogOperation>, std::allocator<std::shared_ptr<librbd::cache::pwl::GenericLogOperation> > >&)::{lambda(int)#3}::operator()(int) const+0x3ad) [0x7fead81e3fab]
5: (LambdaContext<librbd::cache::pwl::ssd::WriteLog<librbd::ImageCtx>::append_op_log_entries(std::__cxx11::list<std::shared_ptr<librbd::cache::pwl::GenericLogOperation>, std::allocator<std::shared_ptr<librbd::cache::pwl::GenericLogOperation> > >&)::{lambda(int)#3}>::finish(int)+0x11) [0x7fead81e4071]
6: (Context::complete(int)+0xd) [0x7feb3df8d5a9]
7: (librbd::cache::pwl::ssd::WriteLog<librbd::ImageCtx>::aio_cache_cb(void*, void*)+0x1c) [0x7fead81da4a6]
8: (KernelDevice::_aio_thread()+0x845) [0x7fead81fdf6b]
9: (KernelDevice::AioCompletionThread::entry()+0x11) [0x7fead820e79d]
10: (Thread::entry_wrapper()+0x43) [0x7feb3df8561f]
11: (Thread::_entry_func(void*)+0xd) [0x7feb3df8563b]
12: /lib64/libpthread.so.0(+0x814a) [0x7feb4021714a]

Actions #7

Updated by Ilya Dryomov over 2 years ago

This assert that was added in https://github.com/ceph/ceph/pull/41490 to catch obvious corruption before pool root is written out to disk. So the bug still exists in master, the assert is different because it gets detected earlier.

How big is the cache (rbd_persistent_cache_size) in your setup?

Actions #8

Updated by chunsong feng over 2 years ago

20GB

Actions #9

Updated by Ilya Dryomov over 2 years ago

Ah, you hit https://tracker.ceph.com/issues/50675. Unfortunately anything larger than 4GB is currently broken...

Actions #10

Updated by Ilya Dryomov over 2 years ago

  • Related to deleted (Bug #49819: [pwl ssd] assert in retire_entries() during QEMU xfstest workload)
Actions #11

Updated by Ilya Dryomov over 2 years ago

  • Is duplicate of Bug #50675: [pwl ssd] cache larger than 4G will corrupt itself added
Actions #12

Updated by Ilya Dryomov over 2 years ago

  • Status changed from New to Duplicate
  • Assignee set to Ilya Dryomov
Actions

Also available in: Atom PDF