Bug #55534
closedPersistent write back cache - Error message needs improvement for corrupted cache with appropriate message instead "No space left on device
0%
Description
Description of problem:Persistent write back cache - Error message needs improvement for corrupted cache with appropriate message instead "No space left on device "
Version-Release number of selected component (if applicable):
ceph version 16.2.7-106.el8cp (83a8e200569d52a42ad69374c2d4cfd39921b24d) pacific (stable)
[root@intel-purley-lr-02 pmem]#
How reproducible:
Pre-req
1. Working ceph cluster
2. client node with pemem
3. # ceph config set client rbd_persistent_cache_mode rwl
4. # ceph config set client rbd_plugins pwl_cache
List the ndctl (must include the pmem as below)
[root@intel-purley-02 tmp]# ndctl list {
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":12681478144,
"uuid":"c5dbfb44-fe3a-42ac-8331-8df3187e7d74",
"sector_size":512,
"align":2097152,
"Blockdev":"pmem0"
}
mkfs.ext4 /dev/pmem0
mount -o dax=always /dev/pmem0 <mountpoint>
And then set rbd_persistent_cache_path to the mountpoint
- rbd config global set global rbd_persistent_cache_path path
After mounting, make sure that DAX is indeed enabled
Check for something like "EXT4-fs (pmem0): DAX enabled ..." in dmesg
Steps to Reproduce:
1) wite data using RBD bench to pmem/image after few minutes abort, cache file present in path and not flushed to OSDs
2) start FIO write with different pool/image name i.e pmem1/image and then observe the errors
output snippet:
^Coot@intel-purley-lr-02 pmem]# Jobs: 1 (f=0): [/(1),X(1)][-.-%][eta 09m:56s]
fio: io_u error on file test-1.0.0: No space left on device: write offset=4096, buflen=4096
fio: pid=96033, err=28/file:io_u.c:1803, func=io_u error, error=No space left on device
Jobs: 1 (f=1): [f(1),X(1)][-.-%][eta 00m:00s]
test-1: (groupid=0, jobs=2): err=28 (file:io_u.c:1803, func=io_u error, error=No space left on device): pid=96033: Fri Apr 29 06:57:59 2022
cpu : usr=0.00%, sys=0.00%, ctx=10, majf=0, minf=30
IO depths : 1=12.5%, 2=25.0%, 4=50.0%, 8=12.5%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,16,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=8
we are seeing IO error and No apce left on device
this needs manual flush or invalidate cache command
Expected Results:
This is expected. If the corrupted cache is not cleared, it will give out error, the error msg should be more helpful instead of showing user as "no space left on device " which is incorrect