Bug #55534
Updated by Deepika Upadhyay almost 2 years ago
Description of problem:Persistent write back cache - Error message needs improvement for corrupted cache with appropriate message instead "No space left on device " Version-Release number of selected component (if applicable): ceph version 16.2.7-106.el8cp (83a8e200569d52a42ad69374c2d4cfd39921b24d) pacific (stable) [root@intel-purley-lr-02 pmem]# How reproducible: Pre-req 1. Working ceph cluster 2. client node with pemem 3. # ceph config set client rbd_persistent_cache_mode rwl 4. # ceph config set client rbd_plugins pwl_cache Steps to enable DAX List the ndctl (must include the pmem as below) [root@intel-purley-02 tmp]# ndctl list { "dev":"namespace0.0", "mode":"fsdax", "map":"dev", "size":12681478144, "uuid":"c5dbfb44-fe3a-42ac-8331-8df3187e7d74", "sector_size":512, "align":2097152, "Blockdev":"pmem0" } mkfs.ext4 /dev/pmem0 mount -o dax=always /dev/pmem0 <mountpoint> And then set rbd_persistent_cache_path to the mountpoint # rbd config global set global rbd_persistent_cache_path path After mounting, make sure that DAX is indeed enabled Check for something like "EXT4-fs (pmem0): DAX enabled ..." in dmesg Steps to Reproduce: 1) wite data using RBD bench to pmem/image after few minutes abort, cache file present in path and not flushed to OSDs 2) start FIO write with different pool/image name i.e pmem1/image and then observe the errors output snippet: ^Coot@intel-purley-lr-02 pmem]# Jobs: 1 (f=0): [/(1),X(1)][-.-%][eta 09m:56s] fio: io_u error on file test-1.0.0: No space left on device: write offset=4096, buflen=4096 fio: pid=96033, err=28/file:io_u.c:1803, func=io_u error, error=No space left on device Jobs: 1 (f=1): [f(1),X(1)][-.-%][eta 00m:00s] test-1: (groupid=0, jobs=2): err=28 (file:io_u.c:1803, func=io_u error, error=No space left on device): pid=96033: Fri Apr 29 06:57:59 2022 cpu : usr=0.00%, sys=0.00%, ctx=10, majf=0, minf=30 IO depths : 1=12.5%, 2=25.0%, 4=50.0%, 8=12.5%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,16,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=8 we are seeing IO error and No apce left on device this needs manual flush or invalidate cache command Expected Results: This is expected. If the corrupted cache is not cleared, it will give out error, the error msg should be more helpful instead of showing user as "no space left on device " which is incorrect Additional info: Additional info: cluster details magna021 pmem client details - root@intel-purley-lr-02.7a2m.lab.eng.bos.redhat.com password - QwAo2U6GRxyNPKiZaOCx