Project

General

Profile

Actions

Bug #53613

closed

[pwl] Failed to start IOs when SSD mode persistent write back cache is enabled in ceph version 16.2.7-3.el8cp

Added by Preethi Nataraj over 2 years ago. Updated over 2 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
12/15/2021
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We upgraded the cluster to the latest and saw Io's failed to start. (Triggered Ios from RBD bench and FIO both)
[root@magna031 ubuntu]# ceph version
ceph version 16.2.7-3.el8cp (54410e69e153d229a04fb6acc388f7e4afdd05e7) pacific (stable)

RBD bench output for reference -
[root@plena007 ubuntu]# rbd bench-write image1 --pool=test --io-threads=1
rbd: bench-write is deprecated, use rbd bench --io-type write ...
2021-12-14T07:25:30.666+0000 7fc3327fc700 -1 librbd::exclusive_lock::PostAcquireRequest: 0x7fc32c037000 handle_process_plugin_acquire_lock: failed to process plugins: (2) No such file or directory
rbd: failed to flush: 2021-12-14T07:25:30.669+0000 7fc3327fc700 -1 librbd::exclusive_lock::ImageDispatch: 0x7fc314002b60 handle_acquire_lock: failed to acquire exclusive lock: (2) No such file or directory
2021-12-14T07:25:30.669+0000 7fc3327fc700 -1 librbd::io::AioCompletion: 0x559cca568320 fail: (2) No such file or directory
(2) No such file or directory
bench failed: (2) No such file or directory

FIO output -
[root@plena007 ubuntu]# fio --name=test-1 --ioengine=rbd --pool=test1 --rbdname=image2 --numjobs=1 --rw=write --bs=4k --iodepth=1 --fsync=32 --runtime=480 --time_based --group_reporting --ramp_time=120
test-1: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1
fio-3.19
Starting 1 process
fio: io_u error on file test-1.0.0: No such file or directory: write offset=0, buflen=4096
fio: pid=1197333, err=2/file:io_u.c:1803, func=io_u error, error=No such file or directory

test-1: (groupid=0, jobs=1): err= 2 (file:io_u.c:1803, func=io_u error, error=No such file or directory): pid=1197333: Tue Dec 14 07:26:47 2021
cpu : usr=0.00%, sys=0.00%, ctx=2, majf=0, minf=5
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,1,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):

Disk stats (read/write):
sda: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
[root@plena007 ubuntu]#


Configuration and steps

1) After updating conf file to SSD mode as below (Tried from both CLI and conf file)

root@plena007 log]# cat /etc/ceph/ceph.conf
  1. minimal ceph.conf for d6e5c458-0f10-11ec-9663-002590fc25a4
    [global]
    fsid = d6e5c458-0f10-11ec-9663-002590fc25a4
    mon_host = [v2:10.8.128.31:3300/0,v1:10.8.128.31:6789/0]
    [client]
    rbd_cache = false
    rbd_persistent_cache_mode = ssd
    rbd_plugins = pwl_cache
    rbd_persistent_cache_size = 1073741824
    rbd_persistent_cache_path = /mnt/nvme/

Started Ios using rbd bench and FIO , and saw the above error

steps performed to mount -
1. Working ceph cluster
2. client node with NVMe SSD
3. # ceph config set client rbd_persistent_cache_mode SSD
4. # ceph config set client rbd_plugins pwl_cache

Steps to enable DAX

mount -o dax=always /dev/pmem0 <mountpoint>
And then set rbd_persistent_cache_path to the mountpoint
  1. rbd config global set global rbd_persistent_cache_path path
    After mounting, make sure that DAX is indeed enabled
    Check for something like "EXT4-fs (pmem0): DAX enabled ..." in dmesg
Actions #1

Updated by Deepika Upadhyay over 2 years ago

@CONGMIN YIN are you seeing this issue as well? I will verify this today otherwise

Actions #2

Updated by jianpeng ma over 2 years ago

@Deepika Upadhyay, https://github.com/ceph/ceph/pull/44199 looks good. Does 16.2.7-3.el8cp contain this?

Actions #3

Updated by CONGMIN YIN over 2 years ago

Deepika Upadhyay wrote:

@CONGMIN YIN are you seeing this issue as well? I will verify this today otherwise

No, I haven't met this issue. Could you please reproduce it and display its stack in GDB? Or tell me how can I download 16.2.7-3.el8cp, then, let me try.

By the way, I want to explain that I couldn't receive @CONGMIN YIN email many times before. The @xxx function does not seem to work properly. I tried to change the default option of email notification. After testing, I can receive all rbd notification now, not as smart as GitHub. I'm sorry I didn't receive notification before and didn't reply in time.

Actions #4

Updated by CONGMIN YIN over 2 years ago

  • Subject changed from Failed to start IOs when SSD mode persistent write back cache is enabled in ceph version 16.2.7-3.el8cp to [pwl] Failed to start IOs when SSD mode persistent write back cache is enabled in ceph version 16.2.7-3.el8cp
Actions #5

Updated by Deepika Upadhyay over 2 years ago

jianpeng ma wrote:

@Deepika Upadhyay, https://github.com/ceph/ceph/pull/44199 looks good. Does 16.2.7-3.el8cp contain this?

@jianpeng ma yes

Actions #6

Updated by Preethi Nataraj over 2 years ago

After setting the rbd_PWL to 20. Captured the logs http://pastebin.test.redhat.com/1017506. Issue still seen in downstream ceph version ceph version 16.2.7-3.el8cp

[root@plena007 nvme1]# ceph config get client
WHO MASK LEVEL OPTION VALUE RO
global basic container_image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:3a98e9409fe47c4b96e15e49fed66872c7376ba3bac30cb79367c1355c9f2bf7 *
client advanced debug_rbd 30/30
client advanced debug_rbd_pwl 20/20
global basic log_to_file true
global basic rbd_compression_hint compressible
global dev rbd_config_pool_override_update_timestamp 1636975961
client advanced rbd_persistent_cache_mode ssd *
global advanced rbd_persistent_cache_path /mnt/pmem *
client advanced rbd_plugins pwl_cache *
[root@plena007 nvme1]# cat /etc/ceph/ceph.conf
  1. minimal ceph.conf for d6e5c458-0f10-11ec-9663-002590fc25a4
    [global]
    fsid = d6e5c458-0f10-11ec-9663-002590fc25a4
    mon_host = [v2:10.8.128.31:3300/0,v1:10.8.128.31:6789/0]
    [client]
    rbd_cache = false
    rbd_persistent_cache_mode = ssd
    rbd_plugins = pwl_cache
    rbd_persistent_cache_size = 1073741824
    rbd_persistent_cache_path = /mnt/nvme1/

[root@plena007 nvme1]# NOTE: updating ceph.conf as above do not show cache type as SSD instead we see RWL , We need to set via CLI to get the cache mode/type as SSD. However, issue is still seen when IO is started.

Actions #7

Updated by Preethi Nataraj over 2 years ago

Preethi Nataraj wrote:

After setting the rbd_PWL to 20. Captured the logs http://pastebin.test.redhat.com/1017506. Issue still seen in downstream ceph version ceph version 16.2.7-3.el8cp

[root@plena007 nvme1]# ceph config get client
WHO MASK LEVEL OPTION VALUE RO
global basic container_image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:3a98e9409fe47c4b96e15e49fed66872c7376ba3bac30cb79367c1355c9f2bf7 *
client advanced debug_rbd 30/30
client advanced debug_rbd_pwl 20/20
global basic log_to_file true
global basic rbd_compression_hint compressible
global dev rbd_config_pool_override_update_timestamp 1636975961
client advanced rbd_persistent_cache_mode ssd *
global advanced rbd_persistent_cache_path /mnt/nvme1 *
client advanced rbd_plugins pwl_cache *
[root@plena007 nvme1]# cat /etc/ceph/ceph.conf
  1. minimal ceph.conf for d6e5c458-0f10-11ec-9663-002590fc25a4
    [global]
    fsid = d6e5c458-0f10-11ec-9663-002590fc25a4
    mon_host = [v2:10.8.128.31:3300/0,v1:10.8.128.31:6789/0]
    [client]
    rbd_cache = false
    rbd_persistent_cache_mode = ssd
    rbd_plugins = pwl_cache
    rbd_persistent_cache_size = 1073741824
    rbd_persistent_cache_path = /mnt/nvme1/

[root@plena007 nvme1]# NOTE: updating ceph.conf as above do not show cache type as SSD instead we see RWL , We need to set via CLI to get the cache mode/type as SSD. However, issue is still seen when IO is started.

http://pastebin.test.redhat.com/1017510

Actions #8

Updated by Ilya Dryomov over 2 years ago

  • Status changed from New to Rejected
  • Assignee set to Ilya Dryomov

This is an issue with a particular downstream build, closing.

Actions #9

Updated by Preethi Nataraj over 2 years ago

Issue is seen with latest ceph version ceph version 16.2.7-11.el8cp.
WHO MASK LEVEL OPTION VALUE RO
global basic container_image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:898b39a8ce0e88868ba897c2e1617520840293c6ba281250ef2f90cdd09cf0bb *
client advanced debug_rbd 30/30
client advanced debug_rbd_pwl 20/20
global basic log_to_file true
global basic rbd_compression_hint compressible
global dev rbd_config_pool_override_update_timestamp 1636975961
client advanced rbd_persistent_cache_mode ssd *
global advanced rbd_persistent_cache_path /mnt/nvme1 *
client advanced rbd_plugins pwl_cache

Output is copied here ---> http://pastebin.test.redhat.com/1017689

Actions

Also available in: Atom PDF