Project

General

Profile

Actions

Support #64021

open

CephFS first read after write:

Added by Nishit Khosla 4 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(FS):
Labels (FS):
Pull request ID:

Description

We are working with k8s and implemented cephfs. We have an application that scales out with replica pods and mounts the same PVC in different replica pods running on different worker nodes. As per observation, if there is only 1 pod reading/writing to cephfs, then we are able to achieve first read from client cache after write. However, when the pods are scaled up (i.e. replica > 1), then it is observed that after writing file(different pod writting to different files) to CephFS from a k8s pod, the first read from the same pod is not served from the client cache. This is degrading the read performance while scaling. What setting can be done to achieve the first reads to be served from the client cache as long as data is available in cache if the file is written from same pod?

We were looking at client capabilities but could not find a way to modify capabilities provided by MDS to ceph client to read from client cache.

Actions #1

Updated by Venky Shankar 3 months ago

  • Assignee set to Venky Shankar
Actions #2

Updated by Venky Shankar 3 months ago

Nishit Khosla wrote:

We are working with k8s and implemented cephfs. We have an application that scales out with replica pods and mounts the same PVC in different replica pods running on different worker nodes. As per observation, if there is only 1 pod reading/writing to cephfs, then we are able to achieve first read from client cache after write. However, when the pods are scaled up (i.e. replica > 1), then it is observed that after writing file(different pod writting to different files) to CephFS from a k8s pod, the first read from the same pod is not served from the client cache.

Is the other client pod writing/updating to the same file for which the first client pod has the data cached on. Is yes, this behaviour of bypassing the cache is expected since the cached data is stale. Internally, the MDS revokes the "Frc" cache from the first client before allowing the second client to write data. This way, the next time the first client tries to read the same file, it has to read (just writen) data from the OSD.

Actions #3

Updated by Nishit Khosla 3 months ago

Is the other client pod writing/updating to the same file for which the first client pod has the data cached on.

No, the client pods are writing data to different files.

Actions #4

Updated by Venky Shankar 3 months ago

Nishit Khosla wrote:

Is the other client pod writing/updating to the same file for which the first client pod has the data cached on.

No, the client pods are writing data to different files.

Are the "different" files under the same directory? That would explain the drop in IOPS since the parent directories mtime would need to be updated. Also, what's the cluster setup? (single mds, kclient, fuse, etc..)?

Actions #5

Updated by Nishit Khosla 3 months ago

Hello,

We observed that using fio if I run a test with blocksize 64k, I am not able to utilize cache, however, if I update the blocksize to 1M, the fio results show usage of cache. So, I am confused when the ceph uses cache and when it is not able to use cache? Any help is appreciated.

We are using 2 mds deployed using rook-ceph.

The directory where I am running the tests have below attr set:

~:/cephtest # getfattr -n ceph.dir.layout testing3
  1. file: testing3
    ceph.dir.layout="stripe_unit=1048576 stripe_count=2 object_size=16777216 pool=ceph-filesystem-data0"

Below is the command used for fio with bs=64k:

sample-deployment1-d779c89bb-kxstg:/mnt/pvc-1/testing3 # fio --filename=/mnt/pvc-1/testing3/manish --size=5G --direct=0 --rw=randrw --bs=64k --ioengine=libaio --iodepth=4 --runtime=60 --rwmixread=70 --time_based --group_reporting --name=iops-test-job --eta-newline=1

iops-test-job: (g=0): rw=randrw, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=libaio, iodepth=4
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [m(1)][5.0%][r=48.7MiB/s,w=21.8MiB/s][r=778,w=348 IOPS][eta 00m:57s]
Jobs: 1 (f=1): [m(1)][8.3%][r=47.4MiB/s,w=21.4MiB/s][r=758,w=343 IOPS][eta 00m:55s]
Jobs: 1 (f=1): [m(1)][11.7%][r=49.4MiB/s,w=21.6MiB/s][r=790,w=345 IOPS][eta 00m:53s]
Jobs: 1 (f=1): [m(1)][15.0%][r=50.2MiB/s,w=19.6MiB/s][r=803,w=314 IOPS][eta 00m:51s]
Jobs: 1 (f=1): [m(1)][18.3%][r=34.1MiB/s,w=15.1MiB/s][r=545,w=242 IOPS][eta 00m:49s]
Jobs: 1 (f=1): [m(1)][21.7%][r=47.0MiB/s,w=20.0MiB/s][r=767,w=320 IOPS][eta 00m:47s]
Jobs: 1 (f=1): [m(1)][25.0%][r=46.9MiB/s,w=19.6MiB/s][r=750,w=314 IOPS][eta 00m:45s]
Jobs: 1 (f=1): [m(1)][28.3%][r=52.6MiB/s,w=20.8MiB/s][r=842,w=333 IOPS][eta 00m:43s]
Jobs: 1 (f=1): [m(1)][31.7%][r=49.3MiB/s,w=23.3MiB/s][r=789,w=373 IOPS][eta 00m:41s]
Jobs: 1 (f=1): [m(1)][35.0%][r=50.5MiB/s,w=20.9MiB/s][r=808,w=335 IOPS][eta 00m:39s]
Jobs: 1 (f=1): [m(1)][38.3%][r=48.8MiB/s,w=20.7MiB/s][r=780,w=331 IOPS][eta 00m:37s]
Jobs: 1 (f=1): [m(1)][41.7%][r=47.1MiB/s,w=20.3MiB/s][r=753,w=325 IOPS][eta 00m:35s]
Jobs: 1 (f=1): [m(1)][45.0%][r=50.9MiB/s,w=19.3MiB/s][r=813,w=309 IOPS][eta 00m:33s]
Jobs: 1 (f=1): [m(1)][48.3%][r=54.1MiB/s,w=22.3MiB/s][r=865,w=356 IOPS][eta 00m:31s]
Jobs: 1 (f=1): [m(1)][51.7%][r=52.4MiB/s,w=22.5MiB/s][r=838,w=359 IOPS][eta 00m:29s]
Jobs: 1 (f=1): [m(1)][55.0%][r=51.1MiB/s,w=21.0MiB/s][r=816,w=351 IOPS][eta 00m:27s]
Jobs: 1 (f=1): [m(1)][58.3%][r=50.4MiB/s,w=23.5MiB/s][r=805,w=375 IOPS][eta 00m:25s]
Jobs: 1 (f=1): [m(1)][61.7%][r=15.1MiB/s,w=6214KiB/s][r=242,w=97 IOPS][eta 00m:23s]
Jobs: 1 (f=1): [m(1)][65.0%][r=38.8MiB/s,w=14.5MiB/s][r=620,w=232 IOPS][eta 00m:21s]
Jobs: 1 (f=1): [m(1)][68.3%][r=45.9MiB/s,w=21.1MiB/s][r=733,w=337 IOPS][eta 00m:19s]
Jobs: 1 (f=1): [m(1)][71.7%][r=49.8MiB/s,w=21.2MiB/s][r=797,w=339 IOPS][eta 00m:17s]
Jobs: 1 (f=1): [m(1)][75.0%][r=47.2MiB/s,w=20.4MiB/s][r=755,w=327 IOPS][eta 00m:15s]
Jobs: 1 (f=1): [m(1)][78.3%][r=48.4MiB/s,w=20.3MiB/s][r=774,w=325 IOPS][eta 00m:13s]
Jobs: 1 (f=1): [m(1)][81.7%][r=44.4MiB/s,w=20.2MiB/s][r=710,w=324 IOPS][eta 00m:11s]
Jobs: 1 (f=1): [m(1)][85.0%][r=49.8MiB/s,w=20.4MiB/s][r=796,w=327 IOPS][eta 00m:09s]
Jobs: 1 (f=1): [m(1)][88.3%][r=48.0MiB/s,w=21.0MiB/s][r=768,w=336 IOPS][eta 00m:07s]
Jobs: 1 (f=1): [m(1)][91.7%][r=46.7MiB/s,w=18.5MiB/s][r=746,w=296 IOPS][eta 00m:05s]
Jobs: 1 (f=1): [m(1)][95.0%][r=49.2MiB/s,w=19.4MiB/s][r=787,w=310 IOPS][eta 00m:03s]
Jobs: 1 (f=1): [m(1)][98.3%][r=47.6MiB/s,w=22.5MiB/s][r=761,w=359 IOPS][eta 00m:01s]
Jobs: 1 (f=1): [m(1)][100.0%][r=50.8MiB/s,w=23.2MiB/s][r=812,w=371 IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=1): err= 0: pid=1634: Tue Jan 30 10:55:37 2024
...Ommited output...

With bs=1M (the r/w throughput reaches to 4GBps and hence it is coming from cache, also confirmed by the iostat outputs - not attached):

sample-deployment1-d779c89bb-kxstg:/mnt/pvc-1 # fio --filename=/mnt/pvc-1/testing3/manish --size=5G --direct=0 --rw=randrw --bs=1M --ioengine=libaio --iodepth=4 --runtime=60 --rwmixread=70 --time_based --group_reporting --name=iops-test-job --eta-newline=1
iops-test-job: (g=0): rw=randrw, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=4
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [m(1)][5.0%][r=182MiB/s,w=86.1MiB/s][r=182,w=86 IOPS][eta 00m:57s]
Jobs: 1 (f=1): [m(1)][8.3%][r=271MiB/s,w=111MiB/s][r=271,w=111 IOPS][eta 00m:55s]
Jobs: 1 (f=1): [m(1)][11.7%][r=242MiB/s,w=109MiB/s][r=242,w=109 IOPS][eta 00m:53s]
Jobs: 1 (f=1): [m(1)][15.0%][r=267MiB/s,w=120MiB/s][r=267,w=120 IOPS][eta 00m:51s]
Jobs: 1 (f=1): [m(1)][18.3%][r=259MiB/s,w=138MiB/s][r=259,w=138 IOPS][eta 00m:49s]
Jobs: 1 (f=1): [m(1)][22.0%][r=273MiB/s,w=123MiB/s][r=273,w=123 IOPS][eta 00m:46s]
Jobs: 1 (f=1): [m(1)][25.0%][r=996MiB/s,w=445MiB/s][r=996,w=445 IOPS][eta 00m:45s]
Jobs: 1 (f=1): [m(1)][28.3%][r=3012MiB/s,w=1287MiB/s][r=3012,w=1287 IOPS][eta 00m:43s]
Jobs: 1 (f=1): [m(1)][31.7%][r=3053MiB/s,w=1317MiB/s][r=3053,w=1317 IOPS][eta 00m:41s]
Jobs: 1 (f=1): [m(1)][35.0%][r=3070MiB/s,w=1334MiB/s][r=3070,w=1334 IOPS][eta 00m:39s]
Jobs: 1 (f=1): [m(1)][38.3%][r=3141MiB/s,w=1363MiB/s][r=3141,w=1363 IOPS][eta 00m:37s]
Jobs: 1 (f=1): [m(1)][41.7%][r=3129MiB/s,w=1392MiB/s][r=3129,w=1392 IOPS][eta 00m:35s]
Jobs: 1 (f=1): [m(1)][45.0%][r=3181MiB/s,w=1346MiB/s][r=3181,w=1346 IOPS][eta 00m:33s]
Jobs: 1 (f=1): [m(1)][48.3%][r=1358MiB/s,w=575MiB/s][r=1358,w=574 IOPS][eta 00m:31s]
Jobs: 1 (f=1): [m(1)][52.5%][r=3089MiB/s,w=1364MiB/s][r=3089,w=1364 IOPS][eta 00m:28s]
Jobs: 1 (f=1): [m(1)][55.0%][r=3284MiB/s,w=1291MiB/s][r=3284,w=1291 IOPS][eta 00m:27s]
Jobs: 1 (f=1): [m(1)][58.3%][r=1024KiB/s,w=1024KiB/s][r=1,w=1 IOPS][eta 00m:25s]
Jobs: 1 (f=1): [m(1)][62.7%][r=9216KiB/s,w=4096KiB/s][r=9,w=4 IOPS][eta 00m:22s]
Jobs: 1 (f=1): [m(1)][66.1%][r=2048KiB/s,w=4096KiB/s][r=2,w=4 IOPS][eta 00m:20s]
Jobs: 1 (f=1): [m(1)][68.3%][r=2988MiB/s,w=1263MiB/s][r=2987,w=1263 IOPS][eta 00m:19s]
Jobs: 1 (f=1): [m(1)][71.7%][r=3116MiB/s,w=1356MiB/s][r=3116,w=1356 IOPS][eta 00m:17s]
Jobs: 1 (f=1): [m(1)][75.0%][r=3219MiB/s,w=1377MiB/s][r=3219,w=1377 IOPS][eta 00m:15s]
Jobs: 1 (f=1): [m(1)][78.3%][r=3076MiB/s,w=1393MiB/s][r=3076,w=1393 IOPS][eta 00m:13s]
Jobs: 1 (f=1): [m(1)][81.7%][r=3133MiB/s,w=1372MiB/s][r=3133,w=1372 IOPS][eta 00m:11s]
Jobs: 1 (f=1): [m(1)][85.0%][r=3196MiB/s,w=1360MiB/s][r=3196,w=1360 IOPS][eta 00m:09s]
Jobs: 1 (f=1): [m(1)][88.3%][r=3221MiB/s,w=1354MiB/s][r=3221,w=1354 IOPS][eta 00m:07s]
Jobs: 1 (f=1): [m(1)][91.7%][r=3159MiB/s,w=1479MiB/s][r=3159,w=1479 IOPS][eta 00m:05s]
Jobs: 1 (f=1): [m(1)][95.0%][r=3186MiB/s,w=1353MiB/s][r=3186,w=1353 IOPS][eta 00m:03s]
Jobs: 1 (f=1): [m(1)][98.3%][r=3169MiB/s,w=1376MiB/s][r=3169,w=1376 IOPS][eta 00m:01s]
Jobs: 1 (f=1): [m(1)][100.0%][r=3161MiB/s,w=1430MiB/s][r=3161,w=1430 IOPS][eta 00m:00s]

iops-test-job: (groupid=0, jobs=1): err= 0: pid=1380: Tue Jan 30 07:30:31 2024
... Ommited output...

Actions #6

Updated by Venky Shankar 3 months ago

Are you using ceph-fuse (or libcephfs binding)?

Actions #7

Updated by Nishit Khosla 3 months ago

This is libceph kernel module:

seliicbl00315:/home/nishit # lsmod |grep -i ceph
ceph 528384 3
libceph 401408 2 ceph,rbd
fscache 393216 3 ceph,nfsv4,nfs
libcrc32c 16384 6 nf_conntrack,nf_nat,bnx2x,xfs,libceph,ip_vs
seliicbl00315:/home/nishit # modinfo libceph
filename: /lib/modules/5.3.18-57-default/kernel/net/ceph/libceph.ko.xz
license: GPL
description: Ceph core library
author: Patience Warnick <>
author: Yehuda Sadeh <>
author: Sage Weil <>
suserelease: SLE15-SP3

Actions #8

Updated by Venky Shankar 3 months ago

Nishit Khosla wrote:

Hello,

We observed that using fio if I run a test with blocksize 64k, I am not able to utilize cache, however, if I update the blocksize to 1M, the fio results show usage of cache. So, I am confused when the ceph uses cache and when it is not able to use cache? Any help is appreciated.

We are using 2 mds deployed using rook-ceph.

I assume this is 1 active and 1 standby, yes?

The directory where I am running the tests have below attr set:

~:/cephtest # getfattr -n ceph.dir.layout testing3
  1. file: testing3
    ceph.dir.layout="stripe_unit=1048576 stripe_count=2 object_size=16777216 pool=ceph-filesystem-data0"

That's a non-default object size and striping strategy. What is the reason to tune that? Have you run tests at RADOS level to assess if this striping strategy gives you optimal performance?

Actions #9

Updated by Nishit Khosla 3 months ago

yes.. MDS is 1 active and 1 standby.

This is not the default. The default was 4MB stripe_unit. We saw recommendation from IBM to tune the stripe_unit and object_size for better performance.

We are working to see the performance benefits of stripping and hence these settings.

We have not yet run any test at RADOS level since RADOS does not give me any RWX capability.

Actions

Also available in: Atom PDF