Project

General

Profile

Bug #57523

CephFS performance degredation in mountpoint

Added by Robert Sander 3 months ago. Updated 6 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

we have a cluster with 7 nodes each with 10 SSD OSDs providing CephFS to
a CloudStack system as primary storage.

When copying a large file into the mountpoint of the CephFS the
bandwidth drops from 500MB/s to 50MB/s after around 30 seconds. We see
some MDS activity in the output of "ceph fs status" at the same time.

It only affects the first file that gets written to.
Additional files can be written to with full speed at
the same time if started a little bit later.

When copying the same file to a subdirectory of the mountpoint the
performance stays at 500MB/s for the whole time. MDS activity does not
seems to influence the performance here.

There are appr 270 other files in the mountpoint directory. CloudStack stores
VM images in qcow2 format there. Appr a dozen clients have mounted the filesystem.

There is no quota set in the filesystem.

History

#1 Updated by Venky Shankar 2 months ago

Robert Sander wrote:

Hi,

we have a cluster with 7 nodes each with 10 SSD OSDs providing CephFS to
a CloudStack system as primary storage.

When copying a large file into the mountpoint of the CephFS the
bandwidth drops from 500MB/s to 50MB/s after around 30 seconds. We see
some MDS activity in the output of "ceph fs status" at the same time.

It only affects the first file that gets written to.
Additional files can be written to with full speed at
the same time if started a little bit later.

This is really weird. I do not have an explanation atm. Is the MDS running with default configuration?

#2 Updated by Vincent Hermes 2 months ago

Hi,

yes the MDS is running with default configuration, we only tested if two active MDS were helping but it didnt change anything so we reverted to 1 active and some standby MDS.

Some additional Information:
  • At the start, about 10 kvm clients mounted the root directory. 270 qcows present there (e.g. "cephfs:/myfile"). Problem was happening.
  • I created a simple folder inside the root file system and moved a qcow there. Mounted one kvm client onto the directory (e.g. "cephfs:/thefolder/myfile"). Problem was not happening.
  • I thought the problem is the root directory so I maintenanced everything and moved all files and mountpoints to the folder.
  • Got everything running again with mountpoint "cephfs:/thefolder/". Problem is back inside the folder and is actually reversed! The root directory is now the one which does not have the issue and if I create another folder and use that one it does not happen as well.

It basically only happens when everything is happening in one directory but thats exactly what cephfs is used for in this case. Shared Storage for KVM.

Greetings
Vincent

Venky Shankar wrote:

Robert Sander wrote:

Hi,

we have a cluster with 7 nodes each with 10 SSD OSDs providing CephFS to
a CloudStack system as primary storage.

When copying a large file into the mountpoint of the CephFS the
bandwidth drops from 500MB/s to 50MB/s after around 30 seconds. We see
some MDS activity in the output of "ceph fs status" at the same time.

It only affects the first file that gets written to.
Additional files can be written to with full speed at
the same time if started a little bit later.

This is really weird. I do not have an explanation atm. Is the MDS running with default configuration?

#3 Updated by Vincent Hermes 6 days ago

Guys this can't be only a thing in our setup. Every time a connection puts more than a few GB into cephfs the performance goes from 500MB/s to 30MB/s and stays there. This is not acceptable for our use case.

I would be very happy if SOMEONE is interested in this issue and checks if I have a bug here or what can be done. I can assure you that system- or network-performance is NOT the problem.

Greetings
Vincent

Also available in: Atom PDF