Project

General

Profile

Bug #17231

CephFS : IOs get terminated while reading/writing on larger files

Added by Parikshith B over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

CephFs Environment:
1 MDS node
1 Ceph Kernel Filesystem, Ubuntu 14.04 LTS, kernel version 4.4.0-36-generic

Test Steps:

1. Wrote on ceph FileSystem with 6 files of 1 tb each(100% write , 64 threads) using VDbench IO tool.

2. If i do read/write operations on the same files, IOs get terminated after 10-30 seconds. {code}
17:22:44.785 input argument scanned: '-ffs_profiles/4k_r_s'
17:22:44.785 input argument scanned: '-oax'
17:22:44.822 anchor=/home/ems/cephfs_mnt: there will be 1 directories and a maximum of 6 files under this anchor.
17:22:44.823 Estimated maximum size for this anchor: 6t
17:22:44.823
17:22:44.850 Starting slave: /home/ems/vdbench_tests/vdbench/vdbench SlaveJvm -m localhost -n localhost-10-160826-17.22.44.765 -l localhost-0 -p 5570
17:22:45.161 All slaves are now connected
17:22:46.000 Starting RD=rd1; elapsed=120 warmup=10; fwdrate=max. For loops: None

Aug 26, 2016 .Interval. .ReqstdOps.. ...cpu%... read ....read.... ...write.... ..mb/sec... mb/sec .xfer.. ...mkdir... ...rmdir... ..create... ...open.... ...close... ..delete... ..getattr.. ..seta
ttr..
rate resp total sys pct rate resp rate resp read write total size rate resp rate resp rate resp rate resp rate resp rate resp rate resp rate
resp
17:22:52.052 6 585.0 56.12 � � 100.0 585.0 56.12 0.0 0.00 2.29 0.00 2.29 4096 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00
17:22:53.054 7 1178.0 5.08 � � 100.0 1178.0 5.08 0.0 0.00 4.60 0.00 4.60 4096 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00
17:22:54.050 8 1138.0 5.10 � � 100.0 1138.0 5.10 0.0 0.00 4.45 0.00 4.45 4096 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00
17:22:55.048 9 1060.0 5.83 � � 100.0 1060.0 5.83 0.0 0.00 4.14 0.00 4.14 4096 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00
17:22:56.047 10 1086.0 5.50 � � 100.0 1086.0 5.50 0.0 0.00 4.24 0.00 4.24 4096 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00
17:22:57.046 11 1131.0 5.20 � � 100.0 1131.0 5.20 0.0 0.00 4.42 0.00 4.42 4096 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00
17:22:57.410
17:22:57.411 Message from slave localhost-0:
17:22:57.411 file=/home/ems/cephfs_mnt/vdb.1_1.dir/vdb_f0001.file,busy=true
17:22:57.411 file=/home/ems/cephfs_mnt/vdb.1_1.dir/vdb_f0002.file,busy=true
17:22:57.411 file=/home/ems/cephfs_mnt/vdb.1_1.dir/vdb_f0003.file,busy=true
17:22:57.411 file=/home/ems/cephfs_mnt/vdb.1_1.dir/vdb_f0004.file,busy=true
17:22:57.411 file=/home/ems/cephfs_mnt/vdb.1_1.dir/vdb_f0005.file,busy=true
17:22:57.411 file=/home/ems/cephfs_mnt/vdb.1_1.dir/vdb_f0006.file,busy=true
17:22:57.411 Thread: FwgThread read /home/ems/cephfs_mnt rd=rd1 For loops: None
17:22:57.411
17:22:57.411 last_ok_request: Fri Aug 26 17:22:45 IST 2016
17:22:57.411 Duration: 11.84 seconds
17:22:57.411 consecutive_blocks: 10001
17:22:57.411 last_block: FILE_BUSY File busy
17:22:57.411 operation: read
17:22:57.411
17:22:57.411 Do you maybe have more threads running than that you have
17:22:57.411 files and therefore some threads ultimately give up after 10000 tries?
17:22:57.412
17:22:57.412 **************************************************
17:22:57.412 Slave localhost-0 aborting: Too many thread blocks
17:22:57.412 **************************************************
17:22:57.412 {code}

3. Tried reducing the thread count to 8, still it terminates.

logfile_vdbench.html View (5.66 KB) Parikshith B, 09/07/2016 10:53 AM

ceph-mds.log.log View (9.11 KB) Parikshith B, 09/07/2016 10:53 AM

History

#1 Updated by John Spray over 7 years ago

I see several clients in your MDS log. How are the operations in your test split between clients?

It is possible you're seeing this because you wrote a large file from one node, then you are reading it from another, which requires the MDS to block until the first node has flushed all the data in the file.

#2 Updated by Parikshith B over 7 years ago

Yes I ran it on multiple clients, but each client had been mounted with different sub-directories(each with 6 files-1tb each) which was created from the root mount.

client1: $ mount
/dev/sda1 on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/cgroup type tmpfs (rw)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
none on /run/user type tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
none on /sys/fs/pstore type pstore (rw)
tracefs on /var/lib/ureadahead/debugfs/tracing type tracefs (rw,relatime)
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
10.242.43.21:/ on /home/ceph/mnt/cephfs type ceph (0) #root
10.242.43.21:/client-1 on /home/ceph/cephfs_mnt type ceph (0)

client-2: $ mount
/dev/sda1 on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/cgroup type tmpfs (rw)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
none on /run/user type tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
none on /sys/fs/pstore type pstore (rw)
tracefs on /var/lib/ureadahead/debugfs/tracing type tracefs (rw,relatime)
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
10.242.43.21:/client-2 on /home/ems/cephfs_mnt type ceph (0)*

client-3: $ mount
/dev/sda1 on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/cgroup type tmpfs (rw)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
none on /run/user type tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
none on /sys/fs/pstore type pstore (rw)
tracefs on /var/lib/ureadahead/debugfs/tracing type tracefs (rw,relatime)
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
10.242.43.21:/client-3 on /home/ems/cephfs_mnt type ceph (0)

All clients were on 4.4 kernel.

#3 Updated by Greg Farnum over 7 years ago

I don't have any idea what this test does, but your pasted bit includes as output:

17:22:57.411 Do you maybe have more threads running than that you have
17:22:57.411 files and therefore some threads ultimately give up after 10000 tries?

Is there some reason you think that recommendation is wrong?

#4 Updated by Zheng Yan over 7 years ago

the slow requests seem like caused by "dirty page writeback"

#5 Updated by Vishal Kanaujia over 7 years ago

  • Assignee set to Vishal Kanaujia

The problem lied with usage of VDbench. If you have 256 threads, but 10 files, VDBench might fail.

From VDBench user guide

"
‘threads=’: how many concurrent operations for this workload Specifies how many concurrent threads to run for this workload. It should be clear that this does not mean that ‘n’ threads are running against each file, but instead it means that there will be ‘n’ concurrent files running this same workload. Unless overridden using the fileio=(random,shared) parameter All file operations for a specific directory or file are single threaded. See MultiThreading and file system testing.
Make sure you always have at least one file for each thread. If not, one or more threads continue trying to find an available file, but Vdbench gives up after 10,000 consecutive failed attempts
"

The following run works on my CephFS cluster:
fsd=fsd1,anchor=/mnt/cephfs_perf,depth=1,width=1,files=10,size=5G,openflags=o_sync,totalsize=50000000000
fwd=fwd1,fsd=fsd*,operation=write,xfersize=(4k),fileio=random,threads=10,fileselect=seq
rd=rd1,fwd=fwd*,interval=1,fwdrate=max,format=yes

You should use either of the following fileselect type:
a. random (number of threads number of files)
b. seq (number of threads number of files)
c. (random,shared)

Closing the bug.

#6 Updated by Vishal Kanaujia over 7 years ago

  • Status changed from New to Resolved

It was a tool use case problem, unrelated to CephFS. Closing the bug.

Also available in: Atom PDF