Project

General

Profile

Bug #11635

hung fsx run

Added by Greg Farnum over 4 years ago. Updated about 3 years ago.

Status:
Rejected
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
05/14/2015
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
kceph
Labels (FS):
Pull request ID:

Description

http://typica002.front.sepia.ceph.com/teuthology-2015-05-11_23:08:01-kcephfs-master-testing-basic-typica/15179/

015-05-14T04:17:59.868 INFO:tasks.workunit.client.0.typica090.stderr:+ ./fsx -z 10MB -N 50000 -p 10000 -l 10485760
2015-05-14T04:18:03.167 INFO:tasks.workunit.client.0.typica090.stderr:fsx: main: filesystem does not support fallocate mode 0x8, disabling!
2015-05-14T04:18:03.167 INFO:tasks.workunit.client.0.typica090.stderr:: Operation not supported
2015-05-14T04:18:03.168 INFO:tasks.workunit.client.0.typica090.stderr:fsx: main: filesystem does not support fallocate mode 0x20, disabling!
2015-05-14T04:18:03.168 INFO:tasks.workunit.client.0.typica090.stderr:: Operation not supported
2015-05-14T04:18:03.169 INFO:tasks.workunit.client.0.typica090.stdout:skipping zero size read
2015-05-14T04:18:03.169 INFO:tasks.workunit.client.0.typica090.stdout:truncating to largest ever: 0x2e1e34
2015-05-14T04:18:05.658 INFO:tasks.workunit.client.0.typica090.stdout:truncating to largest ever: 0x38761c
2015-05-14T04:18:07.145 INFO:tasks.workunit.client.0.typica090.stdout:truncating to largest ever: 0x48106f
2015-05-14T04:18:08.528 INFO:tasks.workunit.client.0.typica090.stdout:truncating to largest ever: 0x8fc158
2015-05-14T04:19:03.066 INFO:tasks.workunit.client.0.typica090.stdout:fallocating to largest ever: 0x943e97
2015-05-14T04:19:49.184 INFO:tasks.workunit.client.0.typica090.stdout:truncating to largest ever: 0x99f165
2015-05-14T04:20:40.269 INFO:tasks.workunit.client.0.typica090.stdout:truncating to largest ever: 0x9c8eb9
2015-05-14T04:20:56.617 INFO:tasks.workunit.client.0.typica090.stdout:fallocating to largest ever: 0x9f340c
2015-05-14T04:23:32.880 INFO:tasks.workunit.client.0.typica090.stdout:truncating to largest ever: 0x9f9a2e
2015-05-14T04:28:30.805 INFO:tasks.workunit.client.0.typica090.stdout:truncating to largest ever: 0x9fe587
2015-05-14T04:29:55.515 INFO:tasks.workunit.client.0.typica090.stdout:truncating to largest ever: 0x9fe996
2015-05-14T04:37:17.877 INFO:tasks.workunit.client.0.typica090.stdout:fallocating to largest ever: 0xa00000
2015-05-14T04:48:15.607 INFO:tasks.workunit.client.0.typica090.stdout:10000 write       0x8100c7 thru   0x81f664        (0xf59e bytes)
2015-05-14T05:51:50.796 INFO:tasks.workunit.client.0.typica090.stdout:signal 15
2015-05-14T05:51:50.797 INFO:tasks.workunit.client.0.typica090.stdout:testcalls = 25999

History

#1 Updated by Zheng Yan over 4 years ago

  • Status changed from New to Need More Info

I checked the OSD log, The test was not hung, just very slow. In the test, fsx uses about 20 minutes to reach "10000 write ...". I run the same command on my local test machine, fsx only uses half minute to reach the same point.

#2 Updated by Greg Farnum over 4 years ago

Was it apparent why things were so slow? Just the OSDs being sluggish, or something else? :/

#3 Updated by Zheng Yan over 4 years ago

  • Status changed from Need More Info to Verified

I compare osd.log with success run's log. It seems the osd handled [write,startsync] slowly. It used several hundred microseconds, success run only uses a few microseconds

2015-05-14 05:51:29.547351 7f70666a4700 15 osd.3 pg_epoch: 9 pg[2.0( v 9'2210 (0'0,9'2210] local-les=9 n=201 ec=8 les/c 9/9 8/8/8) [3,1] r=0 lpr=8 crt=9'2208 lcod 9'2209 mlcod 9'2208 active+clean] log_op_stats osd_op(client.14111.1:34747 100000007cb.00000002 [write 610304~40960 [1806@0],startsync 0~0] 2.a2267b20 snapc 1=[] ondisk+write e9) v4 inb 40960 outb 0 rlat 0.009291 lat 0.257431

2015-05-14 05:51:29.757313 7f7062e9d700 15 osd.3 pg_epoch: 9 pg[2.2( v 9'3596 (9'500,9'3596] local-les=9 n=189 ec=8 les/c 9/9 8/8/8) [3,1] r=0 lpr=8 crt=9'3594 lcod 9'3595 mlcod 9'3594 active+clean] log_op_stats osd_op(client.14111.1:34749 100000007cb.00000001 [write 3710976~40960 [1806@0],startsync 0~0] 2.526004aa snapc 1=[] ondisk+write e9) v4 inb 40960 outb 0 rlat 0.013148 lat 0.206764

2015-05-14 05:51:43.995137 7f70656a2700 15 osd.3 pg_epoch: 9 pg[2.2( v 9'3601 (9'600,9'3601] local-les=9 n=189 ec=8 les/c 9/9 8/8/8) [3,1] r=0 lpr=8 crt=9'3599 lcod 9'3600 mlcod 9'3599 active+clean] log_op_stats osd_op(client.14111.1:34779 100000007cb.00000001 [write 3297280~61835 [1809@0],startsync 0~0] 2.526004aa snapc 1=[] ondisk+write e9) v4 inb 61835 outb 0 rlat 0.013707 lat 0.566197

#4 Updated by Greg Farnum over 4 years ago

  • Status changed from Verified to Rejected

s/microseconds/milliseconds :)

Rejecting this since it's an infrastructure thing we can't do much about for now and there are other efforts to establish the performance of these nodes.

#5 Updated by Greg Farnum about 3 years ago

  • Component(FS) kceph added

Also available in: Atom PDF