Project

General

Profile

Actions

Bug #6366

closed

long TAT - due too long residence time in Ceph code

Added by Dieter Kasper over 10 years ago. Updated almost 9 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In a default environment (repli=2) a 4k IO take approx. 2000 micro seconds for TAT / turn-around-time from start of an IO to completion (e.g. using /dev/rdbX).
The time needed to transfer the 4k block one time over the network is ~100 micro-sec (10GbE) or ~20 micro-sec (56 Gb IPoIB).
The time to write a 4k block to Disk is ~60 micro-sec (PCIe-SSD).

So, approx. 1600 micro-sec of a single IO is spend in the Ceph Code, which is way too much !!

test (e.g.) with
fio --filename=/dev/rbd1 --direct=1 --rw=write --bs=4k --size=10G --numjobs=1 --runtime=60 --group_reporting --name=file1 --output=fio_write_4k_1.numjob=1.debug_1_10_10


Files

TO-20130916-4k-TAT-in-Ceph-code.pdf (71.7 KB) TO-20130916-4k-TAT-in-Ceph-code.pdf Dieter Kasper, 09/23/2013 01:23 AM
Actions #1

Updated by Sage Weil over 10 years ago

  • Assignee changed from Mark Nelson to Sage Weil
Actions #2

Updated by Sage Weil over 10 years ago

i picked a random write, client.4406.1:23924

flab:6366 11:39 AM $ grep -h client.4406.1:23924 * | sort
2013-09-20 23:42:13.971342 7ff16a86f700  1 -- 10.10.37.14:6800/5901 <== client.4406 10.10.37.12:0/3018924743 6098 ==== osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20) ==== 139+0+4096 (843583466 0 2850370365) 0x5d5de80 con 0x4a88b00

~90us to get through dispatch queue, handle_op.  it would be interesting to drill in here by adding prints in OSD::_ms_dispatch and handle_op, and possibly debug tp = 20 to see if the threadpool context switching is partly to blame here.

2013-09-20 23:42:13.971435 7ff165865700 10 osd.2 20 dequeue_op 0x4e78780 prio 127 cost 4096 latency 0.000155 osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20) pg pg[2.e5( v 20'381 (0'0,20'381] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean]
2013-09-20 23:42:13.971462 7ff165865700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'381 (0'0,20'381] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean] do_op osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20) may_write

another 100us just for the stat, and object_context load... could drill into that by instrumenting the filestore code

2013-09-20 23:42:13.971687 7ff165865700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'381 (0'0,20'381] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean] append_log log(0'0,20'381] [20'382 (20'381) modify   88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 by client.4406.1:23924 2013-09-20 23:42:13.970716]
2013-09-20 23:42:13.971705 7ff165865700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 luod=20'381 lcod 20'381 mlcod 20'380 active+clean] add_log_entry 20'382 (20'381) modify   88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 by client.4406.1:23924 2013-09-20 23:42:13.970716
2013-09-20 23:42:13.971744 7ff165865700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 luod=20'381 lcod 20'381 mlcod 20'380 active+clean] new_repop rep_tid 5992 on osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20)
2013-09-20 23:42:13.971795 7ff165865700  1 -- 10.10.39.14:6800/5901 --> osd.1 10.10.39.13:6803/20097 -- osd_sub_op(client.4406.1:23924 2.e5 88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 [] v 20'382 snapset=0=[]:[] snapc=0=[]) v7 -- ?+4708 0x50e8a00

hard to say where the rest of that time went.. there is 10-20us between each print; it could just be the debug overhead. might be interesting to limit the specific prints (e.g., dequeue_op, and do_op post obc load, and see how it looks with just debug ms = 1.

i'll stop there for now.. i suspect just understanding the turnaround between the osd_op receive and osd_sub_op send is a good start.

2013-09-20 23:42:13.971811 7ff165865700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 luod=20'381 lcod 20'381 mlcod 20'380 active+clean] eval_repop repgather(0x498bde0 20'382 rep_tid=5992 wfack=1,2 wfdisk=1,2 op=osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20)) wants=d
2013-09-20 23:42:13.971826 7ff165865700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 luod=20'381 lcod 20'381 mlcod 20'380 active+clean] apply_repop  applying update on repgather(0x498bde0 20'382 rep_tid=5992 wfack=1,2 wfdisk=1,2 op=osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20))
2013-09-20 23:42:13.971976 7f77e7ce8700  1 -- 10.10.39.13:6803/20097 <== osd.2 10.10.39.14:6800/5901 5523 ==== osd_sub_op(client.4406.1:23924 2.e5 88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 [] v 20'382 snapset=0=[]:[] snapc=0=[]) v7 ==== 1139+0+4708 (1919974487 0 219093564) 0x6240000 con 0x551bde0
2013-09-20 23:42:13.972048 7f77e7ce8700 10 osd.1 20 handle_sub_op osd_sub_op(client.4406.1:23924 2.e5 88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 [] v 20'382 snapset=0=[]:[] snapc=0=[]) v7 epoch 20
2013-09-20 23:42:13.972106 7f77e3ce0700 10 osd.1 20 dequeue_op 0x6d44560 prio 127 cost 4708 latency 0.000230 osd_sub_op(client.4406.1:23924 2.e5 88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 [] v 20'382 snapset=0=[]:[] snapc=0=[]) v7 pg pg[2.e5( v 20'381 (0'0,20'381] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=1 lpr=17 pi=3-16/5 luod=0'0 lcod 20'380 active]
2013-09-20 23:42:13.972169 7f77e3ce0700 10 osd.1 pg_epoch: 20 pg[2.e5( v 20'381 (0'0,20'381] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=1 lpr=17 pi=3-16/5 luod=0'0 lcod 20'380 active] append_log log(0'0,20'381] [20'382 (20'381) modify   88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 by client.4406.1:23924 2013-09-20 23:42:13.970716]
2013-09-20 23:42:13.972191 7f77e3ce0700 10 osd.1 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=1 lpr=17 pi=3-16/5 luod=0'0 lcod 20'380 active] add_log_entry 20'382 (20'381) modify   88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 by client.4406.1:23924 2013-09-20 23:42:13.970716
2013-09-20 23:42:13.972194 7ff16e076700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 luod=20'381 lcod 20'381 mlcod 20'380 active+clean] op_commit repgather(0x498bde0 applying 20'382 rep_tid=5992 wfack=1,2 wfdisk=1,2 op=osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20))
2013-09-20 23:42:13.972244 7ff16e076700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean] eval_repop repgather(0x498bde0 applying 20'382 rep_tid=5992 wfack=1 wfdisk=1 op=osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20)) wants=d
2013-09-20 23:42:13.972289 7ff16c072700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean] op_applied repgather(0x498bde0 applying 20'382 rep_tid=5992 wfack=1 wfdisk=1 op=osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20))
2013-09-20 23:42:13.972354 7ff16c072700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean] eval_repop repgather(0x498bde0 applied 20'382 rep_tid=5992 wfack=1 wfdisk=1 op=osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20)) wants=d
2013-09-20 23:42:13.972584 7f77ec4f1700 10 osd.1 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=1 lpr=17 pi=3-16/5 luod=0'0 lcod 20'380 active] sub_op_modify_commit on op osd_sub_op(client.4406.1:23924 2.e5 88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 [] v 20'382 snapset=0=[]:[] snapc=0=[]) v7, sending commit to osd.2
2013-09-20 23:42:13.972606 7f77ec4f1700  1 -- 10.10.39.13:6803/20097 --> osd.2 10.10.39.14:6800/5901 -- osd_sub_op_reply(client.4406.1:23924 2.e5 88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 [] ondisk, result = 0) v1 -- ?+0 0x5cbeb80
2013-09-20 23:42:13.972719 7f77ea4ed700 10 osd.1 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=1 lpr=17 pi=3-16/5 luod=0'0 lcod 20'381 active] sub_op_modify_applied on 0x6ad23a0 op osd_sub_op(client.4406.1:23924 2.e5 88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 [] v 20'382 snapset=0=[]:[] snapc=0=[]) v7
2013-09-20 23:42:13.972837 7ff16986d700  1 -- 10.10.39.14:6800/5901 <== osd.1 10.10.39.13:6803/20097 5536 ==== osd_sub_op_reply(client.4406.1:23924 2.e5 88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (2133842383 0 0) 0x5d3d000 con 0x2bdbf40
2013-09-20 23:42:13.972956 7ff166066700 10 osd.2 20 dequeue_op 0x50f3840 prio 196 cost 0 latency 0.000199 osd_sub_op_reply(client.4406.1:23924 2.e5 88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 [] ondisk, result = 0) v1 pg pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean]
2013-09-20 23:42:13.972998 7ff166066700  7 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean] repop_ack rep_tid 5992 op osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20) result 0 ack_type 4 from osd.1
2013-09-20 23:42:13.973020 7ff166066700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean] eval_repop repgather(0x498bde0 applied 20'382 rep_tid=5992 wfack= wfdisk= op=osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20)) wants=d
2013-09-20 23:42:13.973038 7ff166066700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean]  sending commit on repgather(0x498bde0 applied 20'382 rep_tid=5992 wfack= wfdisk= op=osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20)) 0x5e02600
2013-09-20 23:42:13.973071 7ff166066700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'381 active+clean]  removing repgather(0x498bde0 applied 20'382 rep_tid=5992 wfack= wfdisk= op=osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20))

Actions #3

Updated by Sage Weil over 10 years ago

  • Priority changed from High to Normal
Actions #4

Updated by Sage Weil about 10 years ago

  • Assignee deleted (Sage Weil)
Actions #5

Updated by Loïc Dachary almost 9 years ago

  • Status changed from New to Rejected
  • Assignee set to Mark Nelson
  • Regression set to No

So much has changed performance wise in the past year that this probably no longer apply. Assigning to Mark so that he can re-open if I'm mistaken.

Actions #6

Updated by Loïc Dachary almost 9 years ago

  • Status changed from Rejected to New

In 2013 the TAT for a single 4k IO was approx. 1600 micro-seconds (Client & OSDs).
In 2014 with our private timestamps we measured ~1400 mysec for a single 4k IO within the OSDs.
In 2015 with NTTNG and v0.91 we are still around 1400 mysec in average.

Actions #7

Updated by Greg Farnum almost 9 years ago

  • Status changed from New to Closed

"Latency is too high" isn't a very helpful bug. There are various things we can do about this; some of them are being done and some of them are huge architectural changes we eventually need to go through. Those have or will get their own tickets.

I guess somebody could set up an overall performance improvement tracking bug if they want, but I'm not sure it would add much value.

Actions

Also available in: Atom PDF