Bug #6366: long TAT - due too long residence time in Ceph code - Ceph - Ceph

Actions

Copy link

Bug #6366

closed

long TAT - due too long residence time in Ceph code

Added by Dieter Kasper over 10 years ago. Updated almost 9 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Mark Nelson

Category:

OSD

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

In a default environment (repli=2) a 4k IO take approx. 2000 micro seconds for TAT / turn-around-time from start of an IO to completion (e.g. using /dev/rdbX).
The time needed to transfer the 4k block one time over the network is ~100 micro-sec (10GbE) or ~20 micro-sec (56 Gb IPoIB).
The time to write a 4k block to Disk is ~60 micro-sec (PCIe-SSD).

So, approx. 1600 micro-sec of a single IO is spend in the Ceph Code, which is way too much !!

test (e.g.) with
fio --filename=/dev/rbd1 --direct=1 --rw=write --bs=4k --size=10G --numjobs=1 --runtime=60 --group_reporting --name=file1 --output=fio_write_4k_1.numjob=1.debug_1_10_10

Files

TO-20130916-4k-TAT-in-Ceph-code.pdf (71.7 KB) TO-20130916-4k-TAT-in-Ceph-code.pdf

Dieter Kasper, 09/23/2013 01:23 AM

Actions

Copy link

Updated by Sage Weil over 10 years ago

Assignee changed from Mark Nelson to Sage Weil

Actions

Copy link

Updated by Sage Weil over 10 years ago

i picked a random write, client.4406.1:23924

flab:6366 11:39 AM $ grep -h client.4406.1:23924 * | sort
2013-09-20 23:42:13.971342 7ff16a86f700  1 -- 10.10.37.14:6800/5901 <== client.4406 10.10.37.12:0/3018924743 6098 ==== osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20) ==== 139+0+4096 (843583466 0 2850370365) 0x5d5de80 con 0x4a88b00

~90us to get through dispatch queue, handle_op.  it would be interesting to drill in here by adding prints in OSD::_ms_dispatch and handle_op, and possibly debug tp = 20 to see if the threadpool context switching is partly to blame here.

2013-09-20 23:42:13.971435 7ff165865700 10 osd.2 20 dequeue_op 0x4e78780 prio 127 cost 4096 latency 0.000155 osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20) pg pg[2.e5( v 20'381 (0'0,20'381] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean]
2013-09-20 23:42:13.971462 7ff165865700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'381 (0'0,20'381] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean] do_op osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20) may_write

another 100us just for the stat, and object_context load... could drill into that by instrumenting the filestore code

2013-09-20 23:42:13.971687 7ff165865700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'381 (0'0,20'381] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean] append_log log(0'0,20'381] [20'382 (20'381) modify   88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 by client.4406.1:23924 2013-09-20 23:42:13.970716]
2013-09-20 23:42:13.971705 7ff165865700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 luod=20'381 lcod 20'381 mlcod 20'380 active+clean] add_log_entry 20'382 (20'381) modify   88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 by client.4406.1:23924 2013-09-20 23:42:13.970716
2013-09-20 23:42:13.971744 7ff165865700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 luod=20'381 lcod 20'381 mlcod 20'380 active+clean] new_repop rep_tid 5992 on osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20)
2013-09-20 23:42:13.971795 7ff165865700  1 -- 10.10.39.14:6800/5901 --> osd.1 10.10.39.13:6803/20097 -- osd_sub_op(client.4406.1:23924 2.e5 88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 [] v 20'382 snapset=0=[]:[] snapc=0=[]) v7 -- ?+4708 0x50e8a00

hard to say where the rest of that time went.. there is 10-20us between each print; it could just be the debug overhead. might be interesting to limit the specific prints (e.g., dequeue_op, and do_op post obc load, and see how it looks with just debug ms = 1.

i'll stop there for now.. i suspect just understanding the turnaround between the osd_op receive and osd_sub_op send is a good start.

2013-09-20 23:42:13.971811 7ff165865700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 luod=20'381 lcod 20'381 mlcod 20'380 active+clean] eval_repop repgather(0x498bde0 20'382 rep_tid=5992 wfack=1,2 wfdisk=1,2 op=osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20)) wants=d
2013-09-20 23:42:13.971826 7ff165865700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 luod=20'381 lcod 20'381 mlcod 20'380 active+clean] apply_repop  applying update on repgather(0x498bde0 20'382 rep_tid=5992 wfack=1,2 wfdisk=1,2 op=osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20))
2013-09-20 23:42:13.971976 7f77e7ce8700  1 -- 10.10.39.13:6803/20097 <== osd.2 10.10.39.14:6800/5901 5523 ==== osd_sub_op(client.4406.1:23924 2.e5 88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 [] v 20'382 snapset=0=[]:[] snapc=0=[]) v7 ==== 1139+0+4708 (1919974487 0 219093564) 0x6240000 con 0x551bde0
2013-09-20 23:42:13.972048 7f77e7ce8700 10 osd.1 20 handle_sub_op osd_sub_op(client.4406.1:23924 2.e5 88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 [] v 20'382 snapset=0=[]:[] snapc=0=[]) v7 epoch 20
2013-09-20 23:42:13.972106 7f77e3ce0700 10 osd.1 20 dequeue_op 0x6d44560 prio 127 cost 4708 latency 0.000230 osd_sub_op(client.4406.1:23924 2.e5 88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 [] v 20'382 snapset=0=[]:[] snapc=0=[]) v7 pg pg[2.e5( v 20'381 (0'0,20'381] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=1 lpr=17 pi=3-16/5 luod=0'0 lcod 20'380 active]
2013-09-20 23:42:13.972169 7f77e3ce0700 10 osd.1 pg_epoch: 20 pg[2.e5( v 20'381 (0'0,20'381] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=1 lpr=17 pi=3-16/5 luod=0'0 lcod 20'380 active] append_log log(0'0,20'381] [20'382 (20'381) modify   88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 by client.4406.1:23924 2013-09-20 23:42:13.970716]
2013-09-20 23:42:13.972191 7f77e3ce0700 10 osd.1 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=1 lpr=17 pi=3-16/5 luod=0'0 lcod 20'380 active] add_log_entry 20'382 (20'381) modify   88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 by client.4406.1:23924 2013-09-20 23:42:13.970716
2013-09-20 23:42:13.972194 7ff16e076700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 luod=20'381 lcod 20'381 mlcod 20'380 active+clean] op_commit repgather(0x498bde0 applying 20'382 rep_tid=5992 wfack=1,2 wfdisk=1,2 op=osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20))
2013-09-20 23:42:13.972244 7ff16e076700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean] eval_repop repgather(0x498bde0 applying 20'382 rep_tid=5992 wfack=1 wfdisk=1 op=osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20)) wants=d
2013-09-20 23:42:13.972289 7ff16c072700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean] op_applied repgather(0x498bde0 applying 20'382 rep_tid=5992 wfack=1 wfdisk=1 op=osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20))
2013-09-20 23:42:13.972354 7ff16c072700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean] eval_repop repgather(0x498bde0 applied 20'382 rep_tid=5992 wfack=1 wfdisk=1 op=osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20)) wants=d
2013-09-20 23:42:13.972584 7f77ec4f1700 10 osd.1 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=1 lpr=17 pi=3-16/5 luod=0'0 lcod 20'380 active] sub_op_modify_commit on op osd_sub_op(client.4406.1:23924 2.e5 88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 [] v 20'382 snapset=0=[]:[] snapc=0=[]) v7, sending commit to osd.2
2013-09-20 23:42:13.972606 7f77ec4f1700  1 -- 10.10.39.13:6803/20097 --> osd.2 10.10.39.14:6800/5901 -- osd_sub_op_reply(client.4406.1:23924 2.e5 88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 [] ondisk, result = 0) v1 -- ?+0 0x5cbeb80
2013-09-20 23:42:13.972719 7f77ea4ed700 10 osd.1 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=1 lpr=17 pi=3-16/5 luod=0'0 lcod 20'381 active] sub_op_modify_applied on 0x6ad23a0 op osd_sub_op(client.4406.1:23924 2.e5 88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 [] v 20'382 snapset=0=[]:[] snapc=0=[]) v7
2013-09-20 23:42:13.972837 7ff16986d700  1 -- 10.10.39.14:6800/5901 <== osd.1 10.10.39.13:6803/20097 5536 ==== osd_sub_op_reply(client.4406.1:23924 2.e5 88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (2133842383 0 0) 0x5d3d000 con 0x2bdbf40
2013-09-20 23:42:13.972956 7ff166066700 10 osd.2 20 dequeue_op 0x50f3840 prio 196 cost 0 latency 0.000199 osd_sub_op_reply(client.4406.1:23924 2.e5 88965e5/rb.0.100c.2ae8944a.0000000005c1/head//2 [] ondisk, result = 0) v1 pg pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean]
2013-09-20 23:42:13.972998 7ff166066700  7 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean] repop_ack rep_tid 5992 op osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20) result 0 ack_type 4 from osd.1
2013-09-20 23:42:13.973020 7ff166066700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean] eval_repop repgather(0x498bde0 applied 20'382 rep_tid=5992 wfack= wfdisk= op=osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20)) wants=d
2013-09-20 23:42:13.973038 7ff166066700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'380 active+clean]  sending commit on repgather(0x498bde0 applied 20'382 rep_tid=5992 wfack= wfdisk= op=osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20)) 0x5e02600
2013-09-20 23:42:13.973071 7ff166066700 10 osd.2 pg_epoch: 20 pg[2.e5( v 20'382 (0'0,20'382] local-les=18 n=9 ec=1 les/c 18/18 17/17/17) [2,1] r=0 lpr=17 mlcod 20'381 active+clean]  removing repgather(0x498bde0 applied 20'382 rep_tid=5992 wfack= wfdisk= op=osd_op(client.4406.1:23924 rb.0.100c.2ae8944a.0000000005c1 [write 53248~4096] 2.88965e5 e20))

Actions

Copy link

Updated by Sage Weil over 10 years ago

Priority changed from High to Normal

Actions

Copy link

Updated by Sage Weil about 10 years ago

Assignee deleted (~~Sage Weil~~)

Actions

Copy link

Updated by Loïc Dachary almost 9 years ago

Status changed from New to Rejected
Assignee set to Mark Nelson
Regression set to No

So much has changed performance wise in the past year that this probably no longer apply. Assigning to Mark so that he can re-open if I'm mistaken.

Actions

Copy link

Updated by Loïc Dachary almost 9 years ago

Status changed from Rejected to New

In 2013 the TAT for a single 4k IO was approx. 1600 micro-seconds (Client & OSDs).
In 2014 with our private timestamps we measured ~1400 mysec for a single 4k IO within the OSDs.
In 2015 with NTTNG and v0.91 we are still around 1400 mysec in average.

Actions

Copy link

Updated by Greg Farnum almost 9 years ago

Status changed from New to Closed

"Latency is too high" isn't a very helpful bug. There are various things we can do about this; some of them are being done and some of them are huge architectural changes we eventually need to go through. Those have or will get their own tickets.

I guess somebody could set up an overall performance improvement tracking bug if they want, but I'm not sure it would add much value.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #6366

long TAT - due too long residence time in Ceph code

Updated by Sage Weil over 10 years ago

Updated by Sage Weil over 10 years ago

Updated by Sage Weil over 10 years ago

Updated by Sage Weil about 10 years ago

Updated by Loïc Dachary almost 9 years ago

Updated by Loïc Dachary almost 9 years ago

Updated by Greg Farnum almost 9 years ago