Bug #37673
closedlatency between "initiated" and "queued_for_pg"
0%
Description
- 6 osd cluster
- separated cluster network on eth0, and public network on eth1.
- a rados client accessing from public network
- inject 2 seconds delay using tc to eth0. like
tc qdisc add dev eth0 root netem delay 2000ms
- put a 128 MB file to into 3 replica pool
slow requests are observed. quite a few of them had over 20 seconds' delay between "initiated" and "queued_for_pg"
{ "time": "2018-12-17 05:36:11.977528", "event": "initiated" }, { "time": "2018-12-17 05:36:47.991316", "event": "queued_for_pg" },
found on luminous. need to reproduce this on master.
Updated by Kefu Chai over 5 years ago
- Status changed from New to Won't Fix
after testing locally, i think this is expected behavior.
test settings:
- docker container osd-host.alpha, 172.17.0.3: hosting osd.0, osd.1, osd.2
- docker container osd-host.beta, 172.17.0.4: hosting osd.3, osd.4, osd.5
- docker container mon-host, 172.17.0.2: mon.a
test steps:
- on mon-host: ceph osd pool create test-pool 16 16
- on mon-host: dd if=/dev/zero of=/tmp/4MB count=1 bs=4MB
- on osd-host.alpha: tc qdisc add dev eth0 root netem delay 2000ms
- on mon-host: time bin/rados put -p test-pool test-obj3 /tmp/4MB
real 1m8.079s user 0m0.048s sys 0m0.022s
it took 68 seconds to copy a 4MB object
$ ceph osd map test-pool test-obj3 osdmap e55 pool 'test-pool' (1) object 'test-obj3' -> pg 1.6958e5c3 (1.3) -> up ([1,5,3], p1) acting ([1,5,3], p1)
on osd-host.alpha:
$ ceph daemon osd.1 dump_historic_slow_ops ... "description": "osd_op(client.14194.0:1 1.3 1:c3a71a96:::test-obj3:head [writefull 0~4000000] snapc 0=[] ondisk+write+known_if_redirected e55)", "initiated_at": "2018-12-19 17:12:21.041033", "age": 232.282686, "duration": 58.024624, ... "type_data": { "flag_point": "commit sent; apply or cleanup", "client_info": { "client": "client.14194", "client_addr": "172.17.0.2:36030/1920072373", "tid": 1 }, "events": [ { "time": "2018-12-19 17:12:21.041033", "event": "initiated" }, { "time": "2018-12-19 17:12:21.041033", "event": "header_read" }, { "time": "2018-12-19 17:12:21.041038", "event": "throttled" }, { "time": "2018-12-19 17:12:41.996248", "event": "all_read" }, { "time": "2018-12-19 17:12:41.996308", "event": "dispatched" }, { "time": "2018-12-19 17:12:41.996317", "event": "queued_for_pg" }, { "time": "2018-12-19 17:12:41.996392", "event": "reached_pg" }, { "time": "2018-12-19 17:12:41.996502", "event": "started" }, { "time": "2018-12-19 17:12:41.996617", "event": "waiting for subops from 3,5" }, { "time": "2018-12-19 17:12:42.277594", "event": "op_commit" }, { "time": "2018-12-19 17:13:13.621318", "event": "sub_op_commit_rec" }, { "time": "2018-12-19 17:13:19.065579", "event": "sub_op_commit_rec" }, { "time": "2018-12-19 17:13:19.065635", "event": "commit_sent" }, { "time": "2018-12-19 17:13:19.065656", "event": "done" }
on osd-host.alpha, iperf shows
time iperf -c 172.17.0.4 -n 4M -f KBytes ------------------------------------------------------------ Client connecting to 172.17.0.4, TCP port 5001 TCP window size: 128 KByte (default) ------------------------------------------------------------ [ 3] local 172.17.0.3 port 47388 connected with 172.17.0.4 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-16.8 sec 4096 KBytes 243 KBytes/sec real 0m18.878s user 0m0.014s sys 0m0.024s
on mon-host:
$ time iperf -c 172.17.0.3 -n 4M -f KBytes ------------------------------------------------------------ Client connecting to 172.17.0.3, TCP port 5001 TCP window size: 128 KByte (default) ------------------------------------------------------------ [ 3] local 172.17.0.2 port 57882 connected with 172.17.0.3 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-21.3 sec 4096 KBytes 192 KBytes/sec real 0m23.310s user 0m0.007s sys 0m0.033s
i am wondering somehow both ingress and egress traffic are delayed on osd-host.alpha. and the 4M request was segmented at IP layer into smaller packets, and the packets were queued in the queue in qdisc. that's why it took the primary osd 58 seconds to handle the write request. and from the client's point of view more than 2 * 2 seconds delay were added because of the first packet of the write request from it and ack message from the primary osd.