after testing locally, i think this is expected behavior.
test settings:
- docker container osd-host.alpha, 172.17.0.3: hosting osd.0, osd.1, osd.2
- docker container osd-host.beta, 172.17.0.4: hosting osd.3, osd.4, osd.5
- docker container mon-host, 172.17.0.2: mon.a
test steps:
- on mon-host: ceph osd pool create test-pool 16 16
- on mon-host: dd if=/dev/zero of=/tmp/4MB count=1 bs=4MB
- on osd-host.alpha: tc qdisc add dev eth0 root netem delay 2000ms
- on mon-host: time bin/rados put -p test-pool test-obj3 /tmp/4MB
real 1m8.079s
user 0m0.048s
sys 0m0.022s
it took 68 seconds to copy a 4MB object
$ ceph osd map test-pool test-obj3
osdmap e55 pool 'test-pool' (1) object 'test-obj3' -> pg 1.6958e5c3 (1.3) -> up ([1,5,3], p1) acting ([1,5,3], p1)
on osd-host.alpha:
$ ceph daemon osd.1 dump_historic_slow_ops
...
"description": "osd_op(client.14194.0:1 1.3 1:c3a71a96:::test-obj3:head [writefull 0~4000000] snapc 0=[] ondisk+write+known_if_redirected e55)",
"initiated_at": "2018-12-19 17:12:21.041033",
"age": 232.282686,
"duration": 58.024624,
...
"type_data": {
"flag_point": "commit sent; apply or cleanup",
"client_info": {
"client": "client.14194",
"client_addr": "172.17.0.2:36030/1920072373",
"tid": 1
},
"events": [
{
"time": "2018-12-19 17:12:21.041033",
"event": "initiated"
},
{
"time": "2018-12-19 17:12:21.041033",
"event": "header_read"
},
{
"time": "2018-12-19 17:12:21.041038",
"event": "throttled"
},
{
"time": "2018-12-19 17:12:41.996248",
"event": "all_read"
},
{
"time": "2018-12-19 17:12:41.996308",
"event": "dispatched"
},
{
"time": "2018-12-19 17:12:41.996317",
"event": "queued_for_pg"
},
{
"time": "2018-12-19 17:12:41.996392",
"event": "reached_pg"
},
{
"time": "2018-12-19 17:12:41.996502",
"event": "started"
},
{
"time": "2018-12-19 17:12:41.996617",
"event": "waiting for subops from 3,5"
},
{
"time": "2018-12-19 17:12:42.277594",
"event": "op_commit"
},
{
"time": "2018-12-19 17:13:13.621318",
"event": "sub_op_commit_rec"
},
{
"time": "2018-12-19 17:13:19.065579",
"event": "sub_op_commit_rec"
},
{
"time": "2018-12-19 17:13:19.065635",
"event": "commit_sent"
},
{
"time": "2018-12-19 17:13:19.065656",
"event": "done"
}
on osd-host.alpha, iperf shows
time iperf -c 172.17.0.4 -n 4M -f KBytes
------------------------------------------------------------
Client connecting to 172.17.0.4, TCP port 5001
TCP window size: 128 KByte (default)
------------------------------------------------------------
[ 3] local 172.17.0.3 port 47388 connected with 172.17.0.4 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-16.8 sec 4096 KBytes 243 KBytes/sec
real 0m18.878s
user 0m0.014s
sys 0m0.024s
on mon-host:
$ time iperf -c 172.17.0.3 -n 4M -f KBytes
------------------------------------------------------------
Client connecting to 172.17.0.3, TCP port 5001
TCP window size: 128 KByte (default)
------------------------------------------------------------
[ 3] local 172.17.0.2 port 57882 connected with 172.17.0.3 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-21.3 sec 4096 KBytes 192 KBytes/sec
real 0m23.310s
user 0m0.007s
sys 0m0.033s
i am wondering somehow both ingress and egress traffic are delayed on osd-host.alpha. and the 4M request was segmented at IP layer into smaller packets, and the packets were queued in the queue in qdisc. that's why it took the primary osd 58 seconds to handle the write request. and from the client's point of view more than 2 * 2 seconds delay were added because of the first packet of the write request from it and ack message from the primary osd.