Actions
Bug #9910
closedceph_test_rados: out of order, probably due to message delay logic
% Done:
0%
Source:
Q/A
Tags:
Backport:
giant
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
/a/samuelj-2014-10-24_23:51:24-rados-wip-sam-testing-wip-testing-vanilla-fixes-basic-multi/571220
Updated by Sage Weil over 9 years ago
- Status changed from New to 12
- Source changed from other to Q/A
another one: ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2014-10-27_02:32:02-rados-giant-distro-basic-multi/573501
2014-10-27T15:36:01.096 INFO:tasks.rados.rados.0.burnupi17.stdout:2964: write initial oid 2964 2014-10-27T15:36:01.096 INFO:tasks.rados.rados.0.burnupi17.stdout:2964: seq_num 2963 ranges {647379=706029,2041302=423587,2999079=704193,3847378=1} 2014-10-27T15:36:01.123 INFO:tasks.rados.rados.0.burnupi17.stdout:2964: writing burnupi1748115-2964 from 647379 to 1353408 tid 1 2014-10-27T15:36:01.142 INFO:tasks.rados.rados.0.burnupi17.stdout:2964: writing burnupi1748115-2964 from 2041302 to 2464889 tid 2 2014-10-27T15:36:01.167 INFO:tasks.rados.rados.0.burnupi17.stdout:2964: writing burnupi1748115-2964 from 2999079 to 3703272 tid 3 2014-10-27T15:36:01.169 INFO:tasks.rados.rados.0.burnupi17.stdout:2964: writing burnupi1748115-2964 from 3847378 to 3847379 tid 4 2014-10-27T15:36:02.824 INFO:tasks.rados.rados.0.burnupi17.stdout:2964: finishing write tid 1 to burnupi1748115-2964 2014-10-27T15:36:02.825 INFO:tasks.rados.rados.0.burnupi17.stdout:2964: finishing write tid 2 to burnupi1748115-2964 2014-10-27T15:36:02.825 INFO:tasks.rados.rados.0.burnupi17.stdout:2964: finishing write tid 3 to burnupi1748115-2964 2014-10-27T15:36:02.825 INFO:tasks.rados.rados.0.burnupi17.stdout:2964: finishing write tid 6 to burnupi1748115-2964 2014-10-27T15:36:02.828 INFO:tasks.rados.rados.0.burnupi17.stdout:2964: finishing write tid 4 to burnupi1748115-2964
and msgr delays are enabled
Updated by Sage Weil over 9 years ago
- Status changed from 12 to In Progress
- Assignee set to Sage Weil
- Priority changed from Urgent to Immediate
reproducing with client ms logs
Updated by Sage Weil over 9 years ago
yeah, almost certain this is a bug with delayed messages. testing a fix.
ubuntu@teuthology:/a/sage-bug-9910-a/576723
Updated by Sage Weil over 9 years ago
2014-10-28 17:05:43.166634 7f3c945f6700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=2 pgs=65 cs=1 l=1 c=0x7f3c84069720).reader got message 10 0x7f3c58016bb0 osd_op_reply(11420 plana3310955-2283 [write 1684227~621560] v40'2788 uv2788 ondisk = 0) v6 2014-10-28 17:05:43.166651 7f3c945f6700 1 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=2 pgs=65 cs=1 l=1 c=0x7f3c84069720).queue_received will delay until 2014-10-28 17:05:43.821954 on 0x7f3c58016bb0 osd_op_reply(11420 plana3310955-2283 [write 1684227~621560] v40'2788 uv ... 2014-10-28 17:05:43.819902 7f3c945f6700 2 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=2 pgs=65 cs=1 l=1 c=0x7f3c84069720).fault (0) Success 2014-10-28 17:05:43.819919 7f3c945f6700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=2 pgs=65 cs=1 l=1 c=0x7f3c84069720).fault on lossy channel, failing 2014-10-28 17:05:43.819926 7f3c945f6700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=2 pgs=65 cs=1 l=1 c=0x7f3c84069720).stop 2014-10-28 17:05:43.819936 7f3c945f6700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720). sleep for 0.002 2014-10-28 17:05:43.819990 7f3c943f4700 20 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).writer finishing 2014-10-28 17:05:43.820011 7f3c943f4700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).writer done 2014-10-28 17:05:43.822051 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58016bb0 for delivery, past 2014-10-28 17:05:43.821954 2014-10-28 17:05:43.822076 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c58016bb0 prio 127 2014-10-28 17:05:43.822079 7f3c945f6700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).unregister_pipe 2014-10-28 17:05:43.822092 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c5802f8d0 for delivery, past 0.000000 2014-10-28 17:05:43.822102 7f3c945f6700 10 -- 10.214.131.7:0/1010955 dispatch_throttle_release 184 to dispatch throttler 9626/104857600 2014-10-28 17:05:43.822118 7f3c945f6700 20 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c945f5848DelayedDelivery::discard 2014-10-28 17:05:43.822127 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c5802f8d0 prio 127 2014-10-28 17:05:43.822138 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58030910 for delivery, past 0.000000 2014-10-28 17:05:43.822147 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c58030910 prio 127 2014-10-28 17:05:43.822152 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58012720 for delivery, past 0.000000 2014-10-28 17:05:43.822159 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c58012720 prio 127 2014-10-28 17:05:43.822164 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c580323c0 for delivery, past 0.000000 2014-10-28 17:05:43.822171 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c580323c0 prio 127 2014-10-28 17:05:43.822176 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58013400 for delivery, past 0.000000 2014-10-28 17:05:43.822183 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c58013400 prio 127 2014-10-28 17:05:43.822188 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58014680 for delivery, past 0.000000 2014-10-28 17:05:43.822195 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c58014680 prio 127 2014-10-28 17:05:43.822199 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58015780 for delivery, past 0.000000 2014-10-28 17:05:43.822182 7f3c97fff700 1 -- 10.214.131.7:0/1010955 <== osd.5 10.214.136.134:6800/14388 11 ==== osd_op_reply(11437 plana3310955-2286 [write 3154508~631673] v40'2811 uv2811 ack = 0) v6 ==== 184+0+0 (423115415 0 0) 0x7f3c5802f8d0 con 0x7f3c84069720 2014-10-28 17:05:43.822206 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c58015780 prio 127 2014-10-28 17:05:43.822211 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58051120 for delivery, past 0.000000 2014-10-28 17:05:43.822218 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c58051120 prio 127 2014-10-28 17:05:43.822220 7f3c97fff700 10 -- 10.214.131.7:0/1010955 dispatch_throttle_release 184 to dispatch throttler 9442/104857600 2014-10-28 17:05:43.822223 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58037570 for delivery, past 0.000000 2014-10-28 17:05:43.822228 7f3c97fff700 20 -- 10.214.131.7:0/1010955 done calling dispatch on 0x7f3c5802f8d0 2014-10-28 17:05:43.822230 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c58037570 prio 127 2014-10-28 17:05:43.822240 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58038720 for delivery, past 0.000000 2014-10-28 17:05:43.822248 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c58038720 prio 127 2014-10-28 17:05:43.822253 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58034a10 for delivery, past 0.000000 2014-10-28 17:05:43.822248 7f3c97fff700 1 -- 10.214.131.7:0/1010955 <== osd.5 10.214.136.134:6800/14388 12 ==== osd_op_reply(11421 plana3310955-2283 [write 2669911~1] v40'2789 uv2789 ack = 0) v6 ==== 184+0+0 (2864957661 0 0) 0x7f3c58030910 con 0x7f3c84069720 ... 2014-10-28 17:05:43.822923 7f3c945f6700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).discard_queue 2014-10-28 17:05:43.822955 7f3c945f6700 10 -- 10.214.131.7:0/1010955 queue_reap 0x7f3c84039a80 2014-10-28 17:05:43.822976 7f3c945f6700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).reader done
note that that message 0x7f3c58016bb0 is the one we miss above because it slipped in before the queue discard
2014-10-28 17:05:43.168645 7f3c97fff700 1 -- 10.214.131.7:0/1010955 <== osd.5 10.214.136.134:6800/14388 5 ==== osd_op_reply(11419 plana3310955-2283 [write 669912~434503] v40'2787 uv2787 ack = 0) v6 ==== 184+0+0 (2623050850 0 0) 0x7f3c58029390 con 0x7f3c84069720 2014-10-28 17:05:43.168674 7f3c97fff700 1 -- 10.214.131.7:0/1010955 <== osd.5 10.214.136.134:6800/14388 6 ==== osd_op_reply(11419 plana3310955-2283 [write 669912~434503] v40'2787 uv2787 ondisk = 0) v6 ==== 184+0+0 (297340449 0 0) 0x7f3c58027150 con 0x7f3c84069720 2014-10-28 17:05:43.168820 7f3c97fff700 1 -- 10.214.131.7:0/1010955 <== osd.5 10.214.136.134:6800/14388 8 ==== osd_op_reply(11420 plana3310955-2283 [write 1684227~621560] v40'2788 uv2788 ack = 0) v6 ==== 184+0+0 (2961534089 0 0) 0x7f3c58044210 con 0x7f3c84069720 ..here.. 2014-10-28 17:05:43.822248 7f3c97fff700 1 -- 10.214.131.7:0/1010955 <== osd.5 10.214.136.134:6800/14388 12 ==== osd_op_reply(11421 plana3310955-2283 [write 2669911~1] v40'2789 uv2789 ack = 0) v6 ==== 184+0+0 (2864957661 0 0) 0x7f3c58030910 con 0x7f3c84069720 2014-10-28 17:05:43.822286 7f3c97fff700 1 -- 10.214.131.7:0/1010955 <== osd.5 10.214.136.134:6800/14388 13 ==== osd_op_reply(11421 plana3310955-2283 [write 2669911~1] v40'2789 uv2789 ondisk = 0) v6 ==== 184+0+0 (656560798 0 0) 0x7f3c58012720 con 0x7f3c84069720
Updated by Sage Weil over 9 years ago
- Status changed from 7 to Fix Under Review
- Priority changed from Immediate to Urgent
- Backport set to giant
Updated by Sage Weil over 9 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Loïc Dachary about 9 years ago
- giant backport https://github.com/ceph/ceph/pull/4029
Updated by Sage Weil about 9 years ago
- Status changed from Pending Backport to Resolved
Actions