Project

General

Profile

Actions

Bug #9910

closed

ceph_test_rados: out of order, probably due to message delay logic

Added by Samuel Just over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
giant
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/samuelj-2014-10-24_23:51:24-rados-wip-sam-testing-wip-testing-vanilla-fixes-basic-multi/571220

Actions #1

Updated by Sage Weil over 9 years ago

  • Status changed from New to 12
  • Source changed from other to Q/A

another one: ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2014-10-27_02:32:02-rados-giant-distro-basic-multi/573501

2014-10-27T15:36:01.096 INFO:tasks.rados.rados.0.burnupi17.stdout:2964: write initial oid 2964
2014-10-27T15:36:01.096 INFO:tasks.rados.rados.0.burnupi17.stdout:2964:  seq_num 2963 ranges {647379=706029,2041302=423587,2999079=704193,3847378=1}
2014-10-27T15:36:01.123 INFO:tasks.rados.rados.0.burnupi17.stdout:2964:  writing burnupi1748115-2964 from 647379 to 1353408 tid 1
2014-10-27T15:36:01.142 INFO:tasks.rados.rados.0.burnupi17.stdout:2964:  writing burnupi1748115-2964 from 2041302 to 2464889 tid 2
2014-10-27T15:36:01.167 INFO:tasks.rados.rados.0.burnupi17.stdout:2964:  writing burnupi1748115-2964 from 2999079 to 3703272 tid 3
2014-10-27T15:36:01.169 INFO:tasks.rados.rados.0.burnupi17.stdout:2964:  writing burnupi1748115-2964 from 3847378 to 3847379 tid 4
2014-10-27T15:36:02.824 INFO:tasks.rados.rados.0.burnupi17.stdout:2964:  finishing write tid 1 to burnupi1748115-2964
2014-10-27T15:36:02.825 INFO:tasks.rados.rados.0.burnupi17.stdout:2964:  finishing write tid 2 to burnupi1748115-2964
2014-10-27T15:36:02.825 INFO:tasks.rados.rados.0.burnupi17.stdout:2964:  finishing write tid 3 to burnupi1748115-2964
2014-10-27T15:36:02.825 INFO:tasks.rados.rados.0.burnupi17.stdout:2964:  finishing write tid 6 to burnupi1748115-2964
2014-10-27T15:36:02.828 INFO:tasks.rados.rados.0.burnupi17.stdout:2964:  finishing write tid 4 to burnupi1748115-2964

and msgr delays are enabled

Actions #2

Updated by Sage Weil over 9 years ago

  • Status changed from 12 to In Progress
  • Assignee set to Sage Weil
  • Priority changed from Urgent to Immediate

reproducing with client ms logs

Actions #3

Updated by Sage Weil over 9 years ago

yeah, almost certain this is a bug with delayed messages. testing a fix.

ubuntu@teuthology:/a/sage-bug-9910-a/576723

Actions #4

Updated by Sage Weil over 9 years ago

  • Status changed from In Progress to 7

wip-9910

Actions #5

Updated by Sage Weil over 9 years ago

2014-10-28 17:05:43.166634 7f3c945f6700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=2 pgs=65 cs=1 l=1 c=0x7f3c84069720).reader got message 10 0x7f3c58016bb0 osd_op_reply(11420 plana3310955-2283 [write 1684227~621560] v40'2788 uv2788 ondisk = 0) v6
2014-10-28 17:05:43.166651 7f3c945f6700  1 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=2 pgs=65 cs=1 l=1 c=0x7f3c84069720).queue_received will delay until 2014-10-28 17:05:43.821954 on 0x7f3c58016bb0 osd_op_reply(11420 plana3310955-2283 [write 1684227~621560] v40'2788 uv
...
2014-10-28 17:05:43.819902 7f3c945f6700  2 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=2 pgs=65 cs=1 l=1 c=0x7f3c84069720).fault (0) Success
2014-10-28 17:05:43.819919 7f3c945f6700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=2 pgs=65 cs=1 l=1 c=0x7f3c84069720).fault on lossy channel, failing
2014-10-28 17:05:43.819926 7f3c945f6700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=2 pgs=65 cs=1 l=1 c=0x7f3c84069720).stop
2014-10-28 17:05:43.819936 7f3c945f6700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720). sleep for 0.002
2014-10-28 17:05:43.819990 7f3c943f4700 20 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).writer finishing
2014-10-28 17:05:43.820011 7f3c943f4700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).writer done
2014-10-28 17:05:43.822051 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58016bb0 for delivery, past 2014-10-28 17:05:43.821954
2014-10-28 17:05:43.822076 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c58016bb0 prio 127
2014-10-28 17:05:43.822079 7f3c945f6700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).unregister_pipe
2014-10-28 17:05:43.822092 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c5802f8d0 for delivery, past 0.000000
2014-10-28 17:05:43.822102 7f3c945f6700 10 -- 10.214.131.7:0/1010955 dispatch_throttle_release 184 to dispatch throttler 9626/104857600
2014-10-28 17:05:43.822118 7f3c945f6700 20 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c945f5848DelayedDelivery::discard
2014-10-28 17:05:43.822127 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c5802f8d0 prio 127
2014-10-28 17:05:43.822138 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58030910 for delivery, past 0.000000
2014-10-28 17:05:43.822147 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c58030910 prio 127
2014-10-28 17:05:43.822152 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58012720 for delivery, past 0.000000
2014-10-28 17:05:43.822159 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c58012720 prio 127
2014-10-28 17:05:43.822164 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c580323c0 for delivery, past 0.000000
2014-10-28 17:05:43.822171 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c580323c0 prio 127
2014-10-28 17:05:43.822176 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58013400 for delivery, past 0.000000
2014-10-28 17:05:43.822183 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c58013400 prio 127
2014-10-28 17:05:43.822188 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58014680 for delivery, past 0.000000
2014-10-28 17:05:43.822195 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c58014680 prio 127
2014-10-28 17:05:43.822199 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58015780 for delivery, past 0.000000
2014-10-28 17:05:43.822182 7f3c97fff700  1 -- 10.214.131.7:0/1010955 <== osd.5 10.214.136.134:6800/14388 11 ==== osd_op_reply(11437 plana3310955-2286 [write 3154508~631673] v40'2811 uv2811 ack = 0) v6 ==== 184+0+0 (423115415 0 0) 0x7f3c5802f8d0 con 0x7f3c84069720
2014-10-28 17:05:43.822206 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c58015780 prio 127
2014-10-28 17:05:43.822211 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58051120 for delivery, past 0.000000
2014-10-28 17:05:43.822218 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c58051120 prio 127
2014-10-28 17:05:43.822220 7f3c97fff700 10 -- 10.214.131.7:0/1010955 dispatch_throttle_release 184 to dispatch throttler 9442/104857600
2014-10-28 17:05:43.822223 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58037570 for delivery, past 0.000000
2014-10-28 17:05:43.822228 7f3c97fff700 20 -- 10.214.131.7:0/1010955 done calling dispatch on 0x7f3c5802f8d0
2014-10-28 17:05:43.822230 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c58037570 prio 127
2014-10-28 17:05:43.822240 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58038720 for delivery, past 0.000000
2014-10-28 17:05:43.822248 7f3c7ffff700 20 -- 10.214.131.7:0/1010955 queue 0x7f3c58038720 prio 127
2014-10-28 17:05:43.822253 7f3c7ffff700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).0x7f3c7fffebf8DelayedDelivery::entry dequeuing message 0x7f3c58034a10 for delivery, past 0.000000
2014-10-28 17:05:43.822248 7f3c97fff700  1 -- 10.214.131.7:0/1010955 <== osd.5 10.214.136.134:6800/14388 12 ==== osd_op_reply(11421 plana3310955-2283 [write 2669911~1] v40'2789 uv2789 ack = 0) v6 ==== 184+0+0 (2864957661 0 0) 0x7f3c58030910 con 0x7f3c84069720
...
2014-10-28 17:05:43.822923 7f3c945f6700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).discard_queue
2014-10-28 17:05:43.822955 7f3c945f6700 10 -- 10.214.131.7:0/1010955 queue_reap 0x7f3c84039a80
2014-10-28 17:05:43.822976 7f3c945f6700 10 -- 10.214.131.7:0/1010955 >> 10.214.136.134:6800/14388 pipe(0x7f3c84039a80 sd=9 :53209 s=4 pgs=65 cs=1 l=1 c=0x7f3c84069720).reader done

note that that message 0x7f3c58016bb0 is the one we miss above because it slipped in before the queue discard

2014-10-28 17:05:43.168645 7f3c97fff700  1 -- 10.214.131.7:0/1010955 <== osd.5 10.214.136.134:6800/14388 5 ==== osd_op_reply(11419 plana3310955-2283 [write 669912~434503] v40'2787 uv2787 ack = 0) v6 ==== 184+0+0 (2623050850 0 0) 0x7f3c58029390 con 0x7f3c84069720
2014-10-28 17:05:43.168674 7f3c97fff700  1 -- 10.214.131.7:0/1010955 <== osd.5 10.214.136.134:6800/14388 6 ==== osd_op_reply(11419 plana3310955-2283 [write 669912~434503] v40'2787 uv2787 ondisk = 0) v6 ==== 184+0+0 (297340449 0 0) 0x7f3c58027150 con 0x7f3c84069720
2014-10-28 17:05:43.168820 7f3c97fff700  1 -- 10.214.131.7:0/1010955 <== osd.5 10.214.136.134:6800/14388 8 ==== osd_op_reply(11420 plana3310955-2283 [write 1684227~621560] v40'2788 uv2788 ack = 0) v6 ==== 184+0+0 (2961534089 0 0) 0x7f3c58044210 con 0x7f3c84069720
..here..
2014-10-28 17:05:43.822248 7f3c97fff700  1 -- 10.214.131.7:0/1010955 <== osd.5 10.214.136.134:6800/14388 12 ==== osd_op_reply(11421 plana3310955-2283 [write 2669911~1] v40'2789 uv2789 ack = 0) v6 ==== 184+0+0 (2864957661 0 0) 0x7f3c58030910 con 0x7f3c84069720
2014-10-28 17:05:43.822286 7f3c97fff700  1 -- 10.214.131.7:0/1010955 <== osd.5 10.214.136.134:6800/14388 13 ==== osd_op_reply(11421 plana3310955-2283 [write 2669911~1] v40'2789 uv2789 ondisk = 0) v6 ==== 184+0+0 (656560798 0 0) 0x7f3c58012720 con 0x7f3c84069720

Actions #6

Updated by Sage Weil over 9 years ago

  • Status changed from 7 to Fix Under Review
  • Priority changed from Immediate to Urgent
  • Backport set to giant
Actions #7

Updated by Sage Weil over 9 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #8

Updated by Loïc Dachary about 9 years ago

  • Description updated (diff)
Actions #10

Updated by Sage Weil about 9 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF