Project

General

Profile

Bug #18739

Updated by Nathan Cutler over 7 years ago

Various test failures in jewel, all of which have this assert in common, and the assert takes place under high IO with lots of network disconnects. 

 Example: 

 <pre> 
 2017-01-30T21:09:27.873 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stderr:+ rbd --cluster cluster2 -p mirror bench-write test_clone --io-size 4096 --io-threads 1 --io-total 409600 --io-pattern rand 
 2017-01-30T21:09:27.916 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stdout:bench-write io_size 4096 io_threads 1 bytes 409600 pattern random 
 2017-01-30T21:09:27.938 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stdout:    SEC OPS     OPS/SEC     BYTES/SEC 
 2017-01-30T21:09:27.938 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stderr:2017-01-30 21:09:27.940951 7f9988260700    0 -- 172.21.15.46:0/1437589183 >> 172.21.15.46:6800/30315 pipe(0x7f995c004e20 sd=9 :45073 s=2 pgs=143 cs=1 l=1 c=0x7f995c0060e0).injecting socket failure 
 2017-01-30T21:09:27.940 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stderr:2017-01-30 21:09:27.943152 7f9988260700    0 -- 172.21.15.46:0/1437589183 >> 172.21.15.46:6800/30315 pipe(0x7f9978009e40 sd=9 :45075 s=2 pgs=144 cs=1 l=1 c=0x7f997800b150).injecting socket failure 
 2017-01-30T21:09:27.952 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stderr:2017-01-30 21:09:27.955667 7f995bfff700    0 -- 172.21.15.46:0/1437589183 submit_message osd_op(client.4161.0:103 2.e812df66 journal_data.2.10af39ee015c.0 [call journal.guard_append,append 0~4159 [fadvise_dontneed]] snapc 0=[] ack+ondisk+write+known_if_redirected e18) v7 remote, 172.21.15.46:6800/30315, failed lossy con, dropping message 0x7f995008f790 
 2017-01-30T21:09:28.031 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stderr:2017-01-30 21:09:28.034939 7f99805f6700    0 -- 172.21.15.46:0/1437589183 >> 172.21.15.46:6800/30315 pipe(0x7f9978009e40 sd=9 :45079 s=2 pgs=146 cs=1 l=1 c=0x7f997800e940).injecting socket failure 
 2017-01-30T21:09:28.037 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stderr:2017-01-30 21:09:28.039628 7f9973fff700    0 -- 172.21.15.46:0/1437589183 submit_message osd_op(client.4161.0:203 2.47a3f1c8 rbd_data.10af39ee015c.000000000000001f [call rbd.copyup,set-alloc-hint object_size 4194304 write_size 4194304,write 778240~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e18) v7 remote, 172.21.15.46:6800/30315, failed lossy con, dropping message 0x7f996c01cd20 
 2017-01-30T21:09:28.037 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stderr:2017-01-30 21:09:28.039769 7f9973fff700    0 -- 172.21.15.46:0/1437589183 submit_message osd_op(client.4161.0:205 2.47a3f1c8 rbd_data.10af39ee015c.000000000000001f [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 1110016~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e18) v7 remote, 172.21.15.46:6800/30315, failed lossy con, dropping message 0x7f996c07c880 
 2017-01-30T21:09:28.037 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stderr:2017-01-30 21:09:28.039908 7f9973fff700    0 -- 172.21.15.46:0/1437589183 submit_message osd_op(client.4161.0:207 2.a5573408 rbd_data.10af39ee015c.000000000000000c [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 3215360~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e18) v7 remote, 172.21.15.46:6800/30315, failed lossy con, dropping message 0x7f996c044590 
 2017-01-30T21:09:28.087 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stderr:2017-01-30 21:09:28.090139 7f9988361700    0 -- 172.21.15.46:0/1437589183 >> 172.21.15.46:6800/30315 pipe(0x7f997802cde0 sd=11 :45082 s=2 pgs=148 cs=1 l=1 c=0x7f997800e940).injecting socket failure 
 2017-01-30T21:09:28.141 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr:2017-01-30 21:09:28.144429 7fcef0ead700 -1 osd.0 pg_epoch: 18 pg[2.8( v 18'16 (0'0,18'16] local-les=7 n=3 ec=6 les/c/f 7/7/0 6/6/6) [0,2] r=0 lpr=6 luod=18'15 lua=18'15 crt=13'13 lcod 18'14 mlcod 18'14 active+clean] bad op order, already applied 214 > this 207 
 2017-01-30T21:09:28.142 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr:osd/ReplicatedPG.cc: In function 'void ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)' thread 7fcef0ead700 time 2017-01-30 21:09:28.144437 
 2017-01-30T21:09:28.143 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr:osd/ReplicatedPG.cc: 3030: FAILED assert(0 == "out of order op") 
 2017-01-30T21:09:28.156 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: ceph version 10.2.5-6034-gb671230 (b671230f7f70b620905eb02c6dbd93d051b53fb7) 
 2017-01-30T21:09:28.157 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fcf1190024b] 
 2017-01-30T21:09:28.157 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 2: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1a25) [0x7fcf11422eb5] 
 2017-01-30T21:09:28.157 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 3: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x2709) [0x7fcf11425b09] 
 2017-01-30T21:09:28.157 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 4: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x70c) [0x7fcf113e2dcc] 
 2017-01-30T21:09:28.157 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) [0x7fcf112a0a75] 
 2017-01-30T21:09:28.157 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 6: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d) [0x7fcf112a0c9d] 
 2017-01-30T21:09:28.158 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x869) [0x7fcf112a5659] 
 2017-01-30T21:09:28.158 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x885) [0x7fcf118f05b5] 
 2017-01-30T21:09:28.158 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fcf118f24f0] 
 2017-01-30T21:09:28.158 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 10: (()+0x8184) [0x7fcf0fdc7184] 
 2017-01-30T21:09:28.158 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 11: (clone()+0x6d) [0x7fcf0def337d] 
 </pre> 

 Example: http://pulpito.front.sepia.ceph.com:80/smithfarm-2017-01-30_19:52:39-rbd-wip-jewel-backports-distro-basic-smithi/ 

 This has been fixed in kraken, but deemed quite complicated to fix in jewel.

Back