Project

General

Profile

Actions

Bug #18739

closed

FAILED assert(0 == "out of order op") in jewel integration testing

Added by Nathan Cutler about 7 years ago. Updated almost 7 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Various test failures in jewel, all of which have this assert in common, and the assert takes place under high IO with lots of network disconnects.

Example:

2017-01-30T21:09:27.873 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stderr:+ rbd --cluster cluster2 -p mirror bench-write test_clone --io-size 4096 --io-threads 1 --io-total 409600 --io-pattern rand
2017-01-30T21:09:27.916 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stdout:bench-write io_size 4096 io_threads 1 bytes 409600 pattern random
2017-01-30T21:09:27.938 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stdout:  SEC OPS   OPS/SEC   BYTES/SEC
2017-01-30T21:09:27.938 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stderr:2017-01-30 21:09:27.940951 7f9988260700  0 -- 172.21.15.46:0/1437589183 >> 172.21.15.46:6800/30315 pipe(0x7f995c004e20 sd=9 :45073 s=2 pgs=143 cs=1 l=1 c=0x7f995c0060e0).injecting socket failure
2017-01-30T21:09:27.940 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stderr:2017-01-30 21:09:27.943152 7f9988260700  0 -- 172.21.15.46:0/1437589183 >> 172.21.15.46:6800/30315 pipe(0x7f9978009e40 sd=9 :45075 s=2 pgs=144 cs=1 l=1 c=0x7f997800b150).injecting socket failure
2017-01-30T21:09:27.952 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stderr:2017-01-30 21:09:27.955667 7f995bfff700  0 -- 172.21.15.46:0/1437589183 submit_message osd_op(client.4161.0:103 2.e812df66 journal_data.2.10af39ee015c.0 [call journal.guard_append,append 0~4159 [fadvise_dontneed]] snapc 0=[] ack+ondisk+write+known_if_redirected e18) v7 remote, 172.21.15.46:6800/30315, failed lossy con, dropping message 0x7f995008f790
2017-01-30T21:09:28.031 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stderr:2017-01-30 21:09:28.034939 7f99805f6700  0 -- 172.21.15.46:0/1437589183 >> 172.21.15.46:6800/30315 pipe(0x7f9978009e40 sd=9 :45079 s=2 pgs=146 cs=1 l=1 c=0x7f997800e940).injecting socket failure
2017-01-30T21:09:28.037 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stderr:2017-01-30 21:09:28.039628 7f9973fff700  0 -- 172.21.15.46:0/1437589183 submit_message osd_op(client.4161.0:203 2.47a3f1c8 rbd_data.10af39ee015c.000000000000001f [call rbd.copyup,set-alloc-hint object_size 4194304 write_size 4194304,write 778240~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e18) v7 remote, 172.21.15.46:6800/30315, failed lossy con, dropping message 0x7f996c01cd20
2017-01-30T21:09:28.037 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stderr:2017-01-30 21:09:28.039769 7f9973fff700  0 -- 172.21.15.46:0/1437589183 submit_message osd_op(client.4161.0:205 2.47a3f1c8 rbd_data.10af39ee015c.000000000000001f [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 1110016~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e18) v7 remote, 172.21.15.46:6800/30315, failed lossy con, dropping message 0x7f996c07c880
2017-01-30T21:09:28.037 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stderr:2017-01-30 21:09:28.039908 7f9973fff700  0 -- 172.21.15.46:0/1437589183 submit_message osd_op(client.4161.0:207 2.a5573408 rbd_data.10af39ee015c.000000000000000c [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 3215360~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e18) v7 remote, 172.21.15.46:6800/30315, failed lossy con, dropping message 0x7f996c044590
2017-01-30T21:09:28.087 INFO:tasks.workunit.cluster1.client.mirror.smithi046.stderr:2017-01-30 21:09:28.090139 7f9988361700  0 -- 172.21.15.46:0/1437589183 >> 172.21.15.46:6800/30315 pipe(0x7f997802cde0 sd=11 :45082 s=2 pgs=148 cs=1 l=1 c=0x7f997800e940).injecting socket failure
2017-01-30T21:09:28.141 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr:2017-01-30 21:09:28.144429 7fcef0ead700 -1 osd.0 pg_epoch: 18 pg[2.8( v 18'16 (0'0,18'16] local-les=7 n=3 ec=6 les/c/f 7/7/0 6/6/6) [0,2] r=0 lpr=6 luod=18'15 lua=18'15 crt=13'13 lcod 18'14 mlcod 18'14 active+clean] bad op order, already applied 214 > this 207
2017-01-30T21:09:28.142 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr:osd/ReplicatedPG.cc: In function 'void ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)' thread 7fcef0ead700 time 2017-01-30 21:09:28.144437
2017-01-30T21:09:28.143 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr:osd/ReplicatedPG.cc: 3030: FAILED assert(0 == "out of order op")
2017-01-30T21:09:28.156 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: ceph version 10.2.5-6034-gb671230 (b671230f7f70b620905eb02c6dbd93d051b53fb7)
2017-01-30T21:09:28.157 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fcf1190024b]
2017-01-30T21:09:28.157 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 2: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1a25) [0x7fcf11422eb5]
2017-01-30T21:09:28.157 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 3: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x2709) [0x7fcf11425b09]
2017-01-30T21:09:28.157 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 4: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x70c) [0x7fcf113e2dcc]
2017-01-30T21:09:28.157 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) [0x7fcf112a0a75]
2017-01-30T21:09:28.157 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 6: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d) [0x7fcf112a0c9d]
2017-01-30T21:09:28.158 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x869) [0x7fcf112a5659]
2017-01-30T21:09:28.158 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x885) [0x7fcf118f05b5]
2017-01-30T21:09:28.158 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fcf118f24f0]
2017-01-30T21:09:28.158 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 10: (()+0x8184) [0x7fcf0fdc7184]
2017-01-30T21:09:28.158 INFO:tasks.ceph.cluster2.osd.0.smithi046.stderr: 11: (clone()+0x6d) [0x7fcf0def337d]

Example: http://pulpito.front.sepia.ceph.com:80/smithfarm-2017-01-30_19:52:39-rbd-wip-jewel-backports-distro-basic-smithi/

This has been fixed in kraken (by recording failed ops in the PG log), but deemed quite complicated to fix in jewel.

Actions #1

Updated by Nathan Cutler about 7 years ago

  • Description updated (diff)
Actions #2

Updated by Nathan Cutler about 7 years ago

  • Description updated (diff)
Actions #3

Updated by Nathan Cutler almost 7 years ago

  • Description updated (diff)
Actions #4

Updated by Greg Farnum almost 7 years ago

  • Status changed from 12 to Won't Fix
  • Priority changed from Low to Normal

Apparently this was a result of not recording failed ops in the PGLog. We are unlikely to fix that for Jewel — it's hard and I don't think it's ever been seen outside our testing.

Actions

Also available in: Atom PDF