Bug #8977
closedosd: didn't discard sub_op_reply from previous interval?
0%
Description
/a/teuthology-2014-07-29_02:30:02-rados-firefly-distro-basic-plana/384397
an op gets stuck in limbo because we are a stray:
2014-07-30 10:24:32.449532 7f1877677700 1 -- 10.214.131.24:6811/12238 <== osd.4 10.214.131.24:6806/12231 5945 ==== osd_sub_op_reply(client.4167.0:8943 4.0 b2ec8328/plana1527198-2131/head//4 [] ondisk, result = 0) v2 ==== 155+0+0 (1552322249 0 0) 0x420dc80 con 0x2cdac60
...
2014-07-30 10:24:32.709313 7f187166b700 20 osd.5 pg_epoch: 60 pg[4.0( v 59'2048 (0'0,59'2048] local-les=31 n=490 ec=11 les/c 31/47 60/60/60) [3,4] r=-1 lpr=60 pi=30-59/1 crt=28'602 lcod 59'2042 inactive] remove_repop repgather(0x2ed3780 59'2044 rep_tid=4725 committed?=0 applied?=0 lock=0 op=osd_op(client.4167.0:8943 plana1527198-2131 [write 1554223~1] 4.b2ec8328 ack+ondisk+write e59) v4)
...
2014-07-30 10:24:32.712716 7f187166b700 10 osd.5 60 dequeue_op 0x47c9f00 prio 196 cost 0 latency 0.263264 osd_sub_op_reply(client.4167.0:8943 4.0 b2ec8328/plana1527198-2131/head//4 [] ondisk, result = 0) v2 pg pg[4.0( v 59'2048 (0'0,59'2048] local-les=31 n=490 ec=11 les/c 31/47 60/60/60) [3,4] r=-1 lpr=60 pi=30-59/1 crt=28'602 lcod 59'2042 inactive NOTIFY]
note that we are stray, we should have discarded this op since it is prior to our lpr=60. or is it?
from osd.4:
2014-07-30 10:24:32.449141 7fc52a3f0700 10 osd.4 pg_epoch: 59 pg[4.0( v 59'2048 (0'0,59'2048] local-les=31 n=490 ec=11 les/c 31/47 30/30/30) [5,4] r=1 lpr=30 pi=11-29/1 luod=0'0 crt=0'0 lcod 59'2042 active] sub_op_modify_commit on op osd_sub_op(client.4167.0:8943 4.0 b2ec8328/plana1527198-2131/head//4 [] v 59'2044 snapset=0=[]:[] snapc=0=[]) v10, sending commit to osd.5 2014-07-30 10:24:32.449171 7fc52a3f0700 20 osd.4 60 _share_map_outgoing 0x1fb06e0 already has epoch 60 2014-07-30 10:24:32.449175 7fc52a3f0700 1 -- 10.214.131.24:6806/12231 --> osd.5 10.214.131.24:6811/12238 -- osd_sub_op_reply(client.4167.0:8943 4.0 b2ec8328/plana1527198-2131/head//4 [] ondisk, result = 0) v2 -- ?+0 0x5430500
the pg has 59, the osd has 60. but the pg's epoch is what we should be labeling the op with... right?
i was looking as the master branch code, maybe this has been changed/fixed since firefly...
Updated by Samuel Just over 9 years ago
2014-07-30 10:40:58.317063 7fc2164da700 0 log [WRN] : slow request 960.196157 seconds old, received at 2014-07-30 10:24:58.120716: MOSDPGPush(4.0 70 [PushOp(d94b8988/plana1527198-17/head//4, version: 18'8, data_included: [0~178143], data_size: 178143, omap_header_size: 0, omap_entries_size: 0, attrset_size: 3, recovery_info: ObjectRecoveryInfo(d94b8988/plana1527198-17/head//4@18'8, copy_subset: [0~178143], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:178143, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 currently no flag points reached
I think that message got stuck in the queue. Still looking.
Updated by Samuel Just over 9 years ago
Added some debugging to dump the OpWQ queue information if there are stale ops, running in loop.
Updated by Samuel Just over 9 years ago
- Status changed from New to Can't reproduce