Bug #8977
closedosd: didn't discard sub_op_reply from previous interval?
0%
Description
/a/teuthology-2014-07-29_02:30:02-rados-firefly-distro-basic-plana/384397
an op gets stuck in limbo because we are a stray:
2014-07-30 10:24:32.449532 7f1877677700 1 -- 10.214.131.24:6811/12238 <== osd.4 10.214.131.24:6806/12231 5945 ==== osd_sub_op_reply(client.4167.0:8943 4.0 b2ec8328/plana1527198-2131/head//4 [] ondisk, result = 0) v2 ==== 155+0+0 (1552322249 0 0) 0x420dc80 con 0x2cdac60
...
2014-07-30 10:24:32.709313 7f187166b700 20 osd.5 pg_epoch: 60 pg[4.0( v 59'2048 (0'0,59'2048] local-les=31 n=490 ec=11 les/c 31/47 60/60/60) [3,4] r=-1 lpr=60 pi=30-59/1 crt=28'602 lcod 59'2042 inactive] remove_repop repgather(0x2ed3780 59'2044 rep_tid=4725 committed?=0 applied?=0 lock=0 op=osd_op(client.4167.0:8943 plana1527198-2131 [write 1554223~1] 4.b2ec8328 ack+ondisk+write e59) v4)
...
2014-07-30 10:24:32.712716 7f187166b700 10 osd.5 60 dequeue_op 0x47c9f00 prio 196 cost 0 latency 0.263264 osd_sub_op_reply(client.4167.0:8943 4.0 b2ec8328/plana1527198-2131/head//4 [] ondisk, result = 0) v2 pg pg[4.0( v 59'2048 (0'0,59'2048] local-les=31 n=490 ec=11 les/c 31/47 60/60/60) [3,4] r=-1 lpr=60 pi=30-59/1 crt=28'602 lcod 59'2042 inactive NOTIFY]
note that we are stray, we should have discarded this op since it is prior to our lpr=60. or is it?
from osd.4:
2014-07-30 10:24:32.449141 7fc52a3f0700 10 osd.4 pg_epoch: 59 pg[4.0( v 59'2048 (0'0,59'2048] local-les=31 n=490 ec=11 les/c 31/47 30/30/30) [5,4] r=1 lpr=30 pi=11-29/1 luod=0'0 crt=0'0 lcod 59'2042 active] sub_op_modify_commit on op osd_sub_op(client.4167.0:8943 4.0 b2ec8328/plana1527198-2131/head//4 [] v 59'2044 snapset=0=[]:[] snapc=0=[]) v10, sending commit to osd.5 2014-07-30 10:24:32.449171 7fc52a3f0700 20 osd.4 60 _share_map_outgoing 0x1fb06e0 already has epoch 60 2014-07-30 10:24:32.449175 7fc52a3f0700 1 -- 10.214.131.24:6806/12231 --> osd.5 10.214.131.24:6811/12238 -- osd_sub_op_reply(client.4167.0:8943 4.0 b2ec8328/plana1527198-2131/head//4 [] ondisk, result = 0) v2 -- ?+0 0x5430500
the pg has 59, the osd has 60. but the pg's epoch is what we should be labeling the op with... right?
i was looking as the master branch code, maybe this has been changed/fixed since firefly...
Updated by Samuel Just almost 10 years ago
2014-07-30 10:40:58.317063 7fc2164da700 0 log [WRN] : slow request 960.196157 seconds old, received at 2014-07-30 10:24:58.120716: MOSDPGPush(4.0 70 [PushOp(d94b8988/plana1527198-17/head//4, version: 18'8, data_included: [0~178143], data_size: 178143, omap_header_size: 0, omap_entries_size: 0, attrset_size: 3, recovery_info: ObjectRecoveryInfo(d94b8988/plana1527198-17/head//4@18'8, copy_subset: [0~178143], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:178143, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 currently no flag points reached
I think that message got stuck in the queue. Still looking.
Updated by Samuel Just almost 10 years ago
Added some debugging to dump the OpWQ queue information if there are stale ops, running in loop.
Updated by Samuel Just over 9 years ago
- Status changed from New to Can't reproduce