Bug #21555
opensrc/osd/PGLog.h: 1455: FAILED assert(miter != missing.get_items().end())
0%
Description
pg 2.3s0 up/acting is [7,0,2]/[6,0,2]
in backfill_toofull state, osd.6 got write op, bc object > last_backfill, and only send the pglog to osd.7,
then in backfilling state, osd.6 got delete op, bc object > last_backfill, and only send the pglog to osd.7,
after backfill finished, up/acting changed to [7,0,2],
osd.7 receive delete (retry=2) op, bc osd.7 dont have the object in local storage, so the delete(retry=2) op
failed, it recorded in pglog as 'error 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head by client.4215.0:17822 0.000000 -2'.
after reboot osd.7, with osd_debug_verify_missing_on_start=true, it checks the pglog,
it will skip the 'error' type log entry, but when it checks the first modify log entry,
the getattr() of the object returns ENOENT, and the object not in missing set,
so osd crashed in src/osd/PGLog.h: 1455: FAILED assert(miter != missing.get_items().end())
Updated by huang jun over 6 years ago
osd7:
91'473 (0'0) modify
151'793 (0'0) error
osd.6
91'473 (0'0) modify
149'793 (91'473) delete
Updated by huang jun over 6 years ago
@Josh Jones do you have time to look at it?
Updated by Josh Durgin over 6 years ago
Is this on master?
Shouldn't osd.7 have the 149'793 log entry for the delete, and thus detect the retry as a duplicate op, respond immediately, and not add an error entry? Can you attach the osd debug logs?
Updated by huang jun over 6 years ago
2017-09-24 17:25:51.299203 7f62e7dc0700 10 osd.7 pg_epoch: 149 pg[2.3s0( v 149'793 (22'1,149'793] lb 2:c369923f:::benchmark_data_mira035.xsky.com_17216_object 8996:head (bitwise) local-lis/les=86/87 n=19 ec=20/20 lis/c 86/79 les/c/f 87/80/0 85/86/86) [7,0,2]/[6,0,2] r=-1 lpr=86 pi=[79,86)/1 luod=0'0 lua=149'792 crt= 149'793 lcod 149'792 active+remapped] add_log_entry 149'793 (91'473) delete 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head by client.421 5.0:17822 2017-09-24 17:25:51.166477 0 2017-09-24 17:25:51.828386 7f62ec5c9700 10 merge_log log((149'792,149'792], crt=149'792) from osd.2(2) into log((22'1,149'794], crt=149'793) 2017-09-24 17:25:51.828393 7f62ec5c9700 10 rewind_divergent_log truncate divergent future 149'792 2017-09-24 17:25:51.829283 7f62ec5c9700 10 rewind_divergent_log future divergent 149'793 (91'473) delete 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head by client.4215.0:17822 2017-09-24 17:25:51.166477 0 2017-09-24 17:25:51.829309 7f62ec5c9700 10 rewind_divergent_log future divergent 149'794 (91'474) delete 2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head by client.4215.0:17826 2017-09-24 17:25:51.184513 0 2017-09-24 17:25:51.829321 7f62ec5c9700 20 _merge_object_divergent_entries: merging hoid 2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head entries: 149'794 (91'474) delete 2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head by client.4215.0:17826 2017-09-24 17:25:51.184513 0 2017-09-24 17:25:51.829332 7f62ec5c9700 20 _merge_object_divergent_entries: keeping 149'794 (91'474) delete 2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head by client.4215.0:17826 2017-09-24 17:25:51.184513 0 2017-09-24 17:25:51.829342 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head prior_version: 91'474 first_divergent_update: 149'794 last_divergent_update: 149'794 2017-09-24 17:25:51.829349 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head has no more recent entries in log 2017-09-24 17:25:51.829359 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head must be rolled back or recovered, attempting to rollback 2017-09-24 17:25:51.829364 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head rolling back 149'794 (91'474) delete 2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head by client.4215.0:17826 2017-09-24 17:25:51.184513 0 2017-09-24 17:25:51.829416 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head rolled back 2017-09-24 17:25:51.829422 7f62ec5c9700 20 _merge_object_divergent_entries: merging hoid 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head entries: 149'793 (91'473) delete 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head by client.4215.0:17822 2017-09-24 17:25:51.166477 0 2017-09-24 17:25:51.829447 7f62ec5c9700 20 _merge_object_divergent_entries: keeping 149'793 (91'473) delete 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head by client.4215.0:17822 2017-09-24 17:25:51.166477 0 2017-09-24 17:25:51.829457 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head prior_version: 91'473 first_divergent_update: 149'793 last_divergent_update: 149'793 2017-09-24 17:25:51.829463 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head has no more recent entries in log 2017-09-24 17:25:51.829467 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head must be rolled back or recovered, attempting to rollback 2017-09-24 17:25:51.829471 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head rolling back 149'793 (91'473) delete 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head by client.4215.0:17822 2017-09-24 17:25:51.166477 0 2017-09-24 17:25:51.829492 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head rolled back 2017-09-24 17:25:51.829522 7f62ec5c9700 10 merge_log result log((22'1,149'792], crt=149'792) missing(0 may_include_deletes = 1) changed=1 { "op_num": 2, "op_name": "remove", "collection": "2.3s0_head", "oid": "0#2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head#" }, { "op_num": 3, "op_name": "op_coll_move_rename", "old_collection": "2.3s0_head", "old_oid": "0#2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head#319", "new_collection": "2.3s0_head", "new_oid": "0#2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head#" },
Updated by huang jun over 6 years ago
osd2 receive delete op delayed, so osd2's pglog don't have "149'793 (91'473) delete" 2017-09-24 17:25:51.798267 7f9951a8e700 10 osd.2 150 dequeue_op 0x7f99dd82bc20 prio 127 cost 0 latency 0.001306 MOSDECSubOpWrite(2.3s2 149/86 ECSubWrite(tid=2937, reqid=client.4215.0:17822, at_version=149'793, trim_to=22'1, roll_forward_to=149'792)) v2 pg pg[2.3s2( v 149'792 (22'1,149'792] local-lis/les=86/87 n=32 ec=20/20 lis/c 86/79 les/c/f 87/80/0 85/150/150) [7,0,2] r=2 lpr=150 pi=[79,150)/1 crt=149'792 unknown NOTIFY] 2017-09-24 17:25:51.798354 7f9951a8e700 10 osd.2 pg_epoch: 150 pg[2.3s2( v 149'792 (22'1,149'792] local-lis/les=86/87 n=32 ec=20/20 lis/c 86/79 les/c/f 87/80/0 85/150/150) [7,0,2] r=2 lpr=150 pi=[79,150)/1 crt=149'792 unknown NOTIFY] old_peering_msg reply_epoch 149 query_epoch 149 last_peering_reset 150 2017-09-24 17:25:51.798385 7f9951a8e700 10 osd.2 pg_epoch: 150 pg[2.3s2( v 149'792 (22'1,149'792] local-lis/les=86/87 n=32 ec=20/20 lis/c 86/79 les/c/f 87/80/0 85/150/150) [7,0,2] r=2 lpr=150 pi=[79,150)/1 crt=149'792 unknown NOTIFY] can_discard_replica_op pg changed ec=20/20 lis/c 86/79 les/c/f 87/80/0 85/150/150 after 149, dropping
Updated by huang jun over 6 years ago
osd.6 remove object "0#2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head#" from backfillinfo.objects in update_range(). It is ok if all the shards complete delete op, but in this case, osd2
doesn't complete delete op.
can we don't remove the object if it's have un-committed repop replys? so the backfill will not finish until
delete op committed on all shards.