Project

General

Profile

Actions

Bug #21555

open

src/osd/PGLog.h: 1455: FAILED assert(miter != missing.get_items().end())

Added by huang jun over 6 years ago. Updated over 6 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

pg 2.3s0 up/acting is [7,0,2]/[6,0,2]

in backfill_toofull state, osd.6 got write op, bc object > last_backfill, and only send the pglog to osd.7,
then in backfilling state, osd.6 got delete op, bc object > last_backfill, and only send the pglog to osd.7,
after backfill finished, up/acting changed to [7,0,2],
osd.7 receive delete (retry=2) op, bc osd.7 dont have the object in local storage, so the delete(retry=2) op
failed, it recorded in pglog as 'error 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head by client.4215.0:17822 0.000000 -2'.

after reboot osd.7, with osd_debug_verify_missing_on_start=true, it checks the pglog,
it will skip the 'error' type log entry, but when it checks the first modify log entry,
the getattr() of the object returns ENOENT, and the object not in missing set,
so osd crashed in src/osd/PGLog.h: 1455: FAILED assert(miter != missing.get_items().end())

Actions #1

Updated by huang jun over 6 years ago

osd7:
91'473 (0'0) modify
151'793 (0'0) error

osd.6
91'473 (0'0) modify
149'793 (91'473) delete

Actions #2

Updated by huang jun over 6 years ago

@Josh Jones do you have time to look at it?

Actions #3

Updated by Josh Durgin over 6 years ago

Is this on master?

Shouldn't osd.7 have the 149'793 log entry for the delete, and thus detect the retry as a duplicate op, respond immediately, and not add an error entry? Can you attach the osd debug logs?

Actions #4

Updated by huang jun over 6 years ago

2017-09-24 17:25:51.299203 7f62e7dc0700 10 osd.7 pg_epoch: 149 pg[2.3s0( v 149'793 (22'1,149'793] lb 2:c369923f:::benchmark_data_mira035.xsky.com_17216_object
8996:head (bitwise) local-lis/les=86/87 n=19 ec=20/20 lis/c 86/79 les/c/f 87/80/0 85/86/86) [7,0,2]/[6,0,2] r=-1 lpr=86 pi=[79,86)/1 luod=0'0 lua=149'792 crt=
149'793 lcod 149'792 active+remapped] add_log_entry 149'793 (91'473) delete   2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head by client.421
5.0:17822 2017-09-24 17:25:51.166477 0

2017-09-24 17:25:51.828386 7f62ec5c9700 10 merge_log log((149'792,149'792], crt=149'792) from osd.2(2) into log((22'1,149'794], crt=149'793)
2017-09-24 17:25:51.828393 7f62ec5c9700 10 rewind_divergent_log truncate divergent future 149'792
2017-09-24 17:25:51.829283 7f62ec5c9700 10 rewind_divergent_log future divergent 149'793 (91'473) delete   2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head by client.4215.0:17822 2017-09-24 17:25:51.166477 0
2017-09-24 17:25:51.829309 7f62ec5c9700 10 rewind_divergent_log future divergent 149'794 (91'474) delete   2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head by client.4215.0:17826 2017-09-24 17:25:51.184513 0
2017-09-24 17:25:51.829321 7f62ec5c9700 20 _merge_object_divergent_entries: merging hoid 2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head entries: 149'794 (91'474) delete   2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head by client.4215.0:17826 2017-09-24 17:25:51.184513 0
2017-09-24 17:25:51.829332 7f62ec5c9700 20 _merge_object_divergent_entries: keeping 149'794 (91'474) delete   2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head by client.4215.0:17826 2017-09-24 17:25:51.184513 0
2017-09-24 17:25:51.829342 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head prior_version: 91'474 first_divergent_update: 149'794 last_divergent_update: 149'794
2017-09-24 17:25:51.829349 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head has no more recent entries in log
2017-09-24 17:25:51.829359 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head must be rolled back or recovered, attempting to rollback
2017-09-24 17:25:51.829364 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head rolling back 149'794 (91'474) delete   2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head by client.4215.0:17826 2017-09-24 17:25:51.184513 0
2017-09-24 17:25:51.829416 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c11080bd:::benchmark_data_mira035.xsky.com_17216_object7872:head rolled back
2017-09-24 17:25:51.829422 7f62ec5c9700 20 _merge_object_divergent_entries: merging hoid 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head entries: 149'793 (91'473) delete   2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head by client.4215.0:17822 2017-09-24 17:25:51.166477 0
2017-09-24 17:25:51.829447 7f62ec5c9700 20 _merge_object_divergent_entries: keeping 149'793 (91'473) delete   2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head by client.4215.0:17822 2017-09-24 17:25:51.166477 0
2017-09-24 17:25:51.829457 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head prior_version: 91'473 first_divergent_update: 149'793 last_divergent_update: 149'793
2017-09-24 17:25:51.829463 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head has no more recent entries in log
2017-09-24 17:25:51.829467 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head must be rolled back or recovered, attempting to rollback
2017-09-24 17:25:51.829471 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head rolling back 149'793 (91'473) delete   2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head by client.4215.0:17822 2017-09-24 17:25:51.166477 0
2017-09-24 17:25:51.829492 7f62ec5c9700 10 _merge_object_divergent_entries: hoid 2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head rolled back
2017-09-24 17:25:51.829522 7f62ec5c9700 10 merge_log result log((22'1,149'792], crt=149'792) missing(0 may_include_deletes = 1) changed=1

        {
            "op_num": 2,
            "op_name": "remove",
            "collection": "2.3s0_head",
            "oid": "0#2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head#" 
        },
        {
            "op_num": 3,
            "op_name": "op_coll_move_rename",
            "old_collection": "2.3s0_head",
            "old_oid": "0#2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head#319",
            "new_collection": "2.3s0_head",
            "new_oid": "0#2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head#" 
        },
Actions #5

Updated by huang jun over 6 years ago

osd2 receive delete op delayed, so osd2's pglog don't have "149'793 (91'473) delete" 
2017-09-24 17:25:51.798267 7f9951a8e700 10 osd.2 150 dequeue_op 0x7f99dd82bc20 prio 127 cost 0 latency 0.001306 MOSDECSubOpWrite(2.3s2 149/86 ECSubWrite(tid=2937, reqid=client.4215.0:17822, at_version=149'793, trim_to=22'1, roll_forward_to=149'792)) v2 pg pg[2.3s2( v 149'792 (22'1,149'792] local-lis/les=86/87 n=32 ec=20/20 lis/c 86/79 les/c/f 87/80/0 85/150/150) [7,0,2] r=2 lpr=150 pi=[79,150)/1 crt=149'792 unknown NOTIFY]
2017-09-24 17:25:51.798354 7f9951a8e700 10 osd.2 pg_epoch: 150 pg[2.3s2( v 149'792 (22'1,149'792] local-lis/les=86/87 n=32 ec=20/20 lis/c 86/79 les/c/f 87/80/0 85/150/150) [7,0,2] r=2 lpr=150 pi=[79,150)/1 crt=149'792 unknown NOTIFY] old_peering_msg reply_epoch 149 query_epoch 149 last_peering_reset 150
2017-09-24 17:25:51.798385 7f9951a8e700 10 osd.2 pg_epoch: 150 pg[2.3s2( v 149'792 (22'1,149'792] local-lis/les=86/87 n=32 ec=20/20 lis/c 86/79 les/c/f 87/80/0 85/150/150) [7,0,2] r=2 lpr=150 pi=[79,150)/1 crt=149'792 unknown NOTIFY] can_discard_replica_op pg changed ec=20/20 lis/c 86/79 les/c/f 87/80/0 85/150/150 after 149, dropping
Actions #6

Updated by huang jun over 6 years ago

osd.6 remove object "0#2:c4b0339b:::benchmark_data_mira035.xsky.com_17216_object7868:head#" from backfillinfo.objects in update_range(). It is ok if all the shards complete delete op, but in this case, osd2
doesn't complete delete op.
can we don't remove the object if it's have un-committed repop replys? so the backfill will not finish until
delete op committed on all shards.

Actions

Also available in: Atom PDF