Actions
Bug #17789
closedosd: update_log_missing does not order correctly with osd_ops
% Done:
0%
Source:
Q/A
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
A normal op will often wait for just commit. The update_log_missing only waits for ack. On completion, eval_repop says it's complete and it tries to pop it off hte repop queue, and we hit
-32> 2016-11-01 19:58:40.318552 7fcff72c9700 5 -- op tracker -- seq: 11837, time: 2016-11-01 19:58:40.318552, event: reached_pg, op: pg_update_log_missing_reply(1.2 epoch 630 rep_tid 3656) -31> 2016-11-01 19:58:40.318555 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] handle_message: 0x7fd05643d3c0 -30> 2016-11-01 19:58:40.318561 7fcff72c9700 20 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] do_update_log_missing_reply got reply from 3 -29> 2016-11-01 19:58:40.318567 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] repop_all_applied: repop tid 3656 all applied -28> 2016-11-01 19:58:40.318572 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] eval_repop repgather(0x7fd05643b560 0'0 rep_tid=3656 committed?=0 applied?=1) wants= -27> 2016-11-01 19:58:40.318578 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] applied: repgather(0x7fd05643b560 0'0 rep_tid=3656 committed?=0 applied?=1) -26> 2016-11-01 19:58:40.318583 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] repop_all_committed: repop tid 3656 all committed -25> 2016-11-01 19:58:40.318587 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] eval_repop repgather(0x7fd05643b560 0'0 rep_tid=3656 committed?=1 applied?=1) wants= -24> 2016-11-01 19:58:40.318592 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] commit: repgather(0x7fd05643b560 0'0 rep_tid=3656 committed?=1 applied?=1) -23> 2016-11-01 19:58:40.318596 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] applied: repgather(0x7fd05643b560 0'0 rep_tid=3656 committed?=1 applied?=1) -22> 2016-11-01 19:58:40.318640 7fcff72c9700 15 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] publish_stats_to_osd 630:1381 -21> 2016-11-01 19:58:40.318647 7fcff72c9700 20 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] finished operator() r=-2 -20> 2016-11-01 19:58:40.318651 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] sending commit on osd_op(osd.1.371:181 1.e95139a6 smithi02628481-7 ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo [delete] snapc 160=[] write+ordersnap+ignore_overlay+enforce_snapc+known_if_redirected e630) v7 0x7fd02dc0b340 -19> 2016-11-01 19:58:40.318660 7fcff72c9700 1 -- 172.21.15.26:6808/23999 --> 172.21.15.26:0/22235 -- osd_op_reply(181 smithi02628481-7 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo [delete] v630'2473 uv3381 ondisk = -2 ((2) No such file or directory)) v7 -- ?+0 0x7fd02dc0b340 con 0x7fd03ba01f00 -18> 2016-11-01 19:58:40.318671 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] removing repgather(0x7fd05643b560 0'0 rep_tid=3656 committed?=1 applied?=1) -17> 2016-11-01 19:58:40.318677 7fcff72c9700 20 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] q front is repgather(0x7fd05643d600 630'2472 rep_tid=3655 committed?=0 applied?=0) -16> 2016-11-01 19:58:40.318682 7fcff72c9700 0 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] removing repgather(0x7fd05643b560 0'0 rep_tid=3656 committed?=1 applied?=1) -15> 2016-11-01 19:58:40.318687 7fcff72c9700 0 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] q front is repgather(0x7fd05643d600 630'2472 rep_tid=3655 committed?=0 applied?=0) 0> 2016-11-01 19:58:40.345433 7fcff72c9700 -1 /srv/autobuild-ceph/gitbuilder.git/build/rpmbuild/BUILD/ceph-11.0.2/src/osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)' thread 7fcff72c9700 time 2016-11-01 19:58:40.326765 /srv/autobuild-ceph/gitbuilder.git/build/rpmbuild/BUILD/ceph-11.0.2/src/osd/ReplicatedPG.cc: 8683: FAILED assert(repop_queue.front() == repop)
I suspect this bit of code:
/* Hack to work around the fact that ReplicatedBackend sends * ack+commit if commit happens first */ if (pool.info.ec_pool()) { t.register_on_complete(complete); } else { t.register_on_commit(complete); } t.register_on_applied( new C_OSD_OnApplied{this, get_osdmap()->get_epoch(), info.last_update});
in do_update_log_missing
/a/sage-2016-11-01_16:30:29-rados:thrash-master---basic-smithi/509438
Actions