Project

General

Profile

Actions

Bug #17789

closed

osd: update_log_missing does not order correctly with osd_ops

Added by Sage Weil over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

A normal op will often wait for just commit. The update_log_missing only waits for ack. On completion, eval_repop says it's complete and it tries to pop it off hte repop queue, and we hit

   -32> 2016-11-01 19:58:40.318552 7fcff72c9700  5 -- op tracker -- seq: 11837, time: 2016-11-01 19:58:40.318552, event: reached_pg, op: pg_update_log_missing_reply(1.2 epoch 630 rep_tid 3656)
   -31> 2016-11-01 19:58:40.318555 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] handle_message: 0x7fd05643d3c0
   -30> 2016-11-01 19:58:40.318561 7fcff72c9700 20 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] do_update_log_missing_reply got reply from 3
   -29> 2016-11-01 19:58:40.318567 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] repop_all_applied: repop tid 3656 all applied 
   -28> 2016-11-01 19:58:40.318572 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] eval_repop repgather(0x7fd05643b560 0'0 rep_tid=3656 committed?=0 applied?=1) wants=
   -27> 2016-11-01 19:58:40.318578 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]]  applied: repgather(0x7fd05643b560 0'0 rep_tid=3656 committed?=0 applied?=1) 
   -26> 2016-11-01 19:58:40.318583 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] repop_all_committed: repop tid 3656 all committed 
   -25> 2016-11-01 19:58:40.318587 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] eval_repop repgather(0x7fd05643b560 0'0 rep_tid=3656 committed?=1 applied?=1) wants=
   -24> 2016-11-01 19:58:40.318592 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]]  commit: repgather(0x7fd05643b560 0'0 rep_tid=3656 committed?=1 applied?=1)
   -23> 2016-11-01 19:58:40.318596 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]]  applied: repgather(0x7fd05643b560 0'0 rep_tid=3656 committed?=1 applied?=1) 
   -22> 2016-11-01 19:58:40.318640 7fcff72c9700 15 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] publish_stats_to_osd 630:1381
   -21> 2016-11-01 19:58:40.318647 7fcff72c9700 20 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]] finished operator() r=-2
   -20> 2016-11-01 19:58:40.318651 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]]  sending commit on osd_op(osd.1.371:181 1.e95139a6 smithi02628481-7 ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo [delete] snapc 160=[] write+ordersnap+ignore_overlay+enforce_snapc+known_if_redirected e630) v7 0x7fd02dc0b340
   -19> 2016-11-01 19:58:40.318660 7fcff72c9700  1 -- 172.21.15.26:6808/23999 --> 172.21.15.26:0/22235 -- osd_op_reply(181 smithi02628481-7 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo [delete] v630'2473 uv3381 ondisk = -2 ((2) No such file or directory)) v7 -- ?+0 0x7fd02dc0b340 con 0x7fd03ba01f00
   -18> 2016-11-01 19:58:40.318671 7fcff72c9700 10 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]]  removing repgather(0x7fd05643b560 0'0 rep_tid=3656 committed?=1 applied?=1)
   -17> 2016-11-01 19:58:40.318677 7fcff72c9700 20 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]]    q front is repgather(0x7fd05643d600 630'2472 rep_tid=3655 committed?=0 applied?=0)
   -16> 2016-11-01 19:58:40.318682 7fcff72c9700  0 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]]  removing repgather(0x7fd05643b560 0'0 rep_tid=3656 committed?=1 applied?=1)
   -15> 2016-11-01 19:58:40.318687 7fcff72c9700  0 osd.0 pg_epoch: 630 pg[1.2( v 630'2473 (0'0,630'2473] local-les=616 n=119 ec=7 les/c/f 616/598/0 614/615/470) [0,3] r=0 lpr=615 pi=583-614/2 luod=630'2471 rops=3 crt=613'2441 lcod 620'2470 mlcod 505'2184 active+recovering+degraded snaptrimq=[179~1,184~1,187~1,194~1,197~1,19b~6,1a2~1]]    q front is repgather(0x7fd05643d600 630'2472 rep_tid=3655 committed?=0 applied?=0)
     0> 2016-11-01 19:58:40.345433 7fcff72c9700 -1 /srv/autobuild-ceph/gitbuilder.git/build/rpmbuild/BUILD/ceph-11.0.2/src/osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)' thread 7fcff72c9700 time 2016-11-01 19:58:40.326765
/srv/autobuild-ceph/gitbuilder.git/build/rpmbuild/BUILD/ceph-11.0.2/src/osd/ReplicatedPG.cc: 8683: FAILED assert(repop_queue.front() == repop)

I suspect this bit of code:

  /* Hack to work around the fact that ReplicatedBackend sends
   * ack+commit if commit happens first */
  if (pool.info.ec_pool()) {
    t.register_on_complete(complete);
  } else {
    t.register_on_commit(complete);
  }
  t.register_on_applied(
    new C_OSD_OnApplied{this, get_osdmap()->get_epoch(), info.last_update});

in do_update_log_missing

/a/sage-2016-11-01_16:30:29-rados:thrash-master---basic-smithi/509438


Related issues 1 (0 open1 closed)

Copied to Ceph - Backport #17876: jewel: osd: update_log_missing does not order correctly with osd_opsResolvedAbhishek VarshneyActions
Actions #1

Updated by Samuel Just over 7 years ago

  • Assignee set to Samuel Just
Actions #2

Updated by Samuel Just over 7 years ago

By any chance, is it possible for bluestore to send an apply before a commit?

Actions #3

Updated by Samuel Just over 7 years ago

Bah, stupid bug, forgot to take the lock and check the interval in the callback.

Actions #4

Updated by Samuel Just over 7 years ago

  • Status changed from 12 to 7
Actions #5

Updated by Samuel Just over 7 years ago

  • Backport set to jewel
Actions #6

Updated by Sage Weil over 7 years ago

  • Status changed from 7 to Pending Backport
Actions #7

Updated by Nathan Cutler over 7 years ago

  • Copied to Backport #17876: jewel: osd: update_log_missing does not order correctly with osd_ops added
Actions #8

Updated by Nathan Cutler over 7 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF