Project

General

Profile

Actions

Bug #9053

closed

mon/Paxos.cc: 628: FAILED assert(begin->last_committed == last_committed)

Added by Sage Weil over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
firefly
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@teuthology:/a/teuthology-2014-08-06_02:30:01-rados-next-testing-basic-plana/402965
description: rados/monthrash/{ceph/ceph.yaml clusters/9-mons.yaml fs/xfs.yaml msgr-failures/few.yaml thrashers/sync-many.yaml workloads/snaps-few-objects.yaml}

2014-08-07T18:37:45.127 INFO:teuthology.task.ceph.mon.h.plana39.stderr:mon/Paxos.cc: In function 'void Paxos::handle_begin(MMonPaxos*)' thread 7f655bb0a700 time 2014-08-07 18:37:45.127077
2014-08-07T18:37:45.127 INFO:teuthology.task.ceph.mon.h.plana39.stderr:mon/Paxos.cc: 628: FAILED assert(begin->last_committed == last_committed)
2014-08-07T18:37:45.127 INFO:teuthology.task.ceph.mon.h.plana39.stderr: ceph version 0.83-412-gdceab8d (dceab8dc49d4f458e3bc9fd40eb8b487f1e35948)
2014-08-07T18:37:45.127 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 1: (Paxos::handle_begin(MMonPaxos*)+0x8a0) [0x5a4700]
2014-08-07T18:37:45.128 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 2: (Paxos::dispatch(PaxosServiceMessage*)+0x24b) [0x5a981b]
2014-08-07T18:37:45.128 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 3: (Monitor::dispatch(MonSession*, Message*, bool)+0x507) [0x5833d7]
2014-08-07T18:37:45.128 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 4: (Monitor::_ms_dispatch(Message*)+0x2d7) [0x581337]
2014-08-07T18:37:45.128 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 5: (Monitor::ms_dispatch(Message*)+0x32) [0x59ae52]
2014-08-07T18:37:45.128 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 6: (Messenger::ms_deliver_dispatch(Message*)+0x77) [0x84a167]
2014-08-07T18:37:45.128 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 7: (DispatchQueue::entry()+0x4b2) [0x846f02]
2014-08-07T18:37:45.128 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x75253d]
2014-08-07T18:37:45.128 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 9: (()+0x7e9a) [0x7f6560ffde9a]
2014-08-07T18:37:45.129 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 10: (clone()+0x6d) [0x7f655f9ae3fd]
2014-08-07T18:37:45.129 INFO:teuthology.task.ceph.mon.h.plana39.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2014-08-07T18:37:45.129 INFO:teuthology.task.ceph.mon.h.plana39.stderr:2014-08-07 18:37:45.128620 7f655bb0a700 -1 mon/Paxos.cc: In function 'void Paxos::handle_begin(MMonPaxos*)' thread 7f655bb0a700 time 2014-08-07 18:37:45.127077
2014-08-07T18:37:45.129 INFO:teuthology.task.ceph.mon.h.plana39.stderr:mon/Paxos.cc: 628: FAILED assert(begin->last_committed == last_committed)

Actions #1

Updated by Sage Weil over 9 years ago

  • Status changed from New to 7
Actions #2

Updated by Sage Weil over 9 years ago

Paxos::handle_last() bug.

the peon:

   -63> 2014-08-07 18:37:44.847440 7f655bb0a700  1 -- 10.214.132.39:6791/0 <== mon.0 10.214.131.16:6789/0 277680973 ==== paxos(collect lc 3375 fc 3356 pn 6100 opn 0) v3 ==== 80+0+0 (2683124772 0 0) 0x1a2c780 con 0x1813b00
   -56> 2014-08-07 18:37:44.847504 7f655bb0a700 10 mon.h@5(peon).paxos(paxos recovering c 3356..3375) handle_collect paxos(collect lc 3375 fc 3356 pn 6100 opn 0) v3
   -55> 2014-08-07 18:37:44.847509 7f655bb0a700 10 mon.h@5(peon).paxos(paxos recovering c 3356..3375) accepting pn 6100 from 0
   -54> 2014-08-07 18:37:44.867164 7f655bb0a700 10 mon.h@5(peon).paxos(paxos recovering c 3356..3375)  sharing our accepted but uncommitted value for 3376 (12982 bytes)
   -53> 2014-08-07 18:37:44.867217 7f655bb0a700  1 -- 10.214.132.39:6791/0 --> mon.0 10.214.131.16:6789/0 -- paxos(last lc 3375 fc 3356 pn 6100 opn 5900) v3 -- ?+0 0x19ce000
   -44> 2014-08-07 18:37:45.019888 7f655bb0a700  1 -- 10.214.132.39:6791/0 <== mon.0 10.214.131.16:6789/0 277680975 ==== paxos(commit lc 3392 fc 0 pn 0 opn 0) v3 ==== 82475+0+0 (1187477362 0 0) 0x19ce000 con 0x1813b00
   -37> 2014-08-07 18:37:45.019958 7f655bb0a700 10 mon.h@5(peon).paxos(paxos recovering c 3356..3375) handle_commit on 3392
   -36> 2014-08-07 18:37:45.019961 7f655bb0a700 10 mon.h@5(peon).paxos(paxos recovering c 3356..3375) store_state ignoring all values, they start at 3378 > last_committed+1
   -35> 2014-08-07 18:37:45.019964 7f655bb0a700 10 mon.h@5(peon).paxos(paxos recovering c 3356..3375) store_state nothing to commit

the commit is not contiguous! why? see the leader:

2014-08-07 18:37:44.866128 7f5ce07e6700  1 -- 10.214.131.16:6789/0 <== mon.5 10.214.132.39:6791/0 127 ==== paxos(last lc 3375 fc 3356 pn 6100 opn 5900) v3 ==== 13074+0+0 (2602906985 0 0) 0x30baf80 con 0x2b28c00
2014-08-07 18:37:44.866194 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3356..3375) handle_last paxos(last lc 3375 fc 3356 pn 6100 opn 5900) v3
2014-08-07 18:37:44.866202 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3356..3375) store_state nothing to commit
2014-08-07 18:37:44.866236 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3356..3375)  they accepted our pn, we now have 2 peons
2014-08-07 18:37:44.866245 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3356..3375) we learned an uncommitted value for 3376 pn 5900 12982 bytes

so far so good. but then
2014-08-07 18:37:44.869570 7f5ce07e6700  1 -- 10.214.131.16:6789/0 <== mon.3 10.214.132.39:6790/0 779 ==== paxos(last lc 3391 fc 3367 pn 6100 opn 6002) v3 ==== 110200+0+0 (3866067640 0 0) 0x30b8500 con 0x2b27ce0
2014-08-07 18:37:44.869628 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3356..3375) handle_last paxos(last lc 3391 fc 3367 pn 6100 opn 6002) v3
2014-08-07 18:37:44.869636 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3356..3391) store_state [3376..3391]
2014-08-07 18:37:44.869955 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3356..3391)  forgetting obsolete uncommitted value 3376 pn 5900
2014-08-07 18:37:44.914213 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3367..3391)  they accepted our pn, we now have 3 peons
2014-08-07 18:37:44.914236 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3367..3391) we learned an uncommitted value for 3392 pn 6002 449 bytes

first_committed is moving forward, but still overlapping. then

2014-08-07 18:37:44.964477 7f5ce07e6700  1 -- 10.214.131.16:6789/0 <== mon.2 10.214.131.16:6790/0 745 ==== paxos(last lc 3392 fc 3378 pn 6100 opn 0) v3 ==== 82475+0+0 (3684089814 0 0) 0x34eaa80 con 0x2b27fa0
2014-08-07 18:37:44.964498 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3367..3391) handle_last paxos(last lc 3392 fc 3378 pn 6100 opn 0) v3
2014-08-07 18:37:44.964502 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3367..3392) store_state [3392..3392]
2014-08-07 18:37:44.964520 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3367..3392)  forgetting obsolete uncommitted value 3392 pn 6002

no longer contiguous! but we don't notice, and

2014-08-07 18:37:45.014603 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  they accepted our pn, we now have 7 peons
2014-08-07 18:37:45.014623 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sending commit to mon.3
2014-08-07 18:37:45.014627 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) share_state peer has fc 3367 lc 3391
2014-08-07 18:37:45.014665 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3392 (449 bytes)
2014-08-07 18:37:45.014669 7f5ce07e6700  1 -- 10.214.131.16:6789/0 --> mon.3 10.214.132.39:6790/0 -- paxos(commit lc 3392 fc 0 pn 0 opn 0) v3 -- ?+0 0x349f700
2014-08-07 18:37:45.014688 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sending commit to mon.4
2014-08-07 18:37:45.014692 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) share_state peer has fc 3367 lc 3391
2014-08-07 18:37:45.014724 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3392 (449 bytes)
2014-08-07 18:37:45.014728 7f5ce07e6700  1 -- 10.214.131.16:6789/0 --> mon.4 10.214.131.16:6791/0 -- paxos(commit lc 3392 fc 0 pn 0 opn 0) v3 -- ?+0 0x349cf00
2014-08-07 18:37:45.014744 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sending commit to mon.5
2014-08-07 18:37:45.014749 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) share_state peer has fc 3356 lc 3375
2014-08-07 18:37:45.014877 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3378 (485 bytes)
2014-08-07 18:37:45.014928 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3379 (15211 bytes)
2014-08-07 18:37:45.014957 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3380 (449 bytes)
2014-08-07 18:37:45.014986 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3381 (408 bytes)
2014-08-07 18:37:45.015025 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3382 (485 bytes)
2014-08-07 18:37:45.015077 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3383 (13232 bytes)
2014-08-07 18:37:45.015105 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3384 (2906 bytes)
2014-08-07 18:37:45.015135 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3385 (12991 bytes)
2014-08-07 18:37:45.015164 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3386 (9739 bytes)
2014-08-07 18:37:45.015191 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3387 (312 bytes)
2014-08-07 18:37:45.015220 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3388 (10902 bytes)
2014-08-07 18:37:45.015249 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3389 (808 bytes)
2014-08-07 18:37:45.015277 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3390 (786 bytes)
2014-08-07 18:37:45.015308 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3391 (13052 bytes)
2014-08-07 18:37:45.015338 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3392 (449 bytes)
2014-08-07 18:37:45.015341 7f5ce07e6700  1 -- 10.214.131.16:6789/0 --> mon.5 10.214.132.39:6791/0 -- paxos(commit lc 3392 fc 0 pn 0 opn 0) v3 -- ?+0 0x30ba800
2014-08-07 18:37:45.015363 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sending commit to mon.6
2014-08-07 18:37:45.015367 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) share_state peer has fc 3367 lc 3391
2014-08-07 18:37:45.015401 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3392 (449 bytes)
2014-08-07 18:37:45.015404 7f5ce07e6700  1 -- 10.214.131.16:6789/0 --> mon.6 10.214.131.16:6792/0 -- paxos(commit lc 3392 fc 0 pn 0 opn 0) v3 -- ?+0 0x30b8500
2014-08-07 18:37:45.015419 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sending commit to mon.8
2014-08-07 18:37:45.015423 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) share_state peer has fc 3367 lc 3391
2014-08-07 18:37:45.015455 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392)  sharing 3392 (449 bytes)
2014-08-07 18:37:45.015458 7f5ce07e6700  1 -- 10.214.131.16:6789/0 --> mon.8 10.214.131.16:6793/0 -- paxos(commit lc 3392 fc 0 pn 0 opn 0) v3 -- ?+0 0x3032a00
2014-08-07 18:37:45.015474 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) that's everyone.  active!

in the end the peon goes to 3375, but we start ad 3378, and we are not contiguous. the peon blows up a bit later when they get a begin for 3393.

Actions #3

Updated by Sage Weil over 9 years ago

  • Assignee set to Sage Weil
Actions #4

Updated by Sage Weil over 9 years ago

  • Status changed from 7 to Pending Backport
Actions #5

Updated by Sage Weil over 9 years ago

  • Priority changed from Urgent to High
Actions #6

Updated by Samuel Just over 9 years ago

  • Status changed from Pending Backport to Resolved
  • Backport set to firefly

Merged to firefly.

Actions

Also available in: Atom PDF