Actions
Bug #9053
closedmon/Paxos.cc: 628: FAILED assert(begin->last_committed == last_committed)
% Done:
0%
Source:
Q/A
Tags:
Backport:
firefly
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
ubuntu@teuthology:/a/teuthology-2014-08-06_02:30:01-rados-next-testing-basic-plana/402965
description: rados/monthrash/{ceph/ceph.yaml clusters/9-mons.yaml fs/xfs.yaml msgr-failures/few.yaml thrashers/sync-many.yaml workloads/snaps-few-objects.yaml}
2014-08-07T18:37:45.127 INFO:teuthology.task.ceph.mon.h.plana39.stderr:mon/Paxos.cc: In function 'void Paxos::handle_begin(MMonPaxos*)' thread 7f655bb0a700 time 2014-08-07 18:37:45.127077 2014-08-07T18:37:45.127 INFO:teuthology.task.ceph.mon.h.plana39.stderr:mon/Paxos.cc: 628: FAILED assert(begin->last_committed == last_committed) 2014-08-07T18:37:45.127 INFO:teuthology.task.ceph.mon.h.plana39.stderr: ceph version 0.83-412-gdceab8d (dceab8dc49d4f458e3bc9fd40eb8b487f1e35948) 2014-08-07T18:37:45.127 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 1: (Paxos::handle_begin(MMonPaxos*)+0x8a0) [0x5a4700] 2014-08-07T18:37:45.128 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 2: (Paxos::dispatch(PaxosServiceMessage*)+0x24b) [0x5a981b] 2014-08-07T18:37:45.128 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 3: (Monitor::dispatch(MonSession*, Message*, bool)+0x507) [0x5833d7] 2014-08-07T18:37:45.128 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 4: (Monitor::_ms_dispatch(Message*)+0x2d7) [0x581337] 2014-08-07T18:37:45.128 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 5: (Monitor::ms_dispatch(Message*)+0x32) [0x59ae52] 2014-08-07T18:37:45.128 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 6: (Messenger::ms_deliver_dispatch(Message*)+0x77) [0x84a167] 2014-08-07T18:37:45.128 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 7: (DispatchQueue::entry()+0x4b2) [0x846f02] 2014-08-07T18:37:45.128 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x75253d] 2014-08-07T18:37:45.128 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 9: (()+0x7e9a) [0x7f6560ffde9a] 2014-08-07T18:37:45.129 INFO:teuthology.task.ceph.mon.h.plana39.stderr: 10: (clone()+0x6d) [0x7f655f9ae3fd] 2014-08-07T18:37:45.129 INFO:teuthology.task.ceph.mon.h.plana39.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2014-08-07T18:37:45.129 INFO:teuthology.task.ceph.mon.h.plana39.stderr:2014-08-07 18:37:45.128620 7f655bb0a700 -1 mon/Paxos.cc: In function 'void Paxos::handle_begin(MMonPaxos*)' thread 7f655bb0a700 time 2014-08-07 18:37:45.127077 2014-08-07T18:37:45.129 INFO:teuthology.task.ceph.mon.h.plana39.stderr:mon/Paxos.cc: 628: FAILED assert(begin->last_committed == last_committed)
Updated by Sage Weil over 9 years ago
Paxos::handle_last() bug.
the peon:
-63> 2014-08-07 18:37:44.847440 7f655bb0a700 1 -- 10.214.132.39:6791/0 <== mon.0 10.214.131.16:6789/0 277680973 ==== paxos(collect lc 3375 fc 3356 pn 6100 opn 0) v3 ==== 80+0+0 (2683124772 0 0) 0x1a2c780 con 0x1813b00 -56> 2014-08-07 18:37:44.847504 7f655bb0a700 10 mon.h@5(peon).paxos(paxos recovering c 3356..3375) handle_collect paxos(collect lc 3375 fc 3356 pn 6100 opn 0) v3 -55> 2014-08-07 18:37:44.847509 7f655bb0a700 10 mon.h@5(peon).paxos(paxos recovering c 3356..3375) accepting pn 6100 from 0 -54> 2014-08-07 18:37:44.867164 7f655bb0a700 10 mon.h@5(peon).paxos(paxos recovering c 3356..3375) sharing our accepted but uncommitted value for 3376 (12982 bytes) -53> 2014-08-07 18:37:44.867217 7f655bb0a700 1 -- 10.214.132.39:6791/0 --> mon.0 10.214.131.16:6789/0 -- paxos(last lc 3375 fc 3356 pn 6100 opn 5900) v3 -- ?+0 0x19ce000 -44> 2014-08-07 18:37:45.019888 7f655bb0a700 1 -- 10.214.132.39:6791/0 <== mon.0 10.214.131.16:6789/0 277680975 ==== paxos(commit lc 3392 fc 0 pn 0 opn 0) v3 ==== 82475+0+0 (1187477362 0 0) 0x19ce000 con 0x1813b00 -37> 2014-08-07 18:37:45.019958 7f655bb0a700 10 mon.h@5(peon).paxos(paxos recovering c 3356..3375) handle_commit on 3392 -36> 2014-08-07 18:37:45.019961 7f655bb0a700 10 mon.h@5(peon).paxos(paxos recovering c 3356..3375) store_state ignoring all values, they start at 3378 > last_committed+1 -35> 2014-08-07 18:37:45.019964 7f655bb0a700 10 mon.h@5(peon).paxos(paxos recovering c 3356..3375) store_state nothing to commit
the commit is not contiguous! why? see the leader:
2014-08-07 18:37:44.866128 7f5ce07e6700 1 -- 10.214.131.16:6789/0 <== mon.5 10.214.132.39:6791/0 127 ==== paxos(last lc 3375 fc 3356 pn 6100 opn 5900) v3 ==== 13074+0+0 (2602906985 0 0) 0x30baf80 con 0x2b28c00 2014-08-07 18:37:44.866194 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3356..3375) handle_last paxos(last lc 3375 fc 3356 pn 6100 opn 5900) v3 2014-08-07 18:37:44.866202 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3356..3375) store_state nothing to commit 2014-08-07 18:37:44.866236 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3356..3375) they accepted our pn, we now have 2 peons 2014-08-07 18:37:44.866245 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3356..3375) we learned an uncommitted value for 3376 pn 5900 12982 bytes
so far so good. but then
2014-08-07 18:37:44.869570 7f5ce07e6700 1 -- 10.214.131.16:6789/0 <== mon.3 10.214.132.39:6790/0 779 ==== paxos(last lc 3391 fc 3367 pn 6100 opn 6002) v3 ==== 110200+0+0 (3866067640 0 0) 0x30b8500 con 0x2b27ce0 2014-08-07 18:37:44.869628 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3356..3375) handle_last paxos(last lc 3391 fc 3367 pn 6100 opn 6002) v3 2014-08-07 18:37:44.869636 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3356..3391) store_state [3376..3391] 2014-08-07 18:37:44.869955 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3356..3391) forgetting obsolete uncommitted value 3376 pn 5900 2014-08-07 18:37:44.914213 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3367..3391) they accepted our pn, we now have 3 peons 2014-08-07 18:37:44.914236 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3367..3391) we learned an uncommitted value for 3392 pn 6002 449 bytes
first_committed is moving forward, but still overlapping. then
2014-08-07 18:37:44.964477 7f5ce07e6700 1 -- 10.214.131.16:6789/0 <== mon.2 10.214.131.16:6790/0 745 ==== paxos(last lc 3392 fc 3378 pn 6100 opn 0) v3 ==== 82475+0+0 (3684089814 0 0) 0x34eaa80 con 0x2b27fa0 2014-08-07 18:37:44.964498 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3367..3391) handle_last paxos(last lc 3392 fc 3378 pn 6100 opn 0) v3 2014-08-07 18:37:44.964502 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3367..3392) store_state [3392..3392] 2014-08-07 18:37:44.964520 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3367..3392) forgetting obsolete uncommitted value 3392 pn 6002
no longer contiguous! but we don't notice, and
2014-08-07 18:37:45.014603 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) they accepted our pn, we now have 7 peons 2014-08-07 18:37:45.014623 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sending commit to mon.3 2014-08-07 18:37:45.014627 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) share_state peer has fc 3367 lc 3391 2014-08-07 18:37:45.014665 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3392 (449 bytes) 2014-08-07 18:37:45.014669 7f5ce07e6700 1 -- 10.214.131.16:6789/0 --> mon.3 10.214.132.39:6790/0 -- paxos(commit lc 3392 fc 0 pn 0 opn 0) v3 -- ?+0 0x349f700 2014-08-07 18:37:45.014688 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sending commit to mon.4 2014-08-07 18:37:45.014692 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) share_state peer has fc 3367 lc 3391 2014-08-07 18:37:45.014724 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3392 (449 bytes) 2014-08-07 18:37:45.014728 7f5ce07e6700 1 -- 10.214.131.16:6789/0 --> mon.4 10.214.131.16:6791/0 -- paxos(commit lc 3392 fc 0 pn 0 opn 0) v3 -- ?+0 0x349cf00 2014-08-07 18:37:45.014744 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sending commit to mon.5 2014-08-07 18:37:45.014749 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) share_state peer has fc 3356 lc 3375 2014-08-07 18:37:45.014877 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3378 (485 bytes) 2014-08-07 18:37:45.014928 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3379 (15211 bytes) 2014-08-07 18:37:45.014957 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3380 (449 bytes) 2014-08-07 18:37:45.014986 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3381 (408 bytes) 2014-08-07 18:37:45.015025 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3382 (485 bytes) 2014-08-07 18:37:45.015077 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3383 (13232 bytes) 2014-08-07 18:37:45.015105 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3384 (2906 bytes) 2014-08-07 18:37:45.015135 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3385 (12991 bytes) 2014-08-07 18:37:45.015164 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3386 (9739 bytes) 2014-08-07 18:37:45.015191 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3387 (312 bytes) 2014-08-07 18:37:45.015220 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3388 (10902 bytes) 2014-08-07 18:37:45.015249 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3389 (808 bytes) 2014-08-07 18:37:45.015277 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3390 (786 bytes) 2014-08-07 18:37:45.015308 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3391 (13052 bytes) 2014-08-07 18:37:45.015338 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3392 (449 bytes) 2014-08-07 18:37:45.015341 7f5ce07e6700 1 -- 10.214.131.16:6789/0 --> mon.5 10.214.132.39:6791/0 -- paxos(commit lc 3392 fc 0 pn 0 opn 0) v3 -- ?+0 0x30ba800 2014-08-07 18:37:45.015363 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sending commit to mon.6 2014-08-07 18:37:45.015367 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) share_state peer has fc 3367 lc 3391 2014-08-07 18:37:45.015401 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3392 (449 bytes) 2014-08-07 18:37:45.015404 7f5ce07e6700 1 -- 10.214.131.16:6789/0 --> mon.6 10.214.131.16:6792/0 -- paxos(commit lc 3392 fc 0 pn 0 opn 0) v3 -- ?+0 0x30b8500 2014-08-07 18:37:45.015419 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sending commit to mon.8 2014-08-07 18:37:45.015423 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) share_state peer has fc 3367 lc 3391 2014-08-07 18:37:45.015455 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) sharing 3392 (449 bytes) 2014-08-07 18:37:45.015458 7f5ce07e6700 1 -- 10.214.131.16:6789/0 --> mon.8 10.214.131.16:6793/0 -- paxos(commit lc 3392 fc 0 pn 0 opn 0) v3 -- ?+0 0x3032a00 2014-08-07 18:37:45.015474 7f5ce07e6700 10 mon.a@0(leader).paxos(paxos recovering c 3378..3392) that's everyone. active!
in the end the peon goes to 3375, but we start ad 3378, and we are not contiguous. the peon blows up a bit later when they get a begin for 3393.
Updated by Samuel Just over 9 years ago
- Status changed from Pending Backport to Resolved
- Backport set to firefly
Merged to firefly.
Actions