Bug #42977
closedmon/Elector.cc: FAILED ceph_assert(m->epoch == get_epoch())
0%
Description
2019-11-22T03:44:44.746 INFO:teuthology.orchestra.run.smithi026:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph quorum_status 2019-11-22T03:44:44.781 INFO:tasks.ceph.mon.b.smithi061.stderr:/build/ceph-15.0.0-7461-ge226ea9/src/mon/Elector.cc: In function 'void Elector::handle_ack(MonOpRequestRef)' thread 7f1123f73700 time 2019-11-22T03:44:44.784417+0000 2019-11-22T03:44:44.781 INFO:tasks.ceph.mon.b.smithi061.stderr:/build/ceph-15.0.0-7461-ge226ea9/src/mon/Elector.cc: 260: FAILED ceph_assert(m->epoch == get_epoch()) 2019-11-22T03:44:44.786 INFO:tasks.ceph.mon.b.smithi061.stderr: ceph version 15.0.0-7461-ge226ea9 (e226ea94f1181f8578df72a07e64fb8173aefd04) octopus (dev) 2019-11-22T03:44:44.786 INFO:tasks.ceph.mon.b.smithi061.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f113076aa4c] 2019-11-22T03:44:44.786 INFO:tasks.ceph.mon.b.smithi061.stderr: 2: (()+0x284c24) [0x7f113076ac24] 2019-11-22T03:44:44.786 INFO:tasks.ceph.mon.b.smithi061.stderr: 3: (Elector::handle_ack(boost::intrusive_ptr<MonOpRequest>)+0x673) [0x55af5d960103] 2019-11-22T03:44:44.787 INFO:tasks.ceph.mon.b.smithi061.stderr: 4: (Elector::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xc12) [0x55af5d962b52] 2019-11-22T03:44:44.787 INFO:tasks.ceph.mon.b.smithi061.stderr: 5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x12b1) [0x55af5d8e9951] 2019-11-22T03:44:44.787 INFO:tasks.ceph.mon.b.smithi061.stderr: 6: (Monitor::_ms_dispatch(Message*)+0x4b2) [0x55af5d8ea002] 2019-11-22T03:44:44.787 INFO:tasks.ceph.mon.b.smithi061.stderr: 7: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x58) [0x55af5d9187b8] 2019-11-22T03:44:44.787 INFO:tasks.ceph.mon.b.smithi061.stderr: 8: (DispatchQueue::entry()+0x11f2) [0x7f1130961182] 2019-11-22T03:44:44.787 INFO:tasks.ceph.mon.b.smithi061.stderr: 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f11309fa70d] 2019-11-22T03:44:44.787 INFO:tasks.ceph.mon.b.smithi061.stderr: 10: (()+0x76db) [0x7f112f5b86db] 2019-11-22T03:44:44.788 INFO:tasks.ceph.mon.b.smithi061.stderr: 11: (clone()+0x3f) [0x7f112e79e88f] 2019-11-22T03:44:44.788 INFO:tasks.ceph.mon.b.smithi061.stderr:*** Caught signal (Aborted) **
/a/nojha-2019-11-22_01:55:39-rados-fix-merge-vs-backoff-with-revert-distro-basic-smithi/4531396/
Updated by Sage Weil over 4 years ago
- Status changed from New to Triaged
- Priority changed from Normal to Urgent
/a/sage-2020-01-24_01:55:08-rados-wip-sage4-testing-2020-01-23-1347-distro-basic-smithi/4697995
I bet this has a similar root cause to #42328, namely https://github.com/ceph/ceph/pull/30223
Updated by Sage Weil over 4 years ago
- Related to Bug #42328: osd/PrimaryLogPG.cc: 3962: ceph_abort_msg("out of order op") added
Updated by Neha Ojha over 4 years ago
Let's see what happened in /a/sage-2020-01-24_01:55:08-rados-wip-sage4-testing-2020-01-23-1347-distro-basic-smithi/4697995
relevant logs ceph-mon.e.log.gz and ceph-mon.g.log.gz
From ceph-mon.g's log
2020-01-24T09:51:10.686+0000 7fcb20b80700 1 -- [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] send_to--> mon [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e55) v8 -- ?+0 0x55ea14e57500 2020-01-24T09:51:10.686+0000 7fcb20b80700 1 -- [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] --> [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e55) v8 -- 0x55ea14e57500 con 0x55ea13797680 and then 2020-01-24T09:51:10.687+0000 7fcb20b80700 1 -- [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] <== mon.8 v2:172.21.15.2:3304/0 7 ==== election(1ea613b2-22ff-498c-9cd8-3aa905879791 ack rel 15 e67) v8 ==== 1870+0+0 (secure 0 0 0) 0x55ea14e56300 con 0x55ea13797680
From ceph-mon.e's log
2020-01-24T09:51:10.686+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 ack rel 15 e67) v8 -- ?+0 0x5653aa93c300 2020-01-24T09:51:10.686+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 ack rel 15 e67) v8 -- 0x5653aa93c300 con 0x5653aad48900 and then this 2020-01-24T09:51:10.688+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] <== mon.3 v2:172.21.15.177:3301/0 10 ==== election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e55) v8 ==== 1058+0+0 (unknown 0 0 0) 0x5653aa87c000 con 0x5653aad48900
ceph-mon.g asserts because it was at epoch 55 when it received 67. Each of them seem to be sending and receiving on the same connection. Yet to find a correlation with https://tracker.ceph.com/issues/42328.
Updated by Sage Weil over 4 years ago
on mon.g (3), the epoch is 55 (or looks that way, it just sent these):
2020-01-24T09:51:10.685+0000 7fcb20b80700 1 -- [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] send_to--> mon [v2:172.21.15.2:3300/0,v1:172.21.15.2:6789/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e55) v8 -- ?+0 0x55ea14e56c00 2020-01-24T09:51:10.685+0000 7fcb20b80700 1 -- [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] --> [v2:172.21.15.2:3300/0,v1:172.21.15.2:6789/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e55) v8 -- 0x55ea14e56c00 con 0x55ea12bb7600 2020-01-24T09:51:10.685+0000 7fcb20b80700 1 -- [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] send_to--> mon [v2:172.21.15.177:3300/0,v1:172.21.15.177:6789/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e55) v8 -- ?+0 0x55ea14e56900 2020-01-24T09:51:10.685+0000 7fcb20b80700 1 -- [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] --> [v2:172.21.15.177:3300/0,v1:172.21.15.177:6789/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e55) v8 -- 0x55ea14e56900 con 0x55ea12bb7a80 ...
when it receives an ack from mon.e (8):
2020-01-24T09:51:10.687+0000 7fcb20b80700 1 -- [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] <== mon.8 v2:172.21.15.2:3304/0 7 ==== election(1ea613b2-22ff-498c-9cd8-3aa905879791 ack rel 15 e67) v8 ==== 1870+0+0 (secure 0 0 0) 0x55ea14e56300 con 0x55ea13797680 2020-01-24T09:51:10.687+0000 7fcb20b80700 20 mon.g@3(electing) e1 _ms_dispatch existing session 0x55ea13a2c240 for mon.8 2020-01-24T09:51:10.687+0000 7fcb20b80700 20 mon.g@3(electing) e1 entity caps allow * 2020-01-24T09:51:10.687+0000 7fcb20b80700 20 is_capable service=mon command= read addr v2:172.21.15.2:3304/0 on cap allow * 2020-01-24T09:51:10.687+0000 7fcb20b80700 20 allow so far , doing grant allow * 2020-01-24T09:51:10.687+0000 7fcb20b80700 20 allow all 2020-01-24T09:51:10.687+0000 7fcb20b80700 20 is_capable service=mon command= exec addr v2:172.21.15.2:3304/0 on cap allow * 2020-01-24T09:51:10.687+0000 7fcb20b80700 20 allow so far , doing grant allow * 2020-01-24T09:51:10.687+0000 7fcb20b80700 20 allow all 2020-01-24T09:51:10.687+0000 7fcb20b80700 5 mon.g@3(electing).elector(55) handle_ack from mon.8 2020-01-24T09:51:10.688+0000 7fcb20b80700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.0.0-9704-gb1192ba/rpm/el8/BUILD/ceph-15.0.0-9704-gb1192ba/src/mon/Elector.cc: In function 'void Elector::handle_ack(MonOpRequestRef)' thread 7fcb20b80700 time 2020-01-24T09:51:10.687684+0000 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.0.0-9704-gb1192ba/rpm/el8/BUILD/ceph-15.0.0-9704-gb1192ba/src/mon/Elector.cc: 260: FAILED ceph_assert(m->epoch == get_epoch()) ceph version 15.0.0-9704-gb1192ba (b1192ba8ca76fda4688de6ceec97751d65ca697f) octopus (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x156) [0x7fcb2e20d1e4] 2: (()+0x27b3fe) [0x7fcb2e20d3fe] 3: (Elector::handle_ack(boost::intrusive_ptr<MonOpRequest>)+0x66a) [0x55ea1107f8ca] 4: (Elector::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xc23) [0x55ea11082d53] 5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xf64) [0x55ea11003424] 6: (Monitor::_ms_dispatch(Message*)+0xa33) [0x55ea11004453] 7: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x55ea1103339c] 8: (DispatchQueue::entry()+0x126a) [0x7fcb2e426a0a] 9: (DispatchQueue::DispatchThread::entry()+0x11) [0x7fcb2e4cae11] 10: (()+0x82de) [0x7fcb2bf4d2de] 11: (clone()+0x43) [0x7fcb2acf74b3]
handle_ack has this assert,
ceph_assert(m->epoch == get_epoch());
and the caller Elector::dispatch() has
if (em->epoch < get_epoch()) { dout(5) << "old epoch, dropping" << dendl; break; }
but in this case the message epoch is greater than our current epoch.
On the sender, i see
2020-01-24T09:51:10.674+0000 7fb7394f8700 5 paxos.8).electionLogic(66) start -- can i be leader? 2020-01-24T09:51:10.674+0000 7fb7394f8700 1 paxos.8).electionLogic(66) init, last seen epoch 66 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 paxos.8).electionLogic(66) bump_epoch66 to 67 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing) e1 join_election 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing) e1 _reset 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).auth v7 _set_mon_num_rank num 0 rank 0 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing) e1 cancel_probe_timeout (none scheduled) 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing) e1 timecheck_finish 2020-01-24T09:51:10.674+0000 7fb7394f8700 15 mon.e@8(electing) e1 health_tick_stop 2020-01-24T09:51:10.674+0000 7fb7394f8700 15 mon.e@8(electing) e1 health_interval_stop 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing) e1 scrub_event_cancel 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing) e1 scrub_reset 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxos(paxos recovering c 1..341) restart -- canceling timeouts 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxosservice(mdsmap 1..1) restart 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxosservice(osdmap 1..128) restart 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxosservice(logm 1..124) restart 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxosservice(monmap 1..1) restart 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxosservice(auth 1..7) restart 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxosservice(mgr 1..4) restart 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxosservice(mgrstat 63..74) restart 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxosservice(health 1..38) restart 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxosservice(config 1..1) restart 2020-01-24T09:51:10.674+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.2:3300/0,v1:172.21.15.2:6789/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- ?+0 0x5653aa816000 2020-01-24T09:51:10.674+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.2:3300/0,v1:172.21.15.2:6789/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- 0x5653aa816000 con 0x5653a99af600 2020-01-24T09:51:10.674+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.177:3300/0,v1:172.21.15.177:6789/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- ?+0 0x5653aa816300 2020-01-24T09:51:10.674+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.177:3300/0,v1:172.21.15.177:6789/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- 0x5653aa816300 con 0x5653a99afa80 2020-01-24T09:51:10.674+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.2:3301/0,v1:172.21.15.2:6790/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- ?+0 0x5653aa816600 2020-01-24T09:51:10.674+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.2:3301/0,v1:172.21.15.2:6790/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- 0x5653aa816600 con 0x5653aad48480 2020-01-24T09:51:10.674+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- ?+0 0x5653aa816900 2020-01-24T09:51:10.674+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- 0x5653aa816900 con 0x5653aad48900 2020-01-24T09:51:10.674+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.2:3302/0,v1:172.21.15.2:6791/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- ?+0 0x5653aa816c00 2020-01-24T09:51:10.674+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.2:3302/0,v1:172.21.15.2:6791/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- 0x5653aa816c00 con 0x5653aad48d80 2020-01-24T09:51:10.674+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.177:3302/0,v1:172.21.15.177:6791/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- ?+0 0x5653aa816f00 2020-01-24T09:51:10.674+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.177:3302/0,v1:172.21.15.177:6791/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- 0x5653aa816f00 con 0x5653aad49200 2020-01-24T09:51:10.674+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.2:3303/0,v1:172.21.15.2:6792/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- ?+0 0x5653aa817200 2020-01-24T09:51:10.674+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.2:3303/0,v1:172.21.15.2:6792/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- 0x5653aa817200 con 0x5653aad49680 2020-01-24T09:51:10.674+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.177:3303/0,v1:172.21.15.177:6792/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- ?+0 0x5653aa93c000 2020-01-24T09:51:10.674+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.177:3303/0,v1:172.21.15.177:6792/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- 0x5653aa93c000 con 0x5653aad49b00 2020-01-24T09:51:10.686+0000 7fb7394f8700 5 paxos.8).electionLogic(67) defer to 3 2020-01-24T09:51:10.686+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 ack rel 15 e67) v8 -- ?+0 0x5653aa93c300 2020-01-24T09:51:10.686+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 ack rel 15 e67) v8 -- 0x5653aa93c300 con 0x5653aad48900
but I can't figure out who called defer() and why...
Updated by Neha Ojha over 4 years ago
I think defer() is called by mon.e in receive_propose() because of the following
2020-01-24T09:51:10.673+0000 7fb7394f8700 10 mon.e@8(peon) e1 ms_handle_accept con 0x5653aad48900 no session 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(peon) e1 _ms_dispatch new session 0x5653aa88a000 MonSession(mon.3 [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] is open , features 0x3f01cfb8ffadffff (luminous)) features 0x3f01cfb8ffadffff 2020-01-24T09:51:10.674+0000 7fb7394f8700 1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] <== mon.3 v2:172.21.15.177:3301/0 3 ==== election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e33) v8 ==== 1058+0+0 (unknown 0 0 0) 0x5653aa817500 con 0x5653aad48900 2020-01-24T09:51:10.674+0000 7fb7394f8700 5 paxos.8).electionLogic(66) got propose from old epoch, 3 must have just started 2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(peon) e1 start_election --> https://github.com/ceph/ceph/blob/master/src/mon/ElectionLogic.cc#L147 2020-01-24T09:51:10.686+0000 7fb7394f8700 5 paxos.8).electionLogic(67) defer to 3 --> https://github.com/ceph/ceph/blob/master/src/mon/ElectionLogic.cc#L170
Updated by Greg Farnum over 4 years ago
- Status changed from Triaged to In Progress
- Assignee set to Greg Farnum
Yep looks like something went horribly wrong in refactoring — we correctly call the new election on receiving an old propose, but we don't return from the function call so we later fall through and run the "should I defer" logic. Yikes!
Thanks for finding all the key log lines. :D
Updated by Greg Farnum over 4 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 32981
Hmm this not-returning issue seems to date from 2008 (3859475bbfafb8754841af41044cb41124e87fc7); I'm not sure why it's only being exposed now; looking at the nautilus code it is susceptible to this too.
Anyway, PR added: https://github.com/ceph/ceph/pull/32981
I guess we'll want to backport it?
Updated by Sage Weil over 4 years ago
- Status changed from Fix Under Review to Resolved
Updated by Nathan Cutler over 4 years ago
- Status changed from Resolved to Pending Backport
- Backport set to nautilus
Adding nautilus backport per Greg's comment "looking at the nautilus code it is susceptible to this too."
Updated by Nathan Cutler over 4 years ago
But if the issue was introduced in 2008, then we'd need to backport further than nautilus...
Updated by Nathan Cutler over 4 years ago
- Copied to Backport #43928: nautilus: mon/Elector.cc: FAILED ceph_assert(m->epoch == get_epoch()) added
Updated by Neha Ojha about 4 years ago
- Related to deleted (Bug #42328: osd/PrimaryLogPG.cc: 3962: ceph_abort_msg("out of order op"))
Updated by Greg Farnum about 4 years ago
- Priority changed from Urgent to Normal
- Backport changed from nautilus to nautilus, mimic, luminous
Nathan Cutler wrote:
But if the issue was introduced in 2008, then we'd need to backport further than nautilus...
Yeah it seems it’s present in all releases.
Updated by Nathan Cutler about 4 years ago
- Copied to Backport #44087: luminous: mon/Elector.cc: FAILED ceph_assert(m->epoch == get_epoch()) added
Updated by Nathan Cutler about 4 years ago
- Copied to Backport #44088: mimic: mon/Elector.cc: FAILED ceph_assert(m->epoch == get_epoch()) added
Updated by Nathan Cutler over 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".