Project

General

Profile

Bug #42977

mon/Elector.cc: FAILED ceph_assert(m->epoch == get_epoch())

Added by Neha Ojha 3 months ago. Updated 7 days ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
nautilus, mimic, luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature:

Description

2019-11-22T03:44:44.746 INFO:teuthology.orchestra.run.smithi026:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph quorum_status
2019-11-22T03:44:44.781 INFO:tasks.ceph.mon.b.smithi061.stderr:/build/ceph-15.0.0-7461-ge226ea9/src/mon/Elector.cc: In function 'void Elector::handle_ack(MonOpRequestRef)' thread 7f1123f73700 time 2019-11-22T03:44:44.784417+0000
2019-11-22T03:44:44.781 INFO:tasks.ceph.mon.b.smithi061.stderr:/build/ceph-15.0.0-7461-ge226ea9/src/mon/Elector.cc: 260: FAILED ceph_assert(m->epoch == get_epoch())
2019-11-22T03:44:44.786 INFO:tasks.ceph.mon.b.smithi061.stderr: ceph version 15.0.0-7461-ge226ea9 (e226ea94f1181f8578df72a07e64fb8173aefd04) octopus (dev)
2019-11-22T03:44:44.786 INFO:tasks.ceph.mon.b.smithi061.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f113076aa4c]
2019-11-22T03:44:44.786 INFO:tasks.ceph.mon.b.smithi061.stderr: 2: (()+0x284c24) [0x7f113076ac24]
2019-11-22T03:44:44.786 INFO:tasks.ceph.mon.b.smithi061.stderr: 3: (Elector::handle_ack(boost::intrusive_ptr<MonOpRequest>)+0x673) [0x55af5d960103]
2019-11-22T03:44:44.787 INFO:tasks.ceph.mon.b.smithi061.stderr: 4: (Elector::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xc12) [0x55af5d962b52]
2019-11-22T03:44:44.787 INFO:tasks.ceph.mon.b.smithi061.stderr: 5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x12b1) [0x55af5d8e9951]
2019-11-22T03:44:44.787 INFO:tasks.ceph.mon.b.smithi061.stderr: 6: (Monitor::_ms_dispatch(Message*)+0x4b2) [0x55af5d8ea002]
2019-11-22T03:44:44.787 INFO:tasks.ceph.mon.b.smithi061.stderr: 7: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x58) [0x55af5d9187b8]
2019-11-22T03:44:44.787 INFO:tasks.ceph.mon.b.smithi061.stderr: 8: (DispatchQueue::entry()+0x11f2) [0x7f1130961182]
2019-11-22T03:44:44.787 INFO:tasks.ceph.mon.b.smithi061.stderr: 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f11309fa70d]
2019-11-22T03:44:44.787 INFO:tasks.ceph.mon.b.smithi061.stderr: 10: (()+0x76db) [0x7f112f5b86db]
2019-11-22T03:44:44.788 INFO:tasks.ceph.mon.b.smithi061.stderr: 11: (clone()+0x3f) [0x7f112e79e88f]
2019-11-22T03:44:44.788 INFO:tasks.ceph.mon.b.smithi061.stderr:*** Caught signal (Aborted) **

/a/nojha-2019-11-22_01:55:39-rados-fix-merge-vs-backoff-with-revert-distro-basic-smithi/4531396/


Related issues

Copied to RADOS - Backport #43928: nautilus: mon/Elector.cc: FAILED ceph_assert(m->epoch == get_epoch()) Resolved
Copied to RADOS - Backport #44087: luminous: mon/Elector.cc: FAILED ceph_assert(m->epoch == get_epoch()) New
Copied to RADOS - Backport #44088: mimic: mon/Elector.cc: FAILED ceph_assert(m->epoch == get_epoch()) New

History

#1 Updated by Sage Weil 24 days ago

  • Status changed from New to Triaged
  • Priority changed from Normal to Urgent

/a/sage-2020-01-24_01:55:08-rados-wip-sage4-testing-2020-01-23-1347-distro-basic-smithi/4697995

I bet this has a similar root cause to #42328, namely https://github.com/ceph/ceph/pull/30223

#2 Updated by Sage Weil 24 days ago

  • Related to Bug #42328: osd/PrimaryLogPG.cc: 3962: ceph_abort_msg("out of order op") added

#3 Updated by Sage Weil 21 days ago

  • Target version set to v15.0.0

#4 Updated by Neha Ojha 19 days ago

Let's see what happened in /a/sage-2020-01-24_01:55:08-rados-wip-sage4-testing-2020-01-23-1347-distro-basic-smithi/4697995

relevant logs ceph-mon.e.log.gz and ceph-mon.g.log.gz

From ceph-mon.g's log

2020-01-24T09:51:10.686+0000 7fcb20b80700  1 -- [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] send_to--> mon [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e55) v8 -- ?+0 0x55ea14e57500
2020-01-24T09:51:10.686+0000 7fcb20b80700  1 -- [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] --> [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e55) v8 -- 0x55ea14e57500 con 0x55ea13797680

and then 

2020-01-24T09:51:10.687+0000 7fcb20b80700  1 -- [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] <== mon.8 v2:172.21.15.2:3304/0 7 ==== election(1ea613b2-22ff-498c-9cd8-3aa905879791 ack rel 15 e67) v8 ==== 1870+0+0 (secure 0 0 0) 0x55ea14e56300 con 0x55ea13797680

From ceph-mon.e's log


2020-01-24T09:51:10.686+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 ack rel 15 e67) v8 -- ?+0 0x5653aa93c300
2020-01-24T09:51:10.686+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 ack rel 15 e67) v8 -- 0x5653aa93c300 con 0x5653aad48900

and then this

2020-01-24T09:51:10.688+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] <== mon.3 v2:172.21.15.177:3301/0 10 ==== election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e55) v8 ==== 1058+0+0 (unknown 0 0 0) 0x5653aa87c000 con 0x5653aad48900

ceph-mon.g asserts because it was at epoch 55 when it received 67. Each of them seem to be sending and receiving on the same connection. Yet to find a correlation with https://tracker.ceph.com/issues/42328.

#5 Updated by Sage Weil 18 days ago

on mon.g (3), the epoch is 55 (or looks that way, it just sent these):

2020-01-24T09:51:10.685+0000 7fcb20b80700  1 -- [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] send_to--> mon [v2:172.21.15.2:3300/0,v1:172.21.15.2:6789/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e55) v8 -- ?+0 0x55ea14e56c00
2020-01-24T09:51:10.685+0000 7fcb20b80700  1 -- [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] --> [v2:172.21.15.2:3300/0,v1:172.21.15.2:6789/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e55) v8 -- 0x55ea14e56c00 con 0x55ea12bb7600
2020-01-24T09:51:10.685+0000 7fcb20b80700  1 -- [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] send_to--> mon [v2:172.21.15.177:3300/0,v1:172.21.15.177:6789/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e55) v8 -- ?+0 0x55ea14e56900
2020-01-24T09:51:10.685+0000 7fcb20b80700  1 -- [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] --> [v2:172.21.15.177:3300/0,v1:172.21.15.177:6789/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e55) v8 -- 0x55ea14e56900 con 0x55ea12bb7a80
...

when it receives an ack from mon.e (8):

2020-01-24T09:51:10.687+0000 7fcb20b80700  1 -- [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] <== mon.8 v2:172.21.15.2:3304/0 7 ==== election(1ea613b2-22ff-498c-9cd8-3aa905879791 ack rel 15 e67) v8 ==== 1870+0+0 (secure 0 0 0) 0x55ea14e56300 con 0x55ea13797680
2020-01-24T09:51:10.687+0000 7fcb20b80700 20 mon.g@3(electing) e1 _ms_dispatch existing session 0x55ea13a2c240 for mon.8
2020-01-24T09:51:10.687+0000 7fcb20b80700 20 mon.g@3(electing) e1  entity  caps allow *
2020-01-24T09:51:10.687+0000 7fcb20b80700 20 is_capable service=mon command= read addr v2:172.21.15.2:3304/0 on cap allow *
2020-01-24T09:51:10.687+0000 7fcb20b80700 20  allow so far , doing grant allow *
2020-01-24T09:51:10.687+0000 7fcb20b80700 20  allow all
2020-01-24T09:51:10.687+0000 7fcb20b80700 20 is_capable service=mon command= exec addr v2:172.21.15.2:3304/0 on cap allow *
2020-01-24T09:51:10.687+0000 7fcb20b80700 20  allow so far , doing grant allow *
2020-01-24T09:51:10.687+0000 7fcb20b80700 20  allow all
2020-01-24T09:51:10.687+0000 7fcb20b80700  5 mon.g@3(electing).elector(55) handle_ack from mon.8
2020-01-24T09:51:10.688+0000 7fcb20b80700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.0.0-9704-gb1192ba/rpm/el8/BUILD/ceph-15.0.0-9704-gb1192ba/src/mon/Elector.cc: In function 'void Elector::handle_ack(MonOpRequestRef)' thread 7fcb20b80700 time 2020-01-24T09:51:10.687684+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.0.0-9704-gb1192ba/rpm/el8/BUILD/ceph-15.0.0-9704-gb1192ba/src/mon/Elector.cc: 260: FAILED ceph_assert(m->epoch == get_epoch())

 ceph version 15.0.0-9704-gb1192ba (b1192ba8ca76fda4688de6ceec97751d65ca697f) octopus (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x156) [0x7fcb2e20d1e4]
 2: (()+0x27b3fe) [0x7fcb2e20d3fe]
 3: (Elector::handle_ack(boost::intrusive_ptr<MonOpRequest>)+0x66a) [0x55ea1107f8ca]
 4: (Elector::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xc23) [0x55ea11082d53]
 5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xf64) [0x55ea11003424]
 6: (Monitor::_ms_dispatch(Message*)+0xa33) [0x55ea11004453]
 7: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x55ea1103339c]
 8: (DispatchQueue::entry()+0x126a) [0x7fcb2e426a0a]
 9: (DispatchQueue::DispatchThread::entry()+0x11) [0x7fcb2e4cae11]
 10: (()+0x82de) [0x7fcb2bf4d2de]
 11: (clone()+0x43) [0x7fcb2acf74b3]

handle_ack has this assert,
  ceph_assert(m->epoch == get_epoch());

and the caller Elector::dispatch() has
      if (em->epoch < get_epoch()) {
    dout(5) << "old epoch, dropping" << dendl;
    break;
      }

but in this case the message epoch is greater than our current epoch.

On the sender, i see

2020-01-24T09:51:10.674+0000 7fb7394f8700  5 paxos.8).electionLogic(66) start -- can i be leader?
2020-01-24T09:51:10.674+0000 7fb7394f8700  1 paxos.8).electionLogic(66) init, last seen epoch 66
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 paxos.8).electionLogic(66) bump_epoch66 to 67
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing) e1 join_election
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing) e1 _reset
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).auth v7 _set_mon_num_rank num 0 rank 0
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing) e1 cancel_probe_timeout (none scheduled)
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing) e1 timecheck_finish
2020-01-24T09:51:10.674+0000 7fb7394f8700 15 mon.e@8(electing) e1 health_tick_stop
2020-01-24T09:51:10.674+0000 7fb7394f8700 15 mon.e@8(electing) e1 health_interval_stop
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing) e1 scrub_event_cancel
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing) e1 scrub_reset
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxos(paxos recovering c 1..341) restart -- canceling timeouts
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxosservice(mdsmap 1..1) restart
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxosservice(osdmap 1..128) restart
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxosservice(logm 1..124) restart
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxosservice(monmap 1..1) restart
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxosservice(auth 1..7) restart
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxosservice(mgr 1..4) restart
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxosservice(mgrstat 63..74) restart
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxosservice(health 1..38) restart
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(electing).paxosservice(config 1..1) restart
2020-01-24T09:51:10.674+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.2:3300/0,v1:172.21.15.2:6789/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- ?+0 0x5653aa816000
2020-01-24T09:51:10.674+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.2:3300/0,v1:172.21.15.2:6789/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- 0x5653aa816000 con 0x5653a99af600
2020-01-24T09:51:10.674+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.177:3300/0,v1:172.21.15.177:6789/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- ?+0 0x5653aa816300
2020-01-24T09:51:10.674+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.177:3300/0,v1:172.21.15.177:6789/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- 0x5653aa816300 con 0x5653a99afa80
2020-01-24T09:51:10.674+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.2:3301/0,v1:172.21.15.2:6790/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- ?+0 0x5653aa816600
2020-01-24T09:51:10.674+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.2:3301/0,v1:172.21.15.2:6790/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- 0x5653aa816600 con 0x5653aad48480
2020-01-24T09:51:10.674+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- ?+0 0x5653aa816900
2020-01-24T09:51:10.674+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- 0x5653aa816900 con 0x5653aad48900
2020-01-24T09:51:10.674+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.2:3302/0,v1:172.21.15.2:6791/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- ?+0 0x5653aa816c00
2020-01-24T09:51:10.674+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.2:3302/0,v1:172.21.15.2:6791/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- 0x5653aa816c00 con 0x5653aad48d80
2020-01-24T09:51:10.674+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.177:3302/0,v1:172.21.15.177:6791/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- ?+0 0x5653aa816f00
2020-01-24T09:51:10.674+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.177:3302/0,v1:172.21.15.177:6791/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- 0x5653aa816f00 con 0x5653aad49200
2020-01-24T09:51:10.674+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.2:3303/0,v1:172.21.15.2:6792/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- ?+0 0x5653aa817200
2020-01-24T09:51:10.674+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.2:3303/0,v1:172.21.15.2:6792/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- 0x5653aa817200 con 0x5653aad49680
2020-01-24T09:51:10.674+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.177:3303/0,v1:172.21.15.177:6792/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- ?+0 0x5653aa93c000
2020-01-24T09:51:10.674+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.177:3303/0,v1:172.21.15.177:6792/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e67) v8 -- 0x5653aa93c000 con 0x5653aad49b00
2020-01-24T09:51:10.686+0000 7fb7394f8700  5 paxos.8).electionLogic(67) defer to 3
2020-01-24T09:51:10.686+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] send_to--> mon [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 ack rel 15 e67) v8 -- ?+0 0x5653aa93c300
2020-01-24T09:51:10.686+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] --> [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] -- election(1ea613b2-22ff-498c-9cd8-3aa905879791 ack rel 15 e67) v8 -- 0x5653aa93c300 con 0x5653aad48900

but I can't figure out who called defer() and why...

#6 Updated by Neha Ojha 18 days ago

I think defer() is called by mon.e in receive_propose() because of the following

2020-01-24T09:51:10.673+0000 7fb7394f8700 10 mon.e@8(peon) e1 ms_handle_accept con 0x5653aad48900 no session
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(peon) e1 _ms_dispatch new session 0x5653aa88a000 MonSession(mon.3 [v2:172.21.15.177:3301/0,v1:172.21.15.177:6790/0] is open , features 0x3f01cfb8ffadffff (luminous)) features 0x3f01cfb8ffadffff

2020-01-24T09:51:10.674+0000 7fb7394f8700  1 -- [v2:172.21.15.2:3304/0,v1:172.21.15.2:6793/0] <== mon.3 v2:172.21.15.177:3301/0 3 ==== election(1ea613b2-22ff-498c-9cd8-3aa905879791 propose rel 15 e33) v8 ==== 1058+0+0 (unknown 0 0 0) 0x5653aa817500 con 0x5653aad48900

2020-01-24T09:51:10.674+0000 7fb7394f8700  5 paxos.8).electionLogic(66)  got propose from old epoch, 3 must have just started
2020-01-24T09:51:10.674+0000 7fb7394f8700 10 mon.e@8(peon) e1 start_election --> https://github.com/ceph/ceph/blob/master/src/mon/ElectionLogic.cc#L147

2020-01-24T09:51:10.686+0000 7fb7394f8700  5 paxos.8).electionLogic(67) defer to 3 --> https://github.com/ceph/ceph/blob/master/src/mon/ElectionLogic.cc#L170

#7 Updated by Greg Farnum 18 days ago

  • Status changed from Triaged to In Progress
  • Assignee set to Greg Farnum

Yep looks like something went horribly wrong in refactoring — we correctly call the new election on receiving an old propose, but we don't return from the function call so we later fall through and run the "should I defer" logic. Yikes!

Thanks for finding all the key log lines. :D

#8 Updated by Greg Farnum 18 days ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 32981

Hmm this not-returning issue seems to date from 2008 (3859475bbfafb8754841af41044cb41124e87fc7); I'm not sure why it's only being exposed now; looking at the nautilus code it is susceptible to this too.

Anyway, PR added: https://github.com/ceph/ceph/pull/32981

I guess we'll want to backport it?

#9 Updated by Sage Weil 17 days ago

  • Status changed from Fix Under Review to Resolved

#10 Updated by Nathan Cutler 17 days ago

  • Status changed from Resolved to Pending Backport
  • Backport set to nautilus

Adding nautilus backport per Greg's comment "looking at the nautilus code it is susceptible to this too."

#11 Updated by Nathan Cutler 17 days ago

But if the issue was introduced in 2008, then we'd need to backport further than nautilus...

#12 Updated by Nathan Cutler 17 days ago

  • Copied to Backport #43928: nautilus: mon/Elector.cc: FAILED ceph_assert(m->epoch == get_epoch()) added

#13 Updated by Neha Ojha 11 days ago

  • Related to deleted (Bug #42328: osd/PrimaryLogPG.cc: 3962: ceph_abort_msg("out of order op"))

#14 Updated by Greg Farnum 7 days ago

  • Priority changed from Urgent to Normal
  • Backport changed from nautilus to nautilus, mimic, luminous

Nathan Cutler wrote:

But if the issue was introduced in 2008, then we'd need to backport further than nautilus...

Yeah it seems it’s present in all releases.

#15 Updated by Nathan Cutler 5 days ago

  • Copied to Backport #44087: luminous: mon/Elector.cc: FAILED ceph_assert(m->epoch == get_epoch()) added

#16 Updated by Nathan Cutler 5 days ago

  • Copied to Backport #44088: mimic: mon/Elector.cc: FAILED ceph_assert(m->epoch == get_epoch()) added

Also available in: Atom PDF