Project

General

Profile

Actions

Bug #48386

closed

Paxos::restart() and Paxos::shutdown() can race leading to use-after-free on 'logger' object. Seen in Nautilus.

Added by Brad Hubbard over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus, nautilus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

   -40> 2020-11-01 10:02:17.190 7f1b60500700  1 mon.a@0(leader) e3 shutdown <========================**NOTE**
   -39> 2020-11-01 10:02:17.215 7f1b584f0700  4 set_mon_vals no callback set
   -38> 2020-11-01 10:02:17.216 7f1b584f0700  4 mgrc handle_mgr_map Got map version 522
   -37> 2020-11-01 10:02:17.216 7f1b584f0700  4 mgrc handle_mgr_map Active mgr is now [v2:XXX.XXX.XXX.XXX:6800/1,v1:XXX.XXX.XXX.XXX:6801/1]
   -36> 2020-11-01 10:02:17.219 7f1b584f0700  2 mon.a@0(leader) e3 send_reply 0x559bd36a60f0 0x559bd66d4000 log(last 21629202) v1
   -35> 2020-11-01 10:02:17.219 7f1b584f0700  2 mon.a@0(leader) e3 send_reply 0x559bd5924960 0x559bd049af00 log(last 22) v1
   -34> 2020-11-01 10:02:17.219 7f1b584f0700  2 mon.a@0(leader) e3 send_reply 0x559bd3782960 0x559bd66d41e0 log(last 23) v1
   -33> 2020-11-01 10:02:17.219 7f1b584f0700  2 mon.a@0(leader) e3 send_reply 0x559bd3782780 0x559bd66d5e00 log(last 2464) v1
   -32> 2020-11-01 10:02:17.219 7f1b584f0700  2 mon.a@0(leader) e3 send_reply 0x559bd37822d0 0x559bd66d5c20 log(last 24) v1
   -31> 2020-11-01 10:02:17.219 7f1b584f0700  2 mon.a@0(leader) e3 send_reply 0x559bd373e870 0x559bd66d45a0 log(last 4568) v1
   -30> 2020-11-01 10:02:17.219 7f1b584f0700  2 mon.a@0(leader) e3 send_reply 0x559bd373e4b0 0x559bd66d4960 log(last 25) v1
   -29> 2020-11-01 10:02:17.221 7f1b584f0700  2 mon.a@0(leader) e3 send_reply 0x559bd373e1e0 0x559bd66d43c0 log(last 4569) v1
   -28> 2020-11-01 10:02:17.221 7f1b584f0700  2 mon.a@0(leader) e3 send_reply 0x559bd372fe00 0x559bd66d4780 log(last 111639138) v1
   -27> 2020-11-01 10:02:17.221 7f1b584f0700  2 mon.a@0(leader) e3 send_reply 0x559bd42ebb30 0x559bd66d52c0 log(last 111640138) v1
   -26> 2020-11-01 10:02:17.221 7f1b584f0700  2 mon.a@0(leader) e3 send_reply 0x559bd428db30 0x559bd66d50e0 log(last 21630202) v1
   -25> 2020-11-01 10:02:17.221 7f1b584f0700  2 mon.a@0(leader) e3 send_reply 0x559bd4773590 0x559bd66d4f00 log(last 21631202) v1
   -24> 2020-11-01 10:02:17.222 7f1b584f0700  5 mon.a@0(leader).paxos(paxos active c 1332309..1332941) is_readable = 1 - now=2020-11-01 10:02:17.223960 lease_expire=2020-11-01 10:02:22.218641 has v0 lc 1332941
   -23> 2020-11-01 10:02:17.222 7f1b584f0700  5 mon.a@0(leader).paxos(paxos active c 1332309..1332941) is_readable = 1 - now=2020-11-01 10:02:17.224018 lease_expire=2020-11-01 10:02:22.218641 has v0 lc 1332941
   -22> 2020-11-01 10:02:17.222 7f1b584f0700  5 mon.a@0(leader).paxos(paxos active c 1332309..1332941) is_readable = 1 - now=2020-11-01 10:02:17.224084 lease_expire=2020-11-01 10:02:22.218641 has v0 lc 1332941
   -21> 2020-11-01 10:02:17.222 7f1b584f0700  5 mon.a@0(leader).paxos(paxos active c 1332309..1332941) is_readable = 1 - now=2020-11-01 10:02:17.224113 lease_expire=2020-11-01 10:02:22.218641 has v0 lc 1332941
   -20> 2020-11-01 10:02:17.223 7f1b60500700  5 asok(0x559bd04fa000) unregister_commands add_bootstrap_peer_hint
   -19> 2020-11-01 10:02:17.223 7f1b60500700  5 asok(0x559bd04fa000) unregister_commands add_bootstrap_peer_hintv
   -18> 2020-11-01 10:02:17.223 7f1b60500700  5 asok(0x559bd04fa000) unregister_commands dump_historic_ops
   -17> 2020-11-01 10:02:17.223 7f1b60500700  5 asok(0x559bd04fa000) unregister_commands dump_historic_ops_by_duration
   -16> 2020-11-01 10:02:17.223 7f1b60500700  5 asok(0x559bd04fa000) unregister_commands dump_historic_slow_ops
   -15> 2020-11-01 10:02:17.223 7f1b60500700  5 asok(0x559bd04fa000) unregister_commands mon_status
   -14> 2020-11-01 10:02:17.223 7f1b60500700  5 asok(0x559bd04fa000) unregister_commands ops
   -13> 2020-11-01 10:02:17.223 7f1b60500700  5 asok(0x559bd04fa000) unregister_commands quorum enter
   -12> 2020-11-01 10:02:17.223 7f1b60500700  5 asok(0x559bd04fa000) unregister_commands quorum exit
   -11> 2020-11-01 10:02:17.223 7f1b60500700  5 asok(0x559bd04fa000) unregister_commands quorum_status
   -10> 2020-11-01 10:02:17.223 7f1b60500700  5 asok(0x559bd04fa000) unregister_commands sessions
    -9> 2020-11-01 10:02:17.223 7f1b60500700  5 asok(0x559bd04fa000) unregister_commands sync_force
    -8> 2020-11-01 10:02:17.224 7f1b60500700  2 mon.a@0(shutdown) e3 send_reply 0x559bd57013b0 0x559bd544c000 log(last 26) v1
    -7> 2020-11-01 10:02:17.224 7f1b60500700  2 mon.a@0(shutdown) e3 send_reply 0x559bd3658a50 0x559bd544c1e0 log(last 27) v1
    -6> 2020-11-01 10:02:17.224 7f1b60500700  2 mon.a@0(shutdown) e3 send_reply 0x559bd4318f00 0x559bd544c3c0 log(last 28) v1
    -5> 2020-11-01 10:02:17.224 7f1b60500700  2 mon.a@0(shutdown) e3 send_reply 0x559bd4318690 0x559bd544c5a0 log(last 29) v1
    -4> 2020-11-01 10:02:17.224 7f1b5b4f6700  1 Monitor::cpu_tp worker finish
    -3> 2020-11-01 10:02:17.236 7f1b5bcf7700  1 Monitor::cpu_tp worker finish
    -2> 2020-11-01 10:02:17.236 7f1b5acf5700  1 Monitor::cpu_tp worker finish
    -1> 2020-11-01 10:02:17.236 7f1b5a4f4700  1 Monitor::cpu_tp worker finish
     0> 2020-11-01 10:02:17.238 7f1b59cf3700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f1b59cf3700 thread_name:ms_dispatch

 ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)
 1: (()+0x12dd0) [0x7f1b664dfdd0]
 2: (PerfCounters::inc(int, unsigned long)+0x7) [0x7f1b699eb807]
 3: (Paxos::restart()+0x2b9) [0x559bcdec7a89]
 4: (Monitor::_reset()+0x268) [0x559bcddcacd8]
 5: (Monitor::join_election()+0xdb) [0x559bcddcae6b]
 6: (Elector::bump_epoch(unsigned int)+0x158) [0x559bcde63e68]
 7: (Elector::handle_propose(boost::intrusive_ptr<MonOpRequest>)+0x988) [0x559bcde64fc8]
 8: (Elector::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x96e) [0x559bcde66fbe]
 9: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x11c6) [0x559bcdde60d6]
 10: (Monitor::_ms_dispatch(Message*)+0xa23) [0x559bcdde6e53]
   1383 void Paxos::restart()
   1384 {
   1385   dout(10) << "restart -- canceling timeouts" << dendl;
   1386   cancel_events();
   1387   new_value.clear();
   1388 
   1389   if (is_writing() || is_writing_previous()) {
   1390     dout(10) << __func__ << " flushing" << dendl;
   1391     mon->lock.Unlock();
   1392     mon->store->flush();
   1393     mon->lock.Lock();
   1394     dout(10) << __func__ << " flushed" << dendl;
   1395   }
   1396   state = STATE_RECOVERING;
   1397 
   1398   // discard pending transaction
   1399   pending_proposal.reset();
   1400 
   1401   reset_pending_committing_finishers();
   1402   finish_contexts(g_ceph_context, waiting_for_active, -EAGAIN);
   1403 
   1404   logger->inc(l_paxos_restart);
   1405 }
    972 void Monitor::shutdown()
    973 {
...
    1058   // clean up
    1059   paxos->shutdown();
   1313 void Paxos::shutdown()
   1314 {
...
   1333   if (logger)
   1334     g_ceph_context->get_perfcounters_collection()->remove(logger);
   1335   delete logger;
   1336 }

So we've already deleted 'logger' but we have raced because the epoch has changed
and we are resetting the monitor (and restarting Paxos) at the same time.


Related issues 2 (0 open2 closed)

Copied to RADOS - Backport #48495: nautilus: Paxos::restart() and Paxos::shutdown() can race leading to use-after-free on 'logger' object. Seen in Nautilus.ResolvedNathan CutlerActions
Copied to RADOS - Backport #48496: octopus: Paxos::restart() and Paxos::shutdown() can race leading to use-after-free on 'logger' object. Seen in Nautilus.ResolvedNathan CutlerActions
Actions #1

Updated by Brad Hubbard over 3 years ago

  • Subject changed from Monitor::join_election() and Monitor::shutdown() can race leading to use-after-free on 'logger' object. Seen in Nautilus. to Monitor::join_election() and Paxos::shutdown() can race leading to use-after-free on 'logger' object. Seen in Nautilus.
Actions #2

Updated by Brad Hubbard over 3 years ago

  • Subject changed from Monitor::join_election() and Paxos::shutdown() can race leading to use-after-free on 'logger' object. Seen in Nautilus. to Paxos::restart() and Paxos::shutdown() can race leading to use-after-free on 'logger' object. Seen in Nautilus.
Actions #3

Updated by Brad Hubbard over 3 years ago

  • Description updated (diff)
Actions #4

Updated by Brad Hubbard over 3 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 38314
Actions #5

Updated by Brad Hubbard over 3 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #6

Updated by Nathan Cutler over 3 years ago

  • Copied to Backport #48495: nautilus: Paxos::restart() and Paxos::shutdown() can race leading to use-after-free on 'logger' object. Seen in Nautilus. added
Actions #7

Updated by Nathan Cutler over 3 years ago

  • Copied to Backport #48496: octopus: Paxos::restart() and Paxos::shutdown() can race leading to use-after-free on 'logger' object. Seen in Nautilus. added
Actions #8

Updated by Nathan Cutler about 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF