Project

General

Profile

Bug #44185

Monintors cascading crash as they become the leader (possibly a repeat of bug 41025)

Added by Robert Burrowes about 1 month ago. Updated about 1 month ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Upgrading 5 monitors from luminous to Nautilus, the first monitor we upgraded crashed. We reverted to luminous on this one and tried another, and it was fine. We upgraded the rest, and they all worked.

Then we upgraded the first one again, and after it became the leader, it died. Then the second one became the leader, and it died. Then the third became the leader, and it died, leaving mon 4 and 5 unable to form a quorum.

We tried creating a single monitor cluster by editing the monmap of mon05, and it died in the same way, just without the paxos negotiation first.

We are trying to revert to luminous monitors.

One oddity in our deployment is that there was a test mds instance, and it is running mimic. I shut it down, as the monitor trace has an MDS call in it, but the nautilus monitors still die the same way.

    "mds": {
        "ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)": 1
    },

All the crashes produced the same trace.

...
   -11> 2020-02-18 09:50:00.800 7fd164a1a700  5 mon.ntr-mon02@1(leader).paxos(paxos recovering c 85448935..85449502) is_readable = 0 - now=2020-02-18 09:50:00.804429 lease_expire=0.000000 has v0 lc 85449502
   -10> 2020-02-18 09:50:00.800 7fd164a1a700  5 mon.ntr-mon02@1(leader).paxos(paxos recovering c 85448935..85449502) is_readable = 0 - now=2020-02-18 09:50:00.804446 lease_expire=0.000000 has v0 lc 85449502
    -9> 2020-02-18 09:50:00.800 7fd164a1a700  5 mon.ntr-mon02@1(leader).paxos(paxos recovering c 85448935..85449502) is_readable = 0 - now=2020-02-18 09:50:00.804460 lease_expire=0.000000 has v0 lc 85449502
    -8> 2020-02-18 09:50:00.800 7fd164a1a700  4 set_mon_vals no callback set
    -7> 2020-02-18 09:50:00.800 7fd164a1a700  4 mgrc handle_mgr_map Got map version 2301191
    -6> 2020-02-18 09:50:00.804 7fd164a1a700  4 mgrc handle_mgr_map Active mgr is now v1:10.31.88.17:6801/2924412
    -5> 2020-02-18 09:50:00.804 7fd164a1a700  0 log_channel(cluster) log [DBG] : monmap e25: 5 mons at {ntr-mon01=v1:10.31.88.14:6789/0,ntr-mon02=v1:10.31.88.15:6789/0,ntr-mon03=v1:10.31.88.16:6789/0,ntr-mon04=v1:10.31.88.17:6789/0,ntr-mon05=v1:10.31.88.18:6789/0}
    -4> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client _send_to_mon log to self
    -3> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client  log_queue is 3 last_log 3 sent 2 num 3 unsent 1 sending 1
    -2> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client  will send 2020-02-18 09:50:00.806845 mon.ntr-mon02 (mon.1) 3 : cluster [DBG] monmap e25: 5 mons at {ntr-mon01=v1:10.31.88.14:6789/0,ntr-mon02=v1:10.31.88.15:6789/0,ntr-mon03=v1:10.31.88.16:6789/0,ntr-mon04=v1:10.31.88.17:6789/0,ntr-mon05=v1:10.31.88.18:6789/0}
    -1> 2020-02-18 09:50:00.804 7fd164a1a700  5 mon.ntr-mon02@1(leader).paxos(paxos active c 85448935..85449502) is_readable = 1 - now=2020-02-18 09:50:00.806920 lease_expire=2020-02-18 09:50:05.804479 has v0 lc 85449502
     0> 2020-02-18 09:50:00.812 7fd164a1a700 -1 *** Caught signal (Aborted) **
 in thread 7fd164a1a700 thread_name:ms_dispatch

 ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)
 1: (()+0x11390) [0x7fd171e98390]
 2: (gsignal()+0x38) [0x7fd1715e5428]
 3: (abort()+0x16a) [0x7fd1715e702a]
 4: (__gnu_cxx::__verbose_terminate_handler()+0x135) [0x7fd173673bf5]
 5: (__cxxabiv1::__terminate(void (*)())+0x6) [0x7fd173667bd6]
 6: (()+0x8b6c21) [0x7fd173667c21]
 7: (()+0x8c2e34) [0x7fd173673e34]
 8: (std::__throw_out_of_range(char const*)+0x3f) [0x7fd17367f55f]
 9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x79ae00]
 10: (MDSMonitor::tick()+0xc9) [0x79c669]
 11: (MDSMonitor::on_active()+0x28) [0x785e88]
 12: (PaxosService::_active()+0xdd) [0x6d4b2d]
 13: (Context::complete(int)+0x9) [0x600789]
 14: (void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xa8) [0x6299a8]
 15: (Paxos::finish_round()+0x76) [0x6cb276]
 16: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xbff) [0x6cc47f]
 17: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x24b) [0x6ccf2b]
 18: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x15c5) [0x5fa6f5]
 19: (Monitor::_ms_dispatch(Message*)+0x4d2) [0x5fad42]
 20: (Monitor::ms_dispatch(Message*)+0x26) [0x62b046]
 21: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x6270b6]
 22: (DispatchQueue::entry()+0x1219) [0x7fd1732b7e59]
 23: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fd17336836d]
 24: (()+0x76ba) [0x7fd171e8e6ba]
 25: (clone()+0x6d) [0x7fd1716b741d]
...

mon02-crash-dump.txt View (118 KB) Robert Burrowes, 02/18/2020 07:44 PM

History

#1 Updated by Robert Burrowes about 1 month ago

cat of crash log attached, from mon02.

Versions, just before the last crash (having updated 4 monitors). Another oddity is that we accidentally restarted two managers by this point (apt update being helpful).

root@ntr-mon01:~# ceph versions
{
    "mon": {
        "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 1,
        "ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)": 4
    },
    "mgr": {
        "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 1,
        "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 1,
        "ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)": 2
    },
    "osd": {
        "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)": 175,
        "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 32,
        "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 16
    },
    "mds": {
        "ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)": 1
    },
    "rgw": {
        "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 2
    },
    "overall": {
        "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)": 175,
        "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 35,
        "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 18,
        "ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)": 1,
        "ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)": 6
    }
}

#2 Updated by Robert Burrowes about 1 month ago

Croit helped us rebuild luminous mons from the osds, and they think our crash was due the the test mds being at mimic, while the rest of the system was on luminous. They excised the test mds from our system. They think another upgrade will now work.

Still, crashing and leaving the cluster with no mons was not a graceful way of handling the error. It would have been nicer to have isolated the mds, and leave the rest of the service running.

#3 Updated by Patrick Donnelly about 1 month ago

  • Status changed from New to Closed

Robert Burrowes wrote:

Croit helped us rebuild luminous mons from the osds, and they think our crash was due the the test mds being at mimic, while the rest of the system was on luminous. They excised the test mds from our system. They think another upgrade will now work.

Still, crashing and leaving the cluster with no mons was not a graceful way of handling the error. It would have been nicer to have isolated the mds, and leave the rest of the service running.

Ceph generally does not tolerate mixed versions of MDSs. This is slated to improve with rolling upgrade support.

With that said, the debug log does not show where the assertion occurred. It's difficult to say what happened without more information.

I'm closing this as there's nothing actionable for this ticket that isn't already planned in other work.

Also available in: Atom PDF