Project

General

Profile

Actions

Bug #50587

closed

mon election storm following osd recreation: huge tcmalloc and ceph::msgr::v2::FrameAssembler backtraces

Added by Dan van der Ster about 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Messenger, Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We recreated an osd and seconds later our mons started using 100% CPU and going into an election storm which lasted nearly an hour.

Here was what happened:

  • At around 17:40, osd.321 was zapped then re-created. (osd.538 was also newly created around the same time).
  • At 17:41, a peon crashed with backtrace in the msgr and triggered an election. We see huge tcmallocs then a crash in the msgr:
Apr 29 17:41:42 p05517715y01595.cern.ch ceph-mon[1552187]: tcmalloc: large alloc 1073741824 bytes == 0x55d179308000 @  0x7f37244594ef 0x7f3724478bd6 0x55d0d6c090f9 0x55d0d6c0913b 0x55d0d6c099f0 0x7f3721e616f6
Apr 29 17:41:52 p05517715y01595.cern.ch ceph-mon[1552187]: tcmalloc: large alloc 2147483648 bytes == 0x55d1b9308000 @  0x7f37244594ef 0x7f3724478bd6 0x55d0d6c090f9 0x55d0d6c0913b 0x55d0d6c099f0 0x7f3721e616f6
...
Apr 29 17:42:50 p05517715y01595.cern.ch ceph-mon[1552187]: tcmalloc: large alloc 8589934592 bytes == 0x55d339b08000 @  0x7f37244594ef 0x7f3724478bd6 0x55d0d6c090f9 0x55d0d6c0913b 0x55d0d6c099f0 0x7f3721e616f6
Apr 29 17:42:54 p05517715y01595.cern.ch ceph-mon[1552187]: tcmalloc: large alloc 4294967296 bytes == 0x55d239b08000 @  0x7f37244594ef 0x7f3724478bd6 0x55d0d6c090f9 0x55d0d6c0913b 0x55d0d6c099f0 0x55d0d6c09aae
Apr 29 17:42:55 p05517715y01595.cern.ch ceph-mon[1552187]: tcmalloc: large alloc 8589934592 bytes == 0x55d53a308000 @  0x7f37244594ef 0x7f3724478bd6 0x55d0d6c090f9 0x55d0d6c0913b 0x55d0d6c099f0 0x55d0d6c09aae
Apr 29 17:42:58 p05517715y01595.cern.ch ceph-mon[1552187]: tcmalloc: large alloc 4294967296 bytes == 0x55d239b08000 @  0x7f37244594ef 0x7f372447a010 0x7f3725c9f1e2 0x7f3725ca0022 0x7f3725ca011e 0x55d0d6c46d9a
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: terminate called after throwing an instance of 'std::runtime_error'
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: what():  EVP_EncryptUpdate failed
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: *** Caught signal (Aborted) **
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: in thread 7f371a722700 thread_name:msgr-worker-2
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11) nautilus (stable)
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 1: (()+0xf630) [0x7f3722723630]
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 2: (gsignal()+0x37) [0x7f37215163d7]
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 3: (abort()+0x148) [0x7f3721517ac8]
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f3721e26a95]
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 5: (()+0x5ea06) [0x7f3721e24a06]
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 6: (()+0x5ea33) [0x7f3721e24a33]
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 7: (()+0x5ec53) [0x7f3721e24c53]
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 8: (()+0x289f5a) [0x7f3725966f5a]
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 9: (ceph::msgr::v2::FrameAssembler::asm_secure_rev1(ceph::msgr::v2::preamble_block_t const&, ceph::buffer::v14_2_0::list*) const+0x204) [0x7f3725c6bb2
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 10: (ceph::msgr::v2::FrameAssembler::assemble_frame(ceph::msgr::v2::Tag, ceph::buffer::v14_2_0::list*, unsigned short const*, unsigned long)+0x183) [0
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 11: (bool ProtocolV2::append_frame<ceph::msgr::v2::MessageFrame>(ceph::msgr::v2::MessageFrame&)+0x95) [0x7f3725c59555]
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 12: (ProtocolV2::write_message(Message*, bool)+0x4c8) [0x7f3725c3e2b8]
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 13: (ProtocolV2::write_event()+0x38d) [0x7f3725c513bd]
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 14: (AsyncConnection::handle_write()+0x40) [0x7f3725c14720]
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 15: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1397) [0x7f3725c62767]
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 16: (()+0x58b747) [0x7f3725c68747]
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 17: (()+0x82a47f) [0x7f3725f0747f]
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 18: (()+0x7ea5) [0x7f372271bea5]
Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 19: (clone()+0x6d) [0x7f37215de9fd]
  • subsequently most of the mons started crashing this same way -- huge tcmallocs then crash in msgr.

Here's the pattern of tcmalloc on the previous leader, seems it was exactly 1GB->2GB->4GB->8GB:

Apr 29 17:45:12 cephbeesly-mon-2a00f134e5.cern.ch ceph-mon[2676444]: tcmalloc: large alloc 1073741824 bytes == 0x55e813c40000 @  0x7fb247cd84ef 0x7fb247cf7bd6 0x55e76e35a0f9 0x55e76e35a13b 0x55e76e35a9f0 0x7fb2456e06f6 0x7fb2456e4e66 0x7fb2456dbbb5 0x55e76e36898d 0x55e76e397d4b 0x55e76e39be55 0x55e76e39d5a0 0x55e76e3cafa6 0x55e76e3c7986 0x7fb2493dd079 0x7fb24948adfd 0x7fb245f9aea5 0x7fb244e5d9fd
Apr 29 17:45:24 cephbeesly-mon-2a00f134e5.cern.ch ceph-mon[2676444]: tcmalloc: large alloc 2147483648 bytes == 0x55e853c40000 @  0x7fb247cd84ef 0x7fb247cf7bd6 0x55e76e35a0f9 0x55e76e35a13b 0x55e76e35a9f0 0x7fb2456e06f6 0x7fb2456e4e66 0x7fb2456dbbb5 0x55e76e36898d 0x55e76e397d4b 0x55e76e39be55 0x55e76e39d5a0 0x55e76e3cafa6 0x55e76e3c7986 0x7fb2493dd079 0x7fb24948adfd 0x7fb245f9aea5 0x7fb244e5d9fd
Apr 29 17:45:47 cephbeesly-mon-2a00f134e5.cern.ch ceph-mon[2676444]: tcmalloc: large alloc 4294967296 bytes == 0x55e8d4440000 @  0x7fb247cd84ef 0x7fb247cf7bd6 0x55e76e35a0f9 0x55e76e35a13b 0x55e76e35a9f0 0x7fb2456e06f6 0x7fb2456e4e66 0x7fb2456dbbb5 0x55e76e36898d 0x55e76e397d4b 0x55e76e39be55 0x55e76e39d5a0 0x55e76e3cafa6 0x55e76e3c7986 0x7fb2493dd079 0x7fb24948adfd 0x7fb245f9aea5 0x7fb244e5d9fd
Apr 29 17:46:33 cephbeesly-mon-2a00f134e5.cern.ch ceph-mon[2676444]: tcmalloc: large alloc 8589934592 bytes == (nil) @  0x7fb247cd84ef 0x7fb247cf7bd6 0x55e76e35a0f9 0x55e76e35a13b 0x55e76e35a9f0 0x7fb2456e06f6 0x7fb2456e4e66 0x7fb2456dbbb5 0x55e76e36898d 0x55e76e397d4b 0x55e76e39be55 0x55e76e39d5a0 0x55e76e3cafa6 0x55e76e3c7986 0x7fb2493dd079 0x7fb24948adfd 0x7fb245f9aea5 0x7fb244e5d9fd

  • We have 5 mons: three are physical 64GB ram boxes and 2 are VMs with 15GB ram. All five have similar large tcmallocs but only the 3 physical boxes crashed with the backtrace.
  • 17:49 we stopped the previous mon leader cephbeesly-mon-2a00f134e5.
  • 17:53 we decided to stop osd.321 and osd.538. Mons were still thrashing or probing.
  • 18:23 we restarted cephbeesly-mon-2a00f134e5 and finally it became leader and all cpu usage etc went back to normal.

We have collect mon logs and posted them to: ceph-post-file: e1b5db79-c6e1-4c56-a2ac-fe9caf15628f

We also have a coredump and could print anything interesting as needed to understand why these mons went bananas:

(gdb) bt
#0  0x00007f37227234fb in raise () from /lib64/libpthread.so.0
#1  0x000055d0d6e31e12 in reraise_fatal (signum=6) at /usr/src/debug/ceph-14.2.19/src/global/signal_handler.cc:326
#2  handle_fatal_signal(int) () at /usr/src/debug/ceph-14.2.19/src/global/signal_handler.cc:326
#3  <signal handler called>
#4  0x00007f37215163d7 in raise () from /lib64/libc.so.6
#5  0x00007f3721517ac8 in abort () from /lib64/libc.so.6
#6  0x00007f3721e26a95 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#7  0x00007f3721e24a06 in ?? () from /lib64/libstdc++.so.6
#8  0x00007f3721e24a33 in std::terminate() () from /lib64/libstdc++.so.6
#9  0x00007f3721e24c53 in __cxa_throw () from /lib64/libstdc++.so.6
#10 0x00007f3725966f5a in ceph::crypto::onwire::AES128GCM_OnWireTxHandler::authenticated_encrypt_update (this=0x55d0e4bba510, plaintext=...) at /usr/src/debug/ceph-14.2.19/src/include/buffer.h:649
#11 0x00007f3725c6bb24 in ceph::msgr::v2::FrameAssembler::asm_secure_rev1(ceph::msgr::v2::preamble_block_t const&, ceph::buffer::v14_2_0::list*) const ()
    at /usr/src/debug/ceph-14.2.19/src/msg/async/frames_v2.cc:235
#12 0x00007f3725c6bf13 in ceph::msgr::v2::FrameAssembler::assemble_frame (this=this@entry=0x55d10a803600, tag=tag@entry=ceph::msgr::v2::MESSAGE, segment_bls=segment_bls@entry=0x7f371a71ec60,
    segment_aligns=segment_aligns@entry=0x7f3725f57a78 <ceph::msgr::v2::Frame<ceph::msgr::v2::MessageFrame, (unsigned short)8, (unsigned short)8, (unsigned short)8, (unsigned short)4096>::alignments>,
    segment_count=segment_count@entry=4) at /usr/src/debug/ceph-14.2.19/src/msg/async/frames_v2.cc:270
#13 0x00007f3725c59555 in get_buffer (tx_frame_asm=..., this=0x7f371a71ec60) at /usr/src/debug/ceph-14.2.19/src/msg/async/ProtocolV2.cc:577
#14 ProtocolV2::append_frame<ceph::msgr::v2::MessageFrame> (this=this@entry=0x55d10a803500, frame=...) at /usr/src/debug/ceph-14.2.19/src/msg/async/ProtocolV2.cc:580
#15 0x00007f3725c3e2b8 in ProtocolV2::write_message(Message*, bool) () at /usr/src/debug/ceph-14.2.19/src/msg/async/ProtocolV2.cc:543
#16 0x00007f3725c513bd in ProtocolV2::write_event() () at /usr/src/debug/ceph-14.2.19/src/msg/async/ProtocolV2.cc:658
#17 0x00007f3725c14720 in AsyncConnection::handle_write() () at /usr/src/debug/ceph-14.2.19/src/msg/async/AsyncConnection.cc:692
#18 0x00007f3725c62767 in EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) () at /usr/src/debug/ceph-14.2.19/src/msg/async/Event.cc:441
#19 0x00007f3725c68747 in operator() (__closure=0x55d0da257aa8, __closure=0x55d0da257aa8) at /usr/src/debug/ceph-14.2.19/src/msg/async/Stack.cc:53
#20 std::_Function_handler<void (), NetworkStack::add_thread(unsigned int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_function.h:297
#21 0x00007f3725f0747f in execute_native_thread_routine () from /usr/lib64/ceph/libceph-common.so.0
#22 0x00007f372271bea5 in start_thread () from /lib64/libpthread.so.0
#23 0x00007f37215de9fd in clone () from /lib64/libc.so.6


Related issues 2 (1 open1 closed)

Related to RADOS - Bug #48732: Marking OSDs out causes mon daemons to crash following tcmalloc: large allocNeed More Info

Actions
Copied to mgr - Bug #50591: mgr/progress: progress can be negativeResolvedDan van der Ster

Actions
Actions #1

Updated by Dan van der Ster about 3 years ago

  • Related to Bug #48732: Marking OSDs out causes mon daemons to crash following tcmalloc: large alloc added
Actions #2

Updated by Dan van der Ster about 3 years ago

Here are some notes on the timelines between various actors at the start of the incident:

mon.cephbeesly-mon-2a00f134e5 (leader):
  • 17:41:41.366: calling election
  • 17:41:46.389: Health check failed: 1/5 mons down
  • 17:41:46.397: prepare_failure osd.538
  • 17:41:56.411: calling election
  • 17:42:01.431: Health check update: 2/5 mons down
  • 17:45:12: tcmalloc: large alloc 1073741824

mon.p05517715y01595 (first mon leaving quorum)

  • 17:41:32.777: client.admin 'status'
  • 17:41:42: tcmalloc: large alloc 1073741824
  • 17:43:00.200: abort

osd.321 timeline:

  • 17:40:48.763: mkfs
  • 17:40:55.380: first start
  • 17:41:08.768: state: booting -> active
  • 17:53:21.688: shutdown by us

osd.538 timeline:

  • 17:41:01.415: mkfs
  • 17:41:08.421: first start
  • 17:41:22.554: state: booting -> active
  • 17:41:46.394: restarted osd.538 so that it gets put into crush map
  • 17:41:48.660: auth: could not find secret_id=67341
  • 17:43:24.672: done with init
  • 17:43:25.830: booting -> active
  • 17:53:18.004: shutdown by us
Actions #3

Updated by Dan van der Ster about 3 years ago

Something else weird in the mgr log: can negative progress events break the mon ?

2021-04-29 17:39:42.340 7fbfe2b47700  4 mgr[progress] Processing OSDMap change 3577965..3577966
2021-04-29 17:40:42.423 7fbfe2b47700  4 mgr[progress] Processing OSDMap change 3577966..3577967
2021-04-29 17:40:42.589 7fbfe2b47700  1 mgr[progress] osd.321 marked out
2021-04-29 17:40:43.072 7fbfe2b47700  1 mgr[progress] 0 PGs affected by osd.321 being marked out
2021-04-29 17:40:43.521 7fbfe2b47700  4 mgr[progress] Processing OSDMap change 3577967..3577968
2021-04-29 17:40:44.669 7fbfe2b47700  4 mgr[progress] Processing OSDMap change 3577968..3577969
2021-04-29 17:40:45.769 7fbfe2b47700  4 mgr[progress] Processing OSDMap change 3577969..3577970
2021-04-29 17:40:46.879 7fbfe2b47700  4 mgr[progress] Processing OSDMap change 3577970..3577971
2021-04-29 17:40:48.010 7fbfe2b47700  4 mgr[progress] Processing OSDMap change 3577971..3577972
2021-04-29 17:40:56.083 7fbfe2b47700  4 mgr[progress] Processing OSDMap change 3577972..3577973
2021-04-29 17:41:12.081 7fbfe2b47700  4 mgr[progress] Processing OSDMap change 3577973..3577974
2021-04-29 17:41:12.489 7fbfe2b47700  1 mgr[progress] osd.321 marked in
2021-04-29 17:41:12.974 7fbfe2b47700  1 mgr[progress] 3 PGs affected by osd.321 being marked in
2021-04-29 17:41:18.097 7fbfe2b47700  4 mgr[progress] Updated progress to 0.0 (Rebalancing after osd.321 marked in)
2021-04-29 17:41:18.684 7fbfe2b47700  4 mgr[progress] Processing OSDMap change 3577974..3577976
2021-04-29 17:41:24.592 7fbfe2b47700  4 mgr[progress] Updated progress to 0.0 (Rebalancing after osd.321 marked in)
2021-04-29 17:41:30.448 7fbfe2b47700  4 mgr[progress] Updated progress to -0.333333333333 (Rebalancing after osd.321 marked in)
2021-04-29 17:41:30.722 7fbfe2b47700  4 mgr[progress] Processing OSDMap change 3577976..3577977
2021-04-29 17:41:31.222 7fbfe2b47700  1 mgr[progress] osd.538 marked in
2021-04-29 17:41:31.721 7fbfe2b47700  1 mgr[progress] 0 PGs affected by osd.538 being marked in
2021-04-29 17:41:36.768 7fbfe2b47700  4 mgr[progress] Updated progress to -0.333333333333 (Rebalancing after osd.321 marked in)
2021-04-29 17:41:42.117 7fbfe2b47700  4 mgr[progress] Updated progress to -0.333333333333 (Rebalancing after osd.321 marked in)
2021-04-29 17:41:47.358 7fbfe2b47700  4 mgr[progress] Updated progress to -0.333333333333 (Rebalancing after osd.321 marked in)
2021-04-29 17:41:52.587 7fbfe2b47700  4 mgr[progress] Updated progress to -0.333333333333 (Rebalancing after osd.321 marked in)

This loop in Monitor.cc for example: Is that going to print millions of `=` ?

    auto& pem = mgrstatmon()->get_progress_events();
    if (!pem.empty()) {
      ss << "\n \n  progress:\n";
      for (auto& i : pem) {
        ss << "    " << i.second.message << "\n";
        ss << "      [";
        unsigned j;
        for (j = 0; j < (unsigned)(i.second.progress * 30.0); ++j) {
          ss << '=';
        }
        for (; j < 30; ++j) {
          ss << '.';
        }
        ss << "]\n";
      }
    }

Actions #4

Updated by Dan van der Ster about 3 years ago

  • Pull request ID set to 41094

commit 5f95ec4457059889bc4dbc2ad25cdc0537255f69 removed that loop in Monitor.cc but wasn't backported to nautilus.

But perhaps there are other places a negative progress can cause troubles. https://github.com/ceph/ceph/pull/41094

Actions #5

Updated by Josh Durgin about 3 years ago

Ah good catch Dan, that loop does appear to be generating millions of '=' in nautilus. Sounds like we need to fix that loop.

Actions #6

Updated by Konstantin Shalygin about 3 years ago

  • % Done changed from 0 to 40
Actions #7

Updated by Dan van der Ster about 3 years ago

  • Copied to Bug #50591: mgr/progress: progress can be negative added
Actions #8

Updated by Dan van der Ster about 3 years ago

  • Pull request ID deleted (41094)

Leaving this open to address the msgr2 abort. Presumably this is caused by the >4GB message generated to respond to `ceph status`?

Actions #9

Updated by Dan van der Ster about 3 years ago

Josh Durgin wrote:

Ah good catch Dan, that loop does appear to be generating millions of '=' in nautilus. Sounds like we need to fix that loop.

https://github.com/ceph/ceph/pull/41098

Actions #10

Updated by Josh Durgin almost 3 years ago

  • % Done changed from 40 to 0

Dan van der Ster wrote:

Leaving this open to address the msgr2 abort. Presumably this is caused by the >4GB message generated to respond to `ceph status`?

Yes, I'd guess this is a side effect of being low on memory in this case. We're not recording what EVP_EncryptUpdate() failed with, however the tcmalloc large alloc warnings are talking about several multi-GB allocations.

Actions #12

Updated by Rob Haverkamp almost 3 years ago

We hit this bug yesterday in a nautilus 14.2.18 cluster.
All monitors went down and started crashing on restart.

I build a custom monitor based on the 14.2.18 release(git
hash: befbc92f3c11eedd8626487211d200c0b44786d9) with the patch provided by Dan.

That solved the issue almost immediately for us.

Will this patch be released in 14.2.22?

And @Dan, thanks for the patch!

Actions #13

Updated by Dan van der Ster almost 3 years ago

Will this patch be released in 14.2.22?

yes the PR has been merged to the nautilus branch, so it will be in the next release.

Actions #14

Updated by Sage Weil over 2 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF