Bug #50587
mon election storm following osd recreation: huge tcmalloc and ceph::msgr::v2::FrameAssembler backtraces
0%
Description
We recreated an osd and seconds later our mons started using 100% CPU and going into an election storm which lasted nearly an hour.
Here was what happened:
- At around 17:40, osd.321 was zapped then re-created. (osd.538 was also newly created around the same time).
- At 17:41, a peon crashed with backtrace in the msgr and triggered an election. We see huge tcmallocs then a crash in the msgr:
Apr 29 17:41:42 p05517715y01595.cern.ch ceph-mon[1552187]: tcmalloc: large alloc 1073741824 bytes == 0x55d179308000 @ 0x7f37244594ef 0x7f3724478bd6 0x55d0d6c090f9 0x55d0d6c0913b 0x55d0d6c099f0 0x7f3721e616f6 Apr 29 17:41:52 p05517715y01595.cern.ch ceph-mon[1552187]: tcmalloc: large alloc 2147483648 bytes == 0x55d1b9308000 @ 0x7f37244594ef 0x7f3724478bd6 0x55d0d6c090f9 0x55d0d6c0913b 0x55d0d6c099f0 0x7f3721e616f6 ... Apr 29 17:42:50 p05517715y01595.cern.ch ceph-mon[1552187]: tcmalloc: large alloc 8589934592 bytes == 0x55d339b08000 @ 0x7f37244594ef 0x7f3724478bd6 0x55d0d6c090f9 0x55d0d6c0913b 0x55d0d6c099f0 0x7f3721e616f6 Apr 29 17:42:54 p05517715y01595.cern.ch ceph-mon[1552187]: tcmalloc: large alloc 4294967296 bytes == 0x55d239b08000 @ 0x7f37244594ef 0x7f3724478bd6 0x55d0d6c090f9 0x55d0d6c0913b 0x55d0d6c099f0 0x55d0d6c09aae Apr 29 17:42:55 p05517715y01595.cern.ch ceph-mon[1552187]: tcmalloc: large alloc 8589934592 bytes == 0x55d53a308000 @ 0x7f37244594ef 0x7f3724478bd6 0x55d0d6c090f9 0x55d0d6c0913b 0x55d0d6c099f0 0x55d0d6c09aae Apr 29 17:42:58 p05517715y01595.cern.ch ceph-mon[1552187]: tcmalloc: large alloc 4294967296 bytes == 0x55d239b08000 @ 0x7f37244594ef 0x7f372447a010 0x7f3725c9f1e2 0x7f3725ca0022 0x7f3725ca011e 0x55d0d6c46d9a Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: terminate called after throwing an instance of 'std::runtime_error' Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: what(): EVP_EncryptUpdate failed Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: *** Caught signal (Aborted) ** Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: in thread 7f371a722700 thread_name:msgr-worker-2 Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11) nautilus (stable) Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 1: (()+0xf630) [0x7f3722723630] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 2: (gsignal()+0x37) [0x7f37215163d7] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 3: (abort()+0x148) [0x7f3721517ac8] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f3721e26a95] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 5: (()+0x5ea06) [0x7f3721e24a06] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 6: (()+0x5ea33) [0x7f3721e24a33] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 7: (()+0x5ec53) [0x7f3721e24c53] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 8: (()+0x289f5a) [0x7f3725966f5a] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 9: (ceph::msgr::v2::FrameAssembler::asm_secure_rev1(ceph::msgr::v2::preamble_block_t const&, ceph::buffer::v14_2_0::list*) const+0x204) [0x7f3725c6bb2 Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 10: (ceph::msgr::v2::FrameAssembler::assemble_frame(ceph::msgr::v2::Tag, ceph::buffer::v14_2_0::list*, unsigned short const*, unsigned long)+0x183) [0 Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 11: (bool ProtocolV2::append_frame<ceph::msgr::v2::MessageFrame>(ceph::msgr::v2::MessageFrame&)+0x95) [0x7f3725c59555] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 12: (ProtocolV2::write_message(Message*, bool)+0x4c8) [0x7f3725c3e2b8] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 13: (ProtocolV2::write_event()+0x38d) [0x7f3725c513bd] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 14: (AsyncConnection::handle_write()+0x40) [0x7f3725c14720] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 15: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1397) [0x7f3725c62767] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 16: (()+0x58b747) [0x7f3725c68747] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 17: (()+0x82a47f) [0x7f3725f0747f] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 18: (()+0x7ea5) [0x7f372271bea5] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 19: (clone()+0x6d) [0x7f37215de9fd]
- subsequently most of the mons started crashing this same way -- huge tcmallocs then crash in msgr.
Here's the pattern of tcmalloc on the previous leader, seems it was exactly 1GB->2GB->4GB->8GB:
Apr 29 17:45:12 cephbeesly-mon-2a00f134e5.cern.ch ceph-mon[2676444]: tcmalloc: large alloc 1073741824 bytes == 0x55e813c40000 @ 0x7fb247cd84ef 0x7fb247cf7bd6 0x55e76e35a0f9 0x55e76e35a13b 0x55e76e35a9f0 0x7fb2456e06f6 0x7fb2456e4e66 0x7fb2456dbbb5 0x55e76e36898d 0x55e76e397d4b 0x55e76e39be55 0x55e76e39d5a0 0x55e76e3cafa6 0x55e76e3c7986 0x7fb2493dd079 0x7fb24948adfd 0x7fb245f9aea5 0x7fb244e5d9fd Apr 29 17:45:24 cephbeesly-mon-2a00f134e5.cern.ch ceph-mon[2676444]: tcmalloc: large alloc 2147483648 bytes == 0x55e853c40000 @ 0x7fb247cd84ef 0x7fb247cf7bd6 0x55e76e35a0f9 0x55e76e35a13b 0x55e76e35a9f0 0x7fb2456e06f6 0x7fb2456e4e66 0x7fb2456dbbb5 0x55e76e36898d 0x55e76e397d4b 0x55e76e39be55 0x55e76e39d5a0 0x55e76e3cafa6 0x55e76e3c7986 0x7fb2493dd079 0x7fb24948adfd 0x7fb245f9aea5 0x7fb244e5d9fd Apr 29 17:45:47 cephbeesly-mon-2a00f134e5.cern.ch ceph-mon[2676444]: tcmalloc: large alloc 4294967296 bytes == 0x55e8d4440000 @ 0x7fb247cd84ef 0x7fb247cf7bd6 0x55e76e35a0f9 0x55e76e35a13b 0x55e76e35a9f0 0x7fb2456e06f6 0x7fb2456e4e66 0x7fb2456dbbb5 0x55e76e36898d 0x55e76e397d4b 0x55e76e39be55 0x55e76e39d5a0 0x55e76e3cafa6 0x55e76e3c7986 0x7fb2493dd079 0x7fb24948adfd 0x7fb245f9aea5 0x7fb244e5d9fd Apr 29 17:46:33 cephbeesly-mon-2a00f134e5.cern.ch ceph-mon[2676444]: tcmalloc: large alloc 8589934592 bytes == (nil) @ 0x7fb247cd84ef 0x7fb247cf7bd6 0x55e76e35a0f9 0x55e76e35a13b 0x55e76e35a9f0 0x7fb2456e06f6 0x7fb2456e4e66 0x7fb2456dbbb5 0x55e76e36898d 0x55e76e397d4b 0x55e76e39be55 0x55e76e39d5a0 0x55e76e3cafa6 0x55e76e3c7986 0x7fb2493dd079 0x7fb24948adfd 0x7fb245f9aea5 0x7fb244e5d9fd
- We have 5 mons: three are physical 64GB ram boxes and 2 are VMs with 15GB ram. All five have similar large tcmallocs but only the 3 physical boxes crashed with the backtrace.
- 17:49 we stopped the previous mon leader cephbeesly-mon-2a00f134e5.
- 17:53 we decided to stop osd.321 and osd.538. Mons were still thrashing or probing.
- 18:23 we restarted cephbeesly-mon-2a00f134e5 and finally it became leader and all cpu usage etc went back to normal.
We have collect mon logs and posted them to: ceph-post-file: e1b5db79-c6e1-4c56-a2ac-fe9caf15628f
We also have a coredump and could print anything interesting as needed to understand why these mons went bananas:
(gdb) bt #0 0x00007f37227234fb in raise () from /lib64/libpthread.so.0 #1 0x000055d0d6e31e12 in reraise_fatal (signum=6) at /usr/src/debug/ceph-14.2.19/src/global/signal_handler.cc:326 #2 handle_fatal_signal(int) () at /usr/src/debug/ceph-14.2.19/src/global/signal_handler.cc:326 #3 <signal handler called> #4 0x00007f37215163d7 in raise () from /lib64/libc.so.6 #5 0x00007f3721517ac8 in abort () from /lib64/libc.so.6 #6 0x00007f3721e26a95 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6 #7 0x00007f3721e24a06 in ?? () from /lib64/libstdc++.so.6 #8 0x00007f3721e24a33 in std::terminate() () from /lib64/libstdc++.so.6 #9 0x00007f3721e24c53 in __cxa_throw () from /lib64/libstdc++.so.6 #10 0x00007f3725966f5a in ceph::crypto::onwire::AES128GCM_OnWireTxHandler::authenticated_encrypt_update (this=0x55d0e4bba510, plaintext=...) at /usr/src/debug/ceph-14.2.19/src/include/buffer.h:649 #11 0x00007f3725c6bb24 in ceph::msgr::v2::FrameAssembler::asm_secure_rev1(ceph::msgr::v2::preamble_block_t const&, ceph::buffer::v14_2_0::list*) const () at /usr/src/debug/ceph-14.2.19/src/msg/async/frames_v2.cc:235 #12 0x00007f3725c6bf13 in ceph::msgr::v2::FrameAssembler::assemble_frame (this=this@entry=0x55d10a803600, tag=tag@entry=ceph::msgr::v2::MESSAGE, segment_bls=segment_bls@entry=0x7f371a71ec60, segment_aligns=segment_aligns@entry=0x7f3725f57a78 <ceph::msgr::v2::Frame<ceph::msgr::v2::MessageFrame, (unsigned short)8, (unsigned short)8, (unsigned short)8, (unsigned short)4096>::alignments>, segment_count=segment_count@entry=4) at /usr/src/debug/ceph-14.2.19/src/msg/async/frames_v2.cc:270 #13 0x00007f3725c59555 in get_buffer (tx_frame_asm=..., this=0x7f371a71ec60) at /usr/src/debug/ceph-14.2.19/src/msg/async/ProtocolV2.cc:577 #14 ProtocolV2::append_frame<ceph::msgr::v2::MessageFrame> (this=this@entry=0x55d10a803500, frame=...) at /usr/src/debug/ceph-14.2.19/src/msg/async/ProtocolV2.cc:580 #15 0x00007f3725c3e2b8 in ProtocolV2::write_message(Message*, bool) () at /usr/src/debug/ceph-14.2.19/src/msg/async/ProtocolV2.cc:543 #16 0x00007f3725c513bd in ProtocolV2::write_event() () at /usr/src/debug/ceph-14.2.19/src/msg/async/ProtocolV2.cc:658 #17 0x00007f3725c14720 in AsyncConnection::handle_write() () at /usr/src/debug/ceph-14.2.19/src/msg/async/AsyncConnection.cc:692 #18 0x00007f3725c62767 in EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) () at /usr/src/debug/ceph-14.2.19/src/msg/async/Event.cc:441 #19 0x00007f3725c68747 in operator() (__closure=0x55d0da257aa8, __closure=0x55d0da257aa8) at /usr/src/debug/ceph-14.2.19/src/msg/async/Stack.cc:53 #20 std::_Function_handler<void (), NetworkStack::add_thread(unsigned int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_function.h:297 #21 0x00007f3725f0747f in execute_native_thread_routine () from /usr/lib64/ceph/libceph-common.so.0 #22 0x00007f372271bea5 in start_thread () from /lib64/libpthread.so.0 #23 0x00007f37215de9fd in clone () from /lib64/libc.so.6
Related issues
History
#1 Updated by Dan van der Ster almost 2 years ago
- Related to Bug #48732: Marking OSDs out causes mon daemons to crash following tcmalloc: large alloc added
#2 Updated by Dan van der Ster almost 2 years ago
Here are some notes on the timelines between various actors at the start of the incident:
mon.cephbeesly-mon-2a00f134e5 (leader):- 17:41:41.366: calling election
- 17:41:46.389: Health check failed: 1/5 mons down
- 17:41:46.397: prepare_failure osd.538
- 17:41:56.411: calling election
- 17:42:01.431: Health check update: 2/5 mons down
- 17:45:12: tcmalloc: large alloc 1073741824
mon.p05517715y01595 (first mon leaving quorum)
- 17:41:32.777: client.admin 'status'
- 17:41:42: tcmalloc: large alloc 1073741824
- 17:43:00.200: abort
osd.321 timeline:
- 17:40:48.763: mkfs
- 17:40:55.380: first start
- 17:41:08.768: state: booting -> active
- 17:53:21.688: shutdown by us
osd.538 timeline:
- 17:41:01.415: mkfs
- 17:41:08.421: first start
- 17:41:22.554: state: booting -> active
- 17:41:46.394: restarted osd.538 so that it gets put into crush map
- 17:41:48.660: auth: could not find secret_id=67341
- 17:43:24.672: done with init
- 17:43:25.830: booting -> active
- 17:53:18.004: shutdown by us
#3 Updated by Dan van der Ster almost 2 years ago
Something else weird in the mgr log: can negative progress events break the mon ?
2021-04-29 17:39:42.340 7fbfe2b47700 4 mgr[progress] Processing OSDMap change 3577965..3577966 2021-04-29 17:40:42.423 7fbfe2b47700 4 mgr[progress] Processing OSDMap change 3577966..3577967 2021-04-29 17:40:42.589 7fbfe2b47700 1 mgr[progress] osd.321 marked out 2021-04-29 17:40:43.072 7fbfe2b47700 1 mgr[progress] 0 PGs affected by osd.321 being marked out 2021-04-29 17:40:43.521 7fbfe2b47700 4 mgr[progress] Processing OSDMap change 3577967..3577968 2021-04-29 17:40:44.669 7fbfe2b47700 4 mgr[progress] Processing OSDMap change 3577968..3577969 2021-04-29 17:40:45.769 7fbfe2b47700 4 mgr[progress] Processing OSDMap change 3577969..3577970 2021-04-29 17:40:46.879 7fbfe2b47700 4 mgr[progress] Processing OSDMap change 3577970..3577971 2021-04-29 17:40:48.010 7fbfe2b47700 4 mgr[progress] Processing OSDMap change 3577971..3577972 2021-04-29 17:40:56.083 7fbfe2b47700 4 mgr[progress] Processing OSDMap change 3577972..3577973 2021-04-29 17:41:12.081 7fbfe2b47700 4 mgr[progress] Processing OSDMap change 3577973..3577974 2021-04-29 17:41:12.489 7fbfe2b47700 1 mgr[progress] osd.321 marked in 2021-04-29 17:41:12.974 7fbfe2b47700 1 mgr[progress] 3 PGs affected by osd.321 being marked in 2021-04-29 17:41:18.097 7fbfe2b47700 4 mgr[progress] Updated progress to 0.0 (Rebalancing after osd.321 marked in) 2021-04-29 17:41:18.684 7fbfe2b47700 4 mgr[progress] Processing OSDMap change 3577974..3577976 2021-04-29 17:41:24.592 7fbfe2b47700 4 mgr[progress] Updated progress to 0.0 (Rebalancing after osd.321 marked in) 2021-04-29 17:41:30.448 7fbfe2b47700 4 mgr[progress] Updated progress to -0.333333333333 (Rebalancing after osd.321 marked in) 2021-04-29 17:41:30.722 7fbfe2b47700 4 mgr[progress] Processing OSDMap change 3577976..3577977 2021-04-29 17:41:31.222 7fbfe2b47700 1 mgr[progress] osd.538 marked in 2021-04-29 17:41:31.721 7fbfe2b47700 1 mgr[progress] 0 PGs affected by osd.538 being marked in 2021-04-29 17:41:36.768 7fbfe2b47700 4 mgr[progress] Updated progress to -0.333333333333 (Rebalancing after osd.321 marked in) 2021-04-29 17:41:42.117 7fbfe2b47700 4 mgr[progress] Updated progress to -0.333333333333 (Rebalancing after osd.321 marked in) 2021-04-29 17:41:47.358 7fbfe2b47700 4 mgr[progress] Updated progress to -0.333333333333 (Rebalancing after osd.321 marked in) 2021-04-29 17:41:52.587 7fbfe2b47700 4 mgr[progress] Updated progress to -0.333333333333 (Rebalancing after osd.321 marked in)
This loop in Monitor.cc for example: Is that going to print millions of `=` ?
auto& pem = mgrstatmon()->get_progress_events(); if (!pem.empty()) { ss << "\n \n progress:\n"; for (auto& i : pem) { ss << " " << i.second.message << "\n"; ss << " ["; unsigned j; for (j = 0; j < (unsigned)(i.second.progress * 30.0); ++j) { ss << '='; } for (; j < 30; ++j) { ss << '.'; } ss << "]\n"; } }
#4 Updated by Dan van der Ster almost 2 years ago
- Pull request ID set to 41094
commit 5f95ec4457059889bc4dbc2ad25cdc0537255f69 removed that loop in Monitor.cc but wasn't backported to nautilus.
But perhaps there are other places a negative progress can cause troubles. https://github.com/ceph/ceph/pull/41094
#5 Updated by Josh Durgin almost 2 years ago
Ah good catch Dan, that loop does appear to be generating millions of '=' in nautilus. Sounds like we need to fix that loop.
#6 Updated by Konstantin Shalygin almost 2 years ago
- % Done changed from 0 to 40
#7 Updated by Dan van der Ster almost 2 years ago
- Copied to Bug #50591: mgr/progress: progress can be negative added
#8 Updated by Dan van der Ster almost 2 years ago
- Pull request ID deleted (
41094)
Leaving this open to address the msgr2 abort. Presumably this is caused by the >4GB message generated to respond to `ceph status`?
#9 Updated by Dan van der Ster almost 2 years ago
Josh Durgin wrote:
Ah good catch Dan, that loop does appear to be generating millions of '=' in nautilus. Sounds like we need to fix that loop.
#10 Updated by Josh Durgin almost 2 years ago
- % Done changed from 40 to 0
Dan van der Ster wrote:
Leaving this open to address the msgr2 abort. Presumably this is caused by the >4GB message generated to respond to `ceph status`?
Yes, I'd guess this is a side effect of being low on memory in this case. We're not recording what EVP_EncryptUpdate() failed with, however the tcmalloc large alloc warnings are talking about several multi-GB allocations.
#11 Updated by Yuri Weinstein almost 2 years ago
#12 Updated by Rob Haverkamp almost 2 years ago
We hit this bug yesterday in a nautilus 14.2.18 cluster.
All monitors went down and started crashing on restart.
I build a custom monitor based on the 14.2.18 release(git
hash: befbc92f3c11eedd8626487211d200c0b44786d9) with the patch provided by Dan.
That solved the issue almost immediately for us.
Will this patch be released in 14.2.22?
And @Dan, thanks for the patch!
#13 Updated by Dan van der Ster almost 2 years ago
Will this patch be released in 14.2.22?
yes the PR has been merged to the nautilus branch, so it will be in the next release.
#14 Updated by Sage Weil over 1 year ago
- Status changed from New to Resolved