Actions
Bug #50587
closedmon election storm following osd recreation: huge tcmalloc and ceph::msgr::v2::FrameAssembler backtraces
Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Messenger, Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
We recreated an osd and seconds later our mons started using 100% CPU and going into an election storm which lasted nearly an hour.
Here was what happened:
- At around 17:40, osd.321 was zapped then re-created. (osd.538 was also newly created around the same time).
- At 17:41, a peon crashed with backtrace in the msgr and triggered an election. We see huge tcmallocs then a crash in the msgr:
Apr 29 17:41:42 p05517715y01595.cern.ch ceph-mon[1552187]: tcmalloc: large alloc 1073741824 bytes == 0x55d179308000 @ 0x7f37244594ef 0x7f3724478bd6 0x55d0d6c090f9 0x55d0d6c0913b 0x55d0d6c099f0 0x7f3721e616f6 Apr 29 17:41:52 p05517715y01595.cern.ch ceph-mon[1552187]: tcmalloc: large alloc 2147483648 bytes == 0x55d1b9308000 @ 0x7f37244594ef 0x7f3724478bd6 0x55d0d6c090f9 0x55d0d6c0913b 0x55d0d6c099f0 0x7f3721e616f6 ... Apr 29 17:42:50 p05517715y01595.cern.ch ceph-mon[1552187]: tcmalloc: large alloc 8589934592 bytes == 0x55d339b08000 @ 0x7f37244594ef 0x7f3724478bd6 0x55d0d6c090f9 0x55d0d6c0913b 0x55d0d6c099f0 0x7f3721e616f6 Apr 29 17:42:54 p05517715y01595.cern.ch ceph-mon[1552187]: tcmalloc: large alloc 4294967296 bytes == 0x55d239b08000 @ 0x7f37244594ef 0x7f3724478bd6 0x55d0d6c090f9 0x55d0d6c0913b 0x55d0d6c099f0 0x55d0d6c09aae Apr 29 17:42:55 p05517715y01595.cern.ch ceph-mon[1552187]: tcmalloc: large alloc 8589934592 bytes == 0x55d53a308000 @ 0x7f37244594ef 0x7f3724478bd6 0x55d0d6c090f9 0x55d0d6c0913b 0x55d0d6c099f0 0x55d0d6c09aae Apr 29 17:42:58 p05517715y01595.cern.ch ceph-mon[1552187]: tcmalloc: large alloc 4294967296 bytes == 0x55d239b08000 @ 0x7f37244594ef 0x7f372447a010 0x7f3725c9f1e2 0x7f3725ca0022 0x7f3725ca011e 0x55d0d6c46d9a Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: terminate called after throwing an instance of 'std::runtime_error' Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: what(): EVP_EncryptUpdate failed Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: *** Caught signal (Aborted) ** Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: in thread 7f371a722700 thread_name:msgr-worker-2 Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11) nautilus (stable) Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 1: (()+0xf630) [0x7f3722723630] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 2: (gsignal()+0x37) [0x7f37215163d7] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 3: (abort()+0x148) [0x7f3721517ac8] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f3721e26a95] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 5: (()+0x5ea06) [0x7f3721e24a06] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 6: (()+0x5ea33) [0x7f3721e24a33] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 7: (()+0x5ec53) [0x7f3721e24c53] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 8: (()+0x289f5a) [0x7f3725966f5a] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 9: (ceph::msgr::v2::FrameAssembler::asm_secure_rev1(ceph::msgr::v2::preamble_block_t const&, ceph::buffer::v14_2_0::list*) const+0x204) [0x7f3725c6bb2 Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 10: (ceph::msgr::v2::FrameAssembler::assemble_frame(ceph::msgr::v2::Tag, ceph::buffer::v14_2_0::list*, unsigned short const*, unsigned long)+0x183) [0 Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 11: (bool ProtocolV2::append_frame<ceph::msgr::v2::MessageFrame>(ceph::msgr::v2::MessageFrame&)+0x95) [0x7f3725c59555] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 12: (ProtocolV2::write_message(Message*, bool)+0x4c8) [0x7f3725c3e2b8] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 13: (ProtocolV2::write_event()+0x38d) [0x7f3725c513bd] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 14: (AsyncConnection::handle_write()+0x40) [0x7f3725c14720] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 15: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1397) [0x7f3725c62767] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 16: (()+0x58b747) [0x7f3725c68747] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 17: (()+0x82a47f) [0x7f3725f0747f] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 18: (()+0x7ea5) [0x7f372271bea5] Apr 29 17:43:00 p05517715y01595.cern.ch ceph-mon[1552187]: 19: (clone()+0x6d) [0x7f37215de9fd]
- subsequently most of the mons started crashing this same way -- huge tcmallocs then crash in msgr.
Here's the pattern of tcmalloc on the previous leader, seems it was exactly 1GB->2GB->4GB->8GB:
Apr 29 17:45:12 cephbeesly-mon-2a00f134e5.cern.ch ceph-mon[2676444]: tcmalloc: large alloc 1073741824 bytes == 0x55e813c40000 @ 0x7fb247cd84ef 0x7fb247cf7bd6 0x55e76e35a0f9 0x55e76e35a13b 0x55e76e35a9f0 0x7fb2456e06f6 0x7fb2456e4e66 0x7fb2456dbbb5 0x55e76e36898d 0x55e76e397d4b 0x55e76e39be55 0x55e76e39d5a0 0x55e76e3cafa6 0x55e76e3c7986 0x7fb2493dd079 0x7fb24948adfd 0x7fb245f9aea5 0x7fb244e5d9fd Apr 29 17:45:24 cephbeesly-mon-2a00f134e5.cern.ch ceph-mon[2676444]: tcmalloc: large alloc 2147483648 bytes == 0x55e853c40000 @ 0x7fb247cd84ef 0x7fb247cf7bd6 0x55e76e35a0f9 0x55e76e35a13b 0x55e76e35a9f0 0x7fb2456e06f6 0x7fb2456e4e66 0x7fb2456dbbb5 0x55e76e36898d 0x55e76e397d4b 0x55e76e39be55 0x55e76e39d5a0 0x55e76e3cafa6 0x55e76e3c7986 0x7fb2493dd079 0x7fb24948adfd 0x7fb245f9aea5 0x7fb244e5d9fd Apr 29 17:45:47 cephbeesly-mon-2a00f134e5.cern.ch ceph-mon[2676444]: tcmalloc: large alloc 4294967296 bytes == 0x55e8d4440000 @ 0x7fb247cd84ef 0x7fb247cf7bd6 0x55e76e35a0f9 0x55e76e35a13b 0x55e76e35a9f0 0x7fb2456e06f6 0x7fb2456e4e66 0x7fb2456dbbb5 0x55e76e36898d 0x55e76e397d4b 0x55e76e39be55 0x55e76e39d5a0 0x55e76e3cafa6 0x55e76e3c7986 0x7fb2493dd079 0x7fb24948adfd 0x7fb245f9aea5 0x7fb244e5d9fd Apr 29 17:46:33 cephbeesly-mon-2a00f134e5.cern.ch ceph-mon[2676444]: tcmalloc: large alloc 8589934592 bytes == (nil) @ 0x7fb247cd84ef 0x7fb247cf7bd6 0x55e76e35a0f9 0x55e76e35a13b 0x55e76e35a9f0 0x7fb2456e06f6 0x7fb2456e4e66 0x7fb2456dbbb5 0x55e76e36898d 0x55e76e397d4b 0x55e76e39be55 0x55e76e39d5a0 0x55e76e3cafa6 0x55e76e3c7986 0x7fb2493dd079 0x7fb24948adfd 0x7fb245f9aea5 0x7fb244e5d9fd
- We have 5 mons: three are physical 64GB ram boxes and 2 are VMs with 15GB ram. All five have similar large tcmallocs but only the 3 physical boxes crashed with the backtrace.
- 17:49 we stopped the previous mon leader cephbeesly-mon-2a00f134e5.
- 17:53 we decided to stop osd.321 and osd.538. Mons were still thrashing or probing.
- 18:23 we restarted cephbeesly-mon-2a00f134e5 and finally it became leader and all cpu usage etc went back to normal.
We have collect mon logs and posted them to: ceph-post-file: e1b5db79-c6e1-4c56-a2ac-fe9caf15628f
We also have a coredump and could print anything interesting as needed to understand why these mons went bananas:
(gdb) bt #0 0x00007f37227234fb in raise () from /lib64/libpthread.so.0 #1 0x000055d0d6e31e12 in reraise_fatal (signum=6) at /usr/src/debug/ceph-14.2.19/src/global/signal_handler.cc:326 #2 handle_fatal_signal(int) () at /usr/src/debug/ceph-14.2.19/src/global/signal_handler.cc:326 #3 <signal handler called> #4 0x00007f37215163d7 in raise () from /lib64/libc.so.6 #5 0x00007f3721517ac8 in abort () from /lib64/libc.so.6 #6 0x00007f3721e26a95 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6 #7 0x00007f3721e24a06 in ?? () from /lib64/libstdc++.so.6 #8 0x00007f3721e24a33 in std::terminate() () from /lib64/libstdc++.so.6 #9 0x00007f3721e24c53 in __cxa_throw () from /lib64/libstdc++.so.6 #10 0x00007f3725966f5a in ceph::crypto::onwire::AES128GCM_OnWireTxHandler::authenticated_encrypt_update (this=0x55d0e4bba510, plaintext=...) at /usr/src/debug/ceph-14.2.19/src/include/buffer.h:649 #11 0x00007f3725c6bb24 in ceph::msgr::v2::FrameAssembler::asm_secure_rev1(ceph::msgr::v2::preamble_block_t const&, ceph::buffer::v14_2_0::list*) const () at /usr/src/debug/ceph-14.2.19/src/msg/async/frames_v2.cc:235 #12 0x00007f3725c6bf13 in ceph::msgr::v2::FrameAssembler::assemble_frame (this=this@entry=0x55d10a803600, tag=tag@entry=ceph::msgr::v2::MESSAGE, segment_bls=segment_bls@entry=0x7f371a71ec60, segment_aligns=segment_aligns@entry=0x7f3725f57a78 <ceph::msgr::v2::Frame<ceph::msgr::v2::MessageFrame, (unsigned short)8, (unsigned short)8, (unsigned short)8, (unsigned short)4096>::alignments>, segment_count=segment_count@entry=4) at /usr/src/debug/ceph-14.2.19/src/msg/async/frames_v2.cc:270 #13 0x00007f3725c59555 in get_buffer (tx_frame_asm=..., this=0x7f371a71ec60) at /usr/src/debug/ceph-14.2.19/src/msg/async/ProtocolV2.cc:577 #14 ProtocolV2::append_frame<ceph::msgr::v2::MessageFrame> (this=this@entry=0x55d10a803500, frame=...) at /usr/src/debug/ceph-14.2.19/src/msg/async/ProtocolV2.cc:580 #15 0x00007f3725c3e2b8 in ProtocolV2::write_message(Message*, bool) () at /usr/src/debug/ceph-14.2.19/src/msg/async/ProtocolV2.cc:543 #16 0x00007f3725c513bd in ProtocolV2::write_event() () at /usr/src/debug/ceph-14.2.19/src/msg/async/ProtocolV2.cc:658 #17 0x00007f3725c14720 in AsyncConnection::handle_write() () at /usr/src/debug/ceph-14.2.19/src/msg/async/AsyncConnection.cc:692 #18 0x00007f3725c62767 in EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) () at /usr/src/debug/ceph-14.2.19/src/msg/async/Event.cc:441 #19 0x00007f3725c68747 in operator() (__closure=0x55d0da257aa8, __closure=0x55d0da257aa8) at /usr/src/debug/ceph-14.2.19/src/msg/async/Stack.cc:53 #20 std::_Function_handler<void (), NetworkStack::add_thread(unsigned int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_function.h:297 #21 0x00007f3725f0747f in execute_native_thread_routine () from /usr/lib64/ceph/libceph-common.so.0 #22 0x00007f372271bea5 in start_thread () from /lib64/libpthread.so.0 #23 0x00007f37215de9fd in clone () from /lib64/libc.so.6
Actions