Project

General

Profile

Actions

Bug #49696

open

all mons crash suddenly and cann't restart unless close cephx

Added by wencong wan about 3 years ago. Updated about 3 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

crash info {
"os_version_id": "7",
"utsname_release": "4.14.0jsdx_kernel",
"os_name": "CentOS Linux",
"entity_name": "mon.cc-xxx-ceph-3",
"timestamp": "2021-02-24 08:18:02.542715Z",
"process_name": "ceph-mon",
"utsname_machine": "aarch64",
"utsname_sysname": "Linux",
"os_version": "7 (AltArch)",
"os_id": "centos",
"utsname_version": "#2 SMP Thu Apr 16 13:25:43 CST 2020",
"backtrace": [
"[0xffffa473c66c]",
"(aes_v8_ctr32_encrypt_blocks()+0x2c) [0xffff9b34b38c]"
],
"utsname_hostname": "cc-xxx-ceph-3",
"crash_id": "2021-02-24_08:18:02.542715Z_8d4b4472-b900-4d65-88c8-4fdfeabf9e9c",
"archived": "2021-02-25 02:51:52.807970",
"ceph_version": "14.2.8"
}

Due to the wrong setting of log file permissions, there is no mon log at that time. We can see all mons enter electing state and then meet Segmentation fault.

ceph-mon: * Caught signal (Segmentation fault) *
ceph-mon: in thread ffff879a08d0 thread_name:msgr-worker-0
ceph-mon: ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
ceph-mon: 1: [0xffff9a97566c]
ceph-mon: 2: (aes_v8_ctr32_encrypt_blocks()+0x2c) [0xffff9158438c]
ceph-mon: 2021-02-24 16:18:16.280 ffff879a08d0 -1 *
Caught signal (Segmentation fault) *
ceph-mon: in thread ffff879a08d0 thread_name:msgr-worker-0
ceph-mon: ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
ceph-mon: 1: [0xffff9a97566c]
ceph-mon: 2: (aes_v8_ctr32_encrypt_blocks()+0x2c) [0xffff9158438c]
ceph-mon: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this

Set debug_mon to 20/20, then try to restart ceph-mon, we find mon receive many auth request in electing state,then start trimming session because it has been out of quorum too long.

2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 trimming session 0xaaab103d5b00 client.498460981 because we've been out of quorum too long
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 remove_session 0xaaab10093180 client.498460981 10.43.128.186:0/2903429616 features 0x3ffddff8ffacffff
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 trimming session 0xaaab0f6c3200 client.498452769 because we've been out of quorum too long
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 remove_session 0xaaab0f982c40 client.498452769 10.43.129.15:0/1456949729 features 0x3ffddff8ffacffff
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 trimming session 0xaaab0ee22000 client.498494072 because we've been out of quorum too long
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 remove_session 0xaaab0fc368c0 client.498494072 10.43.128.10:0/272050050 features 0x3ffddff8ffacffff
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 trimming session 0xaaab0ecbc480 client.109834960 because we've been out of quorum too long
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 remove_session 0xaaab0ffad6c0 client.109834960 10.43.128.98:0/197448536 features 0x3ffddff8ffacffff
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 trimming session 0xaaab0f5a7f80 client.498453429 because we've been out of quorum too long
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 remove_session 0xaaab10873500 client.498453429 10.43.128.152:0/2807504925 features 0x3ffddff8ffacffff
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 trimming session 0xaaab0f88f680 client.498462793 because we've been out of quorum too long
2021-02-24 19:19:09.243 ffffa75698d0 -1 ** Caught signal (Segmentation fault) *
in thread ffffa75698d0 thread_name:msgr-worker-1

ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
1: [0xffffb4d2366c]
2: (aes_v8_ctr32_encrypt_blocks()+0x2c) [0xffffab93238c]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 remove_session 0xaaab0f13fa40 client.498462793 10.43.129.42:0/1215103146 features 0x3ffddff8ffacffff
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 trimming session 0xaaab0f786900 client.498494168 because we've been out of quorum too long
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 remove_session 0xaaab0ef41180 client.498494168 10.43.129.3:0/2682246823 features 0x3ffddff8ffacffff
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 trimming session 0xaaab0f4ecd00 client.884656032 because we've been out of quorum too long
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 remove_session 0xaaab0fcac380 client.884656032 10.43.128.157:0/30950754 features 0x3ffddff8ffacffff
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 trimming session 0xaaab0f5e3a80 client.808768598 because we've been out of quorum too long
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 remove_session 0xaaab0fdad6c0 client.808768598 10.43.128.146:0/1443842962 features 0x3ffddff8ffacffff
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 trimming session 0xaaab11870d00 client.498459763 because we've been out of quorum too long
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 remove_session 0xaaab10873dc0 client.498459763 10.43.129.36:0/3469754800 features 0x3ffddff8ffacffff
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 trimming session 0xaaab107c7b00 client.819058790 because we've been out of quorum too long
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 remove_session 0xaaab0ffada40 client.819058790 10.43.128.63:0/1571617782 features 0x3ffddff8ffacffff
2021-02-24 19:19:09.243 ffffa55658d0 10 mon.xxx-ceph-2@1(electing) e2 trimming session 0xaaab1069c880 client.498464821 because we've been out of quorum too long
--- begin dump of recent events ---
-9999> 2021-02-24 19:19:00.262 ffffa2d608d0 20 mon.xxx-ceph-2@1(electing) e2 _ms_dispatch existing session 0xaaab116dba40 for client.956912816
-9998> 2021-02-24 19:19:00.262 ffffa2d608d0 20 mon.xxx-ceph-2@1(electing) e2 caps profile rbd
-9997> 2021-02-24 19:19:00.262 ffffa2d608d0 10 mon.xxx-ceph-2@1(electing) e2 handle_mon_get_map
-9996> 2021-02-24 19:19:00.262 ffffa2d608d0 20 mon.xxx-ceph-2@1(electing) e2 _ms_dispatch existing session 0xaaab101e4000 for client.818832510
-9995> 2021-02-24 19:19:00.262 ffffa2d608d0 20 mon.xxx-ceph-2@1(electing) e2 caps profile rbd
-9994> 2021-02-24 19:19:00.262 ffffa2d608d0 10 mon.xxx-ceph-2@1(electing) e2 handle_mon_get_map
-9993> 2021-02-24 19:19:00.262 ffffa2d608d0 20 mon.xxx-ceph-2@1(electing) e2 _ms_dispatch existing session 0xaaab13c62e00 for client.801397547
-9992> 2021-02-24 19:19:00.262 ffffa2d608d0 20 mon.xxx-ceph-2@1(electing) e2 caps profile rbd
-9991> 2021-02-24 19:19:00.262 ffffa2d608d0 10 mon.xxx-ceph-2@1(electing) e2 handle_mon_get_map
-9990> 2021-02-24 19:19:00.262 ffffa1d4e8d0 10 mon.xxx-ceph-2@1(electing) e2 ms_handle_authentication session 0xaaab0eda9500 con 0xaaab1025db00 addr - MonSession(unknown.0 is open , features 0x0 (unknown))
-9989> 2021-02-24 19:19:00.262 ffffa2d608d0 20 mon.xxx-ceph-2@1(electing) e2 _ms_dispatch existing session 0xaaab12107c00 for client.846351021
-9988> 2021-02-24 19:19:00.262 ffffa2d608d0 20 mon.xxx-ceph-2@1(electing) e2 caps profile rbd
-9987> 2021-02-24 19:19:00.262 ffffa2d608d0 10 mon.xxx-ceph-2@1(electing) e2 handle_mon_get_map
-9986> 2021-02-24 19:19:00.262 ffffa2d608d0 20 mon.xxx-ceph-2@1(electing) e2 _ms_dispatch existing session 0xaaab11ad6a80 for client.984091960
-9985> 2021-02-24 19:19:00.262 ffffa2d608d0 20 mon.xxx-ceph-2@1(electing) e2 caps profile rbd
-9984> 2021-02-24 19:19:00.262 ffffa2d608d0 10 mon.xxx-ceph-2@1(electing) e2 handle_mon_get_map
-9983> 2021-02-24 19:19:00.262 ffffa75698d0 10 mon.xxx-ceph-2@1(electing) e2 handle_auth_request con 0xaaab13638d80 (more) method 2 payload 36
-9982> 2021-02-24 19:19:00.262 ffffa2d608d0 20 mon.xxx-ceph-2@1(electing) e2 _ms_dispatch existing session 0xaaab14331c00 for client.819025256
-9981> 2021-02-24 19:19:00.262 ffffa2d608d0 20 mon.xxx-ceph-2@1(electing) e2 caps profile rbd
-9980> 2021-02-24 19:19:00.262 ffffa2d608d0 10 mon.xxx-ceph-2@1(electing) e2 handle_mon_get_map
-9979> 2021-02-24 19:19:00.262 ffffa2d608d0 20 mon.xxx-ceph-2@1(electing) e2 _ms_dispatch existing session 0xaaab0fd18540 for client.871134848
-9978> 2021-02-24 19:19:00.262 ffffa2d608d0 20 mon.xxx-ceph-2@1(electing) e2 caps profile rbd
-9977> 2021-02-24 19:19:00.262 ffffa2d608d0 10 mon.xxx-ceph-2@1(electing) e2 handle_mon_get_map
-9976> 2021-02-24 19:19:00.262 ffffa2d608d0 20 mon.xxx-ceph-2@1(electing) e2 _ms_dispatch existing session 0xaaab13c62c40 for client.833118133
-9975> 2021-02-24 19:19:00.262 ffffa2d608d0 20 mon.xxx-ceph-2@1(electing) e2 caps profile rbd
-9974> 2021-02-24 19:19:00.262 ffffa2d608d0 10 mon.xxx-ceph-2@1(electing) e2 handle_mon_get_map
-9973> 2021-02-24 19:19:00.262 ffffa2d608d0 20 mon.xxx-ceph-2@1(electing) e2 _ms_dispatch existing session 0xaaab11ad6e00 for client.48609621
-9972> 2021-02-24 19:19:00.262 ffffa2d608d0 20 mon.xxx-ceph-2@1(electing) e2 caps allow *
-9971> 2021-02-24 19:19:00.262 ffffa2d608d0 10 mon.xxx-ceph-2@1(electing) e2 handle_mon_get_map

Actions #1

Updated by Neha Ojha about 3 years ago

  • Status changed from New to Need More Info

can you share a coredump from the monitor, if the issue is still reproducible?

Actions #2

Updated by wencong wan about 3 years ago

Neha Ojha wrote:

can you share a coredump from the monitor, if the issue is still reproducible?

I'm afraid not. This happened in a production environment. Because of the server setting, coredump file is not generated. Trying to reproduce the problem will cause business interruption.

Actions

Also available in: Atom PDF