Bug #34525
openMDS Daemon msgr-worker-2 thread crush
0%
Description
I found such log as below:
2018-08-17 19:07:03.523167 7f7418023700 0 -- 192.168.212.28:6801/3119423490 >> 192.168.213.61:0/1349706434 conn(0x560b3c1f6800 :6801 s=STATE_OPEN pgs=23126 cs=17 l=0).process bad tag 102 2018-08-17 19:07:03.524336 7f7418023700 0 -- 192.168.212.28:6801/3119423490 >> 192.168.213.61:0/1349706434 conn(0x560b399fb800 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 18 vs existing csq=17 existing_state=STATE_STANDBY 2018-08-17 19:07:03.558748 7f7418023700 -1 *** Caught signal (Segmentation fault) ** in thread 7f7418023700 thread_name:msgr-worker-2 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable) 1: (()+0x5bdfa4) [0x560b2befbfa4] 2: (()+0x11390) [0x7f741bb7f390] 3: (ceph::buffer::ptr::c_str()+0x23) [0x560b2befe333] 4: (AsyncConnection::_process_connection()+0x141b) [0x560b2c2c81ab] 5: (AsyncConnection::process()+0x1ae8) [0x560b2c2cdb98] 6: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa08) [0x560b2bfe3128] 7: (()+0x6a90b8) [0x560b2bfe70b8] 8: (()+0xb8c80) [0x7f741b47bc80] 9: (()+0x76ba) [0x7f741bb756ba] 10: (clone()+0x6d) [0x7f741abe141d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Files
Updated by John Spray over 5 years ago
- Project changed from Ceph to CephFS
- Category deleted (
msgr)
On its own, this probably isn't going to be enough to diagnose an issue -- the crash may be caused by something bad that another thread did.
Has this happened again since?
Updated by Patrick Donnelly over 5 years ago
- Related to Bug #25027: mon: src/msg/async/AsyncConnection.cc: 1710: FAILED assert(can_write == WriteStatus::NOWRITE) added
Updated by Patrick Donnelly over 5 years ago
- Description updated (diff)
- Target version set to v14.0.0
- Tags deleted (
Luminous 12.2.7) - Backport set to mimic,luminous
- ceph-qa-suite deleted (
fs) - Component(FS) MDS added
- Labels (FS) crash added
May be related to #25027.
Updated by Michael Yang over 5 years ago
John Spray wrote:
On its own, this probably isn't going to be enough to diagnose an issue -- the crash may be caused by something bad that another thread did.
Has this happened again since?
No, it only happy once when the CephFS Metadata Pool is rebalance after I add more OSDs;
Updated by Michael Yang over 5 years ago
- File ceph-mds.mds-jq7.log.gz ceph-mds.mds-jq7.log.gz added
I upload the related log about the crush MDS, find it from attachment.
Updated by Zhi Zhang over 5 years ago
The same crash happened on our cluster, but it was on OSD this time. The ceph version is 12.2.8.
Updated by Patrick Donnelly about 5 years ago
- Target version changed from v14.0.0 to v15.0.0
Updated by geng jichao almost 4 years ago
The same crash happened on our cluster.The ceph version is 12.2.4
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
1: (()+0x59c8c1) [0x7f4a575bc8c1]
2: (()+0xf5e0) [0x7f4a551075e0]
3: (ceph::buffer::ptr::c_str()+0x23) [0x7f4a575bea83]
4: (AsyncConnection::_process_connection()+0x1779) [0x7f4a5791f5a9]
5: (AsyncConnection::process()+0x768) [0x7f4a57923538]
6: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x359) [0x7f4a57693ef9]
7: (()+0x676abe) [0x7f4a57696abe]
8: (()+0xb52b0) [0x7f4a54a7a2b0]
9: (()+0x7e25) [0x7f4a550ffe25]
10: (clone()+0x6d) [0x7f4a541e234d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by 相洋 于 over 3 years ago
This issue happened on our cluster too. The ceph version is 12.2.13.
#0 0x00007fc0057bf49b in raise () from /lib64/libpthread.so.0
#1 0x000055ea97116fb6 in reraise_fatal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:74
#2 handle_fatal_signal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:138
#3 <signal handler called>
#4 0x000055ea971190a3 in ceph::buffer::ptr::c_str (this=0x55eae746c4d0) at /usr/src/debug/ceph-12.2.12/src/common/buffer.cc:995
#5 0x000055ea9741a039 in AsyncConnection::_process_connection (this=this@entry=0x55eb36100800) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:1322
#6 0x000055ea9741e288 in AsyncConnection::process (this=0x55eb36100800) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:844
#7 0x000055ea971f2aa9 in EventCenter::process_events (this=this@entry=0x55eaa0f84c80, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000,
working_dur=working_dur@entry=0x7fc001ed2730) at /usr/src/debug/ceph-12.2.12/src/msg/async/Event.cc:411
#8 0x000055ea971f566e in NetworkStack::__lambda4::operator() (__closure=0x55eaa0cec200) at /usr/src/debug/ceph-12.2.12/src/msg/async/Stack.cc:51
#9 0x00007fc005144070 in ?? () from /lib64/libstdc++.so.6
#10 0x00007fc0057b7dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc0048a7ead in clone () from /lib64/libc.so.6
Updated by 相洋 于 over 3 years ago
相洋 于 wrote:
This issue happened on our cluster too. The ceph version is 12.2.13.
#0 0x00007fc0057bf49b in raise () from /lib64/libpthread.so.0
#1 0x000055ea97116fb6 in reraise_fatal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:74
#2 handle_fatal_signal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:138
#3 <signal handler called>
#4 0x000055ea971190a3 in ceph::buffer::ptr::c_str (this=0x55eae746c4d0) at /usr/src/debug/ceph-12.2.12/src/common/buffer.cc:995
#5 0x000055ea9741a039 in AsyncConnection::_process_connection (this=this@entry=0x55eb36100800) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:1322
#6 0x000055ea9741e288 in AsyncConnection::process (this=0x55eb36100800) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:844
#7 0x000055ea971f2aa9 in EventCenter::process_events (this=this@entry=0x55eaa0f84c80, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000,
working_dur=working_dur@entry=0x7fc001ed2730) at /usr/src/debug/ceph-12.2.12/src/msg/async/Event.cc:411
#8 0x000055ea971f566e in NetworkStack::__lambda4::operator() (__closure=0x55eaa0cec200) at /usr/src/debug/ceph-12.2.12/src/msg/async/Stack.cc:51
#9 0x00007fc005144070 in ?? () from /lib64/libstdc++.so.6
#10 0x00007fc0057b7dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc0048a7ead in clone () from /lib64/libc.so.6
Verion 12.2.12
Updated by 相洋 于 over 3 years ago
I also get some strange trace like this.
I think it may cause by bad osd memory target set. Can any one give me some clue?
ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
1: (()+0xa64ee1) [0x556595635ee1]
2: (()+0xf5d0) [0x7fe83e0795d0]
3: (()+0x1876a) [0x7fe83fd0776a]
4: (posix_memalign()+0x40) [0x7fe83fd26010]
5: (ceph::buffer::create_aligned_in_mempool(unsigned int, unsigned int, int)+0x17a) [0x55659563ccba]
6: (AsyncConnection::process()+0x20b2) [0x55659593ebd2]
7: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x359) [0x556595711aa9]
8: (()+0xb4366e) [0x55659571466e]
9: (()+0xb5070) [0x7fe83d9fe070]
10: (()+0x7dd5) [0x7fe83e071dd5]
11: (clone()+0x6d) [0x7fe83d161ead]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by yu zhang about 3 years ago
- File ceph-osd.182.log.gz ceph-osd.182.log.gz added
I get a 30-level log for the same crash on our cluster. The ceph version is 12.2.13. Can any one give me some clue?
2021-01-19 16:47:35.952917 7f82b2d67700 20 -- 172.12.1.106:6836/238868 >> 172.12.0.31:0/1961790763 conn(0x555d24333000 :6836 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).process prev state is STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
0> 2021-01-19 16:47:35.956660 7f82b2d67700 -1 ** Caught signal (Segmentation fault) *
in thread 7f82b2d67700 thread_name:msgr-worker-2
ceph version 12.2.13-1-585-g39b7a52 (39b7a52bd63aff44a139e94f90a6922216655fbd) luminous (stable)
1: (()+0xb34f61) [0x555c6529af61]
2: (()+0xf5f0) [0x7f82b59995f0]
3: (ceph::buffer::ptr::c_str()+0x23) [0x555c6529d533]
4: (AsyncConnection::_process_connection()+0x1b31) [0x555c655c7c81]
5: (AsyncConnection::process()+0x85e) [0x555c655cd11e]
6: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x361) [0x555c65382661]
7: (()+0xc1f5ae) [0x555c653855ae]
8: (()+0xb5070) [0x7f82b531e070]
9: (()+0x7e65) [0x7f82b5991e65]
10: (clone()+0x6d) [0x7f82b4a818ad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by hongsong wu almost 3 years ago
This issue happened on our cluster too(ceph version is 12.2.13). not only mds have this problem,osd may encounter this issue too.
Updated by Patrick Donnelly almost 3 years ago
- Project changed from CephFS to Messengers
- Backport deleted (
mimic,luminous)