Project

General

Profile

Actions

Bug #34525

open

MDS Daemon msgr-worker-2 thread crush

Added by Michael Yang over 5 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I found such log as below:

2018-08-17 19:07:03.523167 7f7418023700  0 -- 192.168.212.28:6801/3119423490 >> 192.168.213.61:0/1349706434 conn(0x560b3c1f6800 :6801 s=STATE_OPEN pgs=23126 cs=17 l=0).process bad tag 102
2018-08-17 19:07:03.524336 7f7418023700  0 -- 192.168.212.28:6801/3119423490 >> 192.168.213.61:0/1349706434 conn(0x560b399fb800 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 18 vs existing csq=17 existing_state=STATE_STANDBY
2018-08-17 19:07:03.558748 7f7418023700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f7418023700 thread_name:msgr-worker-2

 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)
 1: (()+0x5bdfa4) [0x560b2befbfa4]
 2: (()+0x11390) [0x7f741bb7f390]
 3: (ceph::buffer::ptr::c_str()+0x23) [0x560b2befe333]
 4: (AsyncConnection::_process_connection()+0x141b) [0x560b2c2c81ab]
 5: (AsyncConnection::process()+0x1ae8) [0x560b2c2cdb98]
 6: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa08) [0x560b2bfe3128]
 7: (()+0x6a90b8) [0x560b2bfe70b8]
 8: (()+0xb8c80) [0x7f741b47bc80]
 9: (()+0x76ba) [0x7f741bb756ba]
 10: (clone()+0x6d) [0x7f741abe141d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Files

ceph-mds.mds-jq7.log.gz (856 KB) ceph-mds.mds-jq7.log.gz Michael Yang, 09/13/2018 11:25 AM
ceph-mds.TX-97-140-48.log (799 KB) ceph-mds.TX-97-140-48.log geng jichao, 06/01/2020 06:50 AM
ceph-osd.182.log.gz (250 KB) ceph-osd.182.log.gz error log yu zhang, 01/27/2021 03:38 AM

Related issues 1 (0 open1 closed)

Related to Messengers - Bug #25027: mon: src/msg/async/AsyncConnection.cc: 1710: FAILED assert(can_write == WriteStatus::NOWRITE)DuplicateSage Weil07/20/2018

Actions
Actions #1

Updated by John Spray over 5 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (msgr)

On its own, this probably isn't going to be enough to diagnose an issue -- the crash may be caused by something bad that another thread did.

Has this happened again since?

Actions #2

Updated by Patrick Donnelly over 5 years ago

  • Related to Bug #25027: mon: src/msg/async/AsyncConnection.cc: 1710: FAILED assert(can_write == WriteStatus::NOWRITE) added
Actions #3

Updated by Patrick Donnelly over 5 years ago

  • Description updated (diff)
  • Target version set to v14.0.0
  • Tags deleted (Luminous 12.2.7)
  • Backport set to mimic,luminous
  • ceph-qa-suite deleted (fs)
  • Component(FS) MDS added
  • Labels (FS) crash added

May be related to #25027.

Actions #4

Updated by Michael Yang over 5 years ago

John Spray wrote:

On its own, this probably isn't going to be enough to diagnose an issue -- the crash may be caused by something bad that another thread did.

Has this happened again since?

No, it only happy once when the CephFS Metadata Pool is rebalance after I add more OSDs;

Actions #5

Updated by Michael Yang over 5 years ago

I upload the related log about the crush MDS, find it from attachment.

Actions #6

Updated by Zhi Zhang over 5 years ago

The same crash happened on our cluster, but it was on OSD this time. The ceph version is 12.2.8.

Actions #7

Updated by Patrick Donnelly about 5 years ago

  • Target version changed from v14.0.0 to v15.0.0
Actions #8

Updated by Patrick Donnelly about 5 years ago

  • Target version deleted (v15.0.0)
Actions #9

Updated by geng jichao almost 4 years ago

The same crash happened on our cluster.The ceph version is 12.2.4

ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
1: (()+0x59c8c1) [0x7f4a575bc8c1]
2: (()+0xf5e0) [0x7f4a551075e0]
3: (ceph::buffer::ptr::c_str()+0x23) [0x7f4a575bea83]
4: (AsyncConnection::_process_connection()+0x1779) [0x7f4a5791f5a9]
5: (AsyncConnection::process()+0x768) [0x7f4a57923538]
6: (EventCenter::process_events(int, std::chrono::duration&lt;unsigned long, std::ratio&lt;1l, 1000000000l&gt; >*)+0x359) [0x7f4a57693ef9]
7: (()+0x676abe) [0x7f4a57696abe]
8: (()+0xb52b0) [0x7f4a54a7a2b0]
9: (()+0x7e25) [0x7f4a550ffe25]
10: (clone()+0x6d) [0x7f4a541e234d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.
Actions #10

Updated by 相洋 于 over 3 years ago

This issue happened on our cluster too. The ceph version is 12.2.13.

#0 0x00007fc0057bf49b in raise () from /lib64/libpthread.so.0
#1 0x000055ea97116fb6 in reraise_fatal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:74
#2 handle_fatal_signal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:138
#3 <signal handler called>
#4 0x000055ea971190a3 in ceph::buffer::ptr::c_str (this=0x55eae746c4d0) at /usr/src/debug/ceph-12.2.12/src/common/buffer.cc:995
#5 0x000055ea9741a039 in AsyncConnection::_process_connection (this=this@entry=0x55eb36100800) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:1322
#6 0x000055ea9741e288 in AsyncConnection::process (this=0x55eb36100800) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:844
#7 0x000055ea971f2aa9 in EventCenter::process_events (this=this@entry=0x55eaa0f84c80, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000,
working_dur=working_dur@entry=0x7fc001ed2730) at /usr/src/debug/ceph-12.2.12/src/msg/async/Event.cc:411
#8 0x000055ea971f566e in NetworkStack::__lambda4::operator() (__closure=0x55eaa0cec200) at /usr/src/debug/ceph-12.2.12/src/msg/async/Stack.cc:51
#9 0x00007fc005144070 in ?? () from /lib64/libstdc++.so.6
#10 0x00007fc0057b7dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc0048a7ead in clone () from /lib64/libc.so.6

Actions #11

Updated by 相洋 于 over 3 years ago

相洋 于 wrote:

This issue happened on our cluster too. The ceph version is 12.2.13.

#0 0x00007fc0057bf49b in raise () from /lib64/libpthread.so.0
#1 0x000055ea97116fb6 in reraise_fatal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:74
#2 handle_fatal_signal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:138
#3 <signal handler called>
#4 0x000055ea971190a3 in ceph::buffer::ptr::c_str (this=0x55eae746c4d0) at /usr/src/debug/ceph-12.2.12/src/common/buffer.cc:995
#5 0x000055ea9741a039 in AsyncConnection::_process_connection (this=this@entry=0x55eb36100800) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:1322
#6 0x000055ea9741e288 in AsyncConnection::process (this=0x55eb36100800) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:844
#7 0x000055ea971f2aa9 in EventCenter::process_events (this=this@entry=0x55eaa0f84c80, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000,
working_dur=working_dur@entry=0x7fc001ed2730) at /usr/src/debug/ceph-12.2.12/src/msg/async/Event.cc:411
#8 0x000055ea971f566e in NetworkStack::__lambda4::operator() (__closure=0x55eaa0cec200) at /usr/src/debug/ceph-12.2.12/src/msg/async/Stack.cc:51
#9 0x00007fc005144070 in ?? () from /lib64/libstdc++.so.6
#10 0x00007fc0057b7dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc0048a7ead in clone () from /lib64/libc.so.6

Verion 12.2.12

Actions #12

Updated by 相洋 于 over 3 years ago

I also get some strange trace like this.

I think it may cause by bad osd memory target set. Can any one give me some clue?

ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
1: (()+0xa64ee1) [0x556595635ee1]
2: (()+0xf5d0) [0x7fe83e0795d0]
3: (()+0x1876a) [0x7fe83fd0776a]
4: (posix_memalign()+0x40) [0x7fe83fd26010]
5: (ceph::buffer::create_aligned_in_mempool(unsigned int, unsigned int, int)+0x17a) [0x55659563ccba]
6: (AsyncConnection::process()+0x20b2) [0x55659593ebd2]
7: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x359) [0x556595711aa9]
8: (()+0xb4366e) [0x55659571466e]
9: (()+0xb5070) [0x7fe83d9fe070]
10: (()+0x7dd5) [0x7fe83e071dd5]
11: (clone()+0x6d) [0x7fe83d161ead]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #13

Updated by yu zhang about 3 years ago

I get a 30-level log for the same crash on our cluster. The ceph version is 12.2.13. Can any one give me some clue?

2021-01-19 16:47:35.952917 7f82b2d67700 20 -- 172.12.1.106:6836/238868 >> 172.12.0.31:0/1961790763 conn(0x555d24333000 :6836 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).process prev state is STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
0> 2021-01-19 16:47:35.956660 7f82b2d67700 -1 ** Caught signal (Segmentation fault) *
in thread 7f82b2d67700 thread_name:msgr-worker-2
ceph version 12.2.13-1-585-g39b7a52 (39b7a52bd63aff44a139e94f90a6922216655fbd) luminous (stable)
1: (()+0xb34f61) [0x555c6529af61]
2: (()+0xf5f0) [0x7f82b59995f0]
3: (ceph::buffer::ptr::c_str()+0x23) [0x555c6529d533]
4: (AsyncConnection::_process_connection()+0x1b31) [0x555c655c7c81]
5: (AsyncConnection::process()+0x85e) [0x555c655cd11e]
6: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x361) [0x555c65382661]
7: (()+0xc1f5ae) [0x555c653855ae]
8: (()+0xb5070) [0x7f82b531e070]
9: (()+0x7e65) [0x7f82b5991e65]
10: (clone()+0x6d) [0x7f82b4a818ad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #14

Updated by hongsong wu almost 3 years ago

This issue happened on our cluster too(ceph version is 12.2.13). not only mds have this problem,osd may encounter this issue too.

Actions #15

Updated by Patrick Donnelly almost 3 years ago

  • Project changed from CephFS to Messengers
  • Backport deleted (mimic,luminous)
Actions

Also available in: Atom PDF