Project

General

Profile

Bug #34525

MDS Daemon msgr-worker-2 thread crush

Added by Michael Yang over 2 years ago. Updated 2 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I found such log as below:

2018-08-17 19:07:03.523167 7f7418023700  0 -- 192.168.212.28:6801/3119423490 >> 192.168.213.61:0/1349706434 conn(0x560b3c1f6800 :6801 s=STATE_OPEN pgs=23126 cs=17 l=0).process bad tag 102
2018-08-17 19:07:03.524336 7f7418023700  0 -- 192.168.212.28:6801/3119423490 >> 192.168.213.61:0/1349706434 conn(0x560b399fb800 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 18 vs existing csq=17 existing_state=STATE_STANDBY
2018-08-17 19:07:03.558748 7f7418023700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f7418023700 thread_name:msgr-worker-2

 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)
 1: (()+0x5bdfa4) [0x560b2befbfa4]
 2: (()+0x11390) [0x7f741bb7f390]
 3: (ceph::buffer::ptr::c_str()+0x23) [0x560b2befe333]
 4: (AsyncConnection::_process_connection()+0x141b) [0x560b2c2c81ab]
 5: (AsyncConnection::process()+0x1ae8) [0x560b2c2cdb98]
 6: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa08) [0x560b2bfe3128]
 7: (()+0x6a90b8) [0x560b2bfe70b8]
 8: (()+0xb8c80) [0x7f741b47bc80]
 9: (()+0x76ba) [0x7f741bb756ba]
 10: (clone()+0x6d) [0x7f741abe141d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ceph-mds.mds-jq7.log.gz (856 KB) Michael Yang, 09/13/2018 11:25 AM

ceph-mds.TX-97-140-48.log View (799 KB) geng jichao, 06/01/2020 06:50 AM

ceph-osd.182.log.gz - error log (250 KB) yu zhang, 01/27/2021 03:38 AM


Related issues

Related to Messengers - Bug #25027: mon: src/msg/async/AsyncConnection.cc: 1710: FAILED assert(can_write == WriteStatus::NOWRITE) Duplicate 07/20/2018

History

#1 Updated by John Spray over 2 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (msgr)

On its own, this probably isn't going to be enough to diagnose an issue -- the crash may be caused by something bad that another thread did.

Has this happened again since?

#2 Updated by Patrick Donnelly over 2 years ago

  • Related to Bug #25027: mon: src/msg/async/AsyncConnection.cc: 1710: FAILED assert(can_write == WriteStatus::NOWRITE) added

#3 Updated by Patrick Donnelly over 2 years ago

  • Description updated (diff)
  • Target version set to v14.0.0
  • Tags deleted (Luminous 12.2.7)
  • Backport set to mimic,luminous
  • ceph-qa-suite deleted (fs)
  • Component(FS) MDS added
  • Labels (FS) crash added

May be related to #25027.

#4 Updated by Michael Yang over 2 years ago

John Spray wrote:

On its own, this probably isn't going to be enough to diagnose an issue -- the crash may be caused by something bad that another thread did.

Has this happened again since?

No, it only happy once when the CephFS Metadata Pool is rebalance after I add more OSDs;

#5 Updated by Michael Yang over 2 years ago

I upload the related log about the crush MDS, find it from attachment.

#6 Updated by Zhi Zhang over 2 years ago

The same crash happened on our cluster, but it was on OSD this time. The ceph version is 12.2.8.

#7 Updated by Patrick Donnelly about 2 years ago

  • Target version changed from v14.0.0 to v15.0.0

#8 Updated by Patrick Donnelly about 2 years ago

  • Target version deleted (v15.0.0)

#9 Updated by geng jichao 12 months ago

The same crash happened on our cluster.The ceph version is 12.2.4

ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
1: (()+0x59c8c1) [0x7f4a575bc8c1]
2: (()+0xf5e0) [0x7f4a551075e0]
3: (ceph::buffer::ptr::c_str()+0x23) [0x7f4a575bea83]
4: (AsyncConnection::_process_connection()+0x1779) [0x7f4a5791f5a9]
5: (AsyncConnection::process()+0x768) [0x7f4a57923538]
6: (EventCenter::process_events(int, std::chrono::duration&lt;unsigned long, std::ratio&lt;1l, 1000000000l&gt; >*)+0x359) [0x7f4a57693ef9]
7: (()+0x676abe) [0x7f4a57696abe]
8: (()+0xb52b0) [0x7f4a54a7a2b0]
9: (()+0x7e25) [0x7f4a550ffe25]
10: (clone()+0x6d) [0x7f4a541e234d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

#10 Updated by 相洋 于 6 months ago

This issue happened on our cluster too. The ceph version is 12.2.13.

#0 0x00007fc0057bf49b in raise () from /lib64/libpthread.so.0
#1 0x000055ea97116fb6 in reraise_fatal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:74
#2 handle_fatal_signal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:138
#3 <signal handler called>
#4 0x000055ea971190a3 in ceph::buffer::ptr::c_str (this=0x55eae746c4d0) at /usr/src/debug/ceph-12.2.12/src/common/buffer.cc:995
#5 0x000055ea9741a039 in AsyncConnection::_process_connection (this=this@entry=0x55eb36100800) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:1322
#6 0x000055ea9741e288 in AsyncConnection::process (this=0x55eb36100800) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:844
#7 0x000055ea971f2aa9 in EventCenter::process_events (this=this@entry=0x55eaa0f84c80, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000,
working_dur=working_dur@entry=0x7fc001ed2730) at /usr/src/debug/ceph-12.2.12/src/msg/async/Event.cc:411
#8 0x000055ea971f566e in NetworkStack::__lambda4::operator() (__closure=0x55eaa0cec200) at /usr/src/debug/ceph-12.2.12/src/msg/async/Stack.cc:51
#9 0x00007fc005144070 in ?? () from /lib64/libstdc++.so.6
#10 0x00007fc0057b7dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc0048a7ead in clone () from /lib64/libc.so.6

#11 Updated by 相洋 于 6 months ago

相洋 于 wrote:

This issue happened on our cluster too. The ceph version is 12.2.13.

#0 0x00007fc0057bf49b in raise () from /lib64/libpthread.so.0
#1 0x000055ea97116fb6 in reraise_fatal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:74
#2 handle_fatal_signal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:138
#3 <signal handler called>
#4 0x000055ea971190a3 in ceph::buffer::ptr::c_str (this=0x55eae746c4d0) at /usr/src/debug/ceph-12.2.12/src/common/buffer.cc:995
#5 0x000055ea9741a039 in AsyncConnection::_process_connection (this=this@entry=0x55eb36100800) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:1322
#6 0x000055ea9741e288 in AsyncConnection::process (this=0x55eb36100800) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:844
#7 0x000055ea971f2aa9 in EventCenter::process_events (this=this@entry=0x55eaa0f84c80, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000,
working_dur=working_dur@entry=0x7fc001ed2730) at /usr/src/debug/ceph-12.2.12/src/msg/async/Event.cc:411
#8 0x000055ea971f566e in NetworkStack::__lambda4::operator() (__closure=0x55eaa0cec200) at /usr/src/debug/ceph-12.2.12/src/msg/async/Stack.cc:51
#9 0x00007fc005144070 in ?? () from /lib64/libstdc++.so.6
#10 0x00007fc0057b7dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc0048a7ead in clone () from /lib64/libc.so.6

Verion 12.2.12

#12 Updated by 相洋 于 6 months ago

I also get some strange trace like this.

I think it may cause by bad osd memory target set. Can any one give me some clue?

ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
1: (()+0xa64ee1) [0x556595635ee1]
2: (()+0xf5d0) [0x7fe83e0795d0]
3: (()+0x1876a) [0x7fe83fd0776a]
4: (posix_memalign()+0x40) [0x7fe83fd26010]
5: (ceph::buffer::create_aligned_in_mempool(unsigned int, unsigned int, int)+0x17a) [0x55659563ccba]
6: (AsyncConnection::process()+0x20b2) [0x55659593ebd2]
7: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x359) [0x556595711aa9]
8: (()+0xb4366e) [0x55659571466e]
9: (()+0xb5070) [0x7fe83d9fe070]
10: (()+0x7dd5) [0x7fe83e071dd5]
11: (clone()+0x6d) [0x7fe83d161ead]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#13 Updated by yu zhang 4 months ago

I get a 30-level log for the same crash on our cluster. The ceph version is 12.2.13. Can any one give me some clue?

2021-01-19 16:47:35.952917 7f82b2d67700 20 -- 172.12.1.106:6836/238868 >> 172.12.0.31:0/1961790763 conn(0x555d24333000 :6836 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).process prev state is STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
0> 2021-01-19 16:47:35.956660 7f82b2d67700 -1 ** Caught signal (Segmentation fault) *
in thread 7f82b2d67700 thread_name:msgr-worker-2
ceph version 12.2.13-1-585-g39b7a52 (39b7a52bd63aff44a139e94f90a6922216655fbd) luminous (stable)
1: (()+0xb34f61) [0x555c6529af61]
2: (()+0xf5f0) [0x7f82b59995f0]
3: (ceph::buffer::ptr::c_str()+0x23) [0x555c6529d533]
4: (AsyncConnection::_process_connection()+0x1b31) [0x555c655c7c81]
5: (AsyncConnection::process()+0x85e) [0x555c655cd11e]
6: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x361) [0x555c65382661]
7: (()+0xc1f5ae) [0x555c653855ae]
8: (()+0xb5070) [0x7f82b531e070]
9: (()+0x7e65) [0x7f82b5991e65]
10: (clone()+0x6d) [0x7f82b4a818ad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#14 Updated by hongsong wu 5 days ago

This issue happened on our cluster too(ceph version is 12.2.13). not only mds have this problem,osd may encounter this issue too.

#15 Updated by Patrick Donnelly 2 days ago

  • Project changed from CephFS to Messengers
  • Backport deleted (mimic,luminous)

Also available in: Atom PDF