Bug #34525
open
MDS Daemon msgr-worker-2 thread crush
Added by Michael Yang over 5 years ago.
Updated almost 3 years ago.
Description
I found such log as below:
2018-08-17 19:07:03.523167 7f7418023700 0 -- 192.168.212.28:6801/3119423490 >> 192.168.213.61:0/1349706434 conn(0x560b3c1f6800 :6801 s=STATE_OPEN pgs=23126 cs=17 l=0).process bad tag 102
2018-08-17 19:07:03.524336 7f7418023700 0 -- 192.168.212.28:6801/3119423490 >> 192.168.213.61:0/1349706434 conn(0x560b399fb800 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 18 vs existing csq=17 existing_state=STATE_STANDBY
2018-08-17 19:07:03.558748 7f7418023700 -1 *** Caught signal (Segmentation fault) **
in thread 7f7418023700 thread_name:msgr-worker-2
ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)
1: (()+0x5bdfa4) [0x560b2befbfa4]
2: (()+0x11390) [0x7f741bb7f390]
3: (ceph::buffer::ptr::c_str()+0x23) [0x560b2befe333]
4: (AsyncConnection::_process_connection()+0x141b) [0x560b2c2c81ab]
5: (AsyncConnection::process()+0x1ae8) [0x560b2c2cdb98]
6: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa08) [0x560b2bfe3128]
7: (()+0x6a90b8) [0x560b2bfe70b8]
8: (()+0xb8c80) [0x7f741b47bc80]
9: (()+0x76ba) [0x7f741bb756ba]
10: (clone()+0x6d) [0x7f741abe141d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Files
- Project changed from Ceph to CephFS
- Category deleted (
msgr)
On its own, this probably isn't going to be enough to diagnose an issue -- the crash may be caused by something bad that another thread did.
Has this happened again since?
- Related to Bug #25027: mon: src/msg/async/AsyncConnection.cc: 1710: FAILED assert(can_write == WriteStatus::NOWRITE) added
- Description updated (diff)
- Target version set to v14.0.0
- Tags deleted (
Luminous 12.2.7)
- Backport set to mimic,luminous
- ceph-qa-suite deleted (
fs)
- Component(FS) MDS added
- Labels (FS) crash added
John Spray wrote:
On its own, this probably isn't going to be enough to diagnose an issue -- the crash may be caused by something bad that another thread did.
Has this happened again since?
No, it only happy once when the CephFS Metadata Pool is rebalance after I add more OSDs;
I upload the related log about the crush MDS, find it from attachment.
The same crash happened on our cluster, but it was on OSD this time. The ceph version is 12.2.8.
- Target version changed from v14.0.0 to v15.0.0
- Target version deleted (
v15.0.0)
The same crash happened on our cluster.The ceph version is 12.2.4
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
1: (()+0x59c8c1) [0x7f4a575bc8c1]
2: (()+0xf5e0) [0x7f4a551075e0]
3: (ceph::buffer::ptr::c_str()+0x23) [0x7f4a575bea83]
4: (AsyncConnection::_process_connection()+0x1779) [0x7f4a5791f5a9]
5: (AsyncConnection::process()+0x768) [0x7f4a57923538]
6: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x359) [0x7f4a57693ef9]
7: (()+0x676abe) [0x7f4a57696abe]
8: (()+0xb52b0) [0x7f4a54a7a2b0]
9: (()+0x7e25) [0x7f4a550ffe25]
10: (clone()+0x6d) [0x7f4a541e234d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
This issue happened on our cluster too. The ceph version is 12.2.13.
#0 0x00007fc0057bf49b in raise () from /lib64/libpthread.so.0
#1 0x000055ea97116fb6 in reraise_fatal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:74
#2 handle_fatal_signal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:138
#3 <signal handler called>
#4 0x000055ea971190a3 in ceph::buffer::ptr::c_str (this=0x55eae746c4d0) at /usr/src/debug/ceph-12.2.12/src/common/buffer.cc:995
#5 0x000055ea9741a039 in AsyncConnection::_process_connection (this=this@entry=0x55eb36100800) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:1322
#6 0x000055ea9741e288 in AsyncConnection::process (this=0x55eb36100800) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:844
#7 0x000055ea971f2aa9 in EventCenter::process_events (this=this@entry=0x55eaa0f84c80, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000,
working_dur=working_dur@entry=0x7fc001ed2730) at /usr/src/debug/ceph-12.2.12/src/msg/async/Event.cc:411
#8 0x000055ea971f566e in NetworkStack::__lambda4::operator() (__closure=0x55eaa0cec200) at /usr/src/debug/ceph-12.2.12/src/msg/async/Stack.cc:51
#9 0x00007fc005144070 in ?? () from /lib64/libstdc++.so.6
#10 0x00007fc0057b7dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc0048a7ead in clone () from /lib64/libc.so.6
相洋 于 wrote:
This issue happened on our cluster too. The ceph version is 12.2.13.
#0 0x00007fc0057bf49b in raise () from /lib64/libpthread.so.0
#1 0x000055ea97116fb6 in reraise_fatal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:74
#2 handle_fatal_signal (signum=11) at /usr/src/debug/ceph-12.2.12/src/global/signal_handler.cc:138
#3 <signal handler called>
#4 0x000055ea971190a3 in ceph::buffer::ptr::c_str (this=0x55eae746c4d0) at /usr/src/debug/ceph-12.2.12/src/common/buffer.cc:995
#5 0x000055ea9741a039 in AsyncConnection::_process_connection (this=this@entry=0x55eb36100800) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:1322
#6 0x000055ea9741e288 in AsyncConnection::process (this=0x55eb36100800) at /usr/src/debug/ceph-12.2.12/src/msg/async/AsyncConnection.cc:844
#7 0x000055ea971f2aa9 in EventCenter::process_events (this=this@entry=0x55eaa0f84c80, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000,
working_dur=working_dur@entry=0x7fc001ed2730) at /usr/src/debug/ceph-12.2.12/src/msg/async/Event.cc:411
#8 0x000055ea971f566e in NetworkStack::__lambda4::operator() (__closure=0x55eaa0cec200) at /usr/src/debug/ceph-12.2.12/src/msg/async/Stack.cc:51
#9 0x00007fc005144070 in ?? () from /lib64/libstdc++.so.6
#10 0x00007fc0057b7dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc0048a7ead in clone () from /lib64/libc.so.6
Verion 12.2.12
I also get some strange trace like this.
I think it may cause by bad osd memory target set. Can any one give me some clue?
ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
1: (()+0xa64ee1) [0x556595635ee1]
2: (()+0xf5d0) [0x7fe83e0795d0]
3: (()+0x1876a) [0x7fe83fd0776a]
4: (posix_memalign()+0x40) [0x7fe83fd26010]
5: (ceph::buffer::create_aligned_in_mempool(unsigned int, unsigned int, int)+0x17a) [0x55659563ccba]
6: (AsyncConnection::process()+0x20b2) [0x55659593ebd2]
7: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x359) [0x556595711aa9]
8: (()+0xb4366e) [0x55659571466e]
9: (()+0xb5070) [0x7fe83d9fe070]
10: (()+0x7dd5) [0x7fe83e071dd5]
11: (clone()+0x6d) [0x7fe83d161ead]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I get a 30-level log for the same crash on our cluster. The ceph version is 12.2.13. Can any one give me some clue?
2021-01-19 16:47:35.952917 7f82b2d67700 20 -- 172.12.1.106:6836/238868 >> 172.12.0.31:0/1961790763 conn(0x555d24333000 :6836 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).process prev state is STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
0> 2021-01-19 16:47:35.956660 7f82b2d67700 -1 ** Caught signal (Segmentation fault) *
in thread 7f82b2d67700 thread_name:msgr-worker-2
ceph version 12.2.13-1-585-g39b7a52 (39b7a52bd63aff44a139e94f90a6922216655fbd) luminous (stable)
1: (()+0xb34f61) [0x555c6529af61]
2: (()+0xf5f0) [0x7f82b59995f0]
3: (ceph::buffer::ptr::c_str()+0x23) [0x555c6529d533]
4: (AsyncConnection::_process_connection()+0x1b31) [0x555c655c7c81]
5: (AsyncConnection::process()+0x85e) [0x555c655cd11e]
6: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x361) [0x555c65382661]
7: (()+0xc1f5ae) [0x555c653855ae]
8: (()+0xb5070) [0x7f82b531e070]
9: (()+0x7e65) [0x7f82b5991e65]
10: (clone()+0x6d) [0x7f82b4a818ad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
This issue happened on our cluster too(ceph version is 12.2.13). not only mds have this problem,osd may encounter this issue too.
- Project changed from CephFS to Messengers
- Backport deleted (
mimic,luminous)
Also available in: Atom
PDF