Project

General

Profile

Actions

Bug #21304

closed

mds v12.2.0 crashing

Added by Andrej Filipcic over 6 years ago. Updated over 6 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

luminous mds crashes few times a day. large activity (eg untaring kernel tarball) causes to crash it in few minutes. A single active mds is configured.

2> 2017-09-07 13:32:45.428075 7faa0abc8700  5 - 194.249.156.133:6800/2972022716 >> 194.249.156.133:6896/2115332 conn(0x55e8fca25000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=30536 cs=1 l=1).
rx osd.157 seq 38 0x55e8fc149180 osd_op_reply(19717 10016b8e60f.00000004 [delete] v327054'788583 uv788583 ondisk = 0) v8
1> 2017-09-07 13:32:45.428086 7faa0abc8700 1 - 194.249.156.133:6800/2972022716 <== osd.157 194.249.156.133:6896/2115332 38 ==== osd_op_reply(19717 10016b8e60f.00000004 [delete] v327054'788583 uv788583
ondisk = 0) v8 ==== 164+0+0 (1980716125 0 0) 0x55e8fc149180 con 0x55e8fca25000
0> 2017-09-07 13:32:45.428062 7faa0b3c9700 -1 ** Caught signal (Segmentation fault) *
in thread 7faa0b3c9700 thread_name:msgr-worker-0
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
1: (()+0x552d74) [0x55e8eaf27d74]
2: (()+0x13be0) [0x7faa0cf0ebe0]
3: (()+0x1904d) [0x7faa0e41404d]
4: (tc_posix_memalign()+0x71) [0x7faa0e4327d1]
5: (ceph::buffer::list::append(char const*, unsigned int)+0x10b) [0x55e8eaf3184b]
6: (CryptoAESKeyHandler::encrypt(ceph::buffer::list const&, ceph::buffer::list&, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; >) const+0x1b5) [0x55e8eb2292e5]
7: (CephXTicketHandler::build_authorizer(unsigned long) const+0x3d0) [0x55e8eb21a8c0]
8: (CephXTicketManager::build_authorizer(unsigned int) const+0x71) [0x55e8eb21ad21]
9: (CephxClientHandler::build_authorizer(unsigned int) const+0x213) [0x55e8eb2130a3]
10: (MonClient::build_authorizer(int) const+0x50) [0x55e8eaf799b0]
11: (Objecter::ms_get_authorizer(int, AuthAuthorizer
*, bool)+0x21) [0x55e8eb1c50c1]
12: (AsyncConnection::_process_connection()+0x81f) [0x55e8eb2f161f]
13: (AsyncConnection::process()+0x7f8) [0x55e8eb2f67e8]
14: (EventCenter::process_events(int, std::chrono::duration&lt;unsigned long, std::ratio&lt;1l, 1000000000l&gt; >*)+0xa08) [0x55e8eb012d38]
15: (()+0x641e38) [0x55e8eb016e38]
16: (()+0xded0e) [0x7faa0cbd9d0e]
17: (()+0x76a7) [0x7faa0cf026a7]
18: (clone()+0x3f) [0x7faa0c30c57f]

017-09-07 12:13:18.577939 caller_uid=0, caller_gid=0{}) v2
1> 2017-09-07 12:13:18.583452 7fc3155c4700 1 - 194.249.156.133:6800/1143634491 --> 194.249.156.61:0/902552033 -- client_reply(???:65892082 = 2 (2) No such file or directory) v1 - 0x560f482cee00 con 0
0> 2017-09-07 12:13:18.586719 7fc3175c8700 -1 ** Caught signal (Segmentation fault) *
in thread 7fc3175c8700 thread_name:msgr-worker-1

ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
1: (()+0x552d74) [0x560efe3dcd74]
2: (()+0x13be0) [0x7fc31990ebe0]
3: (tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned int, int)+0xf3) [0x7fc31ae22e23]
4: (tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned int)+0x35) [0x7fc31ae23215]
5: (ceph::buffer::ptr::release()+0x10d) [0x560efe3dee4d]
6: (std::__cxx11::_List_base&lt;ceph::buffer::ptr, std::allocator&lt;ceph::buffer::ptr&gt; >::_M_clear()+0x1c) [0x560efe0c542c]
7: (Objecter::Op::~Op()+0xda) [0x560efe209bea]
8: (RefCountedObject::put() const+0x2a2) [0x560efe0c4a52]
9: (Objecter::_finish_op(Objecter::Op*, int)+0x10a) [0x560efe68681a]
10: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xdc1) [0x560efe6970f1]
11: (Objecter::ms_dispatch(Message*)+0x29b) [0x560efe6a3f9b]
12: (Objecter::ms_fast_dispatch(Message*)+0xa) [0x560efe6a79ca]
13: (DispatchQueue::fast_dispatch(Message*)+0x7e) [0x560efe76a2be]
14: (AsyncConnection::process()+0x3053) [0x560efe7ae043]
15: (EventCenter::process_events(int, std::chrono::duration&lt;unsigned long, std::ratio&lt;1l, 1000000000l&gt; >*)+0xa08) [0x560efe4c7d38]
16: (()+0x641e38) [0x560efe4cbe38]
17: (()+0xded0e) [0x7fc3195d9d0e]
18: (()+0x76a7) [0x7fc3199026a7]
19: (clone()+0x3f) [0x7fc318d0c57f]
1> 2017-09-08 05:32:45.971278 7fc1f13e9700  1 - 194.249.156.134:6800/315870378 <== osd.59 194.249.156.145:6820/1704092 229 ==== osd_op_reply(1382880 1001492e09c.00000001 [delete] v328702'789603 uv789603
ondisk = 0) v8 ==== 164+0+0 (2094939313 0 0) 0x55fe347e0000 con 0x55fe35e59000
0> 2017-09-08 05:32:45.973281 7fc1f0be8700 -1 ** Caught signal (Segmentation fault) *
in thread 7fc1f0be8700 thread_name:msgr-worker-1
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
1: (()+0x552d74) [0x55fe23a40d74]
2: (()+0x13be0) [0x7fc1f2f2ebe0]
3: (tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned int, int)+0xf3) [0x7fc1f4442e23]
4: (tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned int)+0x35) [0x7fc1f4443215]
5: (ceph::buffer::ptr::release()+0x10d) [0x55fe23a42e4d]
6: (std::__cxx11::_List_base&lt;ceph::buffer::ptr, std::allocator&lt;ceph::buffer::ptr&gt; >::_M_clear()+0x1c) [0x55fe2372942c]
7: (Message::~Message()+0xec) [0x55fe2372953c]
8: (MClientReply::~MClientReply()+0x74) [0x55fe237c64e4]
9: (AsyncConnection::handle_ack(unsigned long)+0x7b0) [0x55fe23dfccb0]
10: (AsyncConnection::process()+0x1092) [0x55fe23e10082]
11: (EventCenter::process_events(int, std::chrono::duration&lt;unsigned long, std::ratio&lt;1l, 1000000000l&gt; >*)+0xa08) [0x55fe23b2bd38]
12: (()+0x641e38) [0x55fe23b2fe38]
13: (()+0xded0e) [0x7fc1f2bf9d0e]
14: (()+0x76a7) [0x7fc1f2f226a7]
15: (clone()+0x3f) [0x7fc1f232c57f]
Actions #1

Updated by Patrick Donnelly over 6 years ago

  • Project changed from Ceph to CephFS
  • Category set to Correctness/Safety
  • Priority changed from Normal to High
  • Source set to Community (user)
  • Component(FS) MDS added
Actions #2

Updated by Zheng Yan over 6 years ago

I'm running luminous (head commit is ba746cd14d) ceph-mds for while, haven't reproduced the issue. could you try the newest luminous ceph-mds

Actions #3

Updated by Andrej Filipcic over 6 years ago

It works fine with that. To be precise I built from the luminous branch from today. No crashes for 8 hours under heavy load.

Actions #4

Updated by Zheng Yan over 6 years ago

  • Status changed from New to Can't reproduce
Actions #5

Updated by Andrej Filipcic over 6 years ago

The following crash still persists with v12.2.1:

2017-10-01 06:07:34.673356 7f1066040700 0 -- 194.249.156.134:6800/1473985024 >> 194.249.156.135:6876/807338 conn(0x557ff54eb000 :-1 s=
STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=67619 cs=1 l=1).process Signature check failed
2017-10-01 06:07:34.677392 7f1066841700 -1 ** Caught signal (Segmentation fault) *
in thread 7f1066841700 thread_name:msgr-worker-0

ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
1: (()+0x556434) [0x557fd6752434]
2: (()+0x13be0) [0x7f1068386be0]
3: (()+0x1904d) [0x7f106988c04d]
4: (tc_posix_memalign()+0x71) [0x7f10698aa7d1]
5: (ceph::buffer::list::append(char const*, unsigned int)+0x10b) [0x557fd675bf0b]
6: (CryptoAESKeyHandler::encrypt(ceph::buffer::list const&, ceph::buffer::list&, std::__cxx11::basic_string&lt;char, std::char_traits&lt;cha
r>, std::allocator&lt;char&gt; >) const+0x1b5) [0x557fd6a4f2e5]
7: (CephxSessionHandler::_calc_signature(Message
, unsigned long*)+0x19d) [0x557fd6b2a65d]
8: (CephxSessionHandler::sign_message(Message*)+0x70) [0x557fd6b2ac10]
9: (AsyncConnection::write_message(Message*, ceph::buffer::list&, bool)+0x8d) [0x557fd6b07fbd]
10: (AsyncConnection::handle_write()+0x5e4) [0x557fd6b10a04]
11: (EventCenter::process_events(int, std::chrono::duration&lt;unsigned long, std::ratio&lt;1l, 1000000000l&gt; >*)+0x1125) [0x557fd683e2e5]
12: (()+0x645cc8) [0x557fd6841cc8]
13: (()+0xded0e) [0x7f1068051d0e]
14: (()+0x76a7) [0x7f106837a6a7]
15: (clone()+0x3f) [0x7f106778457f]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

It is triggered by a removal big number of large-size files (~1GB), ~10TB removed in total, but does not crash otherwise in "normal" operation.

Actions

Also available in: Atom PDF