Project

General

Profile

Bug #15628

segfault at 0 ip sp error 4 in libtcmalloc.so.4.1.2

Added by Sergey Jerusalimov almost 8 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello.

I had a problem with ceph-osd:

Ubuntu 14.04.
ceph version 9.2.1

last trace, when we have segfault:

2016-04-27 01:08:17.769378 7f9cb9582700 -1 ** Caught signal (Segmentation fault) *
in thread 7f9cb9582700

ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
1: (()+0x7d1aca) [0x7f9ccc76baca]
2: (()+0x10340) [0x7f9ccae7e340]
3: (tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)+0x103) [0x7f9ccb0af923]
4: (tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned long)+0x1b) [0x7f9ccb0af9db]
5: (tc_free()+0x1f8) [0x7f9ccb0bd2c8]
6: (()+0x50451) [0x7f9ccab84451]
7: (PK11_FreeSlotList()+0x9) [0x7f9ccab84479]
8: (PK11_GetAllTokens()+0x1cc) [0x7f9ccab86c5c]
9: (PK11_GetBestSlotMultipleWithAttributes()+0x23b) [0x7f9ccab8706b]
10: (PK11_GetBestSlot()+0x1f) [0x7f9ccab870df]
11: (CryptoAES::get_key_handler(ceph::buffer::ptr const&, std::string&)+0x1f4) [0x7f9ccc78b484]
12: (CryptoKey::_set_secret(int, ceph::buffer::ptr const&)+0xcc) [0x7f9ccc78a5fc]
13: (CryptoKey::decode(ceph::buffer::list::iterator&)+0xa2) [0x7f9ccc78a922]
14: (void decode_decrypt_enc_bl<CephXServiceTicket>(CephContext*, CephXServiceTicket&, CryptoKey, ceph::buffer::list&, std::string&)+0x4a5) [0x7f9ccc778f05]
15: (int decode_decrypt<CephXServiceTicket>(CephContext*, CephXServiceTicket&, CryptoKey const&, ceph::buffer::list::iterator&, std::string&)+0x1cf) [0x7f9ccc7792df]
16: (CephXTicketHandler::verify_service_ticket_reply(CryptoKey&, ceph::buffer::list::iterator&)+0xdb) [0x7f9ccc7735ab]
17: (CephXTicketManager::verify_service_ticket_reply(CryptoKey&, ceph::buffer::list::iterator&)+0x122) [0x7f9ccc775442]
18: (CephxClientHandler::handle_response(int, ceph::buffer::list::iterator&)+0xef4) [0x7f9ccc9022b4]
19: (MonClient::handle_auth(MAuthReply*)+0xce) [0x7f9ccc7fd89e]
20: (MonClient::ms_dispatch(Message*)+0x297) [0x7f9ccc7ffb27]
21: (DispatchQueue::entry()+0x63a) [0x7f9ccc90e83a]
22: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f9ccc838ecd]
23: (()+0x8182) [0x7f9ccae76182]
24: (clone()+0x6d) [0x7f9cc91bd47d]

History

#1 Updated by Sergey Jerusalimov almost 8 years ago

I can to provide more information if you need

#2 Updated by Sage Weil almost 8 years ago

  • Priority changed from Normal to Urgent

#3 Updated by David Majchrzak almost 8 years ago

Have the same issue with infernalis 9.2.1-1~bpo80+1 amd64 on Debian 8.3. It happens at least once a week now.

May 13 11:50:17 osd11 kernel: [6643608.139884] ceph-osd[5660]: segfault at 0 ip 00007f27e85120f7 sp 00007f27cff9e860 error 4 in libtcmalloc.so.4.2.2[7f27e84c7000+98000]

Is this fixed in Jewel?

#4 Updated by Paul Emmerich almost 8 years ago

We encountered the same crash today, running Ceph 9.2.1 on Ubuntu 14.04 as well

The crash happened on 3 OSDs that are co-located on a single server, but I guess that could be a hardware issue on that server as well (not seeing any ECC or disk IO errors though).

#5 Updated by Loïc Dachary almost 8 years ago

  • Target version deleted (v9.2.1)

#6 Updated by Sergey Jerusalimov over 7 years ago

we see "cephx client: could not verify service_ticket reply" just before segfault

#7 Updated by Sergey Jerusalimov over 7 years ago

2016-08-18 22:07:32.298264 7fc88cb11700 0 cephx client: could not verify service_ticket reply
2016-08-18 22:07:32.327659 7fc88cb11700 0 cephx client: could not set rotating key: decode_decrypt failed. error:bad magic

#8 Updated by Sergey Jerusalimov over 7 years ago

one thing when it's happen - finishing move objects (recover/misplaced). When monitors compact leveldb

#9 Updated by Sergey Jerusalimov over 7 years ago

Sergey Jerusalimov wrote:

one thing when it's happen - finishing move objects (recover/misplaced). When monitors compact leveldb

and operations, ceph auth del

#10 Updated by Sage Weil over 7 years ago

  • Status changed from New to Need More Info

Are you still on infernalis or have you upgraded to jewel? If you've upgraded, do you still see it on jewel? (We haven't seen this..)

#11 Updated by Sergey Jerusalimov over 7 years ago

I'm now on jewel 10.2.3.
Problem not reproduce.

Sank,you

#12 Updated by Sage Weil over 7 years ago

  • Status changed from Need More Info to Can't reproduce

Also available in: Atom PDF