Project

General

Profile

Actions

Bug #20272

closed

Ceph OSD & MDS Failure

Added by Kyle Traff almost 7 years ago. Updated almost 7 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The following error from one of the OSDs in my cluster brought the Ceph MDS server down over the weekend:

2017-06-10 04:04:12.920338 7f67c9f15700 -1 common/Thread.cc: In function 'void Thread::create(const char*, size_t)' thread 7f67c9f15700 time 2017-06-10 04:04:12.913865
common/Thread.cc: 160: FAILED assert(ret == 0)

 ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55da173ae930]
 2: (Thread::create(char const*, unsigned long)+0xba) [0x55da17391e6a]
 3: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x55da1738597f]
 4: (Accepter::entry()+0x395) [0x55da17453865]
 5: (()+0x76ba) [0x7f67e236d6ba]
 6: (clone()+0x6d) [0x7f67e03e582d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The message is cryptic, but I'd like to prevent this in the future. It looks like the error forces a replay of the journal and re-initialization of the monitors. This happens in a cyclical fashion, along with a seg fault that occurs:

2017-06-10 04:04:58.417337 7f03acf08700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f03acf08700 thread_name:ms_pipe_read

 ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
 1: (()+0x97467e) [0x5651f7c7c67e]
 2: (()+0x11390) [0x7f03c356f390]
 3: (cephx_verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list::iterator&, CephXServiceTicketIn
fo&, ceph::buffer::list&)+0x449) [0x5651f7c8c679]
 4: (CephxAuthorizeHandler::verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list&, ceph::buffer::list&, EntityName&, unsigned long&, AuthCapsInfo&, CryptoKey&, unsigned long*)+0x30f) [0x5651f7c852df]
 5: (OSD::ms_verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0xf0) [0x5651f766a350]
 6: (SimpleMessenger::verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x6c) [0x5651f7d4feec]
 7: (Pipe::accept()+0x1dba) [0x5651f7e838ea]
 8: (Pipe::reader()+0x1d38) [0x5651f7e89058]
 9: (Pipe::Reader::entry()+0xd) [0x5651f7e916ed]
 10: (()+0x76ba) [0x7f03c35656ba]
 11: (clone()+0x6d) [0x7f03c15dd82d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

When checking the health of the cluster, I receive the following messages:

mds cluster is degraded
mds a is laggy

I attempted to restart the mds service using systemctl, which succeeds briefly but then exits. Subsequent attempts to restart fail. I've attached the MDS and OSD logs for reference.


Files

ceph-mds.derams1.log.2.gz (208 KB) ceph-mds.derams1.log.2.gz Kyle Traff, 06/12/2017 10:39 PM
ceph-osd.2.log.3.gz (183 KB) ceph-osd.2.log.3.gz Kyle Traff, 06/12/2017 10:43 PM
Actions #1

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to CephFS

You probably need to bump up the number of allowed thread/process IDs on your box if it's crashing there. But that shouldn't be able to cause a crash in the MDS so this is a Filesystems bug!

Actions #2

Updated by John Spray almost 7 years ago

The MDS backtrace is just the same as the OSD one.

Actions #3

Updated by John Spray almost 7 years ago

  • Status changed from New to Rejected

I don't think there's anything to be done with this right now -- feel free to reopen if there's some other evidence that points to a cephfs bug, as opposed to just the general unfriendliness of Ceph daemons dying when they run out of system resources

Actions

Also available in: Atom PDF