Bug #20272: Ceph OSD & MDS Failure - CephFS - Ceph

Actions

Copy link

Bug #20272

closed

Ceph OSD & MDS Failure

Added by Kyle Traff almost 7 years ago. Updated almost 7 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v10.2.6

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

The following error from one of the OSDs in my cluster brought the Ceph MDS server down over the weekend:

2017-06-10 04:04:12.920338 7f67c9f15700 -1 common/Thread.cc: In function 'void Thread::create(const char*, size_t)' thread 7f67c9f15700 time 2017-06-10 04:04:12.913865
common/Thread.cc: 160: FAILED assert(ret == 0)

 ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55da173ae930]
 2: (Thread::create(char const*, unsigned long)+0xba) [0x55da17391e6a]
 3: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x55da1738597f]
 4: (Accepter::entry()+0x395) [0x55da17453865]
 5: (()+0x76ba) [0x7f67e236d6ba]
 6: (clone()+0x6d) [0x7f67e03e582d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The message is cryptic, but I'd like to prevent this in the future. It looks like the error forces a replay of the journal and re-initialization of the monitors. This happens in a cyclical fashion, along with a seg fault that occurs:

2017-06-10 04:04:58.417337 7f03acf08700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f03acf08700 thread_name:ms_pipe_read

 ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
 1: (()+0x97467e) [0x5651f7c7c67e]
 2: (()+0x11390) [0x7f03c356f390]
 3: (cephx_verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list::iterator&, CephXServiceTicketIn
fo&, ceph::buffer::list&)+0x449) [0x5651f7c8c679]
 4: (CephxAuthorizeHandler::verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list&, ceph::buffer::list&, EntityName&, unsigned long&, AuthCapsInfo&, CryptoKey&, unsigned long*)+0x30f) [0x5651f7c852df]
 5: (OSD::ms_verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0xf0) [0x5651f766a350]
 6: (SimpleMessenger::verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x6c) [0x5651f7d4feec]
 7: (Pipe::accept()+0x1dba) [0x5651f7e838ea]
 8: (Pipe::reader()+0x1d38) [0x5651f7e89058]
 9: (Pipe::Reader::entry()+0xd) [0x5651f7e916ed]
 10: (()+0x76ba) [0x7f03c35656ba]
 11: (clone()+0x6d) [0x7f03c15dd82d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

When checking the health of the cluster, I receive the following messages:

mds cluster is degraded
mds a is laggy

I attempted to restart the mds service using systemctl, which succeeds briefly but then exits. Subsequent attempts to restart fail. I've attached the MDS and OSD logs for reference.

Files

Download all files

ceph-mds.derams1.log.2.gz (208 KB) ceph-mds.derams1.log.2.gz		Kyle Traff, 06/12/2017 10:39 PM
ceph-osd.2.log.3.gz (183 KB) ceph-osd.2.log.3.gz		Kyle Traff, 06/12/2017 10:43 PM

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

Project changed from Ceph to CephFS

You probably need to bump up the number of allowed thread/process IDs on your box if it's crashing there. But that shouldn't be able to cause a crash in the MDS so this is a Filesystems bug!

Actions

Copy link

Updated by John Spray almost 7 years ago

The MDS backtrace is just the same as the OSD one.

Actions

Copy link

Updated by John Spray almost 7 years ago

Status changed from New to Rejected

I don't think there's anything to be done with this right now -- feel free to reopen if there's some other evidence that points to a cephfs bug, as opposed to just the general unfriendliness of Ceph daemons dying when they run out of system resources

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #20272

Ceph OSD & MDS Failure

Updated by Greg Farnum almost 7 years ago

Updated by John Spray almost 7 years ago

Updated by John Spray almost 7 years ago