Actions
Bug #20272
closedCeph OSD & MDS Failure
Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
The following error from one of the OSDs in my cluster brought the Ceph MDS server down over the weekend:
2017-06-10 04:04:12.920338 7f67c9f15700 -1 common/Thread.cc: In function 'void Thread::create(const char*, size_t)' thread 7f67c9f15700 time 2017-06-10 04:04:12.913865
common/Thread.cc: 160: FAILED assert(ret == 0)
ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55da173ae930]
2: (Thread::create(char const*, unsigned long)+0xba) [0x55da17391e6a]
3: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x55da1738597f]
4: (Accepter::entry()+0x395) [0x55da17453865]
5: (()+0x76ba) [0x7f67e236d6ba]
6: (clone()+0x6d) [0x7f67e03e582d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
The message is cryptic, but I'd like to prevent this in the future. It looks like the error forces a replay of the journal and re-initialization of the monitors. This happens in a cyclical fashion, along with a seg fault that occurs:
2017-06-10 04:04:58.417337 7f03acf08700 -1 *** Caught signal (Segmentation fault) **
in thread 7f03acf08700 thread_name:ms_pipe_read
ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
1: (()+0x97467e) [0x5651f7c7c67e]
2: (()+0x11390) [0x7f03c356f390]
3: (cephx_verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list::iterator&, CephXServiceTicketIn
fo&, ceph::buffer::list&)+0x449) [0x5651f7c8c679]
4: (CephxAuthorizeHandler::verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list&, ceph::buffer::list&, EntityName&, unsigned long&, AuthCapsInfo&, CryptoKey&, unsigned long*)+0x30f) [0x5651f7c852df]
5: (OSD::ms_verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0xf0) [0x5651f766a350]
6: (SimpleMessenger::verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x6c) [0x5651f7d4feec]
7: (Pipe::accept()+0x1dba) [0x5651f7e838ea]
8: (Pipe::reader()+0x1d38) [0x5651f7e89058]
9: (Pipe::Reader::entry()+0xd) [0x5651f7e916ed]
10: (()+0x76ba) [0x7f03c35656ba]
11: (clone()+0x6d) [0x7f03c15dd82d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
When checking the health of the cluster, I receive the following messages:
mds cluster is degraded
mds a is laggy
I attempted to restart the mds service using systemctl, which succeeds briefly but then exits. Subsequent attempts to restart fail. I've attached the MDS and OSD logs for reference.
Files
Actions