Bug #20272
closedCeph OSD & MDS Failure
0%
Description
The following error from one of the OSDs in my cluster brought the Ceph MDS server down over the weekend:
2017-06-10 04:04:12.920338 7f67c9f15700 -1 common/Thread.cc: In function 'void Thread::create(const char*, size_t)' thread 7f67c9f15700 time 2017-06-10 04:04:12.913865
common/Thread.cc: 160: FAILED assert(ret == 0)
ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55da173ae930]
2: (Thread::create(char const*, unsigned long)+0xba) [0x55da17391e6a]
3: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x55da1738597f]
4: (Accepter::entry()+0x395) [0x55da17453865]
5: (()+0x76ba) [0x7f67e236d6ba]
6: (clone()+0x6d) [0x7f67e03e582d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
The message is cryptic, but I'd like to prevent this in the future. It looks like the error forces a replay of the journal and re-initialization of the monitors. This happens in a cyclical fashion, along with a seg fault that occurs:
2017-06-10 04:04:58.417337 7f03acf08700 -1 *** Caught signal (Segmentation fault) **
in thread 7f03acf08700 thread_name:ms_pipe_read
ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
1: (()+0x97467e) [0x5651f7c7c67e]
2: (()+0x11390) [0x7f03c356f390]
3: (cephx_verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list::iterator&, CephXServiceTicketIn
fo&, ceph::buffer::list&)+0x449) [0x5651f7c8c679]
4: (CephxAuthorizeHandler::verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list&, ceph::buffer::list&, EntityName&, unsigned long&, AuthCapsInfo&, CryptoKey&, unsigned long*)+0x30f) [0x5651f7c852df]
5: (OSD::ms_verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0xf0) [0x5651f766a350]
6: (SimpleMessenger::verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x6c) [0x5651f7d4feec]
7: (Pipe::accept()+0x1dba) [0x5651f7e838ea]
8: (Pipe::reader()+0x1d38) [0x5651f7e89058]
9: (Pipe::Reader::entry()+0xd) [0x5651f7e916ed]
10: (()+0x76ba) [0x7f03c35656ba]
11: (clone()+0x6d) [0x7f03c15dd82d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
When checking the health of the cluster, I receive the following messages:
mds cluster is degraded
mds a is laggy
I attempted to restart the mds service using systemctl, which succeeds briefly but then exits. Subsequent attempts to restart fail. I've attached the MDS and OSD logs for reference.
Files
Updated by Greg Farnum almost 7 years ago
- Project changed from Ceph to CephFS
You probably need to bump up the number of allowed thread/process IDs on your box if it's crashing there. But that shouldn't be able to cause a crash in the MDS so this is a Filesystems bug!
Updated by John Spray almost 7 years ago
The MDS backtrace is just the same as the OSD one.
Updated by John Spray almost 7 years ago
- Status changed from New to Rejected
I don't think there's anything to be done with this right now -- feel free to reopen if there's some other evidence that points to a cephfs bug, as opposed to just the general unfriendliness of Ceph daemons dying when they run out of system resources