Project

General

Profile

Actions

Bug #18629

open

osd: unclear error when authentication with monitors fails

Added by Patrick Donnelly over 7 years ago. Updated almost 7 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Seen on a cluster of Linode VMs (16 osds, 6 fail with this error). Here's the backtrace:

(gdb) bt
#0  0x00007fe58e85423b in raise () from /lib64/libpthread.so.0
#1  0x00005609af8ad9a4 in reraise_fatal (signum=6) at /usr/src/debug/ceph-11.1.0-6893-g5db320c/src/global/signal_handler.cc:74
#2  handle_fatal_signal (signum=6) at /usr/src/debug/ceph-11.1.0-6893-g5db320c/src/global/signal_handler.cc:138
#3  <signal handler called>
#4  0x00007fe58d45c1d7 in raise () from /lib64/libc.so.6
#5  0x00007fe58d45d8c8 in abort () from /lib64/libc.so.6
#6  0x00007fe58fd0bf24 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x7fe590031b8e "m_workers.empty()", 
    file=file@entry=0x7fe590031a38 "/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.1.0-6893-g5db320c/rpm/el7/BUILD/ceph-11.1.0-6893-g5db320c/s"..., line=line@entry=46, 
    func=func@entry=0x7fe590031e60 <ceph::HeartbeatMap::~HeartbeatMap()::__PRETTY_FUNCTION__> "ceph::HeartbeatMap::~HeartbeatMap()")
    at /usr/src/debug/ceph-11.1.0-6893-g5db320c/src/common/assert.cc:77
#7  0x00007fe58feb25ec in ceph::HeartbeatMap::~HeartbeatMap (this=0x5609b1660240, __in_chrg=<optimized out>) at /usr/src/debug/ceph-11.1.0-6893-g5db320c/src/common/HeartbeatMap.cc:46
#8  0x00007fe58fe9b27f in CephContext::~CephContext (this=0x5609b1748000, __in_chrg=<optimized out>) at /usr/src/debug/ceph-11.1.0-6893-g5db320c/src/common/ceph_context.cc:611
#9  0x00007fe58fe9b5fc in CephContext::put (this=0x5609b1748000) at /usr/src/debug/ceph-11.1.0-6893-g5db320c/src/common/ceph_context.cc:652
#10 0x00005609af8a7c95 in intrusive_ptr_release (cct=<optimized out>) at /usr/src/debug/ceph-11.1.0-6893-g5db320c/src/global/global_init.cc:341
#11 0x00005609af33092a in ~intrusive_ptr (this=0x7ffd9a0533e0, __in_chrg=<optimized out>) at /usr/src/debug/ceph-11.1.0-6893-g5db320c/build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:97
#12 main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/ceph-11.1.0-6893-g5db320c/src/ceph_osd.cc:113

Here's the log:

2017-01-20 15:05:18.729353 7fe5996acc40  0 set uid:gid to 167:167 (ceph:ceph)
2017-01-20 15:05:18.730446 7fe5996acc40  0 ceph version 11.1.0-6893-g5db320c (5db320ce93a335015bf5e2b77ff234e44c764b9a), process ceph-osd, pid 12603
2017-01-20 15:05:18.738306 7fe5996acc40  0 pidfile_write: ignore empty --pid-file
2017-01-20 15:05:18.796123 7fe5996acc40  0 load: jerasure load: lrc load: isa
2017-01-20 15:05:18.800945 7fe5996acc40  0 filestore(/var/lib/ceph/osd/ceph-4) backend xfs (magic 0x58465342)
2017-01-20 15:05:18.802324 7fe5996acc40  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2017-01-20 15:05:18.802338 7fe5996acc40  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2017-01-20 15:05:18.802339 7fe5996acc40  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features: splice() is disabled via 'filestore splice' config option
2017-01-20 15:05:18.806902 7fe5996acc40  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2017-01-20 15:05:18.806976 7fe5996acc40  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_feature: extsize is disabled by conf
2017-01-20 15:05:18.810474 7fe5996acc40  0 filestore(/var/lib/ceph/osd/ceph-4) start omap initiation
2017-01-20 15:05:18.826942 7fe5996acc40  1 leveldb: Recovering log #1793
2017-01-20 15:05:18.899648 7fe5996acc40  1 leveldb: Level-0 table #1796: started
2017-01-20 15:05:18.949737 7fe5996acc40  1 leveldb: Level-0 table #1796: 865556 bytes OK
2017-01-20 15:05:18.952992 7fe5996acc40  1 leveldb: Delete type=3 #8

2017-01-20 15:05:18.953413 7fe5996acc40  1 leveldb: Delete type=0 #1793

2017-01-20 15:05:18.954736 7fe5996acc40  0 filestore(/var/lib/ceph/osd/ceph-4) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2017-01-20 15:05:19.019568 7fe5996acc40  1 journal _open /var/lib/ceph/osd/ceph-4/journal fd 27: 104857600 bytes, block size 4096 bytes, directio = 1, aio = 1
2017-01-20 15:05:19.021536 7fe5996acc40  1 journal _open /var/lib/ceph/osd/ceph-4/journal fd 27: 104857600 bytes, block size 4096 bytes, directio = 1, aio = 1
2017-01-20 15:05:19.030159 7fe5996acc40  1 filestore(/var/lib/ceph/osd/ceph-4) upgrade
2017-01-20 15:05:19.058455 7fe5996acc40  0 _get_class not permitted to load lua
2017-01-20 15:05:19.080692 7fe5996acc40  0 <cls> /mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.1.0-6893-g5db320c/rpm/el7/BUILD/ceph-11.1.0-6893-g5db320c/src/cls/hello/cls_hello.cc:296: loading cls_hello
2017-01-20 15:05:19.084272 7fe5996acc40  0 _get_class not permitted to load kvs
2017-01-20 15:05:19.084620 7fe5996acc40  0 <cls> /mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.1.0-6893-g5db320c/rpm/el7/BUILD/ceph-11.1.0-6893-g5db320c/src/cls/cephfs/cls_cephfs.cc:198: loading cephfs
2017-01-20 15:05:19.087462 7fe5996acc40  0 osd.4 12 crush map has features 2200130813952, adjusting msgr requires for clients
2017-01-20 15:05:19.088100 7fe5996acc40  0 osd.4 12 crush map has features 2200130813952 was 8705, adjusting msgr requires for mons
2017-01-20 15:05:19.088273 7fe5996acc40  0 osd.4 12 crush map has features 2200130813952, adjusting msgr requires for osds
2017-01-20 15:05:22.298089 7fe5996acc40  0 osd.4 12 load_pgs
2017-01-20 15:05:31.265103 7fe5996acc40  0 osd.4 12 load_pgs opened 551 pgs
2017-01-20 15:05:31.265564 7fe5996acc40  0 osd.4 12 using 1 op queue with priority op cut off at 64.
2017-01-20 15:05:31.267875 7fe5996acc40 -1 osd.4 12 log_to_monitors {default=true}
2017-01-20 15:05:31.278199 7fe5996acc40  1 journal close /var/lib/ceph/osd/ceph-4/journal
2017-01-20 15:05:31.348586 7fe5996acc40 -1 ^[[0;31m ** ERROR: osd init failed: (1) Operation not permitted^[[0m
2017-01-20 15:05:31.354780 7fe5996acc40 -1 /mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.1.0-6893-g5db320c/rpm/el7/BUILD/ceph-11.1.0-6893-g5db320c/src/common/HeartbeatMap.cc: In function 'ceph::HeartbeatMap::~HeartbeatMap()' thread 7fe5996acc40 time 2017-01-20 15:05:31.350426
/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.1.0-6893-g5db320c/rpm/el7/BUILD/ceph-11.1.0-6893-g5db320c/src/common/HeartbeatMap.cc: 46: FAILED assert(m_workers.empty())

 ceph version 11.1.0-6893-g5db320c (5db320ce93a335015bf5e2b77ff234e44c764b9a)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7fe58fd0bdb0]
 2: (ceph::HeartbeatMap::~HeartbeatMap()+0x13c) [0x7fe58feb25ec]
 3: (CephContext::~CephContext()+0x37f) [0x7fe58fe9b27f]
 4: (CephContext::put()+0x17c) [0x7fe58fe9b5fc]
 5: (main()+0xc4a) [0x5609af33092a]
 6: (__libc_start_main()+0xf5) [0x7fe58d448b35]
 7: (()+0x2c3a69) [0x5609af3aaa69]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Log attached.


Files

ceph-osd.13.log.gz (538 KB) ceph-osd.13.log.gz Patrick Donnelly, 01/21/2017 12:14 AM
Actions #1

Updated by Josh Durgin over 7 years ago

  • Project changed from rgw to Ceph
  • Subject changed from osd: m_workers.empty() assertion failure in HeartbeatMap to osd: unclear error when authentication with monitors fails

The full log includes:

   -34> 2017-01-20 15:05:19.147939 7f274da9b700  1 -- 192.168.132.144:6800/12683 <== mon.1 192.168.211.143:6789/0 3 ==== auth_reply(proto 2 -1 (1) Operation not permitted) v1 ==== 24+0+0 (1263862845 0 0) 0x555da4b698c0 con 0x555da4797000

This should be visible in a more admin-friendly message about auth failing, at the error log level.

Actions #2

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category set to Administration/Usability
  • Component(RADOS) OSD added
Actions

Also available in: Atom PDF