Project

General

Profile

Actions

Bug #15533

closed

osdmap get_map, add_map_bl cycles causes broken osd

Added by Jan Krcmar about 8 years ago. Updated about 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-deploy
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

hi,

i'm experiencing problem with few osds. this thing started after complete cluster restart. ceph -s shows HEALTH_WARN. osds take a lot of cpu time for a long time, then it goes <defunct>. restaring osds did not help.

here starts some debugging
ceph-osd -i 190 --cluster ceph --setuser ceph --setgroup ceph --debug_osd 10 --debug_ms 1

going through the log file showed, that it tries to get_map/add_map_bl from map 9503 to 49123 in cycles forever.

  1. grep -E '(add_map_bl|get_map) (9503|49123)' ceph-osd.190.log.1
    2016-04-18 09:21:41.465447 7ff059391700 20 osd.190 49124 get_map 9503 - loading and decoding 0x7ff0752d8400
    2016-04-18 09:21:41.465838 7ff059391700 10 osd.190 49124 add_map_bl 9503 218449 bytes
    2016-04-18 09:30:57.349379 7ff059391700 20 osd.190 49124 get_map 49123 - loading and decoding 0x7ff07952eb40
    2016-04-18 09:30:57.367868 7ff059391700 10 osd.190 49124 add_map_bl 49123 191920 bytes
    2016-04-18 09:30:57.375023 7ff059391700 20 osd.190 49124 get_map 9503 - loading and decoding 0x7ff075aea480
    2016-04-18 09:30:57.407427 7ff059391700 10 osd.190 49124 add_map_bl 9503 218449 bytes
    2016-04-18 09:41:55.233828 7ff059391700 20 osd.190 49124 get_map 49123 - loading and decoding 0x7ff07a404900
    2016-04-18 09:41:55.252833 7ff059391700 10 osd.190 49124 add_map_bl 49123 191920 bytes
    ...
    2016-04-18 10:38:49.334151 7ff059391700 20 osd.190 49143 get_map 49123 - loading and decoding 0x7ff08aef9840
    2016-04-18 10:38:49.335127 7ff059391700 10 osd.190 49143 add_map_bl 49123 191920 bytes
    2016-04-18 10:38:49.531940 7ff059391700 20 osd.190 49143 get_map 9503 - loading and decoding 0x7ff08fe48fc0
    2016-04-18 10:38:49.598200 7ff059391700 10 osd.190 49143 add_map_bl 9503 218449 bytes

full log (1G) is here
http://home.zcu.cz/~honza801/ceph-logs/ceph-osd.190.log.1

i have tried setting combinations of noout,noin,noup,nodown flags for the cluster hinted by
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg10187.html
but nothing helped

this problem could be somehow related to resolved bug.
http://tracker.ceph.com/issues/8387

some other osds are complaining about
2016-04-18 11:26:29.944201 7fe01242c700 0 -- xx:0/18 >> yy:6834/25 pipe(0x7fe090346000 sd=592 :41368 s=1 pgs=0 cs=0 l=1 c=0x7fe070e0eb00).connect claims to be yy:6834/20 not yy:6834/25 - wrong node!
auth: could not find secret_id=zz
cephx: verify_authorizer could not get service secret for service osd secret_id=zz
accept: got bad authorizer
cephx: verify_reply couldn't decrypt with error: error decoding block for decryption

i think, this is only the result of the previous problem
(ntp is configured properly)

currently i'm running 12 osds on 24 machines w/ 5 mons
ost stat
osdmap e49153: 288 osds: 288 up, 288 in
flags nodown,noout,noin,sortbitwise
pg stat
v1416986: 6408 pgs: 18 creating+down+peering, 276 creating, 3 stale+peering, 142 creating+peering, 5481 peering, 488 down+peering; 342 GB data, 306 TB used, 721 TB / 1027 TB avail; 3/93767 unfound (0.003%)

ceph version 9.2.1 (from packages on http://ceph.com/debian-infernalis/)
kernel 3.16
mons and osds are running in docker container on debian/jessie.
osds are sharing xfs filesystem with hadoop cluster.

disks have no errors.

please confirm this as a bug or suggest solution.

thanks
have a nice day
fous

Actions #1

Updated by Jan Krcmar about 8 years ago

ceph-osd debug parameter was actually --debug_osd 20

Actions #2

Updated by Jan Krcmar about 8 years ago

according to previous message, some osds are doing:

terminate called after throwing an instance of 'ceph::FailedAssertion'
  • Caught signal (Aborted)
    in thread 7f5e80c3c700
    ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
    1: (()+0x82cd15) [0x7f5e97825d15]
    2: (()+0xf8d0) [0x7f5e95e738d0]
    3: (gsignal()+0x37) [0x7f5e940d5067]
    4: (abort()+0x148) [0x7f5e940d6448]
    5: (_gnu_cxx::_verbose_terminate_handler()+0x15d) [0x7f5e949c2b3d]
    6: (()+0x5ebb6) [0x7f5e949c0bb6]
    7: (()+0x5ec01) [0x7f5e949c0c01]
    8: (()+0x5ee19) [0x7f5e949c0e19]
    9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x269) [0x7f5e97922b39]
    10: (Thread::create(unsigned long)+0x8a) [0x7f5e979064ba]
    11: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x7f5e978faedf]
    12: (Accepter::entry()+0x34b) [0x7f5e979c2bab]
    13: (()+0x80a4) [0x7f5e95e6c0a4]
    14: (clone()+0x6d) [0x7f5e9418887d]
    2016-04-20 11:59:05.502295 7f5e80c3c700 -1
    Caught signal (Aborted) *
    in thread 7f5e80c3c700
ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
1: (()+0x82cd15) [0x7f5e97825d15]
2: (()+0xf8d0) [0x7f5e95e738d0]
3: (gsignal()+0x37) [0x7f5e940d5067]
4: (abort()+0x148) [0x7f5e940d6448]
5: (_gnu_cxx::_verbose_terminate_handler()+0x15d) [0x7f5e949c2b3d]
6: (()+0x5ebb6) [0x7f5e949c0bb6]
7: (()+0x5ec01) [0x7f5e949c0c01]
8: (()+0x5ee19) [0x7f5e949c0e19]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x269) [0x7f5e97922b39]
10: (Thread::create(unsigned long)+0x8a) [0x7f5e979064ba]
11: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x7f5e978faedf]
12: (Accepter::entry()+0x34b) [0x7f5e979c2bab]
13: (()+0x80a4) [0x7f5e95e6c0a4]
14: (clone()+0x6d) [0x7f5e9418887d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- begin dump of recent events ---
0> 2016-04-20 11:59:05.502295 7f5e80c3c700 -1 ** Caught signal (Aborted) *
in thread 7f5e80c3c700

ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
1: (()+0x82cd15) [0x7f5e97825d15]
2: (()+0xf8d0) [0x7f5e95e738d0]
3: (gsignal()+0x37) [0x7f5e940d5067]
4: (abort()+0x148) [0x7f5e940d6448]
5: (_gnu_cxx::_verbose_terminate_handler()+0x15d) [0x7f5e949c2b3d]
6: (()+0x5ebb6) [0x7f5e949c0bb6]
7: (()+0x5ec01) [0x7f5e949c0c01]
8: (()+0x5ee19) [0x7f5e949c0e19]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x269) [0x7f5e97922b39]
10: (Thread::create(unsigned long)+0x8a) [0x7f5e979064ba]
11: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x7f5e978faedf]
12: (Accepter::entry()+0x34b) [0x7f5e979c2bab]
13: (()+0x80a4) [0x7f5e95e6c0a4]
14: (clone()+0x6d) [0x7f5e9418887d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
2/-2 (syslog threshold)
99/99 (stderr threshold)
max_recent 10000
max_new 1000
log_file
--
end dump of recent events ---

objdump (124M) is available on
http://home.zcu.cz/~honza801/ceph-logs/objdump-rdS.ceph-osd

Actions #3

Updated by Greg Farnum about 7 years ago

  • Status changed from New to Can't reproduce

Sorry this bug got lost :( but Infernalis is done and I'm not aware of any similar reports.

Actions

Also available in: Atom PDF