Project

General

Profile

Actions

Bug #15533

closed

osdmap get_map, add_map_bl cycles causes broken osd

Added by Jan Krcmar about 8 years ago. Updated about 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-deploy
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

hi,

i'm experiencing problem with few osds. this thing started after complete cluster restart. ceph -s shows HEALTH_WARN. osds take a lot of cpu time for a long time, then it goes <defunct>. restaring osds did not help.

here starts some debugging
ceph-osd -i 190 --cluster ceph --setuser ceph --setgroup ceph --debug_osd 10 --debug_ms 1

going through the log file showed, that it tries to get_map/add_map_bl from map 9503 to 49123 in cycles forever.

  1. grep -E '(add_map_bl|get_map) (9503|49123)' ceph-osd.190.log.1
    2016-04-18 09:21:41.465447 7ff059391700 20 osd.190 49124 get_map 9503 - loading and decoding 0x7ff0752d8400
    2016-04-18 09:21:41.465838 7ff059391700 10 osd.190 49124 add_map_bl 9503 218449 bytes
    2016-04-18 09:30:57.349379 7ff059391700 20 osd.190 49124 get_map 49123 - loading and decoding 0x7ff07952eb40
    2016-04-18 09:30:57.367868 7ff059391700 10 osd.190 49124 add_map_bl 49123 191920 bytes
    2016-04-18 09:30:57.375023 7ff059391700 20 osd.190 49124 get_map 9503 - loading and decoding 0x7ff075aea480
    2016-04-18 09:30:57.407427 7ff059391700 10 osd.190 49124 add_map_bl 9503 218449 bytes
    2016-04-18 09:41:55.233828 7ff059391700 20 osd.190 49124 get_map 49123 - loading and decoding 0x7ff07a404900
    2016-04-18 09:41:55.252833 7ff059391700 10 osd.190 49124 add_map_bl 49123 191920 bytes
    ...
    2016-04-18 10:38:49.334151 7ff059391700 20 osd.190 49143 get_map 49123 - loading and decoding 0x7ff08aef9840
    2016-04-18 10:38:49.335127 7ff059391700 10 osd.190 49143 add_map_bl 49123 191920 bytes
    2016-04-18 10:38:49.531940 7ff059391700 20 osd.190 49143 get_map 9503 - loading and decoding 0x7ff08fe48fc0
    2016-04-18 10:38:49.598200 7ff059391700 10 osd.190 49143 add_map_bl 9503 218449 bytes

full log (1G) is here
http://home.zcu.cz/~honza801/ceph-logs/ceph-osd.190.log.1

i have tried setting combinations of noout,noin,noup,nodown flags for the cluster hinted by
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg10187.html
but nothing helped

this problem could be somehow related to resolved bug.
http://tracker.ceph.com/issues/8387

some other osds are complaining about
2016-04-18 11:26:29.944201 7fe01242c700 0 -- xx:0/18 >> yy:6834/25 pipe(0x7fe090346000 sd=592 :41368 s=1 pgs=0 cs=0 l=1 c=0x7fe070e0eb00).connect claims to be yy:6834/20 not yy:6834/25 - wrong node!
auth: could not find secret_id=zz
cephx: verify_authorizer could not get service secret for service osd secret_id=zz
accept: got bad authorizer
cephx: verify_reply couldn't decrypt with error: error decoding block for decryption

i think, this is only the result of the previous problem
(ntp is configured properly)

currently i'm running 12 osds on 24 machines w/ 5 mons
ost stat
osdmap e49153: 288 osds: 288 up, 288 in
flags nodown,noout,noin,sortbitwise
pg stat
v1416986: 6408 pgs: 18 creating+down+peering, 276 creating, 3 stale+peering, 142 creating+peering, 5481 peering, 488 down+peering; 342 GB data, 306 TB used, 721 TB / 1027 TB avail; 3/93767 unfound (0.003%)

ceph version 9.2.1 (from packages on http://ceph.com/debian-infernalis/)
kernel 3.16
mons and osds are running in docker container on debian/jessie.
osds are sharing xfs filesystem with hadoop cluster.

disks have no errors.

please confirm this as a bug or suggest solution.

thanks
have a nice day
fous

Actions

Also available in: Atom PDF