Bug #15533: osdmap get_map, add_map_bl cycles causes broken osd - Ceph - Ceph

Actions

Copy link

Bug #15533

closed

osdmap get_map, add_map_bl cycles causes broken osd

Added by Jan Krcmar about 8 years ago. Updated about 7 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

v9.2.1

ceph-qa-suite:

ceph-deploy

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

hi,

i'm experiencing problem with few osds. this thing started after complete cluster restart. ceph -s shows HEALTH_WARN. osds take a lot of cpu time for a long time, then it goes <defunct>. restaring osds did not help.

here starts some debugging
ceph-osd -i 190 --cluster ceph --setuser ceph --setgroup ceph --debug_osd 10 --debug_ms 1

going through the log file showed, that it tries to get_map/add_map_bl from map 9503 to 49123 in cycles forever.

grep -E '(add_map_bl|get_map) (9503|49123)' ceph-osd.190.log.1
2016-04-18 09:21:41.465447 7ff059391700 20 osd.190 49124 get_map 9503 - loading and decoding 0x7ff0752d8400
2016-04-18 09:21:41.465838 7ff059391700 10 osd.190 49124 add_map_bl 9503 218449 bytes
2016-04-18 09:30:57.349379 7ff059391700 20 osd.190 49124 get_map 49123 - loading and decoding 0x7ff07952eb40
2016-04-18 09:30:57.367868 7ff059391700 10 osd.190 49124 add_map_bl 49123 191920 bytes
2016-04-18 09:30:57.375023 7ff059391700 20 osd.190 49124 get_map 9503 - loading and decoding 0x7ff075aea480
2016-04-18 09:30:57.407427 7ff059391700 10 osd.190 49124 add_map_bl 9503 218449 bytes
2016-04-18 09:41:55.233828 7ff059391700 20 osd.190 49124 get_map 49123 - loading and decoding 0x7ff07a404900
2016-04-18 09:41:55.252833 7ff059391700 10 osd.190 49124 add_map_bl 49123 191920 bytes
...
2016-04-18 10:38:49.334151 7ff059391700 20 osd.190 49143 get_map 49123 - loading and decoding 0x7ff08aef9840
2016-04-18 10:38:49.335127 7ff059391700 10 osd.190 49143 add_map_bl 49123 191920 bytes
2016-04-18 10:38:49.531940 7ff059391700 20 osd.190 49143 get_map 9503 - loading and decoding 0x7ff08fe48fc0
2016-04-18 10:38:49.598200 7ff059391700 10 osd.190 49143 add_map_bl 9503 218449 bytes

full log (1G) is here
http://home.zcu.cz/~honza801/ceph-logs/ceph-osd.190.log.1

i have tried setting combinations of noout,noin,noup,nodown flags for the cluster hinted by
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg10187.html
but nothing helped

this problem could be somehow related to resolved bug.
http://tracker.ceph.com/issues/8387

some other osds are complaining about
2016-04-18 11:26:29.944201 7fe01242c700 0 -- xx:0/18 >> yy:6834/25 pipe(0x7fe090346000 sd=592 :41368 s=1 pgs=0 cs=0 l=1 c=0x7fe070e0eb00).connect claims to be yy:6834/20 not yy:6834/25 - wrong node!
auth: could not find secret_id=zz
cephx: verify_authorizer could not get service secret for service osd secret_id=zz
accept: got bad authorizer
cephx: verify_reply couldn't decrypt with error: error decoding block for decryption

i think, this is only the result of the previous problem
(ntp is configured properly)

currently i'm running 12 osds on 24 machines w/ 5 mons
ost stat
osdmap e49153: 288 osds: 288 up, 288 in
flags nodown,noout,noin,sortbitwise
pg stat
v1416986: 6408 pgs: 18 creating+down+peering, 276 creating, 3 stale+peering, 142 creating+peering, 5481 peering, 488 down+peering; 342 GB data, 306 TB used, 721 TB / 1027 TB avail; 3/93767 unfound (0.003%)

ceph version 9.2.1 (from packages on http://ceph.com/debian-infernalis/)
kernel 3.16
mons and osds are running in docker container on debian/jessie.
osds are sharing xfs filesystem with hadoop cluster.

disks have no errors.

please confirm this as a bug or suggest solution.

thanks
have a nice day
fous

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #15533

osdmap get_map, add_map_bl cycles causes broken osd

Updated by Jan Krcmar about 8 years ago

Updated by Jan Krcmar about 8 years ago

Updated by Greg Farnum about 7 years ago