Project

General

Profile

Actions

Bug #4967

closed

Misbehaving OSD sets over half of the cluster as down despite "osd min down reporters = 14"

Added by Faidon Liambotis almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

My Ceph 0.56.4 (mons @ 0.56.6) cluster had a few misbehaving OSDs yesterday, which escalated into a full blown outage for more than an hour.

What seems to have happened is:

1) Several OSDs on multiple boxes memory leaked multiple gigabytes of RAM. Cluster was idle, nothing in ceph.log but pgmap updates with very low traffic and it stopped leaking by the time I got to it, so I don't have any heap dumps :( The heap stats two of them which remained stable were:

2013-05-08 14:51:49.696411 osd.4 [INF] MALLOC: +  15129030656 (14428.2 MB) Bytes in page heap freelist
2013-05-08 14:53:38.211083 osd.113 [INF] MALLOC: +  13321252864 (12704.1 MB) Bytes in page heap freelist

One of them, osd.25, went up to 42GB of RSIZE and sent the box into a swapdeath spiral, until the OOM killer killed it. This resulted in osd.25 first being marked out as down, then the rest of the OSDs of the system (24-35):

2013-05-08 12:09:23.861086 mon.0 10.64.0.167:6789/0 54623 : [INF] osdmap e178808: 144 osds: 144 up, 144 in
2013-05-08 12:10:14.029879 mon.0 10.64.0.167:6789/0 54773 : [INF] osdmap e178809: 144 osds: 143 up, 144 in
[...]
2013-05-08 12:10:29.113060 mon.0 10.64.0.167:6789/0 54785 : [INF] osdmap e178814: 144 osds: 143 up, 144 in
2013-05-08 12:10:32.526681 mon.0 10.64.0.167:6789/0 54815 : [INF] osdmap e178815: 144 osds: 133 up, 144 in
2013-05-08 12:10:35.081063 mon.0 10.64.0.167:6789/0 54817 : [INF] osdmap e178816: 144 osds: 133 up, 144 in
2013-05-08 12:10:37.323234 mon.0 10.64.0.167:6789/0 54825 : [INF] osdmap e178817: 144 osds: 132 up, 144 in
[...]

2) Another OSD in a different system, osd.105, which did not leak memory, started marking several OSDs as out:

2013-05-08 12:10:49.040633 mon.0 10.64.0.167:6789/0 54836 : [DBG] osd.1 10.64.0.173:6826/13302 reported failed by osd.105 10.64.32.14:6854/2966
2013-05-08 12:10:49.040737 mon.0 10.64.0.167:6789/0 54837 : [DBG] osd.2 10.64.0.173:6835/13327 reported failed by osd.105 10.64.32.14:6854/2966
2013-05-08 12:10:49.040828 mon.0 10.64.0.167:6789/0 54838 : [DBG] osd.3 10.64.0.173:6839/13356 reported failed by osd.105 10.64.32.14:6854/2966
2013-05-08 12:10:49.040925 mon.0 10.64.0.167:6789/0 54839 : [DBG] osd.6 10.64.0.173:6849/13483 reported failed by osd.105 10.64.32.14:6854/2966
2013-05-08 12:10:49.041017 mon.0 10.64.0.167:6789/0 54840 : [DBG] osd.7 10.64.0.173:6852/13508 reported failed by osd.105 10.64.32.14:6854/2966

resulting in:

2013-05-08 12:10:51.333276 mon.0 10.64.0.167:6789/0 54931 : [INF] osdmap e178823: 144 osds: 52 up, 144 in

with OSDs gradually going up, then down/up again. It remained unstable for a while, until it stabilized a bit about an hour later.

Now, the one OSD marking 80 OSDs as down is possibly #4552 (the proposed fix hasn't been backported to bobtail).

However, as mitigation for that bug, I had set osd min down reporters to 14 (each box has 12 OSDs). I did this via ceph osd tell \* injectargs '--osd-min-down-reporters 14' and of course updating ceph.conf so that it works across restarts.

Unless I'm misunderstanding the setting, this doesn't seem to have had any effect: osd.105 succesfully marked 80 OSDs as down and I have loglines stating:

# zgrep ' reports ' ceph.log.1.gz
2013-05-08 12:10:13.182478 mon.0 10.64.0.167:6789/0 54772 : [INF] osd.25 10.64.0.175:6808/25281 failed (119 reports from 48 peers after 29.887556 >= grace 29.803195)
2013-05-08 12:47:11.867688 mon.0 10.64.0.167:6789/0 57037 : [INF] osd.22 10.64.0.174:6802/29161 failed (3 reports from 3 peers after 2013-05-08 12:47:32.867639 >= grace 41.569522)
2013-05-08 12:47:16.891521 mon.0 10.64.0.167:6789/0 57051 : [INF] osd.19 10.64.0.174:6850/16067 failed (4 reports from 4 peers after 2013-05-08 12:47:36.891482 >= grace 68.088332)
2013-05-08 12:47:16.891718 mon.0 10.64.0.167:6789/0 57055 : [INF] osd.13 10.64.0.174:6809/15854 failed (4 reports from 4 peers after 2013-05-08 12:47:37.891686 >= grace 62.545798)
2013-05-08 12:47:16.891946 mon.0 10.64.0.167:6789/0 57057 : [INF] osd.5 10.64.0.173:6846/13454 failed (3 reports from 3 peers after 2013-05-08 12:47:36.891914 >= grace 76.727729)
2013-05-08 12:47:16.892152 mon.0 10.64.0.167:6789/0 57061 : [INF] osd.23 10.64.0.174:6861/18535 failed (3 reports from 3 peers after 2013-05-08 12:47:36.892122 >= grace 80.171125)
2013-05-08 12:47:30.080620 mon.0 10.64.0.167:6789/0 57214 : [INF] osd.10 10.64.0.173:6866/13593 failed (4 reports from 4 peers after 2013-05-08 12:47:51.080584 >= grace 40.875931)
2013-05-08 12:49:43.181465 mon.0 10.64.0.167:6789/0 57525 : [INF] osd.29 10.64.0.175:6851/25480 failed (3 reports from 2 peers after 2013-05-08 12:50:30.181386 >= grace 30.536260)
2013-05-08 13:05:53.437425 mon.0 10.64.0.167:6789/0 58451 : [INF] osd.20 10.64.0.174:6854/16519 failed (10 reports from 10 peers after 2013-05-08 13:06:16.437374 >= grace 63.747966)
2013-05-08 13:06:11.784766 mon.0 10.64.0.167:6789/0 58481 : [INF] osd.15 10.64.0.174:6827/15952 failed (3 reports from 3 peers after 2013-05-08 13:06:32.784710 >= grace 46.405326)

3) A third, unrelated (and now impossible to debug) bug which you might notice from the logs, is that 2 pgs remained in an active but unclean state for a while. The pgs were on different sets of osds, and after I marked two of them (one & one) as out and in at about 14:20, with full recovery happening 13 minutes later.

Attached is the ceph.log from the incident & the OSD tree. 105's log file had nothing interesting apart from heartbeat lines. Let me know if you need any more information.


Files

osd-tree (3.83 KB) osd-tree ceph osd tree Faidon Liambotis, 05/09/2013 02:43 AM
ceph.log.1.gz (2.93 MB) ceph.log.1.gz ceph.log Faidon Liambotis, 05/09/2013 02:43 AM
ceph.log.2013-05-19-nopgmap (90.5 KB) ceph.log.2013-05-19-nopgmap Faidon Liambotis, 05/22/2013 06:27 AM
Actions

Also available in: Atom PDF