Bug #4967: Misbehaving OSD sets over half of the cluster as down despite "osd min down reporters = 14" - Ceph - Ceph

Actions

Copy link

Bug #4967

closed

Misbehaving OSD sets over half of the cluster as down despite "osd min down reporters = 14"

Added by Faidon Liambotis almost 11 years ago. Updated almost 11 years ago.

Status:

Resolved

Priority:

High

Assignee:

Sage Weil

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

My Ceph 0.56.4 (mons @ 0.56.6) cluster had a few misbehaving OSDs yesterday, which escalated into a full blown outage for more than an hour.

What seems to have happened is:

1) Several OSDs on multiple boxes memory leaked multiple gigabytes of RAM. Cluster was idle, nothing in ceph.log but pgmap updates with very low traffic and it stopped leaking by the time I got to it, so I don't have any heap dumps :( The heap stats two of them which remained stable were:

2013-05-08 14:51:49.696411 osd.4 [INF] MALLOC: +  15129030656 (14428.2 MB) Bytes in page heap freelist
2013-05-08 14:53:38.211083 osd.113 [INF] MALLOC: +  13321252864 (12704.1 MB) Bytes in page heap freelist

One of them, osd.25, went up to 42GB of RSIZE and sent the box into a swapdeath spiral, until the OOM killer killed it. This resulted in osd.25 first being marked out as down, then the rest of the OSDs of the system (24-35):

2013-05-08 12:09:23.861086 mon.0 10.64.0.167:6789/0 54623 : [INF] osdmap e178808: 144 osds: 144 up, 144 in
2013-05-08 12:10:14.029879 mon.0 10.64.0.167:6789/0 54773 : [INF] osdmap e178809: 144 osds: 143 up, 144 in
[...]
2013-05-08 12:10:29.113060 mon.0 10.64.0.167:6789/0 54785 : [INF] osdmap e178814: 144 osds: 143 up, 144 in
2013-05-08 12:10:32.526681 mon.0 10.64.0.167:6789/0 54815 : [INF] osdmap e178815: 144 osds: 133 up, 144 in
2013-05-08 12:10:35.081063 mon.0 10.64.0.167:6789/0 54817 : [INF] osdmap e178816: 144 osds: 133 up, 144 in
2013-05-08 12:10:37.323234 mon.0 10.64.0.167:6789/0 54825 : [INF] osdmap e178817: 144 osds: 132 up, 144 in
[...]

2) Another OSD in a different system, osd.105, which did not leak memory, started marking several OSDs as out:

2013-05-08 12:10:49.040633 mon.0 10.64.0.167:6789/0 54836 : [DBG] osd.1 10.64.0.173:6826/13302 reported failed by osd.105 10.64.32.14:6854/2966
2013-05-08 12:10:49.040737 mon.0 10.64.0.167:6789/0 54837 : [DBG] osd.2 10.64.0.173:6835/13327 reported failed by osd.105 10.64.32.14:6854/2966
2013-05-08 12:10:49.040828 mon.0 10.64.0.167:6789/0 54838 : [DBG] osd.3 10.64.0.173:6839/13356 reported failed by osd.105 10.64.32.14:6854/2966
2013-05-08 12:10:49.040925 mon.0 10.64.0.167:6789/0 54839 : [DBG] osd.6 10.64.0.173:6849/13483 reported failed by osd.105 10.64.32.14:6854/2966
2013-05-08 12:10:49.041017 mon.0 10.64.0.167:6789/0 54840 : [DBG] osd.7 10.64.0.173:6852/13508 reported failed by osd.105 10.64.32.14:6854/2966

resulting in:

2013-05-08 12:10:51.333276 mon.0 10.64.0.167:6789/0 54931 : [INF] osdmap e178823: 144 osds: 52 up, 144 in

with OSDs gradually going up, then down/up again. It remained unstable for a while, until it stabilized a bit about an hour later.

Now, the one OSD marking 80 OSDs as down is possibly #4552 (the proposed fix hasn't been backported to bobtail).

However, as mitigation for that bug, I had set osd min down reporters to 14 (each box has 12 OSDs). I did this via ceph osd tell \* injectargs '--osd-min-down-reporters 14' and of course updating ceph.conf so that it works across restarts.

Unless I'm misunderstanding the setting, this doesn't seem to have had any effect: osd.105 succesfully marked 80 OSDs as down and I have loglines stating:

# zgrep ' reports ' ceph.log.1.gz
2013-05-08 12:10:13.182478 mon.0 10.64.0.167:6789/0 54772 : [INF] osd.25 10.64.0.175:6808/25281 failed (119 reports from 48 peers after 29.887556 >= grace 29.803195)
2013-05-08 12:47:11.867688 mon.0 10.64.0.167:6789/0 57037 : [INF] osd.22 10.64.0.174:6802/29161 failed (3 reports from 3 peers after 2013-05-08 12:47:32.867639 >= grace 41.569522)
2013-05-08 12:47:16.891521 mon.0 10.64.0.167:6789/0 57051 : [INF] osd.19 10.64.0.174:6850/16067 failed (4 reports from 4 peers after 2013-05-08 12:47:36.891482 >= grace 68.088332)
2013-05-08 12:47:16.891718 mon.0 10.64.0.167:6789/0 57055 : [INF] osd.13 10.64.0.174:6809/15854 failed (4 reports from 4 peers after 2013-05-08 12:47:37.891686 >= grace 62.545798)
2013-05-08 12:47:16.891946 mon.0 10.64.0.167:6789/0 57057 : [INF] osd.5 10.64.0.173:6846/13454 failed (3 reports from 3 peers after 2013-05-08 12:47:36.891914 >= grace 76.727729)
2013-05-08 12:47:16.892152 mon.0 10.64.0.167:6789/0 57061 : [INF] osd.23 10.64.0.174:6861/18535 failed (3 reports from 3 peers after 2013-05-08 12:47:36.892122 >= grace 80.171125)
2013-05-08 12:47:30.080620 mon.0 10.64.0.167:6789/0 57214 : [INF] osd.10 10.64.0.173:6866/13593 failed (4 reports from 4 peers after 2013-05-08 12:47:51.080584 >= grace 40.875931)
2013-05-08 12:49:43.181465 mon.0 10.64.0.167:6789/0 57525 : [INF] osd.29 10.64.0.175:6851/25480 failed (3 reports from 2 peers after 2013-05-08 12:50:30.181386 >= grace 30.536260)
2013-05-08 13:05:53.437425 mon.0 10.64.0.167:6789/0 58451 : [INF] osd.20 10.64.0.174:6854/16519 failed (10 reports from 10 peers after 2013-05-08 13:06:16.437374 >= grace 63.747966)
2013-05-08 13:06:11.784766 mon.0 10.64.0.167:6789/0 58481 : [INF] osd.15 10.64.0.174:6827/15952 failed (3 reports from 3 peers after 2013-05-08 13:06:32.784710 >= grace 46.405326)

3) A third, unrelated (and now impossible to debug) bug which you might notice from the logs, is that 2 pgs remained in an active but unclean state for a while. The pgs were on different sets of osds, and after I marked two of them (one & one) as out and in at about 14:20, with full recovery happening 13 minutes later.

Attached is the ceph.log from the incident & the OSD tree. 105's log file had nothing interesting apart from heartbeat lines. Let me know if you need any more information.

Files

Download all files

osd-tree (3.83 KB) osd-tree	ceph osd tree	Faidon Liambotis, 05/09/2013 02:43 AM
ceph.log.1.gz (2.93 MB) ceph.log.1.gz	ceph.log	Faidon Liambotis, 05/09/2013 02:43 AM
ceph.log.2013-05-19-nopgmap (90.5 KB) ceph.log.2013-05-19-nopgmap		Faidon Liambotis, 05/22/2013 06:27 AM

Actions

Copy link

Updated by Anonymous almost 11 years ago

Priority changed from High to Urgent

Actions

Copy link

Updated by Samuel Just almost 11 years ago

Assignee set to David Zafman

Actions

Copy link

Updated by David Zafman almost 11 years ago

Item #2 the following command should have worked. Telling MONs instead of OSDs to require 14 reporters.

ceph mon tell \* injectargs '--osd-min-down-reporters 14'

Actions

Copy link

Updated by Faidon Liambotis almost 11 years ago

I didn't do this and in retrospect it makes a lot of sense. However, I had my mons restarted because of the 0.56.6 upgrade and the setting was present in ceph.conf (under the [osd] section). So, this is not it.

I'm seeing that the max reporters setting is used in OSDMonitor::check_failure, where it outputs

    mon->clog.info() << osdmap.get_inst(target_osd) << " failed (" 
                     << fi.num_reports << " reports from " << (int)fi.reporters.size() << " peers after " 
                     << failed_for << " >= grace " << grace << ")\n";

While I do have a few of them in my logs (quoted above), when osd.105 marked everything as down, I just got:

2013-05-08 12:10:49.040633 mon.0 10.64.0.167:6789/0 54836 : [DBG] osd.1 10.64.0.173:6826/13302 reported failed by osd.105 10.64.32.14:6854/2966
2013-05-08 12:10:49.040737 mon.0 10.64.0.167:6789/0 54837 : [DBG] osd.2 10.64.0.173:6835/13327 reported failed by osd.105 10.64.32.14:6854/2966
2013-05-08 12:10:49.040828 mon.0 10.64.0.167:6789/0 54838 : [DBG] osd.3 10.64.0.173:6839/13356 reported failed by osd.105 10.64.32.14:6854/2966
[...]
2013-05-08 12:10:49.049352 mon.0 10.64.0.167:6789/0 54915 : [DBG] osd.143 10.64.0.178:6853/3994 reported failed by osd.105 10.64.32.14:6854/2966

Which it seems to be emitted by OSDMonitor::prepare_failure, which in turn calls check_failure, which in this case didn't output anything, presumaby due to not having reached grace, as the reports were cancelled a second after that after that.

However, a second more after that, this happened:

2013-05-08 12:10:51.333276 mon.0 10.64.0.167:6789/0 54931 : [INF] osdmap e178823: 144 osds: 52 up, 144 in

I don't really understand how this could have happened, without seeing the "failed ... reports from ... grace" error message.

Actions

Copy link

Updated by David Zafman almost 11 years ago

You need "osd min down reporters = 14" in the [mon] section not the [osd] section of ceph.conf.

Actions

Copy link

Updated by Faidon Liambotis almost 11 years ago

Docs say "You can change the minimum number of osd down reports by adding an osd min down reports setting under the [osd] section of your Ceph configuration file, or by setting the value at runtime. By default, only one OSD is required to report another OSD down. You can change the number of OSDs required to report a monitor down by adding an osd min down reporters setting under the [osd] section of your Ceph configuration file, or by setting the value at runtime." and it's also under the OSD settings on the same page, and also on the OSD config reference. So if that's the case this is definitely one bug.

Besides that, did you see my reply above about the error messages? Are we sure the reporters setting is the culprit here?

Actions

Copy link

Updated by David Zafman almost 11 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by David Zafman almost 11 years ago

Status changed from In Progress to Can't reproduce

We are not sure about why logging didn't show OSDs getting marked down. It is possible that OSDs were restarted, or nodes were rebooted that would have cause what we saw. I believe that setting the osd min down reporters properly will correct the main issue here. If OSDs are going out of the osdmap without reason you could reopen this bug or file a new one.

Actions

Copy link

Updated by Faidon Liambotis almost 11 years ago

Nodes were not rebooted and those OSDs that were marked as down weren't restarted (ps shows them as started on Apr 3rd). Even if that was the case, I don't see how that would change anything -- mons would still report the " reports down ... grace ..." message, wouldn't they?

Actions

Copy link

#10

Updated by Faidon Liambotis almost 11 years ago

Moments ago I've had a failed disk and seven more OSDs were marked as down. They were all neighbouring (67 failed, 60, 62, 64, 65, 66, 69 and 70 were marked as down), so this time might have been a controller issue.

However, out of the seven, I saw the "14 reports from 14 peers" for only three of them:

2013-05-19 18:11:37.401277 mon.0 10.64.0.167:6789/0 612258 : [INF] osd.66 10.64.32.11:6841/10880 failed (14 reports from 14 peers after 2013-05-19 18:11:57.401231 >= grace 20.000000)
2013-05-19 18:11:45.237810 mon.0 10.64.0.167:6789/0 612297 : [INF] osd.65 10.64.32.11:6838/9550 failed (14 reports from 14 peers after 2013-05-19 18:12:05.237762 >= grace 20.000000)
2013-05-19 18:12:21.469473 mon.0 10.64.0.167:6789/0 612352 : [INF] osd.70 10.64.32.11:6858/19352 failed (14 reports from 14 peers after 2013-05-19 18:12:41.469413 >= grace 20.000000)

The rest had very few reports and no failed/grace messages:

root@ms-fe1001:/var/log/ceph# grep osd.60 ceph.log |grep reported
2013-05-19 18:11:47.789098 mon.0 10.64.0.167:6789/0 612304 : [DBG] osd.60 10.64.32.11:6800/8782 reported failed by osd.44 10.64.0.176:6847/22395
root@ms-fe1001:/var/log/ceph# grep osd.62 ceph.log |grep reported
2013-05-19 18:11:59.035514 mon.0 10.64.0.167:6789/0 612319 : [DBG] osd.62 10.64.32.11:6826/8932 reported failed by osd.14 10.64.0.174:6822/15881
root@ms-fe1001:/var/log/ceph# grep osd.64 ceph.log |grep reported
2013-05-19 18:11:53.785710 mon.0 10.64.0.167:6789/0 612309 : [DBG] osd.64 10.64.32.11:6802/2768 reported failed by osd.10 10.64.0.173:6866/13593
2013-05-19 18:11:53.785770 mon.0 10.64.0.167:6789/0 612310 : [DBG] osd.64 10.64.32.11:6802/2768 reported failed by osd.36 10.64.0.176:6802/22207
2013-05-19 18:11:53.786403 mon.0 10.64.0.167:6789/0 612311 : [DBG] osd.64 10.64.32.11:6802/2768 reported failed by osd.11 10.64.0.173:6869/13701
2013-05-19 18:11:53.786797 mon.0 10.64.0.167:6789/0 612312 : [DBG] osd.64 10.64.32.11:6802/2768 reported failed by osd.8 10.64.0.173:6857/13543
2013-05-19 18:11:53.786920 mon.0 10.64.0.167:6789/0 612314 : [DBG] osd.64 10.64.32.11:6802/2768 reported failed by osd.4 10.64.0.173:6816/19207
root@ms-fe1001:/var/log/ceph# grep osd.69 ceph.log |grep reported
2013-05-19 18:11:43.986445 mon.0 10.64.0.167:6789/0 612277 : [DBG] osd.69 10.64.32.11:6854/5334 reported failed by osd.90 10.64.0.177:6846/29410
2013-05-19 18:11:45.234445 mon.0 10.64.0.167:6789/0 612279 : [DBG] osd.69 10.64.32.11:6854/5334 reported failed by osd.139 10.64.0.178:6839/3893
2013-05-19 18:11:45.234806 mon.0 10.64.0.167:6789/0 612282 : [DBG] osd.69 10.64.32.11:6854/5334 reported failed by osd.143 10.64.0.178:6853/3994
2013-05-19 18:11:45.236059 mon.0 10.64.0.167:6789/0 612288 : [DBG] osd.69 10.64.32.11:6854/5334 reported failed by osd.114 10.64.32.15:6851/28569
2013-05-19 18:11:45.236295 mon.0 10.64.0.167:6789/0 612289 : [DBG] osd.69 10.64.32.11:6854/5334 reported failed by osd.131 10.64.32.16:6854/1276
2013-05-19 18:11:45.237266 mon.0 10.64.0.167:6789/0 612293 : [DBG] osd.69 10.64.32.11:6854/5334 reported failed by osd.39 10.64.0.176:6800/15650
2013-05-19 18:11:45.238547 mon.0 10.64.0.167:6789/0 612301 : [DBG] osd.69 10.64.32.11:6854/5334 reported failed by osd.123 10.64.32.16:6823/1078

I'm suspecting there might have been 14 reports like the first three, with mon not writing those messages on its log, somehow. Any ideas why that would happen?

Actions

Copy link

#11

Updated by Sage Weil almost 11 years ago

Status changed from Can't reproduce to In Progress

Actions

Copy link

#12

Updated by Sage Weil almost 11 years ago

Faidon Liambotis wrote:

Moments ago I've had a failed disk and seven more OSDs were marked as down. They were all neighbouring (67 failed, 60, 62, 64, 65, 66, 69 and 70 were marked as down), so this time might have been a controller issue.

One other path that can mark the daemon down is when the ceph-osd daemon cleanly shuts down. Is that a possibility?

Do you have any monitor logs you can attach for this period that might shed more light? I'm confused as to how this can happen without a msg in the log unless you have a very poorly-timed ceph-mon crash too.

Actions

Copy link

#13

Updated by Sage Weil almost 11 years ago

Assignee changed from David Zafman to Sage Weil

Actions

Copy link

#14

Updated by Faidon Liambotis almost 11 years ago

File ceph.log.2013-05-19-nopgmap ceph.log.2013-05-19-nopgmap added

There was no shut down of those OSDs at the time:

root      8782  9.0  0.8 1320092 415176 ?      Ssl  Apr03 6363:46 /usr/bin/ceph-osd --cluster=ceph -i 60 -f
root      8826  8.6  0.8 1800600 437252 ?      Ssl  Apr03 6123:43 /usr/bin/ceph-osd --cluster=ceph -i 61 -f
root      8932  8.2  0.8 1967892 408344 ?      Ssl  Apr03 5846:47 /usr/bin/ceph-osd --cluster=ceph -i 62 -f
root      9005  7.8  0.8 1505236 400100 ?      Ssl  Apr03 5579:37 /usr/bin/ceph-osd --cluster=ceph -i 63 -f
root      2768  8.1  0.8 2153736 414088 ?      Ssl  Apr24 3365:36 /usr/bin/ceph-osd --cluster=ceph -i 64 -f
root      9550  8.6  0.8 2362872 420340 ?      Ssl  Apr03 6127:21 /usr/bin/ceph-osd --cluster=ceph -i 65 -f
root     10880  8.9  0.8 1954868 430924 ?      Ssl  Apr03 6302:27 /usr/bin/ceph-osd --cluster=ceph -i 66 -f
root      4814  9.0  0.8 2001012 429712 ?      Ssl  Apr03 6417:42 /usr/bin/ceph-osd --cluster=ceph -i 68 -f
root      5334  8.6  0.8 2109340 423368 ?      Ssl  Apr03 6121:35 /usr/bin/ceph-osd --cluster=ceph -i 69 -f
root     19352  9.6  0.9 2365280 465080 ?      Ssl  Apr03 6838:03 /usr/bin/ceph-osd --cluster=ceph -i 70 -f
root     12441  9.0  0.8 1874376 434668 ?      Ssl  Apr03 6410:10 /usr/bin/ceph-osd --cluster=ceph -i 71 -f

nor a mon outage (ceph.log without slow requests & pgmap for easier viewing attached).

I was looking for conditions under which osds get marked as down as well in the code and couldn't find any that would explain this.

Based on the last incident were two of the seven had 14 reporters, my very unscientific guess is that somehow mon loses messages from its log output.

Actions

Copy link

#15