https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2015-09-11T17:30:43ZCeph Ceph - Feature #11443: Elector: throttle election attempts from DoSing peershttps://tracker.ceph.com/issues/11443?journal_id=583952015-09-11T17:30:43ZSage Weilsage@newdream.net
<ul></ul><p>What about a model like</p>
<pre><code>void behaved()<br /> void disappointed() <br /> bool should_ignore()</code></pre>
<p>with a couple exponential decay tunables (initial backoff = 30 seconds, backoff multiplier = 2). If it does something bad, start ignoring for backoff seconds and then double backoff for next time. If does something good, we reset the backoff duration to the initial value.</p>
<p>Bad things would be:<br /> - they were the leader and we were forced to call an election<br /> - they called an election and then failed to complete it</p>
<p>Good things would be:<br /> - joined a quorum (hrm, maybe not enough...)<br /> - committed a value (as leader or as peon)</p>
<p>I think the delicate part is getting the good and bad behaviors right..</p> Ceph - Feature #11443: Elector: throttle election attempts from DoSing peershttps://tracker.ceph.com/issues/11443?journal_id=585292015-09-14T20:43:45ZSage Weilsage@newdream.net
<ul></ul><p>Trivial workaround here is backporting the respawn thresholds for upstart?</p> Ceph - Feature #11443: Elector: throttle election attempts from DoSing peershttps://tracker.ceph.com/issues/11443?journal_id=585302015-09-14T20:47:22ZSage Weilsage@newdream.net
<ul></ul><p><a class="external" href="https://github.com/ceph/ceph/pull/5930">https://github.com/ceph/ceph/pull/5930</a></p> Ceph - Feature #11443: Elector: throttle election attempts from DoSing peershttps://tracker.ceph.com/issues/11443?journal_id=585382015-09-14T22:27:09ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>Yes, getting the good and bad behaviors right is hard. Remember this needs to be something that each monitor can track with only local information, but that will still lead to a healthy quorum if one is possible.</p>
<p>For instance, we can't ignore a monitor that fails to complete an election: it might be healthy but not have gotten enough acknowledgements from peers. Blocking it leads to an easy ongoing election failure when booting monitors slowly.</p>
<p>I think we'd need to do something like:<br />1) Only blacklist a monitor if we can, ourselves, detect an actual failure from them. For leaders: failure to extend paxos leases (or to ack a known-good victory? Is there somewhere in the stage we can know that's happened?). For peons, failure to ack a commit.<br />2) Ignore election start messages from blacklisted monitors.<br />3) When voting in an election, exclude blacklisted monitors from the set of potential leaders.<br />3b) If a blacklisted monitor becomes leader (without us, because we didn't vote for them!) and we are not in quorum, try to join quorum and leave them in the voting set, but do not remove them from the blacklist.</p>
<p>I think something like that would let the cluster converge on agreement, but somebody would need to sketch it out pretty thoroughly. Obviously each blacklisting event would need to come with a decay.</p> Ceph - Feature #11443: Elector: throttle election attempts from DoSing peershttps://tracker.ceph.com/issues/11443?journal_id=585642015-09-15T13:38:38ZKefu Chaitchaikov@gmail.com
<ul></ul><p>Sage,</p>
<p>backporting the respawn thresholds for upstart helps with some cases, but not the case where we originally spotted the DoS, where a monitor was just slow but did not crash or restart itself.</p>