Feature #11443
openElector: throttle election attempts from DoSing peers
0%
Description
In at least one cluster, we've seen a situation where a badly-behaved monitor can DoS the entire cluster: it continually calls elections through one mechanism for another but can't accept any writes, and so the healthy quorum members are unable to make forward progress.
(Due to the nature of the failure, we don't have very helpful logs. For the purposes of this discussion, let's assume that the monitor was asserting out on every failed write, but then getting restarted by its init system ever time. The precise nature of the failure shouldn't actually matter — similar failure modes are possible if the network can't maintain a stable connection, etc)
We want to prevent this misbehaving monitor from disrupting the rest of the cluster. That leads me to think each monitor need some kind of registry of "disappointing" peers: those who were elected leader but failed to ack in time, or those who participated in an election but then timed out on a commit or ratification. Perhaps even those who participate in one election and then start another one that we locally don't think needs to be called.
The difficulty of course is that we can't blacklist them forever, and we need to let them back in once the administrator has resolved the issue. Even worse, monitors always set their nonce values to zero, so we can't identify new daemon instances or anything — only addresses and the epoch values they claim to have. So do we just do some kind of decay on how disappointing we find each peer? Do we prevent them from starting the next election, but let them participate in one that somebody else begins?