Project

General

Profile

Actions

Mon - Independently dispatch non-conflicting messages

Summary

The monitors have a long tradition of dispatching one message at a time, holding a Big Freaking Monitor Lock (BFML) while handling each message in order to guarantee serialibility of operations. This can affect throughput and should affect scalability.
Implementing a message dispatching approach similar to the OSD, allowing the monitor to dispatch different, non-conflicting message types concurrently, should make things better in the long run.

Owners

  • Joao Eduardo Luis (Inktank)
  • Name (Affiliation)
  • Name

Interested Parties

  • Name (Affiliation)
  • Name (Affiliation)
  • Name

Current Status

There's already implementations of dispatch queues, work queues and thread pools. The OSD, e.g., uses them and using them in the monitor should not take much effort.

Detailed Description

There are three main message types in the monitor:
  • Client
    • Messages sent by clients to the monitors, let it be user commands or OSD/MDS messages
    • These also have two types of messages: read and write
      • Read messages are all those that do not change, directly or indirectly, the state
      • Write messages change, directly or indirectly, the state
  • Paxos
    • Lease renewals: letting the other monitors in the quorum to know that a given version is still valid
    • Proposals: responsible for spreading and committing a new state throughout the monitor cluster
  • Monitor-specific
    • Clock-skew detection
    • Monitor store disk health (currently, available space)
    • Elector messages
    • Probes
    • Data Store Synchronization
We can handle the three different types independently.
  • Read-only Client messages can be handled as is, granted the monitor belongs to the quorum.
  • Write Client messages should be forwarded to the quorum leader.
    • One should not need to block progress of other operations if one is just forwarding a message
  • When handling write Client messages the monitor will not trigger proposals automatically, but instead will queue a proposal to be dispatched. This is what the monitor currently does, but while holding the BFML.
  • Paxos lease renewal messages could be handled independently from Paxos proposal messages. As long as the quorum stays healthy, monitors should not timeout because there's a lot of messages to be handled in-between lease extensions.
    • We can expediently dispatch paxos lease messages, so long as we keep paxos commit timeouts in place (hard capping the amount of time any given proposal will take to finish)
    • This prevents a lease timeout from being triggered due to some proposal taking just enough time to be committed to leveldb.
    • There's no reason to not refresh the latest paxos version lease just because the current paxos proposal is underway and taking its sweet time to finish.
    • The one reason that would go against this approach is that the lease messages are used as a healthiness indicator for the monitor quorum: if a lease timeout is triggered you are assumed as not-well and a new election is triggered. However, if all the monitors are alive and well besides taking their time committing a value to the store, this may just be and indication that the value being committed is too big or there's contention in accessing leveldb. As long as the quorum is not waiting for a monitor to finish committing, as long as the monitors are responsive and providing the latest state on request, we should be okay -- if we are not, this should not be the lease extension mechanism's responsibility to assess.
  • Paxos proposals should happen independently from all other message handling. Proposals are started by queuing a new value on the Paxos Proposal's queue (on the leader) and from then on they should not wait on any other message to be handled in order to make progress.
    • The monitor already queues proposals, serializing them, but holds the BFML in the process thus hindering progress on the remaining portions of the monitor.
  • Monitor-specific messages are all those used to maintain the monitor cluster's state and are independent from any interaction with the monitors.
    • Clock-skew detection and Data Store Health tracking, for instance, are performed when the monitor belong to a quorum and do not rely on leveldb access nor do they conflict with any Client or Paxos messages. These are secondary messages meant to check the monitor cluster's healthiness and should not hinder, or be hindered, by other activity in the cluster.
    • Elector messages are responsible to establish a ranked quorum and only happen in two scenarios:
      • A monitor started and wants to join the quorum. While in the process, the undergoing activity should not be affected, nor should it be affected by on-going, time consuming operations.
      • A monitor died, a timeout was triggered, and an election is called.
      • Dispatching these messages independently from other messages only guarantees that we should abort any on-going operation earlier in the run, instead of, e.g., waiting for a value is fully committed before we start or finish an election.
      • Independently handling these messages from Paxos proposals should be discussed, as there may be some non-obvious corner-cases.
    • Probes are used to figure out which monitors are up, and who they are. A monitor just starting, and probing the rest of the monitors, should not be hindered by on-going, lenghty operations or a lot of messages waiting to be dispatched.
    • The Synchronization process may be quick or take a while. A monitor acting as the provider for some new monitor (wanting to join the cluster) should dispatch and handle synchronization messages independently from other requests in order to expediently provide all the values the new monitor needs in order to get it into the quorum as soon as possible.

Work items

Coding tasks

  1. TBD very shortly
  2. TBD very shortly
  3. TBD very shortly

Build / release tasks

  1. Task 1
  2. Task 2
  3. Task 3

Documentation tasks

  1. Task 1
  2. Task 2
  3. Task 3

Deprecation tasks

  1. Task 1
  2. Task 2
  3. Task 3

Updated by Jessica Mack almost 9 years ago · 3 revisions