Project

General

Profile

Mon - dispatch messages while waiting for IO to complete

Summary

During GIant CDS we discussed dispatching monitor messages independently, increasing concurrency while keeping serializability of operations. After some testing with write-intensive workloads, we realized that what we really, really need is dispatching some monitor messages while the monitor is waiting for IO to complete. This will minimize monitors flapping due to leveldb not being able to keep up with the write workload, especially when compactions are triggered.

Owners

  • Joao Eduardo Luis (Inktank)
  • Name (Affiliation)
  • Name

Interested Parties

  • Name (Affiliation)
  • Name (Affiliation)
  • Name

Current Status

Previous blueprint from the Giant CDS didn't get much love, so the current state is pretty much the same as it was back then: dispatch queues, work queues and thread pools already exist and are used in other portions of Ceph, and we can reuse them on the monitors. We may not even need any of that if we choose to keep handling just a single message at a time while relinquishing control of the big monitor lock.

Detailed Description

We will have to make sure to dispatch some messages, especially related to election and leases, while performing IO. Majority of IO will be performed when finishing a paxos transaction and applying the state to leveldb.

Currently, the ideal approach would be to introduce a new state to Paxos identifying that we are currently waiting (or about to wait) for a paxos transaction to complete. We should wait on a condition, relinquish the mon lock, and let the monitor go back to handling messages. We must ensure that lease extensions are still propagated to all the monitors, even if the current proposal has not been fully committed (yes, it is okay to do this because until the transaction is fully committed the old value is still valid).

We may also have to look into lease timeouts and adjusting them properly (or even temporarily disabling them) during a transaction commit, as sometimes a transaction can take up to one minute to commit (a write may have to wait for a compaction to finish, for instance) and all our timeouts are considerably shorter than that by default (longest is, iirc, 10 seconds).

Work items

Coding tasks

  1. Task 1
  2. Task 2
  3. Task 3

Build / release tasks

  1. Task 1
  2. Task 2
  3. Task 3

Documentation tasks

  1. Task 1
  2. Task 2
  3. Task 3

Deprecation tasks

  1. Task 1
  2. Task 2
  3. Task 3