Improve tail latency

Tail latency (e.g. 99.99%) is important for some online serving scenarios, this blueprint summarizes some tail latency issues we observed on our production cluster.

  • OSD ungraceful shutdown. When the OSD might crash due to broken disk, software bug, etc. Currently the crash/down of OSD is detected by its peers and it could take tens of seconds to trigger a osdmap change (20 seconds by default), which further lead client to retry the in flight requests associated with this OSD. We could preemptively tell MON that the OSD is going down (crash) when there is assertion failures, as with a graceful shutdown
  • Peering speed improvements
  • Slow OSDs. OSD could become slow for various reasons, and currently the client latency is determined by the slowest OSD in the PG serving the request.
    • For EC pool, we tested the patch to read k + m chunks and used the first returned k chunks to serve the client, it turned out to significantly (30%) improved the latency, especially for tail. However, there is still a couple of problems:
      • If the primary is stucked, the patch would not help.
      • the patch does not bring benefit for WRITE (maybe only in a negative way as it brought more load).
      • It does not benefit replication pool.