Project

General

Profile

Tail latency improvements » History » Version 2

Samuel Just, 06/29/2015 06:47 PM

1 1 Samuel Just
Improve tail latency
2
Summary
3 2 Samuel Just
4 1 Samuel Just
Tail latency (e.g. 99.99%) is important for some online serving scenarios, this blueprint summarizes some tail latency issues we observed on our production cluster.
5 2 Samuel Just
6
h1. OSD ungraceful shutdown.
7
8
OSD might crash due to broken disk, software bug, etc. Currently the crash/down of OSD is detected by its peers and it could take tens of seconds to trigger a osdmap change (20 seconds by default), which further lead client to retry the in flight requests associated with this OSD.
9
* Preemptively tell MON that the OSD is going down (crash) when there is assertion failures, as with a graceful shutdown
10
* Peering speed improvements
11
* Slow OSDs. OSD could become slow for various reasons, and currently the client latency is determined by the slowest OSD in the PG serving the request.
12
  * For EC pool, we tested the patch to read k + m chunks and used the first returned k chunks to serve the client, it turned out to significantly (30%) improved the latency, especially for tail. However, there is still a couple of problems:
13
    *If the primary is stucked, the patch would not help.
14
    * the patch does not bring benefit for WRITE (maybe only in a negative way as it brought more load).
15
    * It does not benefit replication pool.