Tail latency improvements » History » Version 2
Samuel Just, 06/29/2015 06:47 PM
1 | 1 | Samuel Just | Improve tail latency |
---|---|---|---|
2 | Summary |
||
3 | 2 | Samuel Just | |
4 | 1 | Samuel Just | Tail latency (e.g. 99.99%) is important for some online serving scenarios, this blueprint summarizes some tail latency issues we observed on our production cluster. |
5 | 2 | Samuel Just | |
6 | h1. OSD ungraceful shutdown. |
||
7 | |||
8 | OSD might crash due to broken disk, software bug, etc. Currently the crash/down of OSD is detected by its peers and it could take tens of seconds to trigger a osdmap change (20 seconds by default), which further lead client to retry the in flight requests associated with this OSD. |
||
9 | * Preemptively tell MON that the OSD is going down (crash) when there is assertion failures, as with a graceful shutdown |
||
10 | * Peering speed improvements |
||
11 | * Slow OSDs. OSD could become slow for various reasons, and currently the client latency is determined by the slowest OSD in the PG serving the request. |
||
12 | * For EC pool, we tested the patch to read k + m chunks and used the first returned k chunks to serve the client, it turned out to significantly (30%) improved the latency, especially for tail. However, there is still a couple of problems: |
||
13 | *If the primary is stucked, the patch would not help. |
||
14 | * the patch does not bring benefit for WRITE (maybe only in a negative way as it brought more load). |
||
15 | * It does not benefit replication pool. |