Ceph 0 day for performance regression¶
From Dumpling to Hammer version, we have make great work on performance improvements like sharded workqueue, message fast dispatch, optimize ssd scene for filestore, cache for important metadata structure. But we couldn't easily see the benefits from dumpling release to hammer release via default config(or a specified config). In other word, with a lot of improvements we also made lots of performance regression commits. It's not easy to pick up these bad commits from bunch of commits, and it's also difficult to do benchmark compared to other versions and find the root cause by hand. Ceph is aimed to unified storage, so it provide with lots of interfaces/use cases. For rbd use cases, we may find the regression easily since we have lots of experts users(like tracker.ceph.com/issues/10956). But for other corner cases like cephfs, radosgw, rados api even internal objectstore api, messenger api, we nearly have no guarantee to avoid performance regression. Even with serious regression happened, new users even developers may think it's normal(because they may know little or unknown to previous detail performance numbers).
I think we can do something like linux 0 day(https://lwn.net/Articles/514278/), we always have a active cluster which is responsible for critical use case's performance behavior. Like current unittest in jenkins, if we mark a PR a "performance-test" label, we will pull this PR and build to do performance regression tests.
Haomai Wang (Affiliation)
If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here.
We have some unrelated performance tests like ceph_perf_*
Since we want to make ceph performance works effective and expected for users, we can't always do little performance check for each PR. I think it's always too late to check whether it makes some cases performance regression.
I think we can make this job divided into three parts:
1. test cluster build and running, we may use teuthlogy or lightweight framework?
2. performance metrics from low level to high level, "ceph perf dump", "some critical unittest programs runtime", "benchmark program result like fio, iozone, radosgw benchmark, rados bench, rbd bench", "external monitor like iostat, kernel slabinfo, interrupt number(potential system call increasing), netperf, cpu usage, memory usage, disk statistic, blktrace, ftrace, numa things, vmstat, uptime, softirq"
3. realtime analysis to compare performance metrics to previous(base commit) results. we need to show the regression and the detail number(cases) to developers in short time.
Finally, we want to make these performance regression things revealed ASAP
This section should contain a list of work tasks created by this blueprint. Please include engineering tasks as well as related build/release and documentation work. If this blueprint requires cleanup of deprecated features, please list those tasks as well.
Build / release tasks