Actions
Bug #63014
openosd perf : Effects of osd_op_num_shard_[hdd/ssd] on op latency and bandwidth when use mclock_scheduler
Status:
Fix Under Review
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
v18.1.0
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:
09/28/2023
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
background:
- add patch1: https://github.com/ceph/ceph/pull/53417 (ref: https://tracker.ceph.com/issues/62812)
- add patch2: https://github.com/ceph/ceph/pull/52809 (ref: https://tracker.ceph.com/issues/62293)
# ceph daemon osd.0 config show | grep -e osd_mclock -e shards_hdd -e shard_hdd -e memstore_ "memstore_debug_omit_block_device_write": "false", "memstore_device_bytes": "32212254720", "memstore_page_set": "false", "memstore_page_size": "65536", "osd_mclock_force_run_benchmark_on_init": "false", "osd_mclock_iops_capacity_threshold_hdd": "500.000000", "osd_mclock_iops_capacity_threshold_ssd": "80000.000000", "osd_mclock_max_capacity_iops_hdd": "240.000000", "osd_mclock_max_capacity_iops_ssd": "21500.000000", "osd_mclock_max_sequential_bandwidth_hdd": "251658240", "osd_mclock_max_sequential_bandwidth_ssd": "1258291200", "osd_mclock_override_recovery_settings": "false", "osd_mclock_profile": "custom", "osd_mclock_scheduler_anticipation_timeout": "0.000000", "osd_mclock_scheduler_background_best_effort_lim": "0.100000", "osd_mclock_scheduler_background_best_effort_res": "0.000000", "osd_mclock_scheduler_background_best_effort_wgt": "20", "osd_mclock_scheduler_background_recovery_lim": "0.500000", "osd_mclock_scheduler_background_recovery_res": "0.000000", "osd_mclock_scheduler_background_recovery_wgt": "20", "osd_mclock_scheduler_client_lim": "1.000000", //100% "osd_mclock_scheduler_client_res": "1.000000", //100% "osd_mclock_scheduler_client_wgt": "60", "osd_mclock_skip_benchmark": "true", "osd_op_num_shards_hdd": "5", "osd_op_num_threads_per_shard_hdd": "1",
- in scenarios where expected limits are not exceeded, multiple mclock queues can increase the latency of a single op.
- Here’s why:
case-1: * osd_op_num_shards_hdd = 5 / osd_op_num_threads_per_shard_hdd = 1 * osd_bandwidth_capacity_per_shard = 240 / 5 = 48 MBps * osd_bandwidth_cost_per_io = 240 / 240 = 1MB * IO Buffer size : 200K --> align to 1M in calc_scaled_cost * client_res = 1 / client_lim = 1 / client_wgt = 60 * res_bw = 48MBps / lim_bw = 48MBps * op delay latency = 1M / 48 MBps = 0.02083s = 20.83ms, 1000ms/20.83ms * 200K / 1024K * 5(shard) = 46.88MBps * every shard dequeues ops at a time interval of 20.83 ms * this means that each op has to wait at least 20.83 ms in the mclock queue * In fact, since the shard queue the op enters is based on the random hash of the pgid where op.oid is located, * absolute balance cannot be achieved. * On a macro scale, the number of ops processed by each shard queue is basically balanced, * and there will be no order of magnitude difference. * But on a micro scale, at the same time, * some shard queues have a small number of ops accumulated, * some shard queues have a large number of ops accumulated, * which will directly cause op latency. * test-case-1 : * res=1/lim=1/wgt=60/bw=240M/iops=240 ,num_shards=5,concurrency="-t 5" * rados bench * avg_lat=0.0272721s = 27.27ms // 27.27-20.83= 6.44ms , it's op avg_lat higher than expected (6.44/20.83=0.3091) 30.91% , bad! * bw = 35.77 MBps // 46.88-35.77= 11.11MBps, it's bw lower than expected (11.11/46.88=0.2369) 23.69%, bad!
case-2: * osd_op_num_shards_hdd = 1 / osd_op_num_threads_per_shard_hdd = 5 * osd_bandwidth_capacity_per_shard = 240 MBps * osd_bandwidth_cost_per_io = 240 / 240 = 1MB * IO Buffer size : 200K --> align to 1M in calc_scaled_cost * client_res = 1 / client_lim = 1 / client_wgt = 60 * res_bw = 240 MBps / lim_bw = 240 MBps * op delay latency = 1M / 240 MBps = 0.0.00416s = 4.16 ms * every shard dequeues ops at a time interval of 4.16 ms * this means that each op has to wait at least 4.16 ms in the mclock queue * 4.16 ms is only 1/5 of 20.83 ms * test-case-2 * res=1/lim=1/wgt=60/bw=240M/iops=240 ,num_shards=1,num_shards_thread=5, concurrency= "-t 5" * rados bench * avg_lat = 0.0209406s = 20.94 ms // 20.94-20.83=0.11ms, 0.11/20.83 = 0.52%, very close, good! * bw = 46.61 MBps //46.88 - 46.61 = 0.27MBps, 0.27/46.88 = 0.57%, very close, good!
reproduce:
1. MON=1 MGR=1 OSD=1 FS=0 MDS=0 RGW=0 ../src/vstart.sh -n -l -X -b --msgr2 --memstore
2. modify ceph.conf
memstore_device_bytes = 32212254720
osd_op_queue = "mclock_scheduler"
osd_mclock_skip_benchmark = true
osd_mclock_max_capacity_iops_hdd = 240
osd_mclock_max_sequential_bandwidth_hdd = 251658240
osd_mclock_profile = "custom"
osd_mclock_scheduler_client_wgt = 60
osd_mclock_scheduler_client_res = 1
osd_mclock_scheduler_client_lim = 1
osd_mclock_scheduler_background_recovery_wgt = 20
osd_mclock_scheduler_background_recovery_res = 0
osd_mclock_scheduler_background_recovery_lim = 0.5
osd_mclock_scheduler_background_best_effort_wgt = 20
osd_mclock_scheduler_background_best_effort_res = 0
osd_mclock_scheduler_background_best_effort_lim = 0.1
3. create pool
ceph osd pool create test-pool 128 128 replicated replicated_rule 0 1 off
ceph osd pool application enable test-pool rgw
4. ceph osd pool ls detail
pool 1 'test-pool' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off last_change 274 flags hashpspool stripe_width 0 application rgw
5. ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 0.00099 - 30 GiB 11 GiB 0 B 0 B 0 B 19 GiB 35.83 1.00 - root default
-3 0.00099 - 30 GiB 11 GiB 0 B 0 B 0 B 19 GiB 35.83 1.00 - host zjw-q-dev
0 hdd 0.00099 1.00000 30 GiB 11 GiB 0 B 0 B 0 B 19 GiB 35.83 1.00 128 up osd.0
TOTAL 30 GiB 11 GiB 0 B 0 B 0 B 19 GiB 35.83
MIN/MAX VAR: 1.00/1.00 STDDEV: 0
6.ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 30 GiB 19 GiB 11 GiB 11 GiB 35.83
TOTAL 30 GiB 19 GiB 11 GiB 11 GiB 35.83
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
test-pool 1 128 11 GiB 56.36k 11 GiB 37.72 18 GiB
7. recreate pgid after osd restart for memstore
cat force-create-pgid.sh
#!/bin/sh
set -x
for i in $(ceph pg ls| grep +0800 | awk '{print $1}'); do
ceph osd force-create-pg $i --yes-i-really-mean-it &
done
wait
8. rados bench test
rados -c ./ceph.conf --osd_client_op_priority=63 bench 60 write --no-cleanup -t 5 -p test-pool -b 204800 --show-time
test-case-1 : osd_op_num_shards_hdd = 5 / osd_op_num_threads_per_shard_hdd = 1 (default value)
* res=1/lim=1/wgt=60/bw=240M/iops=240 ,num_shards=5,concurrency="-t 5"
* rados bench
* avg_lat=0.0272721s = 27.27ms
* bw = 35.77 MBps
# rados -c ./ceph.conf --osd_client_op_priority=63 bench 60 write --no-cleanup -t 5 -p test-pool -b 204800 --show-time
2023-09-27T14:51:49.401628+0800 min lat: 0.00063724 max lat: 0.1238 avg lat: 0.0272496 lat p50: 0.0197331 lat p90: 0.0657348 lat p99: 0.104353 lat p999: 0.109506 lat p100: 0.1238
2023-09-27T14:51:49.401628+0800 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
2023-09-27T14:51:49.401628+0800 40 5 7336 7331 35.7908 33.5938 0.00264146 0.0272496
2023-09-27T14:51:50.401830+0800 41 5 7527 7522 35.8276 37.3047 0.00134314 0.0272309
2023-09-27T14:51:51.401970+0800 42 5 7728 7723 35.9091 39.2578 0.00109788 0.0271673
2023-09-27T14:51:52.402096+0800 43 5 7918 7913 35.9369 37.1094 0.0024957 0.0271583
2023-09-27T14:51:53.402254+0800 44 5 8108 8103 35.9634 37.1094 0.0418387 0.0271351
2023-09-27T14:51:54.402424+0800 45 5 8285 8280 35.9323 34.5703 0.000866966 0.0271543
2023-09-27T14:51:55.402541+0800 46 5 8464 8459 35.9111 34.9609 0.000913727 0.0271692
2023-09-27T14:51:56.402654+0800 47 5 8639 8634 35.8742 34.1797 0.000983625 0.0271867
2023-09-27T14:51:57.402782+0800 48 5 8825 8820 35.8836 36.3281 0.0219999 0.0271934
2023-09-27T14:51:58.402913+0800 49 5 9012 9007 35.8965 36.5234 0.00206822 0.0271933
2023-09-27T14:51:59.403064+0800 50 5 9183 9178 35.8465 33.3984 0.0914843 0.0271902
2023-09-27T14:52:00.403192+0800 51 5 9371 9366 35.8635 36.7188 0.00104644 0.0272126
2023-09-27T14:52:01.403348+0800 52 5 9555 9550 35.8648 35.9375 0.00179941 0.0272075
2023-09-27T14:52:02.403497+0800 53 5 9733 9728 35.844 34.7656 0.0218633 0.0272322
2023-09-27T14:52:03.403625+0800 54 5 9917 9912 35.8456 35.9375 0.0202004 0.0272326
2023-09-27T14:52:04.403778+0800 55 5 10101 10096 35.8472 35.9375 0.00126869 0.0272176
2023-09-27T14:52:05.403914+0800 56 5 10293 10288 35.8766 37.5 0.00117897 0.0271972
2023-09-27T14:52:06.404071+0800 57 5 10463 10458 35.8296 33.2031 0.00110588 0.0272368
2023-09-27T14:52:07.404198+0800 58 5 10639 10634 35.8044 34.375 0.00136368 0.0272497
2023-09-27T14:52:08.404305+0800 59 5 10826 10821 35.8166 36.5234 0.021253 0.0272467
2023-09-27T14:52:09.404413+0800 min lat: 0.000626773 max lat: 0.173957 avg lat: 0.0272559 lat p50: 0.0199041 lat p90: 0.0649641 lat p99: 0.104244 lat p999: 0.10962 lat p100: 0.173957
2023-09-27T14:52:09.404413+0800 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
2023-09-27T14:52:09.404413+0800 60 5 11005 11000 35.8022 34.9609 0.0443305 0.0272559
2023-09-27T14:52:10.404683+0800 Total time run: 60.0778
Total writes made: 11005
Write size: 204800
Object size: 204800
Bandwidth (MB/sec): 35.7772
Stddev Bandwidth: 1.69037
Max bandwidth (MB/sec): 39.2578
Min bandwidth (MB/sec): 31.6406
Average IOPS: 183
Stddev IOPS: 8.65471
Max IOPS: 201
Min IOPS: 162
Average Latency(s): 0.0272721
Stddev Latency(s): 0.0246501
Max latency(s): 0.173957
Min latency(s): 0.000626773
Latency P50(s): 0.0199112
Latency P90(s): 0.0649971
Latency P99(s): 0.10425
Latency P99.9(s): 0.10962
Latency P100(s): 0.173957
Every 2.0s: ceph daemon osd.0 dump_op_pq_state | grep scheduler
zjw-q-dev: Wed Sep 27 15:41:37 2023
"scheduler": 0
"scheduler": 0
"scheduler": 2
"scheduler": 3
"scheduler": 0
Every 2.0s: ceph daemon osd.0 dump_op_pq_state | grep scheduler
zjw-q-dev: Wed Sep 27 15:41:49 2023
"scheduler": 5
"scheduler": 0
"scheduler": 0
"scheduler": 0
"scheduler": 0
Every 2.0s: ceph daemon osd.0 dump_op_pq_state | grep scheduler
zjw-q-dev: Wed Sep 27 15:41:58 2023
"scheduler": 0
"scheduler": 0
"scheduler": 1
"scheduler": 4
"scheduler": 0
test-case-2 : osd_op_num_shards_hdd = 1 / osd_op_num_threads_per_shard_hdd = 5 (new value)
* res=1/lim=1/wgt=60/bw=240M/iops=240 ,num_shards=1,num_shards_thread=5, concurrency= "-t 5"
* rados bench
* avg_lat = 0.0209406s = 20.94 ms
* bw = 46.61 MBps
# rados -c ./ceph.conf --osd_client_op_priority=63 bench 60 write --no-cleanup -t 5 -p test-pool -b 204800 --show-time
2023-09-27T17:11:02.876277+0800 min lat: 0.00288863 max lat: 0.0895693 avg lat: 0.0209096 lat p50: 0.0180092 lat p90: 0.0212207 lat p99: 0.0219432 lat p999: 0.071195 lat p100: 0.0895693
2023-09-27T17:11:02.876277+0800 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
2023-09-27T17:11:02.876277+0800 40 5 9566 9561 46.6765 46.875 0.0209049 0.0209096
2023-09-27T17:11:03.876505+0800 41 5 9806 9801 46.6811 46.875 0.0209343 0.0209077
2023-09-27T17:11:04.876666+0800 42 5 10046 10041 46.6855 46.875 0.020779 0.0209057
2023-09-27T17:11:05.876832+0800 43 5 10286 10281 46.6897 46.875 0.0209019 0.0209038
2023-09-27T17:11:06.877002+0800 44 5 10526 10521 46.6938 46.875 0.0209936 0.0209021
2023-09-27T17:11:07.877171+0800 45 5 10766 10761 46.6976 46.875 0.0207428 0.0209004
2023-09-27T17:11:08.877355+0800 46 5 11006 11001 46.7013 46.875 0.0208264 0.0208988
2023-09-27T17:11:09.877532+0800 47 5 11246 11241 46.7048 46.875 0.0208299 0.0208972
2023-09-27T17:11:10.877715+0800 48 5 11486 11481 46.7082 46.875 0.0207058 0.0208957
2023-09-27T17:11:11.877908+0800 49 4 11726 11722 46.7154 47.0703 0.0209205 0.0208942
2023-09-27T17:11:12.878075+0800 50 5 11967 11962 46.7184 46.875 0.020859 0.0208928
2023-09-27T17:11:13.878280+0800 51 5 12207 12202 46.7213 46.875 0.0207391 0.0208915
2023-09-27T17:11:14.878477+0800 52 5 12406 12401 46.5701 38.8672 0.0207016 0.0209585
2023-09-27T17:11:15.878683+0800 53 5 12646 12641 46.5757 46.875 0.0207462 0.020956
2023-09-27T17:11:16.878854+0800 54 5 12886 12881 46.5811 46.875 0.0208307 0.0209536
2023-09-27T17:11:17.879013+0800 55 5 13126 13121 46.5863 46.875 0.0208078 0.0209513
2023-09-27T17:11:18.879170+0800 56 5 13366 13361 46.5913 46.875 0.0207605 0.020949
2023-09-27T17:11:19.879299+0800 57 5 13606 13601 46.5962 46.875 0.0208917 0.0209468
2023-09-27T17:11:20.879458+0800 58 5 13846 13841 46.6009 46.875 0.0208574 0.0209447
2023-09-27T17:11:21.879600+0800 59 5 14086 14081 46.6054 46.875 0.0208224 0.0209426
2023-09-27T17:11:22.879773+0800 min lat: 0.00288863 max lat: 0.20594 avg lat: 0.0209407 lat p50: 0.0180095 lat p90: 0.0212212 lat p99: 0.0219438 lat p999: 0.07239 lat p100: 0.20594
2023-09-27T17:11:22.879773+0800 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
2023-09-27T17:11:22.879773+0800 60 2 14324 14322 46.613 47.0703 0.0208395 0.0209407
2023-09-27T17:11:23.880038+0800 Total time run: 60.0186
Total writes made: 14324
Write size: 204800
Object size: 204800
Bandwidth (MB/sec): 46.6132
Stddev Bandwidth: 1.2887
Max bandwidth (MB/sec): 47.0703
Min bandwidth (MB/sec): 38.8672
Average IOPS: 238
Stddev IOPS: 6.59815
Max IOPS: 241
Min IOPS: 199
Average Latency(s): 0.0209406
Stddev Latency(s): 0.00376967
Max latency(s): 0.20594
Min latency(s): 0.00288863
Latency P50(s): 0.0180095
Latency P90(s): 0.0212212
Latency P99(s): 0.0219438
Latency P99.9(s): 0.07238
Latency P100(s): 0.20594
Actions