Project

General

Profile

Actions

Bug #63014

open

osd perf : Effects of osd_op_num_shard_[hdd/ssd] on op latency and bandwidth when use mclock_scheduler

Added by jianwei zhang 8 months ago. Updated 8 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
v18.1.0
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:
09/28/2023
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

background:
  • add patch1: https://github.com/ceph/ceph/pull/53417 (ref: https://tracker.ceph.com/issues/62812)
  • add patch2: https://github.com/ceph/ceph/pull/52809 (ref: https://tracker.ceph.com/issues/62293)
    # ceph daemon osd.0 config show | grep -e osd_mclock -e shards_hdd -e shard_hdd -e memstore_
        "memstore_debug_omit_block_device_write": "false",
        "memstore_device_bytes": "32212254720",
        "memstore_page_set": "false",
        "memstore_page_size": "65536",
        "osd_mclock_force_run_benchmark_on_init": "false",
        "osd_mclock_iops_capacity_threshold_hdd": "500.000000",
        "osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
        "osd_mclock_max_capacity_iops_hdd": "240.000000",
        "osd_mclock_max_capacity_iops_ssd": "21500.000000",
        "osd_mclock_max_sequential_bandwidth_hdd": "251658240",
        "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
        "osd_mclock_override_recovery_settings": "false",
        "osd_mclock_profile": "custom",
        "osd_mclock_scheduler_anticipation_timeout": "0.000000",
        "osd_mclock_scheduler_background_best_effort_lim": "0.100000",
        "osd_mclock_scheduler_background_best_effort_res": "0.000000",
        "osd_mclock_scheduler_background_best_effort_wgt": "20",
        "osd_mclock_scheduler_background_recovery_lim": "0.500000",
        "osd_mclock_scheduler_background_recovery_res": "0.000000",
        "osd_mclock_scheduler_background_recovery_wgt": "20",
        "osd_mclock_scheduler_client_lim": "1.000000", //100%
        "osd_mclock_scheduler_client_res": "1.000000", //100%
        "osd_mclock_scheduler_client_wgt": "60",       
        "osd_mclock_skip_benchmark": "true",
        "osd_op_num_shards_hdd": "5",            
        "osd_op_num_threads_per_shard_hdd": "1",
    
problem:
  • in scenarios where expected limits are not exceeded, multiple mclock queues can increase the latency of a single op.
  • Here’s why:
    case-1:
       * osd_op_num_shards_hdd = 5 / osd_op_num_threads_per_shard_hdd = 1
       * osd_bandwidth_capacity_per_shard = 240 / 5 = 48 MBps
       * osd_bandwidth_cost_per_io = 240 / 240 = 1MB
       * IO Buffer size : 200K --> align to 1M in calc_scaled_cost
       * client_res = 1 / client_lim = 1 / client_wgt = 60
       * res_bw = 48MBps / lim_bw = 48MBps
       * op delay latency = 1M / 48 MBps = 0.02083s = 20.83ms,  1000ms/20.83ms * 200K / 1024K * 5(shard) = 46.88MBps
         * every shard dequeues ops at a time interval of 20.83 ms
         * this means that each op has to wait at least 20.83 ms in the mclock queue
           * In fact, since the shard queue the op enters is based on the random hash of the pgid where op.oid is located, 
             * absolute balance cannot be achieved.
           * On a macro scale, the number of ops processed by each shard queue is basically balanced, 
             * and there will be no order of magnitude difference.
           * But on a micro scale, at the same time, 
             * some shard queues have a small number of ops accumulated, 
             * some shard queues have a large number of ops accumulated, 
             * which will directly cause op latency.
       * test-case-1 : 
         * res=1/lim=1/wgt=60/bw=240M/iops=240 ,num_shards=5,concurrency="-t 5" 
         * rados bench
           * avg_lat=0.0272721s = 27.27ms  // 27.27-20.83= 6.44ms , it's op avg_lat higher than expected (6.44/20.83=0.3091) 30.91% , bad!
           * bw = 35.77 MBps  // 46.88-35.77= 11.11MBps, it's bw lower than expected (11.11/46.88=0.2369) 23.69%, bad!
    

    case-2:
       * osd_op_num_shards_hdd = 1 / osd_op_num_threads_per_shard_hdd = 5
       * osd_bandwidth_capacity_per_shard = 240 MBps
       * osd_bandwidth_cost_per_io = 240 / 240 = 1MB
       * IO Buffer size : 200K --> align to 1M in calc_scaled_cost
       * client_res = 1 / client_lim = 1 / client_wgt = 60
       * res_bw = 240 MBps / lim_bw = 240 MBps
       * op delay latency = 1M / 240 MBps = 0.0.00416s = 4.16 ms
         * every shard dequeues ops at a time interval of 4.16 ms
         * this means that each op has to wait at least 4.16 ms in the mclock queue
           * 4.16 ms is only 1/5 of 20.83 ms
       * test-case-2
         * res=1/lim=1/wgt=60/bw=240M/iops=240 ,num_shards=1,num_shards_thread=5, concurrency= "-t 5" 
         * rados bench
           * avg_lat = 0.0209406s = 20.94 ms  // 20.94-20.83=0.11ms,  0.11/20.83 = 0.52%, very close, good!
           * bw = 46.61 MBps //46.88 - 46.61 = 0.27MBps,  0.27/46.88 = 0.57%, very close, good!
    

reproduce:

1. MON=1 MGR=1 OSD=1 FS=0 MDS=0 RGW=0 ../src/vstart.sh -n -l -X -b --msgr2 --memstore
2. modify ceph.conf
  memstore_device_bytes = 32212254720
  osd_op_queue = "mclock_scheduler" 
  osd_mclock_skip_benchmark          = true
  osd_mclock_max_capacity_iops_hdd                = 240
  osd_mclock_max_sequential_bandwidth_hdd         = 251658240
  osd_mclock_profile                              = "custom" 
  osd_mclock_scheduler_client_wgt                 = 60
  osd_mclock_scheduler_client_res                 = 1
  osd_mclock_scheduler_client_lim                 = 1
  osd_mclock_scheduler_background_recovery_wgt    = 20
  osd_mclock_scheduler_background_recovery_res    = 0
  osd_mclock_scheduler_background_recovery_lim    = 0.5
  osd_mclock_scheduler_background_best_effort_wgt = 20
  osd_mclock_scheduler_background_best_effort_res = 0
  osd_mclock_scheduler_background_best_effort_lim = 0.1
3. create pool
ceph osd pool create test-pool 128 128 replicated replicated_rule 0 1 off
ceph osd pool application enable test-pool rgw

4. ceph osd pool ls detail
pool 1 'test-pool' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off last_change 274 flags hashpspool stripe_width 0 application rgw

5. ceph osd df tree
ID  CLASS  WEIGHT   REWEIGHT  SIZE    RAW USE  DATA  OMAP  META  AVAIL   %USE   VAR   PGS  STATUS  TYPE NAME         
-1         0.00099         -  30 GiB   11 GiB   0 B   0 B   0 B  19 GiB  35.83  1.00    -          root default      
-3         0.00099         -  30 GiB   11 GiB   0 B   0 B   0 B  19 GiB  35.83  1.00    -              host zjw-q-dev
 0    hdd  0.00099   1.00000  30 GiB   11 GiB   0 B   0 B   0 B  19 GiB  35.83  1.00  128      up          osd.0     
                       TOTAL  30 GiB   11 GiB   0 B   0 B   0 B  19 GiB  35.83                                       
MIN/MAX VAR: 1.00/1.00  STDDEV: 0

6.ceph df
--- RAW STORAGE ---
CLASS    SIZE   AVAIL    USED  RAW USED  %RAW USED
hdd    30 GiB  19 GiB  11 GiB    11 GiB      35.83
TOTAL  30 GiB  19 GiB  11 GiB    11 GiB      35.83

--- POOLS ---
POOL       ID  PGS  STORED  OBJECTS    USED  %USED  MAX AVAIL
test-pool   1  128  11 GiB   56.36k  11 GiB  37.72     18 GiB

7. recreate pgid after osd restart for memstore
cat force-create-pgid.sh 
#!/bin/sh
set -x

for i in $(ceph pg ls| grep +0800 | awk '{print $1}'); do
    ceph osd force-create-pg $i --yes-i-really-mean-it &
done

wait

8. rados bench test
rados -c ./ceph.conf --osd_client_op_priority=63 bench 60 write --no-cleanup -t 5 -p test-pool -b 204800 --show-time

test-case-1 : osd_op_num_shards_hdd = 5 / osd_op_num_threads_per_shard_hdd = 1 (default value)

* res=1/lim=1/wgt=60/bw=240M/iops=240 ,num_shards=5,concurrency="-t 5" 
* rados bench 
  * avg_lat=0.0272721s = 27.27ms
  * bw = 35.77 MBps

# rados -c ./ceph.conf --osd_client_op_priority=63 bench 60 write --no-cleanup -t 5 -p test-pool -b 204800 --show-time
2023-09-27T14:51:49.401628+0800 min lat: 0.00063724 max lat: 0.1238 avg lat: 0.0272496 lat p50: 0.0197331 lat p90: 0.0657348 lat p99: 0.104353 lat p999: 0.109506 lat p100: 0.1238
2023-09-27T14:51:49.401628+0800   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
2023-09-27T14:51:49.401628+0800    40       5      7336      7331   35.7908   33.5938  0.00264146   0.0272496
2023-09-27T14:51:50.401830+0800    41       5      7527      7522   35.8276   37.3047  0.00134314   0.0272309
2023-09-27T14:51:51.401970+0800    42       5      7728      7723   35.9091   39.2578  0.00109788   0.0271673
2023-09-27T14:51:52.402096+0800    43       5      7918      7913   35.9369   37.1094   0.0024957   0.0271583
2023-09-27T14:51:53.402254+0800    44       5      8108      8103   35.9634   37.1094   0.0418387   0.0271351
2023-09-27T14:51:54.402424+0800    45       5      8285      8280   35.9323   34.5703 0.000866966   0.0271543
2023-09-27T14:51:55.402541+0800    46       5      8464      8459   35.9111   34.9609 0.000913727   0.0271692
2023-09-27T14:51:56.402654+0800    47       5      8639      8634   35.8742   34.1797 0.000983625   0.0271867
2023-09-27T14:51:57.402782+0800    48       5      8825      8820   35.8836   36.3281   0.0219999   0.0271934
2023-09-27T14:51:58.402913+0800    49       5      9012      9007   35.8965   36.5234  0.00206822   0.0271933
2023-09-27T14:51:59.403064+0800    50       5      9183      9178   35.8465   33.3984   0.0914843   0.0271902
2023-09-27T14:52:00.403192+0800    51       5      9371      9366   35.8635   36.7188  0.00104644   0.0272126
2023-09-27T14:52:01.403348+0800    52       5      9555      9550   35.8648   35.9375  0.00179941   0.0272075
2023-09-27T14:52:02.403497+0800    53       5      9733      9728    35.844   34.7656   0.0218633   0.0272322
2023-09-27T14:52:03.403625+0800    54       5      9917      9912   35.8456   35.9375   0.0202004   0.0272326
2023-09-27T14:52:04.403778+0800    55       5     10101     10096   35.8472   35.9375  0.00126869   0.0272176
2023-09-27T14:52:05.403914+0800    56       5     10293     10288   35.8766      37.5  0.00117897   0.0271972
2023-09-27T14:52:06.404071+0800    57       5     10463     10458   35.8296   33.2031  0.00110588   0.0272368
2023-09-27T14:52:07.404198+0800    58       5     10639     10634   35.8044    34.375  0.00136368   0.0272497
2023-09-27T14:52:08.404305+0800    59       5     10826     10821   35.8166   36.5234    0.021253   0.0272467
2023-09-27T14:52:09.404413+0800 min lat: 0.000626773 max lat: 0.173957 avg lat: 0.0272559 lat p50: 0.0199041 lat p90: 0.0649641 lat p99: 0.104244 lat p999: 0.10962 lat p100: 0.173957
2023-09-27T14:52:09.404413+0800   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
2023-09-27T14:52:09.404413+0800    60       5     11005     11000   35.8022   34.9609   0.0443305   0.0272559
2023-09-27T14:52:10.404683+0800 Total time run:         60.0778
Total writes made:      11005
Write size:             204800
Object size:            204800
Bandwidth (MB/sec):     35.7772
Stddev Bandwidth:       1.69037
Max bandwidth (MB/sec): 39.2578
Min bandwidth (MB/sec): 31.6406
Average IOPS:           183
Stddev IOPS:            8.65471
Max IOPS:               201
Min IOPS:               162
Average Latency(s):     0.0272721
Stddev Latency(s):      0.0246501
Max latency(s):         0.173957
Min latency(s):         0.000626773
Latency P50(s):         0.0199112
Latency P90(s):         0.0649971
Latency P99(s):         0.10425
Latency P99.9(s):       0.10962
Latency P100(s):        0.173957

Every 2.0s: ceph daemon osd.0 dump_op_pq_state | grep scheduler
zjw-q-dev: Wed Sep 27 15:41:37 2023
            "scheduler": 0
            "scheduler": 0
            "scheduler": 2
            "scheduler": 3
            "scheduler": 0

Every 2.0s: ceph daemon osd.0 dump_op_pq_state | grep scheduler
zjw-q-dev: Wed Sep 27 15:41:49 2023
            "scheduler": 5
            "scheduler": 0
            "scheduler": 0
            "scheduler": 0
            "scheduler": 0

Every 2.0s: ceph daemon osd.0 dump_op_pq_state | grep scheduler
zjw-q-dev: Wed Sep 27 15:41:58 2023
            "scheduler": 0
            "scheduler": 0
            "scheduler": 1
            "scheduler": 4
            "scheduler": 0

test-case-2 : osd_op_num_shards_hdd = 1 / osd_op_num_threads_per_shard_hdd = 5 (new value)

* res=1/lim=1/wgt=60/bw=240M/iops=240 ,num_shards=1,num_shards_thread=5, concurrency= "-t 5" 
* rados bench
  * avg_lat = 0.0209406s = 20.94 ms
  * bw = 46.61 MBps

# rados -c ./ceph.conf --osd_client_op_priority=63 bench 60 write --no-cleanup -t 5 -p test-pool -b 204800 --show-time
2023-09-27T17:11:02.876277+0800 min lat: 0.00288863 max lat: 0.0895693 avg lat: 0.0209096 lat p50: 0.0180092 lat p90: 0.0212207 lat p99: 0.0219432 lat p999: 0.071195 lat p100: 0.0895693
2023-09-27T17:11:02.876277+0800   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
2023-09-27T17:11:02.876277+0800    40       5      9566      9561   46.6765    46.875   0.0209049   0.0209096
2023-09-27T17:11:03.876505+0800    41       5      9806      9801   46.6811    46.875   0.0209343   0.0209077
2023-09-27T17:11:04.876666+0800    42       5     10046     10041   46.6855    46.875    0.020779   0.0209057
2023-09-27T17:11:05.876832+0800    43       5     10286     10281   46.6897    46.875   0.0209019   0.0209038
2023-09-27T17:11:06.877002+0800    44       5     10526     10521   46.6938    46.875   0.0209936   0.0209021
2023-09-27T17:11:07.877171+0800    45       5     10766     10761   46.6976    46.875   0.0207428   0.0209004
2023-09-27T17:11:08.877355+0800    46       5     11006     11001   46.7013    46.875   0.0208264   0.0208988
2023-09-27T17:11:09.877532+0800    47       5     11246     11241   46.7048    46.875   0.0208299   0.0208972
2023-09-27T17:11:10.877715+0800    48       5     11486     11481   46.7082    46.875   0.0207058   0.0208957
2023-09-27T17:11:11.877908+0800    49       4     11726     11722   46.7154   47.0703   0.0209205   0.0208942
2023-09-27T17:11:12.878075+0800    50       5     11967     11962   46.7184    46.875    0.020859   0.0208928
2023-09-27T17:11:13.878280+0800    51       5     12207     12202   46.7213    46.875   0.0207391   0.0208915
2023-09-27T17:11:14.878477+0800    52       5     12406     12401   46.5701   38.8672   0.0207016   0.0209585
2023-09-27T17:11:15.878683+0800    53       5     12646     12641   46.5757    46.875   0.0207462    0.020956
2023-09-27T17:11:16.878854+0800    54       5     12886     12881   46.5811    46.875   0.0208307   0.0209536
2023-09-27T17:11:17.879013+0800    55       5     13126     13121   46.5863    46.875   0.0208078   0.0209513
2023-09-27T17:11:18.879170+0800    56       5     13366     13361   46.5913    46.875   0.0207605    0.020949
2023-09-27T17:11:19.879299+0800    57       5     13606     13601   46.5962    46.875   0.0208917   0.0209468
2023-09-27T17:11:20.879458+0800    58       5     13846     13841   46.6009    46.875   0.0208574   0.0209447
2023-09-27T17:11:21.879600+0800    59       5     14086     14081   46.6054    46.875   0.0208224   0.0209426
2023-09-27T17:11:22.879773+0800 min lat: 0.00288863 max lat: 0.20594 avg lat: 0.0209407 lat p50: 0.0180095 lat p90: 0.0212212 lat p99: 0.0219438 lat p999: 0.07239 lat p100: 0.20594
2023-09-27T17:11:22.879773+0800   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
2023-09-27T17:11:22.879773+0800    60       2     14324     14322    46.613   47.0703   0.0208395   0.0209407
2023-09-27T17:11:23.880038+0800 Total time run:         60.0186
Total writes made:      14324
Write size:             204800
Object size:            204800
Bandwidth (MB/sec):     46.6132
Stddev Bandwidth:       1.2887
Max bandwidth (MB/sec): 47.0703
Min bandwidth (MB/sec): 38.8672
Average IOPS:           238
Stddev IOPS:            6.59815
Max IOPS:               241
Min IOPS:               199
Average Latency(s):     0.0209406
Stddev Latency(s):      0.00376967
Max latency(s):         0.20594
Min latency(s):         0.00288863
Latency P50(s):         0.0180095
Latency P90(s):         0.0212212
Latency P99(s):         0.0219438
Latency P99.9(s):       0.07238
Latency P100(s):        0.20594

Actions

Also available in: Atom PDF