Project

General

Profile

Actions

Bug #62293

closed

osd mclock QoS : osd_mclock_scheduler_client_lim is not limited

Added by jianwei zhang 9 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Category:
OSD
Target version:
% Done:

0%

Source:
Community (user)
Tags:
v18.1.0 backport_processed
Backport:
quincy, reef
Regression:
No
Severity:
3 - minor
Reviewed:
08/03/2023
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description


First of all, I want to confirm whether there is a problem with my understanding:
(1) reserveration / weight / limit
(2) the limit is indeed limiting the maximum resource usage of IOPS/Bandwidth that a certain type of client can use
   - osd_mclock_max_capacity_iops_hdd            180 
   - osd_mclock_scheduler_client_lim             0.25    ==> 180*0.25 = 45   iops
   - osd_mclock_max_sequential_bandwidth_hdd  157286400  ==> 150M*0.25= 37.5 MiB/s

for example,
I limit the client's iops to a maximum of 100 iops,
Then no matter how many rados benches I run, it will never exceed 100 iops,
The iops/util of the HDD disk should always be stable at the same water level

If my understanding of limit is correct, 
then the pressure on osd should not increase with the increase of rados bench

Problem:
But the actual situation is that the more clients there are in rados bench, 
the greater the pressure on osd, and the more iops/bandwidth consumption of hdd

WHY?

reproduce:

ceph cluster:
# ceph -s
  cluster:
    id:     dcc749b4-c686-453b-8f25-6b965cdb360f
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum a (age 22h)
    mgr: x(active, since 22h)
    osd: 1 osds: 1 up (since 20m), 1 in (since 23h)

  data:
    pools:   2 pools, 129 pgs
    objects: 17.80M objects, 1.7 TiB
    usage:   2.0 TiB used, 7.4 TiB / 9.4 TiB avail
    pgs:     129 active+clean

# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME                             STATUS  REWEIGHT  PRI-AFF
-1         9.39259  root default                                                   
-3         9.39259      host SZJD-YFQ-PM-OS01-BCONEST-06                           
 0    hdd  9.39259          osd.0                             up   1.00000  1.00000

# ceph osd df 
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META    AVAIL    %USE   VAR   PGS  STATUS
 0    hdd  9.39259   1.00000  9.4 TiB  2.0 TiB  1.7 TiB    1 KiB  12 GiB  7.4 TiB  21.52  1.00  129      up
                       TOTAL  9.4 TiB  2.0 TiB  1.7 TiB  1.2 KiB  12 GiB  7.4 TiB  21.52                   
MIN/MAX VAR: 1.00/1.00  STDDEV: 0

# ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    9.4 TiB  7.4 TiB  2.0 TiB   2.0 TiB      21.52
TOTAL  9.4 TiB  7.4 TiB  2.0 TiB   2.0 TiB      21.52

--- POOLS ---
POOL       ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr        1    1  577 KiB        2  580 KiB      0    6.9 TiB
test-pool   2  128  1.7 TiB   17.80M  1.7 TiB  19.99    6.9 TiB

# ceph osd pool ls detail
pool 1 '.mgr' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 8 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 1.00
pool 2 'test-pool' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off last_change 20 flags hashpspool stripe_width 0 application rgw read_balance_score 1.00

# ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default" 
            },
            {
                "op": "choose_firstn",
                "num": 0,
                "type": "osd" 
            },
            {
                "op": "emit" 
            }
        ]
    }
]

# ls -l dev/osd0/
total 72
-rw------- 1 root root 11 Aug  2 17:43 bfm_blocks
-rw------- 1 root root  4 Aug  2 17:43 bfm_blocks_per_key
-rw------- 1 root root  5 Aug  2 17:43 bfm_bytes_per_block
-rw------- 1 root root 15 Aug  2 17:43 bfm_size
lrwxrwxrwx 1 root root  8 Aug  2 17:43 block -> /dev/sdh
lrwxrwxrwx 1 root root 10 Aug  2 17:43 block.db -> /dev/sdag1
lrwxrwxrwx 1 root root 10 Aug  2 17:43 block.wal -> /dev/sdag2
-rw------- 1 root root  2 Aug  2 17:43 bluefs
-rw------- 1 root root 37 Aug  2 17:43 ceph_fsid
-rw------- 1 root root 75 Aug  2 17:43 ceph_version_when_created
-rw------- 1 root root 28 Aug  2 17:43 created_at
-rw-r--r-- 1 root root 37 Aug  2 17:43 fsid
-rw-r--r-- 1 root root 63 Aug  2 17:43 keyring
-rw------- 1 root root  8 Aug  2 17:43 kv_backend
-rw------- 1 root root 21 Aug  2 17:43 magic
-rw------- 1 root root  4 Aug  2 17:43 mkfs_done
-rw------- 1 root root 41 Aug  2 17:43 osd_key
-rw------- 1 root root  6 Aug  2 17:43 ready
-rw------- 1 root root  3 Aug  2 17:44 require_osd_release
-rw------- 1 root root 10 Aug  2 17:43 type
-rw------- 1 root root  2 Aug  2 17:43 whoami

[root@SZJD-YFQ-PM-OS01-BCONEST-06 b]# lsblk /dev/sdh /dev/sdag
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sdh       8:112  0   9.1T  0 disk   //hdd block
sdag     66:0    0 447.1G  0 disk   //ssd
|-sdag1  66:1    0 304.1G  0 part   //db
`-sdag2  66:2    0 102.9G  0 part   //wal

Thu Aug  3 14:57:21 UTC 2023
    "debug_mclock": "1/5",
    "osd_mclock_force_run_benchmark_on_init": "false",
    "osd_mclock_iops_capacity_threshold_hdd": "180.000000",
    "osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
    "osd_mclock_max_capacity_iops_hdd": "180.000000",
    "osd_mclock_max_capacity_iops_ssd": "21500.000000",
    "osd_mclock_max_sequential_bandwidth_hdd": "157286400",
    "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
    "osd_mclock_override_recovery_settings": "false",
    "osd_mclock_profile": "custom",
    "osd_mclock_scheduler_anticipation_timeout": "0.000000",
    "osd_mclock_scheduler_background_best_effort_lim": "0.100000",
    "osd_mclock_scheduler_background_best_effort_res": "0.100000",
    "osd_mclock_scheduler_background_best_effort_wgt": "1",
    "osd_mclock_scheduler_background_recovery_lim": "1.000000",
    "osd_mclock_scheduler_background_recovery_res": "0.480000",
    "osd_mclock_scheduler_background_recovery_wgt": "17",
    "osd_mclock_scheduler_client_lim": "0.450000",       //lim = 180*0.45 = 81 iops
    "osd_mclock_scheduler_client_res": "0.250000",       //res = 180*0.25 = 45 iops
    "osd_mclock_scheduler_client_wgt": "3",
    "osd_mclock_skip_benchmark": "true",
    "osd_op_queue": "mclock_scheduler",

# cat write-name4.txt 
89b8a2e4-cdbc-4a96-8432-0a0c28ebe847%25.779912.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_000000000000000000000000000000009d76cf627ab3a_WX_YCY0801_475279953_0
f5e1cec2-3027-49f6-8299-b0334e633178%25.780012.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_000000000000000000000000000000003d0fbf2875c33_WX_YCY0801_714510436_0
b29bd483-dac7-48ae-be13-ac28351ccdd1%25.780112.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_000000000000000000000000000000004870c438a5c82_WX_YCY0801_010795134_0
6daeab64-417e-4117-b5be-293e862255c2%25.780212.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_000000000000000000000000000000003a3b6c4cbd4a8_WX_YCY0801_853521873_0

# cat test-4read.sh
for name in `cat ./write-name4.txt` ; do
    rados bench 3600 rand -t 2 -p test-pool --osd_client_op_priority 47 --show-time --run-name "$name" >> readlog  &
done

test-case-1: one-rados-bench : Disk utilization 40%
iostat -xtm 1 -d /dev/sdag /dev/sdh
08/03/23 15:06:31
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            52.00    0.00      1.80      0.00     0.00     0.00   0.00   0.00    0.35    0.00   0.02    35.38     0.00   0.27   1.40
sdh             45.00    0.00      4.57      0.00    89.00     0.00  66.42   0.00    8.89    0.00   0.40   104.00     0.00   7.69  34.60

08/03/23 15:06:32
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            52.00    0.00      1.31      0.00     0.00     0.00   0.00   0.00    0.38    0.00   0.02    25.85     0.00   0.35   1.80
sdh             48.00    0.00      4.88      0.00    95.00     0.00  66.43   0.00   11.56    0.00   0.56   104.00     0.00   8.98  43.10

08/03/23 15:06:33
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            51.00    0.00      0.84      0.00     0.00     0.00   0.00   0.00    0.20    0.00   0.01    16.94     0.00   0.20   1.00
sdh             49.00    0.00      4.98      0.00    94.00     0.00  65.73   0.00    9.78    0.00   0.48   104.00     0.00   8.65  42.40

08/03/23 15:06:34
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            53.00    0.00      0.74      0.00     0.00     0.00   0.00   0.00    0.25    0.00   0.01    14.26     0.00   0.25   1.30
sdh             51.00    0.00      5.18      0.00   100.00     0.00  66.23   0.00    9.33    0.00   0.48   104.00     0.00   8.25  42.10

08/03/23 15:06:35
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            49.00    0.00      0.39      0.00     0.00     0.00   0.00   0.00    0.22    0.00   0.01     8.08     0.00   0.18   0.90
sdh             49.00    0.00      4.98      0.00    98.00     0.00  66.67   0.00   10.02    0.00   0.49   104.00     0.00   8.73  42.80

08/03/23 15:06:50
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            46.00    0.00      0.36      0.00     0.00     0.00   0.00   0.00    0.17    0.00   0.01     8.09     0.00   0.17   0.80
sdh             44.00    0.00      4.47      0.00    90.00     0.00  67.16   0.00   10.11    0.00   0.46   104.00     0.00   8.93  39.30

08/03/23 15:06:51
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            48.00    0.00      0.38      0.00     0.00     0.00   0.00   0.00    0.12    0.00   0.01     8.08     0.00   0.12   0.60
sdh             48.00    0.00      4.88      0.00    93.00     0.00  65.96   0.00   10.04    0.00   0.48   104.00     0.00   8.50  40.80

08/03/23 15:06:52
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            48.00    0.00      0.38      0.00     0.00     0.00   0.00   0.00    0.27    0.00   0.01     8.17     0.00   0.27   1.30
sdh             49.00    0.00      4.98      0.00    92.00     0.00  65.25   0.00    8.78    0.00   0.42   104.00     0.00   7.67  37.60

08/03/23 15:06:53
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            50.00    0.00      0.40      0.00     0.00     0.00   0.00   0.00    0.10    0.00   0.01     8.16     0.00   0.10   0.50
sdh             51.00    0.00      5.18      0.00    99.00     0.00  66.00   0.00   10.51    0.00   0.53   104.00     0.00   8.31  42.40

test-case-2: two-rados-bench : Disk utilization 70%~80%
iostat -xtm 1 -d /dev/sdag /dev/sdh
08/03/23 15:07:16
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            97.00    0.00      0.76      0.00     0.00     0.00   0.00   0.00    0.19    0.00   0.02     8.04     0.00   0.19   1.80
sdh             99.00    0.00     10.05      0.00   191.00     0.00  65.86   0.00   11.63    0.00   1.13   104.00     0.00   7.13  70.60

08/03/23 15:07:17
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            98.02    0.00      0.78      0.00     0.00     0.00   0.00   0.00    0.13    0.00   0.01     8.16     0.00   0.13   1.29
sdh             97.03    0.00      9.85      0.00   191.09     0.00  66.32   0.00   12.73    0.00   1.24   104.00     0.00   8.12  78.81

08/03/23 15:07:18
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           100.00    0.00      0.80      0.00     0.00     0.00   0.00   0.00    0.20    0.00   0.02     8.16     0.00   0.20   2.00
sdh            101.00    0.00     10.26      0.00   195.00     0.00  65.88   0.00   12.27    0.00   1.23   104.00     0.00   7.54  76.20

08/03/23 15:07:19
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           102.00    0.00      0.82      0.00     0.00     0.00   0.00   0.00    0.22    0.00   0.02     8.24     0.00   0.22   2.20
sdh            100.00    0.00     10.16      0.00   200.00     0.00  66.67   0.00   13.50    0.00   1.37   104.00     0.00   8.15  81.50

08/03/23 15:07:20
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            93.00    0.00      0.74      0.00     0.00     0.00   0.00   0.00    0.14    0.00   0.01     8.13     0.00   0.13   1.20
sdh             95.00    0.00      9.65      0.00   183.00     0.00  65.83   0.00   13.46    0.00   1.26   104.00     0.00   7.98  75.80

08/03/23 15:07:21
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            95.00    0.00      0.76      0.00     0.00     0.00   0.00   0.00    0.13    0.00   0.01     8.21     0.00   0.12   1.10
sdh             96.00    0.00      9.75      0.00   187.00     0.00  66.08   0.00   13.05    0.00   1.25   104.00     0.00   8.19  78.60

08/03/23 15:07:22
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           101.00    0.00      0.82      0.00     0.00     0.00   0.00   0.00    0.17    0.00   0.02     8.28     0.00   0.17   1.70
sdh            100.00    0.00     10.16      0.00   193.00     0.00  65.87   0.00   10.55    0.00   1.06   104.00     0.00   7.90  79.00

test-case-3: three-rados-bench : Disk utilization 95%~99%
iostat -xtm 1 -d /dev/sdag /dev/sdh
08/03/23 15:08:03
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           127.00    0.00      1.01      0.00     0.00     0.00   0.00   0.00    0.18    0.00   0.02     8.16     0.00   0.17   2.20
sdh            126.00    0.00     12.80      0.00   248.00     0.00  66.31   0.00   16.83    0.00   2.14   104.00     0.00   7.52  94.70

08/03/23 15:08:04
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           127.00    0.00      1.01      0.00     0.00     0.00   0.00   0.00    0.17    0.00   0.02     8.16     0.00   0.17   2.20
sdh            130.00    0.00     13.20      0.00   250.00     0.00  65.79   0.00   17.62    0.00   2.29   104.00     0.00   7.58  98.50

08/03/23 15:08:05
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           135.00    0.00      1.08      0.00     0.00     0.00   0.00   0.00    0.19    0.00   0.02     8.18     0.00   0.18   2.40
sdh            136.00    0.00     13.81      0.00   261.00     0.00  65.74   0.00   14.79    0.00   2.00   104.00     0.00   7.08  96.30

08/03/23 15:08:06
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           144.00    0.00      1.14      0.00     0.00     0.00   0.00   0.00    0.22    0.00   0.03     8.08     0.00   0.22   3.10
sdh            144.00    0.00     14.62      0.00   280.00     0.00  66.04   0.00   17.28    0.00   2.49   104.00     0.00   6.83  98.40

08/03/23 15:08:07
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           136.00    0.00      1.07      0.00     0.00     0.00   0.00   0.00    0.15    0.00   0.02     8.06     0.00   0.15   2.00
sdh            137.00    0.00     13.91      0.00   269.00     0.00  66.26   0.00   17.08    0.00   2.35   104.00     0.00   7.22  98.90

test-case-4: four-rados-bench : Disk utilization 99%~100%
iostat -xtm 1 -d /dev/sdag /dev/sdh
08/03/23 15:08:55
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           140.00    0.00      1.11      0.00     0.00     0.00   0.00   0.00    0.24    0.00   0.03     8.09     0.00   0.24   3.30
sdh            145.00    0.00     14.73      0.00   279.00     0.00  65.80   0.00   20.74    0.00   3.00   104.00     0.00   6.88  99.80

08/03/23 15:08:56
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           143.00    0.00      1.14      0.00     0.00     0.00   0.00   0.00    0.16    0.00   0.02     8.17     0.00   0.16   2.30
sdh            146.00    0.00     14.83      0.00   285.00     0.00  66.13   0.00   21.55    0.00   3.11   104.00     0.00   6.84  99.80

08/03/23 15:08:57
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           151.00    0.00      1.20      0.00     0.00     0.00   0.00   0.00    0.17    0.00   0.03     8.11     0.00   0.18   2.70
sdh            153.00    0.00     15.54      0.00   296.00     0.00  65.92   0.00   20.97    0.00   3.24   104.00     0.00   6.54 100.10

08/03/23 15:08:58
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           155.00    0.00      1.22      0.00     0.00     0.00   0.00   0.00    0.17    0.00   0.03     8.05     0.00   0.17   2.60
sdh            160.00    0.00     16.25      0.00   310.00     0.00  65.96   0.00   20.99    0.00   3.33   104.00     0.00   6.24  99.90

08/03/23 15:08:59
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           151.00    0.00      1.20      0.00     0.00     0.00   0.00   0.00    0.22    0.00   0.03     8.11     0.00   0.22   3.30
sdh            154.00    0.00     15.64      0.00   298.00     0.00  65.93   0.00   21.37    0.00   3.27   104.00     0.00   6.49 100.00


Related issues 2 (0 open2 closed)

Copied to Ceph - Backport #62546: reef: osd mclock QoS : osd_mclock_scheduler_client_lim is not limitedResolvedSridhar SeshasayeeActions
Copied to Ceph - Backport #62547: quincy: osd mclock QoS : osd_mclock_scheduler_client_lim is not limitedResolvedSridhar SeshasayeeActions
Actions #1

Updated by xu wang 9 months ago

Through the test, it can be proved that the qos function does not take effect under multiple clients:

1. With a single client, the qos limit function can take effect

2. With two clients, the qos limit function cannot take effect. And the iops and util of two clients are 2 times that of a single client

reproduce:

1) Single client does not limit qos (default configuration), different stress tests, observe iops and util of hdd device (sdh)


# check configuration
Thu Aug  3 16:43:02 UTC 2023
    "debug_mclock": "1/5",
    "osd_mclock_force_run_benchmark_on_init": "false",
    "osd_mclock_iops_capacity_threshold_hdd": "500.000000",
    "osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
    "osd_mclock_max_capacity_iops_hdd": "315.000000",
    "osd_mclock_max_capacity_iops_ssd": "21500.000000",
    "osd_mclock_max_sequential_bandwidth_hdd": "157286400",
    "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
    "osd_mclock_override_recovery_settings": "false",
    "osd_mclock_profile": "balanced",
    "osd_mclock_scheduler_anticipation_timeout": "0.000000",
    "osd_mclock_scheduler_background_best_effort_lim": "0.900000",
    "osd_mclock_scheduler_background_best_effort_res": "0.000000",
    "osd_mclock_scheduler_background_best_effort_wgt": "1",
    "osd_mclock_scheduler_background_recovery_lim": "0.000000",
    "osd_mclock_scheduler_background_recovery_res": "0.500000",
    "osd_mclock_scheduler_background_recovery_wgt": "1",
    "osd_mclock_scheduler_client_lim": "0.000000",
    "osd_mclock_scheduler_client_res": "0.500000",
    "osd_mclock_scheduler_client_wgt": "1",
    "osd_mclock_skip_benchmark": "false",
    "osd_op_queue": "mclock_scheduler",

test-case-1: one-rados-bench, rand, 102434 Bytes, -t 1 ; iops 105~110 ;Disk utilization 90%

08/03/23 16:44:02
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           105.00    0.00      1.12      0.00     0.00     0.00   0.00   0.00    0.25    0.00   0.03    10.93     0.00   0.25   2.60
sdh            102.00    0.00     10.36      0.00   206.00     0.00  66.88   0.00    8.62    0.00   0.89   104.00     0.00   8.69  88.60

08/03/23 16:44:03
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           107.00    0.00      1.07      0.00     0.00     0.00   0.00   0.00    0.18    0.00   0.02    10.21     0.00   0.18   1.90
sdh            106.00    0.00     10.77      0.00   212.00     0.00  66.67   0.00    8.55    0.00   0.90   104.00     0.00   8.52  90.30

test-case-2: one-rados-bench, rand, 102434 Bytes, -t 10 ; iops 160~170 ;Disk utilization 99%~100%

08/03/23 16:45:17
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           171.00    0.00      1.37      0.00     0.00     0.00   0.00   0.00    0.16    0.00   0.03     8.21     0.00   0.16   2.70
sdh            172.00    0.00     17.47      0.00   348.00     0.00  66.92   0.00   21.72    0.00   3.73   104.00     0.00   5.81 100.00

08/03/23 16:45:18
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           159.00    0.00      1.26      0.00     0.00     0.00   0.00   0.00    0.18    0.00   0.03     8.10     0.00   0.18   2.90
sdh            163.00    0.00     16.55      0.00   322.00     0.00  66.39   0.00   22.42    0.00   3.75   104.00     0.00   6.13  99.90

test-case-3: one-rados-bench, rand, 102434 Bytes, -t 100 ; iops 170~180 ;Disk utilization 100%

08/03/23 16:46:51
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           156.00    0.00      1.25      0.00     0.00     0.00   0.00   0.00    0.22    0.00   0.03     8.21     0.00   0.22   3.40
sdh            177.00    0.00     17.98      0.00   354.00     0.00  66.67   0.00   27.64    0.00   4.89   104.00     0.00   5.66 100.10

08/03/23 16:46:52
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           159.00    0.00      1.27      0.00     0.00     0.00   0.00   0.00    0.18    0.00   0.03     8.15     0.00   0.18   2.90
sdh            183.00    0.00     18.59      0.00   366.00     0.00  66.67   0.00   26.42    0.00   4.84   104.00     0.00   5.47 100.10

test-case-4: one-rados-bench, rand, 102434 Bytes, -t 400 ; iops 170~190 ;Disk utilization 100%

08/03/23 16:48:11
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           148.00    0.00      1.18      0.00     0.00     0.00   0.00   0.00    0.11    0.00   0.02     8.14     0.00   0.11   1.60
sdh            175.00    0.00     17.77      0.00   349.00     0.00  66.60   0.00   27.75    0.00   4.92   104.00     0.00   5.72 100.10

08/03/23 16:48:12
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag           132.00    0.00      1.05      0.00     0.00     0.00   0.00   0.00    0.17    0.00   0.02     8.15     0.00   0.17   2.30
sdh            165.00    0.00     16.76      0.00   329.00     0.00  66.60   0.00   29.18    0.00   4.91   104.00     0.00   6.05  99.80

2) Single client limit qos, different stress tests, observe iops and util of hdd device (sdh)


# check configuration
Thu Aug  3 16:50:50 UTC 2023
    "debug_mclock": "1/5",
    "osd_mclock_force_run_benchmark_on_init": "false",
    "osd_mclock_iops_capacity_threshold_hdd": "180.000000",
    "osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
    "osd_mclock_max_capacity_iops_hdd": "180.000000",
    "osd_mclock_max_capacity_iops_ssd": "21500.000000",
    "osd_mclock_max_sequential_bandwidth_hdd": "157286400",
    "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
    "osd_mclock_override_recovery_settings": "false",
    "osd_mclock_profile": "custom",
    "osd_mclock_scheduler_anticipation_timeout": "0.000000",
    "osd_mclock_scheduler_background_best_effort_lim": "0.100000",
    "osd_mclock_scheduler_background_best_effort_res": "0.100000",
    "osd_mclock_scheduler_background_best_effort_wgt": "1",
    "osd_mclock_scheduler_background_recovery_lim": "1.000000",
    "osd_mclock_scheduler_background_recovery_res": "0.480000",
    "osd_mclock_scheduler_background_recovery_wgt": "17",
    "osd_mclock_scheduler_client_lim": "0.250000",
    "osd_mclock_scheduler_client_res": "0.150000",
    "osd_mclock_scheduler_client_wgt": "3",
    "osd_mclock_skip_benchmark": "true",
    "osd_op_queue": "mclock_scheduler",

test-case-1: one-rados-bench, rand, 102434 Bytes, -t 1 ; iops 20~25 ;Disk utilization 15~25%
08/03/23 16:51:55
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            27.00    0.00      0.66      0.00     0.00     0.00   0.00   0.00    0.22    0.00   0.01    24.89     0.00   0.22   0.60
sdh             24.00    0.00      2.44      0.00    48.00     0.00  66.67   0.00    9.54    0.00   0.23   104.00     0.00   9.54  22.90

08/03/23 16:51:56
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            23.00    0.00      0.40      0.00     0.00     0.00   0.00   0.00    0.22    0.00   0.01    17.74     0.00   0.22   0.50
sdh             20.00    0.00      2.03      0.00    42.00     0.00  67.74   0.00    8.40    0.00   0.17   104.00     0.00   8.60  17.20

test-case-2: one-rados-bench, rand, 102434 Bytes, -t 10 ; iops 30~45 ;Disk utilization 20%~35%

08/03/23 16:55:08
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            33.00    0.00      0.27      0.00     0.00     0.00   0.00   0.00    0.18    0.00   0.01     8.24     0.00   0.18   0.60
sdh             34.00    0.00      3.45      0.00    66.00     0.00  66.00   0.00   12.24    0.00   0.40   104.00     0.00   9.32  31.70

08/03/23 16:55:09
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            38.00    0.00      0.30      0.00     0.00     0.00   0.00   0.00    0.18    0.00   0.01     8.21     0.00   0.18   0.70
sdh             38.00    0.00      3.86      0.00    76.00     0.00  66.67   0.00   10.21    0.00   0.39   104.00     0.00   8.29  31.50

08/03/23 16:55:10
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            43.00    0.00      0.34      0.00     0.00     0.00   0.00   0.00    0.16    0.00   0.01     8.09     0.00   0.16   0.70
sdh             43.00    0.00      4.37      0.00    86.00     0.00  66.67   0.00    9.63    0.00   0.41   104.00     0.00   8.53  36.70

test-case-3: one-rados-bench, rand, 102434 Bytes, -t 100 ; iops 40~48 ;Disk utilization 30%~35%

08/03/23 16:56:07
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            44.00    0.00      0.34      0.00     0.00     0.00   0.00   0.00    0.09    0.00   0.00     8.00     0.00   0.09   0.40
sdh             45.00    0.00      4.57      0.00    85.00     0.00  65.38   0.00   22.89    0.00   1.03   104.00     0.00   7.24  32.60

08/03/23 16:56:08
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            42.00    0.00      0.34      0.00     0.00     0.00   0.00   0.00    0.17    0.00   0.01     8.29     0.00   0.10   0.40
sdh             47.00    0.00      4.77      0.00    85.00     0.00  64.39   0.00   21.98    0.00   0.96   104.00     0.00   7.47  35.10

08/03/23 16:56:09
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            43.00    0.00      0.34      0.00     0.00     0.00   0.00   0.00    0.09    0.00   0.00     8.09     0.00   0.07   0.30
sdh             41.00    0.00      4.16      0.00    88.00     0.00  68.22   0.00   22.71    0.00   0.98   104.00     0.00   7.78  31.90

test-case-4: one-rados-bench, rand, 102434 Bytes, -t 400 ; iops 40~50 ;Disk utilization 30%~35%

08/03/23 16:58:34
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            42.00    0.00      0.33      0.00     0.00     0.00   0.00   0.00    0.21    0.00   0.01     8.10     0.00   0.17   0.70
sdh             41.00    0.00      4.16      0.00    90.00     0.00  68.70   0.00   20.10    0.00   0.86   104.00     0.00   6.90  28.30

08/03/23 16:58:35
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            38.00    0.00      0.30      0.00     0.00     0.00   0.00   0.00    0.21    0.00   0.01     8.11     0.00   0.18   0.70
sdh             44.00    0.00      4.47      0.00    88.00     0.00  66.67   0.00   20.91    0.00   0.92   104.00     0.00   6.91  30.40

3) Two client limit qos, different stress tests, observe iops and util of hdd device (sdh)


# check configuration
Thu Aug  3 16:59:14 UTC 2023
    "debug_mclock": "1/5",
    "osd_mclock_force_run_benchmark_on_init": "false",
    "osd_mclock_iops_capacity_threshold_hdd": "180.000000",
    "osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
    "osd_mclock_max_capacity_iops_hdd": "180.000000",
    "osd_mclock_max_capacity_iops_ssd": "21500.000000",
    "osd_mclock_max_sequential_bandwidth_hdd": "157286400",
    "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
    "osd_mclock_override_recovery_settings": "false",
    "osd_mclock_profile": "custom",
    "osd_mclock_scheduler_anticipation_timeout": "0.000000",
    "osd_mclock_scheduler_background_best_effort_lim": "0.100000",
    "osd_mclock_scheduler_background_best_effort_res": "0.100000",
    "osd_mclock_scheduler_background_best_effort_wgt": "1",
    "osd_mclock_scheduler_background_recovery_lim": "1.000000",
    "osd_mclock_scheduler_background_recovery_res": "0.480000",
    "osd_mclock_scheduler_background_recovery_wgt": "17",
    "osd_mclock_scheduler_client_lim": "0.250000",
    "osd_mclock_scheduler_client_res": "0.150000",
    "osd_mclock_scheduler_client_wgt": "3",
    "osd_mclock_skip_benchmark": "true",
    "osd_op_queue": "mclock_scheduler",

test-case-1: two-rados-bench, rand, 102434 Bytes, -t 1 ; iops 44~50 ;Disk utilization 35%~40%

root      151037  138194  1 17:07 pts/45   00:00:00 rados bench 3600 rand -t 1 -p test-pool --osd_client_op_priority 47 --show-time --run-name 211ce56b-594b-4330-b5b5-1b90f841c85a%25.789512.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_000000000000000000000000000000000756fe2b30bee_WX_YCY0802_338517848_0
root      151058  138194  2 17:07 pts/45   00:00:00 rados bench 3600 rand -t 1 -p test-pool --osd_client_op_priority 47 --show-time --run-name 8e6b7f42-6a2a-4f00-9e7e-691e2cc6c1ac%25.785312.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_00000000000000000000000000000000f48d00b574fbe_WX_YCY0802_170044609_0

08/03/23 17:08:11
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            43.00    0.00      0.34      0.00     0.00     0.00   0.00   0.00    0.14    0.00   0.01     8.19     0.00   0.14   0.60
sdh             50.00    0.00      5.08      0.00    99.00     0.00  66.44   0.00   10.46    0.00   0.53   104.00     0.00   8.76  43.80

08/03/23 17:08:12
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            39.00    0.00      0.31      0.00     0.00     0.00   0.00   0.00    0.18    0.00   0.01     8.21     0.00   0.18   0.70
sdh             47.00    0.00      4.77      0.00    92.00     0.00  66.19   0.00    9.09    0.00   0.42   104.00     0.00   7.91  37.20

test-case-2: two-rados-bench, rand, 102434 Bytes, -t 10 ; iops 56~80 ;Disk utilization 55%~70%

root      150991  138194  1 17:05 pts/45   00:00:00 rados bench 3600 rand -t 10 -p test-pool --osd_client_op_priority 47 --show-time --run-name 8e6b7f42-6a2a-4f00-9e7e-691e2cc6c1ac%25.785312.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_00000000000000000000000000000000f48d00b574fbe_WX_YCY0802_170044609_0
root      151012  138194  2 17:05 pts/45   00:00:00 rados bench 3600 rand -t 10 -p test-pool --osd_client_op_priority 47 --show-time --run-name 211ce56b-594b-4330-b5b5-1b90f841c85a%25.789512.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_000000000000000000000000000000000756fe2b30bee_WX_YCY0802_338517848_0

08/03/23 17:06:24
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            61.00    0.00      0.49      0.00     0.00     0.00   0.00   0.00    0.20    0.00   0.01     8.26     0.00   0.20   1.20
sdh             72.00    0.00      7.31      0.00   148.00     0.00  67.27   0.00   15.33    0.00   1.11   104.00     0.00   8.53  61.40

08/03/23 17:06:25
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            61.00    0.00      0.48      0.00     0.00     0.00   0.00   0.00    0.20    0.00   0.01     8.07     0.00   0.16   1.00
sdh             70.00    0.00      7.11      0.00   142.00     0.00  66.98   0.00   13.90    0.00   0.99   104.00     0.00   8.39  58.70

test-case-3: two-rados-bench, rand, 102434 Bytes, -t 100 ; iops 80~90 ;Disk utilization 55%~65%

root      150942  138194  1 17:03 pts/45   00:00:00 rados bench 3600 rand -t 100 -p test-pool --osd_client_op_priority 47 --show-time --run-name 211ce56b-594b-4330-b5b5-1b90f841c85a%25.789512.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_000000000000000000000000000000000756fe2b30bee_WX_YCY0802_338517848_0
root      150963  138194  1 17:03 pts/45   00:00:00 rados bench 3600 rand -t 100 -p test-pool --osd_client_op_priority 47 --show-time --run-name 8e6b7f42-6a2a-4f00-9e7e-691e2cc6c1ac%25.785312.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_00000000000000000000000000000000f48d00b574fbe_WX_YCY0802_170044609_0

08/03/23 17:04:24
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            79.00    0.00      0.62      0.00     0.00     0.00   0.00   0.00    0.18    0.00   0.01     8.10     0.00   0.15   1.20
sdh             88.00    0.00      8.94      0.00   178.00     0.00  66.92   0.00   15.40    0.00   1.37   104.00     0.00   7.39  65.00

08/03/23 17:04:25
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            80.00    0.00      0.63      0.00     0.00     0.00   0.00   0.00    0.20    0.00   0.02     8.05     0.00   0.19   1.50
sdh             92.00    0.00      9.34      0.00   184.00     0.00  66.67   0.00   15.49    0.00   1.47   104.00     0.00   7.21  66.30

08/03/23 17:04:26
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            81.00    0.00      0.64      0.00     0.00     0.00   0.00   0.00    0.21    0.00   0.02     8.05     0.00   0.21   1.70
sdh             87.00    0.00      8.84      0.00   172.00     0.00  66.41   0.00   14.87    0.00   1.24   104.00     0.00   7.52  65.40

test-case-4: two-rados-bench, rand, 102434 Bytes, -t 400 ; iops 86~92 ;Disk utilization 60%~65%

root      150880  138194  1 16:57 pts/45   00:00:03 rados bench 3600 rand -t 400 -p test-pool --osd_client_op_priority 47 --show-time --run-name 211ce56b-594b-4330-b5b5-1b90f841c85a%25.789512.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_000000000000000000000000000000000756fe2b30bee_WX_YCY0802_338517848_0
root      150916  138194  2 16:59 pts/45   00:00:00 rados bench 3600 rand -t 400 -p test-pool --osd_client_op_priority 47 --show-time --run-name 8e6b7f42-6a2a-4f00-9e7e-691e2cc6c1ac%25.785312.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_00000000000000000000000000000000f48d00b574fbe_WX_YCY0802_170044609_0

08/03/23 17:00:43
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            88.00    0.00      0.69      0.00     0.00     0.00   0.00   0.00    0.17    0.00   0.02     8.05     0.00   0.17   1.50
sdh             89.00    0.00      9.04      0.00   177.00     0.00  66.54   0.00   20.02    0.00   1.78   104.00     0.00   7.04  62.70

08/03/23 17:00:44
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            85.00    0.00      0.67      0.00     0.00     0.00   0.00   0.00    0.15    0.00   0.01     8.05     0.00   0.12   1.00
sdh             91.00    0.00      9.24      0.00   178.00     0.00  66.17   0.00   20.41    0.00   1.83   104.00     0.00   6.80  61.90

08/03/23 17:00:45
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdag            82.00    0.00      0.64      0.00     0.00     0.00   0.00   0.00    0.17    0.00   0.01     8.05     0.00   0.16   1.30
sdh             83.00    0.00      8.43      0.00   176.00     0.00  67.95   0.00   19.39    0.00   1.67   104.00     0.00   7.24  60.10
Actions #2

Updated by Sridhar Seshasayee 9 months ago

  • Status changed from New to In Progress
  • Assignee set to Sridhar Seshasayee

Hi jianwei,
The current implementation of mClock scheduler does not implement fairness between
external clients. In other words the scheduler currently does not utilize the
distributed feature of the mClock algorithm (dmClock).

As a result, all the clients are put into the same bucket and each client will get the
specified limit. Therefore, the disk utilization will increase as you increase the number of
clients. This is expected at this point.

The distributed version is under implementation currently.

Therefore, could you please reduce the severity? I will update this tracker with the
PR once ready.

Sridhar

Actions #3

Updated by jianwei zhang 9 months ago

Sridhar Seshasayee wrote:

Hi jianwei,
The current implementation of mClock scheduler does not implement fairness between
external clients. In other words the scheduler currently does not utilize the
distributed feature of the mClock algorithm (dmClock).

As a result, all the clients are put into the same bucket and each client will get the
specified limit. Therefore, the disk utilization will increase as you increase the number of
clients. This is expected at this point.

The distributed version is under implementation currently.

Therefore, could you please reduce the severity? I will update this tracker with the
PR once ready.

Sridhar

1. currently does not utilize the distributed feature of the mClock algorithm (dmClock).
2. As a result, all the clients are put into the same bucket and each client will get the specified limit.

The osd mclock scheduler does not use the distributed feature,
All external clients enter the same one bucket,
mclock_opclass treats each I/O type as one client,

Shouldn't all clients be considered the same client?
Shouldn't all clients be limited by osd_mclock_scheduler_client_lim?

I'm confused about osd's QoS based on mclock_opclass

Can you explain in more detail?
It would be best to give an example,
Because it's kind of confusing,
Osd is based on the QoS of mclock_opclass, shouldn't two clients be regarded as the same type of request and enter the same queue?

Actions #4

Updated by jianwei zhang 9 months ago

Sridhar Seshasayee wrote:

Hi jianwei,
The current implementation of mClock scheduler does not implement fairness between
external clients. In other words the scheduler currently does not utilize the
distributed feature of the mClock algorithm (dmClock).

As a result, all the clients are put into the same bucket and each client will get the
specified limit. Therefore, the disk utilization will increase as you increase the number of
clients. This is expected at this point.

The distributed version is under implementation currently.

Therefore, could you please reduce the severity? I will update this tracker with the
PR once ready.

Sridhar

another question:
internal client (background_recovery),
osd_mclock_scheduler_background_recovery_lim

I redirect the IO of the rados bench client to op_scheduler_class::background_recovery,

  PGOpItem::get_scheduler_class
  op_scheduler_class get_scheduler_class() const final {
    auto type = op->get_req()->get_type();
    if (type == CEPH_MSG_OSD_OP ||
    type == CEPH_MSG_OSD_BACKOFF) {
      /// default osd_client_op_priority value is (CEPH_MSG_PRIO_LOW - 1)
      auto pri = op->get_req()->get_priority();
      if (pri >= (CEPH_MSG_PRIO_LOW - 1)) {
        return op_scheduler_class::client;
      } else {
        return op_scheduler_class::background_recovery;  /// REDIRECT here
      }
    } else {
      return op_scheduler_class::immediate;
    }
  }

In the same test, multiple rados bench clients read the same osd at the same time,
The iops/util of HDD of osd cannot be limited by osd_mclock_scheduler_background_recovery_lim

I don't quite understand why?

Actions #5

Updated by jianwei zhang 9 months ago

such as

osd.0

rados bench client1 OP \\
rados bench client2 OP ==> op_scheduler_class::client ==> osd_mclock_scheduler_client_lim (assuming 1 iops) ==> osd.0 should 1 iops
rados bench client3 OP //
Actions #6

Updated by jianwei zhang 9 months ago

osd must be read and written by multiple clients (such as rados bench),

If osd_mclock_scheduler_client_lim cannot limit IOPS/BW,
Then internal IO such as background_recovery will be preempted by a large number of client IO

in turn,

If osd_mclock_scheduler_background_recovery_lim cannot limit IOPS/BW,
Then the client's external IO will be preempted by a large number of internal IO

This is my point of confusion

Actions #7

Updated by jianwei zhang 9 months ago

balanced

  /**
   * balanced
   *
   * Client Allocation:
   *   reservation: 50% | weight: 1 | limit: 0 (max) |
   * Background Recovery Allocation:
   *   reservation: 50% | weight: 1 | limit: 0 (max) |
   * Background Best Effort Allocation:
   *   reservation: 0 (min) | weight: 1 | limit: 90% |
   */

high_recovery_ops

  /**
   * high_recovery_ops
   *
   * Client Allocation:
   *   reservation: 30% | weight: 1 | limit: 0 (max) |
   * Background Recovery Allocation:
   *   reservation: 70% | weight: 2 | limit: 0 (max) |
   * Background Best Effort Allocation:
   *   reservation: 0 (min) | weight: 1 | limit: 0 (max) |
   */

high_client_ops

  /**
   * high_client_ops
   *
   * Client Allocation:
   *   reservation: 60% | weight: 2 | limit: 0 (max) |
   * Background Recovery Allocation:
   *   reservation: 40% | weight: 1 | limit: 0 (max) |
   * Background Best Effort Allocation:
   *   reservation: 0 (min) | weight: 1 | limit: 70% |
   */

Do these configured limits still work?

Actions #8

Updated by jianwei zhang 9 months ago

Sridhar Seshasayee,
Thank you

could you please reduce the severity?

Sorry, I do not have permission

Actions #9

Updated by Sridhar Seshasayee 9 months ago

  • Severity changed from 1 - critical to 3 - minor

I have changed the severity to 3, considering that client fairness is a yet to be implemented feature.

The mClock profiles allocate max limit to client Ops. To prevent one type of client from overwhelming
another, reservation is set which guarantees minimum bandwidth allocation for those clients.

Regarding the observation about limits not being realized, I am looking into it to find out if
there's a bug. I will get back with my findings.

Actions #10

Updated by jianwei zhang 9 months ago

Sridhar Seshasayee wrote:

I have changed the severity to 3, considering that client fairness is a yet to be implemented feature.

The mClock profiles allocate max limit to client Ops. To prevent one type of client from overwhelming
another, reservation is set which guarantees minimum bandwidth allocation for those clients.

Regarding the observation about limits not being realized, I am looking into it to find out if
there's a bug. I will get back with my findings.

Thank you

I am concerned about the role of limit in mclock,
If limit does not work on a single osd based on opclass (clients/background_recovery/best_effort) scheduler,
Since we use the IO type as the standard for distinguishing clients, so how do we see the limit?

clients/background_recovery/best_effort ==> client op/scrub op/recovery op/pg delete op

Actions #11

Updated by Samuel Just 9 months ago

I'm with the original reporter -- I'd expect our current implementation to group all clients into the same class. The fact that doubling the number of clients doubles the client limit seems like a bug with the current implementation.

Actions #12

Updated by Samuel Just 9 months ago

ClientRegistry::get_external_client seems to always return default_external_client_info, but the id comes from get_scheduler_id which uses item.get_owner() which, in turn, uses the client id. I think the fix would probably be to use the same id for all clients.

Actions #13

Updated by Sridhar Seshasayee 9 months ago

Samuel Just wrote:

ClientRegistry::get_external_client seems to always return default_external_client_info, but the id comes from get_scheduler_id which uses item.get_owner() which, in turn, uses the client id. I think the fix would probably be to use the same id for all clients.

Yes, that's correct and I made the exact change that you mentioned and can confirm that it fixes the issue. I will perform a few more tests before opening the PR to fix this.

Actions #14

Updated by jianwei zhang 9 months ago

Sridhar Seshasayee wrote:

Samuel Just wrote:

ClientRegistry::get_external_client seems to always return default_external_client_info, but the id comes from get_scheduler_id which uses item.get_owner() which, in turn, uses the client id. I think the fix would probably be to use the same id for all clients.

Yes, that's correct and I made the exact change that you mentioned and can confirm that it fixes the issue. I will perform a few more tests before opening the PR to fix this.

Hi Sridhar Seshasayee and Samuel Just,
Thank you very much for your communication and tips
I am in a hurry to solve this problem, and may not wait for your PR, please help me review whether my modification is correct。

https://github.com/ceph/ceph/pull/52808

Actions #15

Updated by Sridhar Seshasayee 9 months ago

Hi Sridhar Seshasayee and Samuel Just,
Thank you very much for your communication and tips
I am in a hurry to solve this problem, and may not wait for your PR, please help me review whether my modification is correct。

https://github.com/ceph/ceph/pull/52808

Hi Jianwei,

I see that you raised a PR to fix this.
I raised https://github.com/ceph/ceph/pull/52809 keeping in mind the
future upcoming changes related to distributed QoS.

I hope it's okay with you if we go with my PR considering my point above.
In any case, the PR must undergo in-house reviews and CI testing before
it's merged which may take a few days.

-Sridhar

Actions #16

Updated by Sridhar Seshasayee 9 months ago

  • Pull request ID set to 52809
Actions #17

Updated by Sridhar Seshasayee 9 months ago

  • Status changed from In Progress to Pending Backport
  • Backport set to quincy, reef
Actions #18

Updated by Backport Bot 9 months ago

  • Copied to Backport #62546: reef: osd mclock QoS : osd_mclock_scheduler_client_lim is not limited added
Actions #19

Updated by Backport Bot 9 months ago

  • Copied to Backport #62547: quincy: osd mclock QoS : osd_mclock_scheduler_client_lim is not limited added
Actions #20

Updated by Backport Bot 9 months ago

  • Tags changed from v18.1.0 to v18.1.0 backport_processed
Actions #21

Updated by Sridhar Seshasayee 6 months ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF