Bug #62293
closedosd mclock QoS : osd_mclock_scheduler_client_lim is not limited
0%
Description
First of all, I want to confirm whether there is a problem with my understanding:
(1) reserveration / weight / limit
(2) the limit is indeed limiting the maximum resource usage of IOPS/Bandwidth that a certain type of client can use
- osd_mclock_max_capacity_iops_hdd 180
- osd_mclock_scheduler_client_lim 0.25 ==> 180*0.25 = 45 iops
- osd_mclock_max_sequential_bandwidth_hdd 157286400 ==> 150M*0.25= 37.5 MiB/s
for example,
I limit the client's iops to a maximum of 100 iops,
Then no matter how many rados benches I run, it will never exceed 100 iops,
The iops/util of the HDD disk should always be stable at the same water level
If my understanding of limit is correct,
then the pressure on osd should not increase with the increase of rados bench
Problem:
But the actual situation is that the more clients there are in rados bench,
the greater the pressure on osd, and the more iops/bandwidth consumption of hdd
WHY?
reproduce:
ceph cluster:
# ceph -s
cluster:
id: dcc749b4-c686-453b-8f25-6b965cdb360f
health: HEALTH_OK
services:
mon: 1 daemons, quorum a (age 22h)
mgr: x(active, since 22h)
osd: 1 osds: 1 up (since 20m), 1 in (since 23h)
data:
pools: 2 pools, 129 pgs
objects: 17.80M objects, 1.7 TiB
usage: 2.0 TiB used, 7.4 TiB / 9.4 TiB avail
pgs: 129 active+clean
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 9.39259 root default
-3 9.39259 host SZJD-YFQ-PM-OS01-BCONEST-06
0 hdd 9.39259 osd.0 up 1.00000 1.00000
# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 9.39259 1.00000 9.4 TiB 2.0 TiB 1.7 TiB 1 KiB 12 GiB 7.4 TiB 21.52 1.00 129 up
TOTAL 9.4 TiB 2.0 TiB 1.7 TiB 1.2 KiB 12 GiB 7.4 TiB 21.52
MIN/MAX VAR: 1.00/1.00 STDDEV: 0
# ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 9.4 TiB 7.4 TiB 2.0 TiB 2.0 TiB 21.52
TOTAL 9.4 TiB 7.4 TiB 2.0 TiB 2.0 TiB 21.52
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 577 KiB 2 580 KiB 0 6.9 TiB
test-pool 2 128 1.7 TiB 17.80M 1.7 TiB 19.99 6.9 TiB
# ceph osd pool ls detail
pool 1 '.mgr' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 8 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 1.00
pool 2 'test-pool' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off last_change 20 flags hashpspool stripe_width 0 application rgw read_balance_score 1.00
# ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "choose_firstn",
"num": 0,
"type": "osd"
},
{
"op": "emit"
}
]
}
]
# ls -l dev/osd0/
total 72
-rw------- 1 root root 11 Aug 2 17:43 bfm_blocks
-rw------- 1 root root 4 Aug 2 17:43 bfm_blocks_per_key
-rw------- 1 root root 5 Aug 2 17:43 bfm_bytes_per_block
-rw------- 1 root root 15 Aug 2 17:43 bfm_size
lrwxrwxrwx 1 root root 8 Aug 2 17:43 block -> /dev/sdh
lrwxrwxrwx 1 root root 10 Aug 2 17:43 block.db -> /dev/sdag1
lrwxrwxrwx 1 root root 10 Aug 2 17:43 block.wal -> /dev/sdag2
-rw------- 1 root root 2 Aug 2 17:43 bluefs
-rw------- 1 root root 37 Aug 2 17:43 ceph_fsid
-rw------- 1 root root 75 Aug 2 17:43 ceph_version_when_created
-rw------- 1 root root 28 Aug 2 17:43 created_at
-rw-r--r-- 1 root root 37 Aug 2 17:43 fsid
-rw-r--r-- 1 root root 63 Aug 2 17:43 keyring
-rw------- 1 root root 8 Aug 2 17:43 kv_backend
-rw------- 1 root root 21 Aug 2 17:43 magic
-rw------- 1 root root 4 Aug 2 17:43 mkfs_done
-rw------- 1 root root 41 Aug 2 17:43 osd_key
-rw------- 1 root root 6 Aug 2 17:43 ready
-rw------- 1 root root 3 Aug 2 17:44 require_osd_release
-rw------- 1 root root 10 Aug 2 17:43 type
-rw------- 1 root root 2 Aug 2 17:43 whoami
[root@SZJD-YFQ-PM-OS01-BCONEST-06 b]# lsblk /dev/sdh /dev/sdag
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdh 8:112 0 9.1T 0 disk //hdd block
sdag 66:0 0 447.1G 0 disk //ssd
|-sdag1 66:1 0 304.1G 0 part //db
`-sdag2 66:2 0 102.9G 0 part //wal
Thu Aug 3 14:57:21 UTC 2023
"debug_mclock": "1/5",
"osd_mclock_force_run_benchmark_on_init": "false",
"osd_mclock_iops_capacity_threshold_hdd": "180.000000",
"osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
"osd_mclock_max_capacity_iops_hdd": "180.000000",
"osd_mclock_max_capacity_iops_ssd": "21500.000000",
"osd_mclock_max_sequential_bandwidth_hdd": "157286400",
"osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
"osd_mclock_override_recovery_settings": "false",
"osd_mclock_profile": "custom",
"osd_mclock_scheduler_anticipation_timeout": "0.000000",
"osd_mclock_scheduler_background_best_effort_lim": "0.100000",
"osd_mclock_scheduler_background_best_effort_res": "0.100000",
"osd_mclock_scheduler_background_best_effort_wgt": "1",
"osd_mclock_scheduler_background_recovery_lim": "1.000000",
"osd_mclock_scheduler_background_recovery_res": "0.480000",
"osd_mclock_scheduler_background_recovery_wgt": "17",
"osd_mclock_scheduler_client_lim": "0.450000", //lim = 180*0.45 = 81 iops
"osd_mclock_scheduler_client_res": "0.250000", //res = 180*0.25 = 45 iops
"osd_mclock_scheduler_client_wgt": "3",
"osd_mclock_skip_benchmark": "true",
"osd_op_queue": "mclock_scheduler",
# cat write-name4.txt
89b8a2e4-cdbc-4a96-8432-0a0c28ebe847%25.779912.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_000000000000000000000000000000009d76cf627ab3a_WX_YCY0801_475279953_0
f5e1cec2-3027-49f6-8299-b0334e633178%25.780012.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_000000000000000000000000000000003d0fbf2875c33_WX_YCY0801_714510436_0
b29bd483-dac7-48ae-be13-ac28351ccdd1%25.780112.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_000000000000000000000000000000004870c438a5c82_WX_YCY0801_010795134_0
6daeab64-417e-4117-b5be-293e862255c2%25.780212.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_000000000000000000000000000000003a3b6c4cbd4a8_WX_YCY0801_853521873_0
# cat test-4read.sh
for name in `cat ./write-name4.txt` ; do
rados bench 3600 rand -t 2 -p test-pool --osd_client_op_priority 47 --show-time --run-name "$name" >> readlog &
done
test-case-1: one-rados-bench : Disk utilization 40%
iostat -xtm 1 -d /dev/sdag /dev/sdh
08/03/23 15:06:31
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 52.00 0.00 1.80 0.00 0.00 0.00 0.00 0.00 0.35 0.00 0.02 35.38 0.00 0.27 1.40
sdh 45.00 0.00 4.57 0.00 89.00 0.00 66.42 0.00 8.89 0.00 0.40 104.00 0.00 7.69 34.60
08/03/23 15:06:32
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 52.00 0.00 1.31 0.00 0.00 0.00 0.00 0.00 0.38 0.00 0.02 25.85 0.00 0.35 1.80
sdh 48.00 0.00 4.88 0.00 95.00 0.00 66.43 0.00 11.56 0.00 0.56 104.00 0.00 8.98 43.10
08/03/23 15:06:33
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 51.00 0.00 0.84 0.00 0.00 0.00 0.00 0.00 0.20 0.00 0.01 16.94 0.00 0.20 1.00
sdh 49.00 0.00 4.98 0.00 94.00 0.00 65.73 0.00 9.78 0.00 0.48 104.00 0.00 8.65 42.40
08/03/23 15:06:34
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 53.00 0.00 0.74 0.00 0.00 0.00 0.00 0.00 0.25 0.00 0.01 14.26 0.00 0.25 1.30
sdh 51.00 0.00 5.18 0.00 100.00 0.00 66.23 0.00 9.33 0.00 0.48 104.00 0.00 8.25 42.10
08/03/23 15:06:35
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 49.00 0.00 0.39 0.00 0.00 0.00 0.00 0.00 0.22 0.00 0.01 8.08 0.00 0.18 0.90
sdh 49.00 0.00 4.98 0.00 98.00 0.00 66.67 0.00 10.02 0.00 0.49 104.00 0.00 8.73 42.80
08/03/23 15:06:50
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 46.00 0.00 0.36 0.00 0.00 0.00 0.00 0.00 0.17 0.00 0.01 8.09 0.00 0.17 0.80
sdh 44.00 0.00 4.47 0.00 90.00 0.00 67.16 0.00 10.11 0.00 0.46 104.00 0.00 8.93 39.30
08/03/23 15:06:51
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 48.00 0.00 0.38 0.00 0.00 0.00 0.00 0.00 0.12 0.00 0.01 8.08 0.00 0.12 0.60
sdh 48.00 0.00 4.88 0.00 93.00 0.00 65.96 0.00 10.04 0.00 0.48 104.00 0.00 8.50 40.80
08/03/23 15:06:52
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 48.00 0.00 0.38 0.00 0.00 0.00 0.00 0.00 0.27 0.00 0.01 8.17 0.00 0.27 1.30
sdh 49.00 0.00 4.98 0.00 92.00 0.00 65.25 0.00 8.78 0.00 0.42 104.00 0.00 7.67 37.60
08/03/23 15:06:53
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 50.00 0.00 0.40 0.00 0.00 0.00 0.00 0.00 0.10 0.00 0.01 8.16 0.00 0.10 0.50
sdh 51.00 0.00 5.18 0.00 99.00 0.00 66.00 0.00 10.51 0.00 0.53 104.00 0.00 8.31 42.40
test-case-2: two-rados-bench : Disk utilization 70%~80%
iostat -xtm 1 -d /dev/sdag /dev/sdh
08/03/23 15:07:16
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 97.00 0.00 0.76 0.00 0.00 0.00 0.00 0.00 0.19 0.00 0.02 8.04 0.00 0.19 1.80
sdh 99.00 0.00 10.05 0.00 191.00 0.00 65.86 0.00 11.63 0.00 1.13 104.00 0.00 7.13 70.60
08/03/23 15:07:17
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 98.02 0.00 0.78 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.01 8.16 0.00 0.13 1.29
sdh 97.03 0.00 9.85 0.00 191.09 0.00 66.32 0.00 12.73 0.00 1.24 104.00 0.00 8.12 78.81
08/03/23 15:07:18
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 100.00 0.00 0.80 0.00 0.00 0.00 0.00 0.00 0.20 0.00 0.02 8.16 0.00 0.20 2.00
sdh 101.00 0.00 10.26 0.00 195.00 0.00 65.88 0.00 12.27 0.00 1.23 104.00 0.00 7.54 76.20
08/03/23 15:07:19
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 102.00 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.22 0.00 0.02 8.24 0.00 0.22 2.20
sdh 100.00 0.00 10.16 0.00 200.00 0.00 66.67 0.00 13.50 0.00 1.37 104.00 0.00 8.15 81.50
08/03/23 15:07:20
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 93.00 0.00 0.74 0.00 0.00 0.00 0.00 0.00 0.14 0.00 0.01 8.13 0.00 0.13 1.20
sdh 95.00 0.00 9.65 0.00 183.00 0.00 65.83 0.00 13.46 0.00 1.26 104.00 0.00 7.98 75.80
08/03/23 15:07:21
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 95.00 0.00 0.76 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.01 8.21 0.00 0.12 1.10
sdh 96.00 0.00 9.75 0.00 187.00 0.00 66.08 0.00 13.05 0.00 1.25 104.00 0.00 8.19 78.60
08/03/23 15:07:22
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 101.00 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.17 0.00 0.02 8.28 0.00 0.17 1.70
sdh 100.00 0.00 10.16 0.00 193.00 0.00 65.87 0.00 10.55 0.00 1.06 104.00 0.00 7.90 79.00
test-case-3: three-rados-bench : Disk utilization 95%~99%
iostat -xtm 1 -d /dev/sdag /dev/sdh
08/03/23 15:08:03
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 127.00 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.18 0.00 0.02 8.16 0.00 0.17 2.20
sdh 126.00 0.00 12.80 0.00 248.00 0.00 66.31 0.00 16.83 0.00 2.14 104.00 0.00 7.52 94.70
08/03/23 15:08:04
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 127.00 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.17 0.00 0.02 8.16 0.00 0.17 2.20
sdh 130.00 0.00 13.20 0.00 250.00 0.00 65.79 0.00 17.62 0.00 2.29 104.00 0.00 7.58 98.50
08/03/23 15:08:05
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 135.00 0.00 1.08 0.00 0.00 0.00 0.00 0.00 0.19 0.00 0.02 8.18 0.00 0.18 2.40
sdh 136.00 0.00 13.81 0.00 261.00 0.00 65.74 0.00 14.79 0.00 2.00 104.00 0.00 7.08 96.30
08/03/23 15:08:06
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 144.00 0.00 1.14 0.00 0.00 0.00 0.00 0.00 0.22 0.00 0.03 8.08 0.00 0.22 3.10
sdh 144.00 0.00 14.62 0.00 280.00 0.00 66.04 0.00 17.28 0.00 2.49 104.00 0.00 6.83 98.40
08/03/23 15:08:07
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 136.00 0.00 1.07 0.00 0.00 0.00 0.00 0.00 0.15 0.00 0.02 8.06 0.00 0.15 2.00
sdh 137.00 0.00 13.91 0.00 269.00 0.00 66.26 0.00 17.08 0.00 2.35 104.00 0.00 7.22 98.90
test-case-4: four-rados-bench : Disk utilization 99%~100%
iostat -xtm 1 -d /dev/sdag /dev/sdh
08/03/23 15:08:55
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 140.00 0.00 1.11 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.03 8.09 0.00 0.24 3.30
sdh 145.00 0.00 14.73 0.00 279.00 0.00 65.80 0.00 20.74 0.00 3.00 104.00 0.00 6.88 99.80
08/03/23 15:08:56
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 143.00 0.00 1.14 0.00 0.00 0.00 0.00 0.00 0.16 0.00 0.02 8.17 0.00 0.16 2.30
sdh 146.00 0.00 14.83 0.00 285.00 0.00 66.13 0.00 21.55 0.00 3.11 104.00 0.00 6.84 99.80
08/03/23 15:08:57
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 151.00 0.00 1.20 0.00 0.00 0.00 0.00 0.00 0.17 0.00 0.03 8.11 0.00 0.18 2.70
sdh 153.00 0.00 15.54 0.00 296.00 0.00 65.92 0.00 20.97 0.00 3.24 104.00 0.00 6.54 100.10
08/03/23 15:08:58
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 155.00 0.00 1.22 0.00 0.00 0.00 0.00 0.00 0.17 0.00 0.03 8.05 0.00 0.17 2.60
sdh 160.00 0.00 16.25 0.00 310.00 0.00 65.96 0.00 20.99 0.00 3.33 104.00 0.00 6.24 99.90
08/03/23 15:08:59
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 151.00 0.00 1.20 0.00 0.00 0.00 0.00 0.00 0.22 0.00 0.03 8.11 0.00 0.22 3.30
sdh 154.00 0.00 15.64 0.00 298.00 0.00 65.93 0.00 21.37 0.00 3.27 104.00 0.00 6.49 100.00
Updated by xu wang 9 months ago
Through the test, it can be proved that the qos function does not take effect under multiple clients:
1. With a single client, the qos limit function can take effect
2. With two clients, the qos limit function cannot take effect. And the iops and util of two clients are 2 times that of a single client
reproduce:
1) Single client does not limit qos (default configuration), different stress tests, observe iops and util of hdd device (sdh)
# check configuration
Thu Aug 3 16:43:02 UTC 2023
"debug_mclock": "1/5",
"osd_mclock_force_run_benchmark_on_init": "false",
"osd_mclock_iops_capacity_threshold_hdd": "500.000000",
"osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
"osd_mclock_max_capacity_iops_hdd": "315.000000",
"osd_mclock_max_capacity_iops_ssd": "21500.000000",
"osd_mclock_max_sequential_bandwidth_hdd": "157286400",
"osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
"osd_mclock_override_recovery_settings": "false",
"osd_mclock_profile": "balanced",
"osd_mclock_scheduler_anticipation_timeout": "0.000000",
"osd_mclock_scheduler_background_best_effort_lim": "0.900000",
"osd_mclock_scheduler_background_best_effort_res": "0.000000",
"osd_mclock_scheduler_background_best_effort_wgt": "1",
"osd_mclock_scheduler_background_recovery_lim": "0.000000",
"osd_mclock_scheduler_background_recovery_res": "0.500000",
"osd_mclock_scheduler_background_recovery_wgt": "1",
"osd_mclock_scheduler_client_lim": "0.000000",
"osd_mclock_scheduler_client_res": "0.500000",
"osd_mclock_scheduler_client_wgt": "1",
"osd_mclock_skip_benchmark": "false",
"osd_op_queue": "mclock_scheduler",
test-case-1: one-rados-bench, rand, 102434 Bytes, -t 1 ; iops 105~110 ;Disk utilization 90%
08/03/23 16:44:02
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 105.00 0.00 1.12 0.00 0.00 0.00 0.00 0.00 0.25 0.00 0.03 10.93 0.00 0.25 2.60
sdh 102.00 0.00 10.36 0.00 206.00 0.00 66.88 0.00 8.62 0.00 0.89 104.00 0.00 8.69 88.60
08/03/23 16:44:03
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 107.00 0.00 1.07 0.00 0.00 0.00 0.00 0.00 0.18 0.00 0.02 10.21 0.00 0.18 1.90
sdh 106.00 0.00 10.77 0.00 212.00 0.00 66.67 0.00 8.55 0.00 0.90 104.00 0.00 8.52 90.30
test-case-2: one-rados-bench, rand, 102434 Bytes, -t 10 ; iops 160~170 ;Disk utilization 99%~100%
08/03/23 16:45:17
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 171.00 0.00 1.37 0.00 0.00 0.00 0.00 0.00 0.16 0.00 0.03 8.21 0.00 0.16 2.70
sdh 172.00 0.00 17.47 0.00 348.00 0.00 66.92 0.00 21.72 0.00 3.73 104.00 0.00 5.81 100.00
08/03/23 16:45:18
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 159.00 0.00 1.26 0.00 0.00 0.00 0.00 0.00 0.18 0.00 0.03 8.10 0.00 0.18 2.90
sdh 163.00 0.00 16.55 0.00 322.00 0.00 66.39 0.00 22.42 0.00 3.75 104.00 0.00 6.13 99.90
test-case-3: one-rados-bench, rand, 102434 Bytes, -t 100 ; iops 170~180 ;Disk utilization 100%
08/03/23 16:46:51
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 156.00 0.00 1.25 0.00 0.00 0.00 0.00 0.00 0.22 0.00 0.03 8.21 0.00 0.22 3.40
sdh 177.00 0.00 17.98 0.00 354.00 0.00 66.67 0.00 27.64 0.00 4.89 104.00 0.00 5.66 100.10
08/03/23 16:46:52
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 159.00 0.00 1.27 0.00 0.00 0.00 0.00 0.00 0.18 0.00 0.03 8.15 0.00 0.18 2.90
sdh 183.00 0.00 18.59 0.00 366.00 0.00 66.67 0.00 26.42 0.00 4.84 104.00 0.00 5.47 100.10
test-case-4: one-rados-bench, rand, 102434 Bytes, -t 400 ; iops 170~190 ;Disk utilization 100%
08/03/23 16:48:11
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 148.00 0.00 1.18 0.00 0.00 0.00 0.00 0.00 0.11 0.00 0.02 8.14 0.00 0.11 1.60
sdh 175.00 0.00 17.77 0.00 349.00 0.00 66.60 0.00 27.75 0.00 4.92 104.00 0.00 5.72 100.10
08/03/23 16:48:12
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 132.00 0.00 1.05 0.00 0.00 0.00 0.00 0.00 0.17 0.00 0.02 8.15 0.00 0.17 2.30
sdh 165.00 0.00 16.76 0.00 329.00 0.00 66.60 0.00 29.18 0.00 4.91 104.00 0.00 6.05 99.80
2) Single client limit qos, different stress tests, observe iops and util of hdd device (sdh)
# check configuration
Thu Aug 3 16:50:50 UTC 2023
"debug_mclock": "1/5",
"osd_mclock_force_run_benchmark_on_init": "false",
"osd_mclock_iops_capacity_threshold_hdd": "180.000000",
"osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
"osd_mclock_max_capacity_iops_hdd": "180.000000",
"osd_mclock_max_capacity_iops_ssd": "21500.000000",
"osd_mclock_max_sequential_bandwidth_hdd": "157286400",
"osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
"osd_mclock_override_recovery_settings": "false",
"osd_mclock_profile": "custom",
"osd_mclock_scheduler_anticipation_timeout": "0.000000",
"osd_mclock_scheduler_background_best_effort_lim": "0.100000",
"osd_mclock_scheduler_background_best_effort_res": "0.100000",
"osd_mclock_scheduler_background_best_effort_wgt": "1",
"osd_mclock_scheduler_background_recovery_lim": "1.000000",
"osd_mclock_scheduler_background_recovery_res": "0.480000",
"osd_mclock_scheduler_background_recovery_wgt": "17",
"osd_mclock_scheduler_client_lim": "0.250000",
"osd_mclock_scheduler_client_res": "0.150000",
"osd_mclock_scheduler_client_wgt": "3",
"osd_mclock_skip_benchmark": "true",
"osd_op_queue": "mclock_scheduler",
test-case-1: one-rados-bench, rand, 102434 Bytes, -t 1 ; iops 20~25 ;Disk utilization 15~25%
08/03/23 16:51:55
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 27.00 0.00 0.66 0.00 0.00 0.00 0.00 0.00 0.22 0.00 0.01 24.89 0.00 0.22 0.60
sdh 24.00 0.00 2.44 0.00 48.00 0.00 66.67 0.00 9.54 0.00 0.23 104.00 0.00 9.54 22.90
08/03/23 16:51:56
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 23.00 0.00 0.40 0.00 0.00 0.00 0.00 0.00 0.22 0.00 0.01 17.74 0.00 0.22 0.50
sdh 20.00 0.00 2.03 0.00 42.00 0.00 67.74 0.00 8.40 0.00 0.17 104.00 0.00 8.60 17.20
test-case-2: one-rados-bench, rand, 102434 Bytes, -t 10 ; iops 30~45 ;Disk utilization 20%~35%
08/03/23 16:55:08
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 33.00 0.00 0.27 0.00 0.00 0.00 0.00 0.00 0.18 0.00 0.01 8.24 0.00 0.18 0.60
sdh 34.00 0.00 3.45 0.00 66.00 0.00 66.00 0.00 12.24 0.00 0.40 104.00 0.00 9.32 31.70
08/03/23 16:55:09
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 38.00 0.00 0.30 0.00 0.00 0.00 0.00 0.00 0.18 0.00 0.01 8.21 0.00 0.18 0.70
sdh 38.00 0.00 3.86 0.00 76.00 0.00 66.67 0.00 10.21 0.00 0.39 104.00 0.00 8.29 31.50
08/03/23 16:55:10
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 43.00 0.00 0.34 0.00 0.00 0.00 0.00 0.00 0.16 0.00 0.01 8.09 0.00 0.16 0.70
sdh 43.00 0.00 4.37 0.00 86.00 0.00 66.67 0.00 9.63 0.00 0.41 104.00 0.00 8.53 36.70
test-case-3: one-rados-bench, rand, 102434 Bytes, -t 100 ; iops 40~48 ;Disk utilization 30%~35%
08/03/23 16:56:07
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 44.00 0.00 0.34 0.00 0.00 0.00 0.00 0.00 0.09 0.00 0.00 8.00 0.00 0.09 0.40
sdh 45.00 0.00 4.57 0.00 85.00 0.00 65.38 0.00 22.89 0.00 1.03 104.00 0.00 7.24 32.60
08/03/23 16:56:08
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 42.00 0.00 0.34 0.00 0.00 0.00 0.00 0.00 0.17 0.00 0.01 8.29 0.00 0.10 0.40
sdh 47.00 0.00 4.77 0.00 85.00 0.00 64.39 0.00 21.98 0.00 0.96 104.00 0.00 7.47 35.10
08/03/23 16:56:09
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 43.00 0.00 0.34 0.00 0.00 0.00 0.00 0.00 0.09 0.00 0.00 8.09 0.00 0.07 0.30
sdh 41.00 0.00 4.16 0.00 88.00 0.00 68.22 0.00 22.71 0.00 0.98 104.00 0.00 7.78 31.90
test-case-4: one-rados-bench, rand, 102434 Bytes, -t 400 ; iops 40~50 ;Disk utilization 30%~35%
08/03/23 16:58:34
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 42.00 0.00 0.33 0.00 0.00 0.00 0.00 0.00 0.21 0.00 0.01 8.10 0.00 0.17 0.70
sdh 41.00 0.00 4.16 0.00 90.00 0.00 68.70 0.00 20.10 0.00 0.86 104.00 0.00 6.90 28.30
08/03/23 16:58:35
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 38.00 0.00 0.30 0.00 0.00 0.00 0.00 0.00 0.21 0.00 0.01 8.11 0.00 0.18 0.70
sdh 44.00 0.00 4.47 0.00 88.00 0.00 66.67 0.00 20.91 0.00 0.92 104.00 0.00 6.91 30.40
3) Two client limit qos, different stress tests, observe iops and util of hdd device (sdh)
# check configuration
Thu Aug 3 16:59:14 UTC 2023
"debug_mclock": "1/5",
"osd_mclock_force_run_benchmark_on_init": "false",
"osd_mclock_iops_capacity_threshold_hdd": "180.000000",
"osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
"osd_mclock_max_capacity_iops_hdd": "180.000000",
"osd_mclock_max_capacity_iops_ssd": "21500.000000",
"osd_mclock_max_sequential_bandwidth_hdd": "157286400",
"osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
"osd_mclock_override_recovery_settings": "false",
"osd_mclock_profile": "custom",
"osd_mclock_scheduler_anticipation_timeout": "0.000000",
"osd_mclock_scheduler_background_best_effort_lim": "0.100000",
"osd_mclock_scheduler_background_best_effort_res": "0.100000",
"osd_mclock_scheduler_background_best_effort_wgt": "1",
"osd_mclock_scheduler_background_recovery_lim": "1.000000",
"osd_mclock_scheduler_background_recovery_res": "0.480000",
"osd_mclock_scheduler_background_recovery_wgt": "17",
"osd_mclock_scheduler_client_lim": "0.250000",
"osd_mclock_scheduler_client_res": "0.150000",
"osd_mclock_scheduler_client_wgt": "3",
"osd_mclock_skip_benchmark": "true",
"osd_op_queue": "mclock_scheduler",
test-case-1: two-rados-bench, rand, 102434 Bytes, -t 1 ; iops 44~50 ;Disk utilization 35%~40%
root 151037 138194 1 17:07 pts/45 00:00:00 rados bench 3600 rand -t 1 -p test-pool --osd_client_op_priority 47 --show-time --run-name 211ce56b-594b-4330-b5b5-1b90f841c85a%25.789512.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_000000000000000000000000000000000756fe2b30bee_WX_YCY0802_338517848_0
root 151058 138194 2 17:07 pts/45 00:00:00 rados bench 3600 rand -t 1 -p test-pool --osd_client_op_priority 47 --show-time --run-name 8e6b7f42-6a2a-4f00-9e7e-691e2cc6c1ac%25.785312.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_00000000000000000000000000000000f48d00b574fbe_WX_YCY0802_170044609_0
08/03/23 17:08:11
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 43.00 0.00 0.34 0.00 0.00 0.00 0.00 0.00 0.14 0.00 0.01 8.19 0.00 0.14 0.60
sdh 50.00 0.00 5.08 0.00 99.00 0.00 66.44 0.00 10.46 0.00 0.53 104.00 0.00 8.76 43.80
08/03/23 17:08:12
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 39.00 0.00 0.31 0.00 0.00 0.00 0.00 0.00 0.18 0.00 0.01 8.21 0.00 0.18 0.70
sdh 47.00 0.00 4.77 0.00 92.00 0.00 66.19 0.00 9.09 0.00 0.42 104.00 0.00 7.91 37.20
test-case-2: two-rados-bench, rand, 102434 Bytes, -t 10 ; iops 56~80 ;Disk utilization 55%~70%
root 150991 138194 1 17:05 pts/45 00:00:00 rados bench 3600 rand -t 10 -p test-pool --osd_client_op_priority 47 --show-time --run-name 8e6b7f42-6a2a-4f00-9e7e-691e2cc6c1ac%25.785312.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_00000000000000000000000000000000f48d00b574fbe_WX_YCY0802_170044609_0
root 151012 138194 2 17:05 pts/45 00:00:00 rados bench 3600 rand -t 10 -p test-pool --osd_client_op_priority 47 --show-time --run-name 211ce56b-594b-4330-b5b5-1b90f841c85a%25.789512.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_000000000000000000000000000000000756fe2b30bee_WX_YCY0802_338517848_0
08/03/23 17:06:24
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 61.00 0.00 0.49 0.00 0.00 0.00 0.00 0.00 0.20 0.00 0.01 8.26 0.00 0.20 1.20
sdh 72.00 0.00 7.31 0.00 148.00 0.00 67.27 0.00 15.33 0.00 1.11 104.00 0.00 8.53 61.40
08/03/23 17:06:25
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 61.00 0.00 0.48 0.00 0.00 0.00 0.00 0.00 0.20 0.00 0.01 8.07 0.00 0.16 1.00
sdh 70.00 0.00 7.11 0.00 142.00 0.00 66.98 0.00 13.90 0.00 0.99 104.00 0.00 8.39 58.70
test-case-3: two-rados-bench, rand, 102434 Bytes, -t 100 ; iops 80~90 ;Disk utilization 55%~65%
root 150942 138194 1 17:03 pts/45 00:00:00 rados bench 3600 rand -t 100 -p test-pool --osd_client_op_priority 47 --show-time --run-name 211ce56b-594b-4330-b5b5-1b90f841c85a%25.789512.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_000000000000000000000000000000000756fe2b30bee_WX_YCY0802_338517848_0
root 150963 138194 1 17:03 pts/45 00:00:00 rados bench 3600 rand -t 100 -p test-pool --osd_client_op_priority 47 --show-time --run-name 8e6b7f42-6a2a-4f00-9e7e-691e2cc6c1ac%25.785312.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_00000000000000000000000000000000f48d00b574fbe_WX_YCY0802_170044609_0
08/03/23 17:04:24
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 79.00 0.00 0.62 0.00 0.00 0.00 0.00 0.00 0.18 0.00 0.01 8.10 0.00 0.15 1.20
sdh 88.00 0.00 8.94 0.00 178.00 0.00 66.92 0.00 15.40 0.00 1.37 104.00 0.00 7.39 65.00
08/03/23 17:04:25
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 80.00 0.00 0.63 0.00 0.00 0.00 0.00 0.00 0.20 0.00 0.02 8.05 0.00 0.19 1.50
sdh 92.00 0.00 9.34 0.00 184.00 0.00 66.67 0.00 15.49 0.00 1.47 104.00 0.00 7.21 66.30
08/03/23 17:04:26
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 81.00 0.00 0.64 0.00 0.00 0.00 0.00 0.00 0.21 0.00 0.02 8.05 0.00 0.21 1.70
sdh 87.00 0.00 8.84 0.00 172.00 0.00 66.41 0.00 14.87 0.00 1.24 104.00 0.00 7.52 65.40
test-case-4: two-rados-bench, rand, 102434 Bytes, -t 400 ; iops 86~92 ;Disk utilization 60%~65%
root 150880 138194 1 16:57 pts/45 00:00:03 rados bench 3600 rand -t 400 -p test-pool --osd_client_op_priority 47 --show-time --run-name 211ce56b-594b-4330-b5b5-1b90f841c85a%25.789512.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_000000000000000000000000000000000756fe2b30bee_WX_YCY0802_338517848_0
root 150916 138194 2 16:59 pts/45 00:00:00 rados bench 3600 rand -t 400 -p test-pool --osd_client_op_priority 47 --show-time --run-name 8e6b7f42-6a2a-4f00-9e7e-691e2cc6c1ac%25.785312.1__shadow_.TOh6_uWIYrPYx5bqObztEFip9uFKxz1_00000000000000000000000000000000f48d00b574fbe_WX_YCY0802_170044609_0
08/03/23 17:00:43
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 88.00 0.00 0.69 0.00 0.00 0.00 0.00 0.00 0.17 0.00 0.02 8.05 0.00 0.17 1.50
sdh 89.00 0.00 9.04 0.00 177.00 0.00 66.54 0.00 20.02 0.00 1.78 104.00 0.00 7.04 62.70
08/03/23 17:00:44
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 85.00 0.00 0.67 0.00 0.00 0.00 0.00 0.00 0.15 0.00 0.01 8.05 0.00 0.12 1.00
sdh 91.00 0.00 9.24 0.00 178.00 0.00 66.17 0.00 20.41 0.00 1.83 104.00 0.00 6.80 61.90
08/03/23 17:00:45
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdag 82.00 0.00 0.64 0.00 0.00 0.00 0.00 0.00 0.17 0.00 0.01 8.05 0.00 0.16 1.30
sdh 83.00 0.00 8.43 0.00 176.00 0.00 67.95 0.00 19.39 0.00 1.67 104.00 0.00 7.24 60.10
Updated by Sridhar Seshasayee 9 months ago
- Status changed from New to In Progress
- Assignee set to Sridhar Seshasayee
Hi jianwei,
The current implementation of mClock scheduler does not implement fairness between
external clients. In other words the scheduler currently does not utilize the
distributed feature of the mClock algorithm (dmClock).
As a result, all the clients are put into the same bucket and each client will get the
specified limit. Therefore, the disk utilization will increase as you increase the number of
clients. This is expected at this point.
The distributed version is under implementation currently.
Therefore, could you please reduce the severity? I will update this tracker with the
PR once ready.
Sridhar
Updated by jianwei zhang 9 months ago
Sridhar Seshasayee wrote:
Hi jianwei,
The current implementation of mClock scheduler does not implement fairness between
external clients. In other words the scheduler currently does not utilize the
distributed feature of the mClock algorithm (dmClock).As a result, all the clients are put into the same bucket and each client will get the
specified limit. Therefore, the disk utilization will increase as you increase the number of
clients. This is expected at this point.The distributed version is under implementation currently.
Therefore, could you please reduce the severity? I will update this tracker with the
PR once ready.Sridhar
1. currently does not utilize the distributed feature of the mClock algorithm (dmClock).
2. As a result, all the clients are put into the same bucket and each client will get the specified limit.
The osd mclock scheduler does not use the distributed feature,
All external clients enter the same one bucket,
mclock_opclass treats each I/O type as one client,
Shouldn't all clients be considered the same client?
Shouldn't all clients be limited by osd_mclock_scheduler_client_lim?
I'm confused about osd's QoS based on mclock_opclass
Can you explain in more detail?
It would be best to give an example,
Because it's kind of confusing,
Osd is based on the QoS of mclock_opclass, shouldn't two clients be regarded as the same type of request and enter the same queue?
Updated by jianwei zhang 9 months ago
Sridhar Seshasayee wrote:
Hi jianwei,
The current implementation of mClock scheduler does not implement fairness between
external clients. In other words the scheduler currently does not utilize the
distributed feature of the mClock algorithm (dmClock).As a result, all the clients are put into the same bucket and each client will get the
specified limit. Therefore, the disk utilization will increase as you increase the number of
clients. This is expected at this point.The distributed version is under implementation currently.
Therefore, could you please reduce the severity? I will update this tracker with the
PR once ready.Sridhar
another question:
internal client (background_recovery),
osd_mclock_scheduler_background_recovery_lim
I redirect the IO of the rados bench client to op_scheduler_class::background_recovery,
PGOpItem::get_scheduler_class
op_scheduler_class get_scheduler_class() const final {
auto type = op->get_req()->get_type();
if (type == CEPH_MSG_OSD_OP ||
type == CEPH_MSG_OSD_BACKOFF) {
/// default osd_client_op_priority value is (CEPH_MSG_PRIO_LOW - 1)
auto pri = op->get_req()->get_priority();
if (pri >= (CEPH_MSG_PRIO_LOW - 1)) {
return op_scheduler_class::client;
} else {
return op_scheduler_class::background_recovery; /// REDIRECT here
}
} else {
return op_scheduler_class::immediate;
}
}
In the same test, multiple rados bench clients read the same osd at the same time,
The iops/util of HDD of osd cannot be limited by osd_mclock_scheduler_background_recovery_lim
I don't quite understand why?
Updated by jianwei zhang 9 months ago
such as
osd.0
rados bench client1 OP \\
rados bench client2 OP ==> op_scheduler_class::client ==> osd_mclock_scheduler_client_lim (assuming 1 iops) ==> osd.0 should 1 iops
rados bench client3 OP //
Updated by jianwei zhang 9 months ago
osd must be read and written by multiple clients (such as rados bench),
If osd_mclock_scheduler_client_lim cannot limit IOPS/BW,
Then internal IO such as background_recovery will be preempted by a large number of client IO
in turn,
If osd_mclock_scheduler_background_recovery_lim cannot limit IOPS/BW,
Then the client's external IO will be preempted by a large number of internal IO
This is my point of confusion
Updated by jianwei zhang 9 months ago
balanced
/**
* balanced
*
* Client Allocation:
* reservation: 50% | weight: 1 | limit: 0 (max) |
* Background Recovery Allocation:
* reservation: 50% | weight: 1 | limit: 0 (max) |
* Background Best Effort Allocation:
* reservation: 0 (min) | weight: 1 | limit: 90% |
*/
high_recovery_ops
/**
* high_recovery_ops
*
* Client Allocation:
* reservation: 30% | weight: 1 | limit: 0 (max) |
* Background Recovery Allocation:
* reservation: 70% | weight: 2 | limit: 0 (max) |
* Background Best Effort Allocation:
* reservation: 0 (min) | weight: 1 | limit: 0 (max) |
*/
high_client_ops
/**
* high_client_ops
*
* Client Allocation:
* reservation: 60% | weight: 2 | limit: 0 (max) |
* Background Recovery Allocation:
* reservation: 40% | weight: 1 | limit: 0 (max) |
* Background Best Effort Allocation:
* reservation: 0 (min) | weight: 1 | limit: 70% |
*/
Do these configured limits still work?
Updated by jianwei zhang 9 months ago
Sridhar Seshasayee,
Thank you
could you please reduce the severity?
Sorry, I do not have permission
Updated by Sridhar Seshasayee 9 months ago
- Severity changed from 1 - critical to 3 - minor
I have changed the severity to 3, considering that client fairness is a yet to be implemented feature.
The mClock profiles allocate max limit to client Ops. To prevent one type of client from overwhelming
another, reservation is set which guarantees minimum bandwidth allocation for those clients.
Regarding the observation about limits not being realized, I am looking into it to find out if
there's a bug. I will get back with my findings.
Updated by jianwei zhang 9 months ago
Sridhar Seshasayee wrote:
I have changed the severity to 3, considering that client fairness is a yet to be implemented feature.
The mClock profiles allocate max limit to client Ops. To prevent one type of client from overwhelming
another, reservation is set which guarantees minimum bandwidth allocation for those clients.Regarding the observation about limits not being realized, I am looking into it to find out if
there's a bug. I will get back with my findings.
Thank you
I am concerned about the role of limit in mclock,
If limit does not work on a single osd based on opclass (clients/background_recovery/best_effort) scheduler,
Since we use the IO type as the standard for distinguishing clients, so how do we see the limit?
clients/background_recovery/best_effort ==> client op/scrub op/recovery op/pg delete op
Updated by Samuel Just 9 months ago
I'm with the original reporter -- I'd expect our current implementation to group all clients into the same class. The fact that doubling the number of clients doubles the client limit seems like a bug with the current implementation.
Updated by Samuel Just 9 months ago
ClientRegistry::get_external_client seems to always return default_external_client_info, but the id comes from get_scheduler_id which uses item.get_owner() which, in turn, uses the client id. I think the fix would probably be to use the same id for all clients.
Updated by Sridhar Seshasayee 9 months ago
Samuel Just wrote:
ClientRegistry::get_external_client seems to always return default_external_client_info, but the id comes from get_scheduler_id which uses item.get_owner() which, in turn, uses the client id. I think the fix would probably be to use the same id for all clients.
Yes, that's correct and I made the exact change that you mentioned and can confirm that it fixes the issue. I will perform a few more tests before opening the PR to fix this.
Updated by jianwei zhang 9 months ago
Sridhar Seshasayee wrote:
Samuel Just wrote:
ClientRegistry::get_external_client seems to always return default_external_client_info, but the id comes from get_scheduler_id which uses item.get_owner() which, in turn, uses the client id. I think the fix would probably be to use the same id for all clients.
Yes, that's correct and I made the exact change that you mentioned and can confirm that it fixes the issue. I will perform a few more tests before opening the PR to fix this.
Hi Sridhar Seshasayee and Samuel Just,
Thank you very much for your communication and tips
I am in a hurry to solve this problem, and may not wait for your PR, please help me review whether my modification is correct。
Updated by Sridhar Seshasayee 9 months ago
Hi Sridhar Seshasayee and Samuel Just,
Thank you very much for your communication and tips
I am in a hurry to solve this problem, and may not wait for your PR, please help me review whether my modification is correct。
Hi Jianwei,
I see that you raised a PR to fix this.
I raised https://github.com/ceph/ceph/pull/52809 keeping in mind the
future upcoming changes related to distributed QoS.
I hope it's okay with you if we go with my PR considering my point above.
In any case, the PR must undergo in-house reviews and CI testing before
it's merged which may take a few days.
-Sridhar
Updated by Sridhar Seshasayee 9 months ago
- Status changed from In Progress to Pending Backport
- Backport set to quincy, reef
Updated by Backport Bot 9 months ago
- Copied to Backport #62546: reef: osd mclock QoS : osd_mclock_scheduler_client_lim is not limited added
Updated by Backport Bot 9 months ago
- Copied to Backport #62547: quincy: osd mclock QoS : osd_mclock_scheduler_client_lim is not limited added
Updated by Backport Bot 9 months ago
- Tags changed from v18.1.0 to v18.1.0 backport_processed
Updated by Sridhar Seshasayee 6 months ago
- Status changed from Pending Backport to Resolved