Bug #62812
closedosd: Is it necessary to unconditionally increase osd_bandwidth_cost_per_io in mClockScheduler::calc_scaled_cost?
0%
Description
In this PR, the IOPS-based QoS cost calculation method is removed and the Bandwidth-based QoS cost calculation method is started :
- https://github.com/ceph/ceph/commit/514cb598fb616dc96f143b0b3a8cc708c212d556
- https://tracker.ceph.com/issues/58529
- https://tracker.ceph.com/issues/59080
- https://github.com/ceph/ceph/pull/49975
uint32_t mClockScheduler::calc_scaled_cost(int item_cost)
{
auto cost = static_cast<uint32_t>(
std::max<int>(
1, // ensure cost is non-zero and positive
item_cost));
auto cost_per_io = static_cast<uint32_t>(osd_bandwidth_cost_per_io);
// Calculate total scaled cost in bytes
return cost_per_io + cost;
}
The osd_bandwidth_cost_per_io parameter used in the function is explained as follows:
/**
* osd_bandwidth_cost_per_io
*
* mClock expects all queued items to have a uniform expression of
* "cost". However, IO devices generally have quite different capacity
* for sequential IO vs small random IO. This implementation handles this
* by expressing all costs as a number of sequential bytes written adding
* additional cost for each random IO equal to osd_bandwidth_cost_per_io.
*
* Thus, an IO operation requiring a total of <size> bytes to be written
* accross <iops> different locations will have a cost of
* <size> + (osd_bandwidth_cost_per_io * <iops>) bytes.
*
* Set in set_osd_capacity_params_from_config in the constructor and upon
* config change.
*
* Has units bytes/io.
*/
double osd_bandwidth_cost_per_io;
osd_bandwidth_cost_per_io is calculated as follows:
void mClockScheduler::set_osd_capacity_params_from_config()
{
uint64_t osd_bandwidth_capacity;
double osd_iop_capacity;
std::tie(osd_bandwidth_capacity, osd_iop_capacity) = [&, this] {
if (is_rotational) {
return std::make_tuple(
cct->_conf.get_val<Option::size_t>("osd_mclock_max_sequential_bandwidth_hdd"),
cct->_conf.get_val<double>("osd_mclock_max_capacity_iops_hdd"));
} else {
return std::make_tuple(
cct->_conf.get_val<Option::size_t>("osd_mclock_max_sequential_bandwidth_ssd"),
cct->_conf.get_val<double>("osd_mclock_max_capacity_iops_ssd"));
}
}();
osd_bandwidth_capacity = std::max<uint64_t>(1, osd_bandwidth_capacity);
osd_iop_capacity = std::max<double>(1.0, osd_iop_capacity);
osd_bandwidth_cost_per_io = static_cast<double>(osd_bandwidth_capacity) / osd_iop_city;
osd_bandwidth_capacity_per_shard = static_cast<double>(osd_bandwidth_capacity) / static_cast<double>(num_shards);
}
To illustrate the problem, I give an example as follows:
Preconditions:
- osd_mclock_max_sequential_bandwidth_hdd = 100MB/s
- osd_mclock_max_capacity_iops_hdd = 100 io/s
- osd_op_num_threads_per_shard = 5
- osd_mclock_scheduler_client_res = 1. //100%
- osd_mclock_scheduler_client_lim = 1. //100%
- osd_mclock_scheduler_client_wgt = 2
- write a 200KB/IO
osd_bandwidth_cost_per_io = 209715.2 bytes/io = 204.8 KB/IO
osd_bandwidth_capacity_per_shard = 100MB/s / 5 = 20 MB/s = 20480 KB/s
Divided into two scenarios,
- One is not to add osd_bandwidth_cost_per_io cost ,
- 200KB/IO cost is 9.76 ms
- One is to add osd_bandwidth_cost_per_io cost
- 200KB/IO+204.8KB/IO is 19.76 ms. (cost doubled)
One is not to add osd_bandwidth_cost_per_io cost:
One is to add osd_bandwidth_cost_per_io cost:
Question:
Is it necessary to increase osd_bandwidth_cost_per_io for each IO?
Files
Updated by jianwei zhang 8 months ago
One is not to add osd_bandwidth_cost_per_io cost:
One is to add osd_bandwidth_cost_per_io cost:
Updated by jianwei zhang 8 months ago
Incremental step calculation method:
void mClockScheduler::ClientRegistry::update_from_config(const ConfigProxy &conf, const double capacity_per_shard)
{
auto get_res = [&](double res) {
if (res) {
return res * capacity_per_shard;
} else {
return default_min; // min reservation --> constexpr double default_min = 0.0;
}
};
auto get_lim = [&](double lim) {
if (lim) { // 如果 osd_mclock_scheduler_client_lim 为 0,则使用无穷大数作为上限
return lim * capacity_per_shard;
} else {
return default_max; // high limit --> constexpr double default_max = std::numeric_limits<double>::is_iec559 ?
// std::numeric_limits<double>::infinity() :
// std::numeric_limits<double>::max();
}
};
// Set external client infos
double res = conf.get_val<double>("osd_mclock_scheduler_client_res");
double lim = conf.get_val<double>("osd_mclock_scheduler_client_lim");
uint64_t wgt = conf.get_val<uint64_t>("osd_mclock_scheduler_client_wgt");
default_external_client_info.update(get_res(res), wgt, get_lim(lim));
}
// order parameters -- min, "normal", max
ClientInfo(double _reservation, double _weight, double _limit)
{
update(_reservation, _weight, _limit);
}
inline void update(double _reservation, double _weight, double _limit)
{
reservation = _reservation;
weight = _weight;
limit = _limit;
reservation_inv = (0.0 == reservation) ? 0.0 : 1.0 / reservation;
weight_inv = (0.0 == weight) ? 0.0 : 1.0 / weight;
limit_inv = (0.0 == limit) ? 0.0 : 1.0 / limit;
}
An IO is added to the tag initialization in the mclock queue:
// data_mtx must be held by caller
RequestTag initial_tag(DelayedTagCalc delayed, ClientRec& client, const ReqParams& params, Time time, Cost cost)
{
RequestTag tag(0, 0, 0, time, 0, 0, cost);
// only calculate a tag if the request is going straight to the front
if (!client.has_request()) {
const ClientInfo* client_info = get_cli_info(client);
assert(client_info);
tag = RequestTag(client.get_req_tag(), *client_info, params, time, cost, anticipation_timeout);
// copy tag to previous tag for client
client.update_req_tag(tag, tick);
}
return tag;
}
// inline crimson::dmclock::RequestTag::RequestTag(double _res, double _prop, double _lim, crimson::dmclock::Time _arrival,
// uint32_t _delta = 0U, uint32_t _rho = 0U, crimson::dmclock::Cost _cost = 1U)
RequestTag(const double _res, const double _prop, const double _lim,
const Time _arrival,
const uint32_t _delta = 0,
const uint32_t _rho = 0,
const Cost _cost = 1u) :
reservation(_res), //0
proportion(_prop), //0
limit(_lim), //0
delta(_delta), //0
rho(_rho), //0
cost(_cost), //非0
ready(false), //false
arrival(_arrival) //非0
{
assert(cost > 0);
assert(reservation < max_tag || proportion < max_tag);
}
RequestTag(const RequestTag& prev_tag,
const ClientInfo& client,
const uint32_t _delta,
const uint32_t _rho,
const Time time,
const Cost _cost = 1u,
const double anticipation_timeout = 0.0) :
delta(_delta),
rho(_rho),
cost(_cost),
ready(false),
arrival(time)
{
assert(cost > 0);
Time max_time = time;
if (time - anticipation_timeout < prev_tag.arrival)
max_time -= anticipation_timeout;
reservation = tag_calc(max_time, prev_tag.reservation, client.reservation_inv, rho, true, cost);
proportion = tag_calc(max_time, prev_tag.proportion, client.weight_inv, delta, true, cost);
limit = tag_calc(max_time, prev_tag.limit, client.limit_inv, delta, false, cost);
assert(reservation < max_tag || proportion < max_tag);
}
static double tag_calc(const Time time,
const double prev,
const double increment,
const uint32_t dist_req_val,
const bool extreme_is_high,
const Cost cost)
{
if (0.0 == increment) {
return extreme_is_high ? max_tag : min_tag;
} else {
// insure 64-bit arithmetic before conversion to double
double tag_increment = increment * (uint64_t(dist_req_val) + cost);
return std::max(time, prev + tag_increment);
}
}
Updated by jianwei zhang 8 months ago
Please help me check if there are any errors in the process of calculating cost.
If nothing goes wrong,
Please discuss whether it is necessary to add osd_bandwidth_cost_per_io
Updated by jianwei zhang 8 months ago
jianwei zhang wrote:
One is not to add osd_bandwidth_cost_per_io cost:
One is to add osd_bandwidth_cost_per_io cost:
One is not to add osd_bandwidth_cost_per_io cost:
One is to add osd_bandwidth_cost_per_io cost:
Updated by jianwei zhang 8 months ago
- File hdd_iops.png hdd_iops.png added
another question :
how to test and get osd_mclock_max_sequential_bandwidth_hdd and osd_mclock_max_capacity_iops_hdd ???
The community recommends using:
ceph tell osd.N bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS]
The code logic of osd bench is as follows:
1. Prewrite NUM_OBJS objects of size OBJ_SIZE
2. begin_time
3. Randomly select 1 object from the pre-written objects according to the offset + BYTES_PER_WRITE of TOTAL_BYTES
and write it until TOTAL_BYTES is finished.
4. end_time
5. elapsed = end_time - start_time;
6. bw = TOTAL_BYTES / elapsed
7. iops = bw / BYTES_PER_WRITE
question:
If the store is using bluestore,
It does not overwrite the original object.
Instead, newly allocated disk space is used for additional writing.
In this case, the tested iops deviate greatly
bluestore_throttle_deferred_bytes = 0
bluestore_prefer_deferred_size_hdd = 0
HDD can be 10000 iops
- name: osd_mclock_max_sequential_bandwidth_hdd
type: size
level: basic
desc: The maximum sequential bandwidth in bytes/second of the OSD (for rotational media)
long_desc: This option specifies the maximum sequential bandwidth to consider
for an OSD whose underlying device type is rotational media. This is
considered by the mclock scheduler to derive the cost factor to be used in
QoS calculations. Only considered for osd_op_queue = mclock_scheduler
fmt_desc: The maximum sequential bandwidth in bytes/second to consider for the
OSD (for rotational media)
default: 150_M
flags:
- runtime
- name: osd_mclock_max_capacity_iops_hdd
type: float
level: basic
desc: Max random write IOPS capacity (at 4KiB block size) to consider per OSD (for rotational media)
long_desc: This option specifies the max OSD random write IOPS capacity per
OSD. Contributes in QoS calculations when enabling a dmclock profile. Only
considered for osd_op_queue = mclock_scheduler
fmt_desc: Max random write IOPS capacity (at 4 KiB block size) to consider per
OSD (for rotational media)
default: 315
flags:
- runtime
https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/
else if (prefix == "bench") {
// default count 1G, size 4MB
int64_t count = cmd_getval_or<int64_t>(cmdmap, "count", 1LL << 30);
int64_t bsize = cmd_getval_or<int64_t>(cmdmap, "size", 4LL << 20);
int64_t osize = cmd_getval_or<int64_t>(cmdmap, "object_size", 0);
int64_t onum = cmd_getval_or<int64_t>(cmdmap, "object_num", 0);
double elapsed = 0.0;
ret = run_osd_bench_test(count, bsize, osize, onum, &elapsed, ss);
if (ret != 0) {
goto out;
}
double rate = count / elapsed;
double iops = rate / bsize;
f->open_object_section("osd_bench_results");
f->dump_int("bytes_written", count);
f->dump_int("blocksize", bsize);
f->dump_float("elapsed_sec", elapsed);
f->dump_float("bytes_per_sec", rate);
f->dump_float("iops", iops);
f->close_section();
}
int OSD::run_osd_bench_test(
int64_t count,
int64_t bsize,
int64_t osize,
int64_t onum,
double *elapsed,
ostream &ss)
{
int ret = 0;
... ...
if (osize && onum) {
bufferlist bl;
bufferptr bp(osize);
memset(bp.c_str(), 'a', bp.length());
bl.push_back(std::move(bp));
bl.rebuild_page_aligned();
for (int i=0; i<onum; ++i) {
char nm[30];
snprintf(nm, sizeof(nm), "disk_bw_test_%d", i);
object_t oid(nm);
hobject_t soid(sobject_t(oid, 0));
ObjectStore::Transaction t;
t.write(coll_t(), ghobject_t(soid), 0, osize, bl);
store->queue_transaction(service.meta_ch, std::move(t), nullptr);
cleanupt.remove(coll_t(), ghobject_t(soid));
}
}
... ...
bufferlist bl;
utime_t start = ceph_clock_now();
for (int64_t pos = 0; pos < count; pos += bsize) {
char nm[34];
unsigned offset = 0;
bufferptr bp(bsize);
memset(bp.c_str(), rand() & 0xff, bp.length());
bl.push_back(std::move(bp));
bl.rebuild_page_aligned();
if (onum && osize) {
snprintf(nm, sizeof(nm), "disk_bw_test_%d", (int)(rand() % onum));
offset = rand() % (osize / bsize) * bsize;
} else {
snprintf(nm, sizeof(nm), "disk_bw_test_%lld", (long long)pos);
}
object_t oid(nm);
hobject_t soid(sobject_t(oid, 0));
ObjectStore::Transaction t;
t.write(coll_t::meta(), ghobject_t(soid), offset, bsize, bl);
store->queue_transaction(service.meta_ch, std::move(t), nullptr);
if (!onum || !osize) {
cleanupt.remove(coll_t::meta(), ghobject_t(soid));
}
bl.clear();
}
{
C_SaferCond waiter;
if (!service.meta_ch->flush_commit(&waiter)) {
waiter.wait();
}
}
utime_t end = ceph_clock_now();
*elapsed = end - start;
... ...
return ret;
}
Updated by jianwei zhang 8 months ago
- File rados_bench_pr_52809.png rados_bench_pr_52809.png added
Updated by jianwei zhang 8 months ago
osd/scheduler/mClockScheduler: Use same profile and client ids for all clients to ensure allocated QoS limit consumption.
https://github.com/ceph/ceph/pull/52809
Hi sseshasa,
How did you test OSD IOPS and bandwidth Capacity?
Updated by jianwei zhang 8 months ago
A version has been modified, please review
For cost calculation, the core idea is to take the larger value of user item_cost and osd_bandwidth_cost_per_io:
https://github.com/ceph/ceph/pull/53417
Updated by Sridhar Seshasayee 8 months ago
Responses to your questions.
Q: how to test and get osd_mclock_max_sequential_bandwidth_hdd and osd_mclock_max_capacity_iops_hdd ?
osd_mclock_max_capacity_iops_hdd is determined during OSD boot up by running an OSD bench test using
4 KiB writes. Although we write to random offsets within an object, the results do vary sometimes.
This could be due to drive specific settings and/or optimizations. Due to these deviations,
osd_mclock_iops_capacity_threshold_hdd was introduced to fallback to saner settings. These options
are configurable. If the default settings do not accurately represent the capability of the device, then
it's recommended to run benchmark tests using other tools (fio for e.g.) and then set the OSD IOPS
capacity. We do log cluster warnings if the threshold values are exceeded so that further steps can be
taken by the user. At this point, the OSD bench is the only tool we can run during OSD boot-up until
another alternative can be identified.
For osd_mclock_max_sequential_bandwidth_hdd (default: 150 MiB/s), the thought is that this is a
reasonable generic setting to use. We currently do not measure this. But this can be changed to
reflect the actual capability of the device by measuring using Fio or other tools.
Q: in https://github.com/ceph/ceph/pull/52809 How did you test OSD IOPS and bandwidth Capacity?
In our test environment, the OSD bench reported IOPS at 4 KiB randw is close to the actual
capability of the device (~375 IOPS). Tests with Fio too reported close to the IOPS value shown
in the graph. For the test, the custom mClock profile was enabled and
osd_mclock_scheduler_client_lim was set to 30% of the OSD's IOPS capacity. With these settings,
5 Rados Bench instances were started and the graph shows the average IOPS reported by each Rados
Bench instance.
Thoughts About Your Proposed Fix
The osd_bandwidth_cost_per_io is currently calculated using the IOPS capacity at 4 KiB
block size. This represents the base cost per IO. For progressively larger IO sizes, the
idea is that the cost should be increased appropriately. This is the reason for adding
the item cost to the base cost_per_io parameter in calc_scaled_cost(). But this approach
as you have noted results in lower than expected IOPS for an item whose cost is lower
than the cost_per_io parameter.
Therefore, your proposed fix to pass only the cost_per_io in the tag calculation and
passing the item cost only if it's greater than the cost_per_io seems good to me.
However, I would also like to hear thoughts from Sam Just on this proposed change.
Updated by jianwei zhang 8 months ago
Sridhar Seshasayee wrote:
Responses to your questions.
Q: how to test and get osd_mclock_max_sequential_bandwidth_hdd and osd_mclock_max_capacity_iops_hdd ?
osd_mclock_max_capacity_iops_hdd is determined during OSD boot up by running an OSD bench test using
4 KiB writes. Although we write to random offsets within an object, the results do vary sometimes.
This could be due to drive specific settings and/or optimizations. Due to these deviations,
osd_mclock_iops_capacity_threshold_hdd was introduced to fallback to saner settings. These options
are configurable. If the default settings do not accurately represent the capability of the device, then
it's recommended to run benchmark tests using other tools (fio for e.g.) and then set the OSD IOPS
capacity. We do log cluster warnings if the threshold values are exceeded so that further steps can be
taken by the user. At this point, the OSD bench is the only tool we can run during OSD boot-up until
another alternative can be identified.For osd_mclock_max_sequential_bandwidth_hdd (default: 150 MiB/s), the thought is that this is a
reasonable generic setting to use. We currently do not measure this. But this can be changed to
reflect the actual capability of the device by measuring using Fio or other tools.Q: in https://github.com/ceph/ceph/pull/52809 How did you test OSD IOPS and bandwidth Capacity?
In our test environment, the OSD bench reported IOPS at 4 KiB randw is close to the actual
capability of the device (~375 IOPS). Tests with Fio too reported close to the IOPS value shown
in the graph. For the test, the custom mClock profile was enabled and
osd_mclock_scheduler_client_lim was set to 30% of the OSD's IOPS capacity. With these settings,
5 Rados Bench instances were started and the graph shows the average IOPS reported by each Rados
Bench instance.Thoughts About Your Proposed Fix
The osd_bandwidth_cost_per_io is currently calculated using the IOPS capacity at 4 KiB
block size. This represents the base cost per IO. For progressively larger IO sizes, the
idea is that the cost should be increased appropriately. This is the reason for adding
the item cost to the base cost_per_io parameter in calc_scaled_cost(). But this approach
as you have noted results in lower than expected IOPS for an item whose cost is lower
than the cost_per_io parameter.Therefore, your proposed fix to pass only the cost_per_io in the tag calculation and
passing the item cost only if it's greater than the cost_per_io seems good to me.However, I would also like to hear thoughts from Sam Just on this proposed change.
Thanks for your reply
The difference between fio and osd bench tests is still too big
fio buffer=200KB direct=1 libaio can achieve 200MB/s bandwidth
The osd bench can reach a maximum bandwidth of 100MB/s.
Updated by jianwei zhang 8 months ago
we prepare
Use osd bench or rados bench to test the bandwidth of osd,
use fio to t est random iops of hdd
Updated by jianwei zhang 8 months ago
hi Sridhar Seshasayee,
I am still quite confused about cost calculation and latency tag.
For HDD,
The problem scenario is as follows:
Preconditions:
bandwidth=100MiB/s
IOPS = 100
1. IO buffer size = 1MiB
2. cost = 1 / 100 = 0.01s = 10ms
These 10 ms are a reference value for the cost of transmission on disk
3. This IO will wait in the mclock queue for 10 ms and then be scheduled
4. This IO actually takes about 10 ms to execute on the disk.
Confused points:
In fact, the IO delay took a total of 20 ms (mclock queue wait + run on disk),
IO latency almost doubled
If we want to limit recovery_limit to 100 MB/s, what we actually get may be 50MB/s, which is lower than expected
mclock’s thoughts:
IO can be scheduled fairly and evenly on the timeline
For example, if it is 100 IOPS, it is expected to schedule an IO every 10 ms.
Assume that the hardware devices are CPU, Memory, and NVMe. Since they are all high-speed hardware, the execution time of IO on them can be ignored, so the delay of IO is basically the delay of mclock scheduling queuing.
Back to HDD low-speed hardware devices:
The basis for time tagging IO is cost.
The cost of IO is calculated based on bandwidth/IOPS.
Since it is a low-speed hardware device, the time cost of actual execution of IO on the disk cannot be ignored.
How do you think about this?
Updated by Sridhar Seshasayee 8 months ago
jianwei zhang wrote:
For HDD,
The problem scenario is as follows:
Preconditions:
bandwidth=100MiB/s
IOPS = 1001. IO buffer size = 1MiB
2. cost = 1 / 100 = 0.01s = 10ms
These 10 ms are a reference value for the cost of transmission on disk
3. This IO will wait in the mclock queue for 10 ms and then be scheduled
4. This IO actually takes about 10 ms to execute on the disk.Confused points:
In fact, the IO delay took a total of 20 ms (mclock queue wait + run on disk),
IO latency almost doubled
If we want to limit recovery_limit to 100 MB/s, what we actually get may be 50MB/s, which is lower than expected
In addition to the op queue which is managed by mClock, items get transferred to the
operation sequencer at the objectstore layer. Once mClock dequeues an item from the
op queue, it no longer has control. The time an op spends in the operation sequencer
must also be factored in the latency calculation.
The above is also mentioned in this section:
https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#caveats
A subset of the options that influence the items in the operation sequencer are:
bluestore_throttle_bytes and bluestore_throttle_deferred_bytes.
To figure out if the above is contributing to the latency, the options may be
tuned to ensure items spend as little time as possible in the operation sequencer.
One way to tune this is mentioned here:
https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#benchmarking-test-steps-using-osd-bench
You can use the IO tool of your choice with block size 1 MiB. The idea is that
for each iteration you set the bluestore throttle options and measure the
throughput and compare it with the baseline (measured with default bluestore throttles).
The throttle values are incremented in each iteration until the throughput matches
the baseline. At this point the throttles can be considered as optimal.
The steps can be easily automated and the optimal bluestore throttle options determined.
The idea with the above exercise is to figure out if operation sequencer is the cause of
the additional latency you are observing.
For HDDs, I expect the throttles values be on the higher side.
How do you think about this?
Let me investigate this a bit from my side as well and get back to you.
Updated by jianwei zhang 8 months ago
Sridhar Seshasayee wrote:
jianwei zhang wrote:
For HDD,
The problem scenario is as follows:
Preconditions:
bandwidth=100MiB/s
IOPS = 1001. IO buffer size = 1MiB
2. cost = 1 / 100 = 0.01s = 10ms
These 10 ms are a reference value for the cost of transmission on disk
3. This IO will wait in the mclock queue for 10 ms and then be scheduled
4. This IO actually takes about 10 ms to execute on the disk.Confused points:
In fact, the IO delay took a total of 20 ms (mclock queue wait + run on disk),
IO latency almost doubled
If we want to limit recovery_limit to 100 MB/s, what we actually get may be 50MB/s, which is lower than expectedIn addition to the op queue which is managed by mClock, items get transferred to the
operation sequencer at the objectstore layer. Once mClock dequeues an item from the
op queue, it no longer has control. The time an op spends in the operation sequencer
must also be factored in the latency calculation.The above is also mentioned in this section:
https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#caveatsA subset of the options that influence the items in the operation sequencer are:
bluestore_throttle_bytes and bluestore_throttle_deferred_bytes.To figure out if the above is contributing to the latency, the options may be
tuned to ensure items spend as little time as possible in the operation sequencer.
One way to tune this is mentioned here:
https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#benchmarking-test-steps-using-osd-benchYou can use the IO tool of your choice with block size 1 MiB. The idea is that
for each iteration you set the bluestore throttle options and measure the
throughput and compare it with the baseline (measured with default bluestore throttles).
The throttle values are incremented in each iteration until the throughput matches
the baseline. At this point the throttles can be considered as optimal.The steps can be easily automated and the optimal bluestore throttle options determined.
The idea with the above exercise is to figure out if operation sequencer is the cause of
the additional latency you are observing.For HDDs, I expect the throttles values be on the higher side.
How do you think about this?
Let me investigate this a bit from my side as well and get back to you.
add this patch : https://github.com/ceph/ceph/pull/53417
bluestore_throttle_bytes = 0
bluestore_throttle_deferred_bytes = 0
test-0 : osd bench
* bsize=1M
* IOPS=237
* BW=237M
test-1 : client_limit = 1.0 / client_res = 0.5 / client_wgt = 60 / iops=200 / bw=200M
* bsize = 1M
* BW = 160M / 200M = 80%
test-2 : client_lim = 0 / client_res = 0.5 / client_wgt = 60 / iops=200 / bw=200M
* bsize = 1M
* 235M / 200M = 117.5%
test-3 : client_lim = 1 / client_res = 0.5 / client_wgt = 60 / iops=240 / bw=240M
* bsize = 1M
* 190M / 240M = 79%
test-4 : client_lim = 1 / client_res = 0.5 / client_wgt = 60 / iops=240 / bw=240M
* bsize = 8K
* BW = 1.6M
* IOPS = 1.6M * 1024 / 8K = 204 IOPS ==> 204 / 240 = 85%
test-5 : client_lim = 1 / client_res = 0.5 / client_wgt = 60 / iops=100 / bw=100M
* bsize = 1M
* BW = 84M / 100M = 84%
test-0 : osd bench ==> bsize=1M, IOPS=237 BW=237M
osd_bench_duration = 300
osd_bench_large_size_max_throughput = 104857600
osd_bench_max_block_size = 67108864
osd_bench_small_size_max_iops = 100
# ceph tell osd.0 cache drop
# ceph tell osd.0 bench 10737418240 1048576 1048576 10240
{
"bytes_written": 10737418240,
"blocksize": 1048576,
"elapsed_sec": 43.103751181,
"bytes_per_sec": 249106352.59821704,
"iops": 237.56633052655891
}
# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/21/2023 09:44:43 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 904.00 0.00 2.12 4.81 0.10 0.11 0.00 0.11 0.11 10.00
sdf 0.00 2667.00 0.00 908.00 0.00 227.00 512.00 140.87 154.84 0.00 154.84 1.10 100.00
09/21/2023 09:44:44 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 949.00 0.00 2.23 4.82 0.09 0.10 0.00 0.10 0.10 9.30
sdf 0.00 2913.00 0.00 948.00 0.00 237.00 512.00 142.02 149.23 0.00 149.23 1.05 100.00
09/21/2023 09:44:45 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 968.00 0.00 2.27 4.80 0.10 0.10 0.00 0.10 0.10 9.80
sdf 0.00 2904.00 0.00 964.00 0.00 241.00 512.00 143.00 148.77 0.00 148.77 1.04 100.00
09/21/2023 09:44:46 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 952.00 0.00 2.24 4.82 0.10 0.11 0.00 0.11 0.11 10.50
sdf 0.00 2784.00 0.00 952.00 0.00 238.00 512.00 143.32 149.99 0.00 149.99 1.05 100.00
09/21/2023 09:44:47 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 900.00 0.00 2.12 4.82 0.09 0.10 0.00 0.10 0.10 8.60
sdf 0.00 2784.00 0.00 900.00 0.00 225.00 512.00 143.46 158.80 0.00 158.80 1.11 100.00
test-1: client_limit = 1.0 / client_res = 0.5 / client_wgt = 60 / iops=200 / bw=200M > bsize=1M > 160M / 200M = 80%
ceph cluster:
# ceph -s
cluster:
id: 0348ad4a-7f88-4cfe-b49f-b3bd80856b79
health: HEALTH_OK
services:
mon: 1 daemons, quorum a (age 7m)
mgr: x(active, since 6m)
osd: 1 osds: 1 up (since 6m), 1 in (since 6h)
flags noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub
data:
pools: 1 pools, 128 pgs
objects: 7.76k objects, 7.6 GiB
usage: 186 GiB used, 9.1 TiB / 9.3 TiB avail
pgs: 128 active+clean
# ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 9.27039 - 9.3 TiB 189 GiB 10 GiB 1 KiB 52 MiB 9.1 TiB 1.99 1.00 - root default
-3 9.27039 - 9.3 TiB 189 GiB 10 GiB 1 KiB 52 MiB 9.1 TiB 1.99 1.00 - host zjw-q-dev
0 hdd 9.27039 1.00000 9.3 TiB 189 GiB 10 GiB 1 KiB 52 MiB 9.1 TiB 1.99 1.00 128 up osd.0
TOTAL 9.3 TiB 189 GiB 10 GiB 1.1 KiB 52 MiB 9.1 TiB 1.99
MIN/MAX VAR: 1.00/1.00 STDDEV: 0
# ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 9.3 TiB 9.1 TiB 189 GiB 189 GiB 1.99
TOTAL 9.3 TiB 9.1 TiB 189 GiB 189 GiB 1.99
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
test-pool 1 128 10 GiB 10.22k 10 GiB 0.11 8.6 TiB
# ceph osd pool ls detail
pool 1 'test-pool' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off last_change 26 flags hashpspool stripe_width 0 application rgw
# ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "choose_firstn",
"num": 0,
"type": "osd"
},
{
"op": "emit"
}
]
}
]
Preconditions:
# ceph daemon osd.0 config show | grep -e osd_mclock -e bluestore_throttle
"bluestore_throttle_bytes": "0", //unlimited
"bluestore_throttle_cost_per_io": "0",
"bluestore_throttle_cost_per_io_hdd": "670000",
"bluestore_throttle_cost_per_io_ssd": "4000",
"bluestore_throttle_deferred_bytes": "0", //unlimited
"bluestore_throttle_trace_rate": "0.000000",
"osd_mclock_force_run_benchmark_on_init": "false",
"osd_mclock_iops_capacity_threshold_hdd": "500.000000",
"osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
"osd_mclock_max_capacity_iops_hdd": "200.000000", //200 IOPS
"osd_mclock_max_capacity_iops_ssd": "21500.000000",
"osd_mclock_max_sequential_bandwidth_hdd": "209715200", //200_M
"osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
"osd_mclock_override_recovery_settings": "false",
"osd_mclock_profile": "custom",
"osd_mclock_scheduler_anticipation_timeout": "0.000000",
"osd_mclock_scheduler_background_best_effort_lim": "0.100000",
"osd_mclock_scheduler_background_best_effort_res": "0.200000",
"osd_mclock_scheduler_background_best_effort_wgt": "20",
"osd_mclock_scheduler_background_recovery_lim": "0.500000",
"osd_mclock_scheduler_background_recovery_res": "0.300000",
"osd_mclock_scheduler_background_recovery_wgt": "20",
"osd_mclock_scheduler_client_lim": "1.000000", //100% --> 200_M
"osd_mclock_scheduler_client_res": "0.500000",
"osd_mclock_scheduler_client_wgt": "60",
"osd_mclock_skip_benchmark": "true"
# cat test-bench-write-1M.sh
> writelog
for i in {1..1} ; do
name1=$(echo $RANDOM)
name2=$(echo $RANDOM)
echo "test-bench-write-$name1-$name2"
nohup rados -c ./ceph.conf bench 600 write --no-cleanup -t 10 -p test-pool -b 1048576 --show-time --run-name "test-bench-write-$name1-$name2" >> writelog 2>&1 &
done
# rados bench output. ==> 160M / 200M = 80%
2023-09-21T22:02:59.047725+0800 min lat: 0.00659005 max lat: 0.340699 avg lat: 0.0613749 lat p50: 0.0499426 lat p90: 0.131342 lat p99: 0.226897 lat p999: 0.249803 lat p100: 0.340699
2023-09-21T22:02:59.047725+0800 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
2023-09-21T22:02:59.047725+0800 40 10 6514 6504 162.571 156 0.0380342 0.0613749
2023-09-21T22:03:00.047956+0800 41 10 6682 6672 162.703 168 0.0173561 0.0613827
2023-09-21T22:03:01.048102+0800 42 10 6850 6840 162.828 168 0.0370583 0.0613393
2023-09-21T22:03:02.048281+0800 43 10 7027 7017 163.157 177 0.0119378 0.0612306
2023-09-21T22:03:03.048453+0800 44 10 7187 7177 163.085 160 0.0123712 0.0612312
2023-09-21T22:03:04.048619+0800 45 10 7354 7344 163.171 167 0.131077 0.0612135
2023-09-21T22:03:05.048763+0800 46 10 7514 7504 163.102 160 0.0284153 0.0612553
2023-09-21T22:03:06.048931+0800 47 10 7640 7630 162.312 126 0.0126964 0.0614655
2023-09-21T22:03:07.049076+0800 48 10 7786 7776 161.972 146 0.124256 0.0616731
2023-09-21T22:03:08.049236+0800 49 10 7953 7943 162.074 167 0.0389563 0.0616579
2023-09-21T22:03:09.049372+0800 50 10 8102 8092 161.812 149 0.0578777 0.0617027
2023-09-21T22:03:10.049539+0800 51 10 8267 8257 161.874 165 0.0878024 0.0617294
2023-09-21T22:03:11.049708+0800 52 10 8436 8426 162.01 169 0.0568071 0.0616733
2023-09-21T22:03:12.049884+0800 53 10 8598 8588 162.01 162 0.0551456 0.0616418
2023-09-21T22:03:13.050060+0800 54 10 8752 8742 161.861 154 0.0323046 0.0617172
2023-09-21T22:03:14.050238+0800 55 10 8909 8899 161.772 157 0.0292404 0.0617236
2023-09-21T22:03:15.050418+0800 56 10 9070 9060 161.758 161 0.0476941 0.0617766
2023-09-21T22:03:16.050595+0800 57 10 9223 9213 161.603 153 0.072164 0.061812
2023-09-21T22:03:17.050767+0800 58 10 9390 9380 161.696 167 0.141416 0.0617799
2023-09-21T22:03:18.050900+0800 59 10 9552 9542 161.701 162 0.0166232 0.0617668
# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/21/2023 10:02:55 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 656.00 0.00 1.54 4.79 0.07 0.11 0.00 0.11 0.11 6.90
sdf 0.00 2064.00 0.00 685.00 0.00 171.25 512.00 13.54 19.60 0.00 19.60 1.40 96.20
09/21/2023 10:02:56 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 652.00 0.00 1.51 4.74 0.06 0.10 0.00 0.10 0.10 6.20
sdf 0.00 1956.00 0.00 655.00 0.00 163.75 512.00 7.11 11.07 0.00 11.07 1.42 93.20
09/21/2023 10:02:57 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 673.00 0.00 1.56 4.75 0.07 0.11 0.00 0.11 0.11 7.50
sdf 0.00 2040.00 0.00 678.00 0.00 169.50 512.00 7.72 11.36 0.00 11.36 1.40 94.70
09/21/2023 10:02:58 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 700.00 0.00 1.64 4.81 0.08 0.11 0.00 0.11 0.11 7.70
sdf 0.00 2088.00 0.00 703.00 0.00 175.75 512.00 10.00 14.28 0.00 14.28 1.37 96.30
test-2: client_lim = 0 / client_res = 0.5 / client_wgt = 60 / iops=200 / bw=200M > bsize=1M > 235M / 200M = 117.5%
# ceph daemon osd.0 config show | grep -e osd_mclock -e bluestore_throttle
"bluestore_throttle_bytes": "0",
"bluestore_throttle_cost_per_io": "0",
"bluestore_throttle_cost_per_io_hdd": "670000",
"bluestore_throttle_cost_per_io_ssd": "4000",
"bluestore_throttle_deferred_bytes": "0",
"bluestore_throttle_trace_rate": "0.000000",
"osd_mclock_force_run_benchmark_on_init": "false",
"osd_mclock_iops_capacity_threshold_hdd": "500.000000",
"osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
"osd_mclock_max_capacity_iops_hdd": "200.000000", // 200 iops
"osd_mclock_max_capacity_iops_ssd": "21500.000000",
"osd_mclock_max_sequential_bandwidth_hdd": "209715200", // 200_M
"osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
"osd_mclock_override_recovery_settings": "false",
"osd_mclock_profile": "custom",
"osd_mclock_scheduler_anticipation_timeout": "0.000000",
"osd_mclock_scheduler_background_best_effort_lim": "0.100000",
"osd_mclock_scheduler_background_best_effort_res": "0.200000",
"osd_mclock_scheduler_background_best_effort_wgt": "20",
"osd_mclock_scheduler_background_recovery_lim": "0.500000",
"osd_mclock_scheduler_background_recovery_res": "0.300000",
"osd_mclock_scheduler_background_recovery_wgt": "20",
"osd_mclock_scheduler_client_lim": "0.000000", //0 unlimited
"osd_mclock_scheduler_client_res": "0.500000",
"osd_mclock_scheduler_client_wgt": "60",
"osd_mclock_skip_benchmark": "true",
# cat test-bench-write-1M.sh
> writelog
for i in {1..1} ; do
name1=$(echo $RANDOM)
name2=$(echo $RANDOM)
echo "test-bench-write-$name1-$name2"
nohup rados -c ./ceph.conf bench 600 write --no-cleanup -t 10 -p test-pool -b 1048576 --show-time --run-name "test-bench-write-$name1-$name2" >> writelog 2>&1 &
done
# rados bench output. ==> 235M / 200M = 117.5%
2023-09-21T22:00:09.629327+0800 min lat: 0.0080338 max lat: 1.44649 avg lat: 0.0423585 lat p50: 0.0406062 lat p90: 0.0475816 lat p99: 0.0622597 lat p999: 1.0747 lat p100: 1.44649
2023-09-21T22:00:09.629327+0800 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
2023-09-21T22:00:09.629327+0800 20 10 4726 4716 235.758 225 0.0437467 0.0423585
2023-09-21T22:00:10.629573+0800 21 10 4954 4944 235.386 228 0.0422658 0.0424362
2023-09-21T22:00:11.629745+0800 22 10 5196 5186 235.685 242 0.0122676 0.0416933
2023-09-21T22:00:12.629926+0800 23 10 5433 5423 235.74 237 0.0409134 0.0423776
2023-09-21T22:00:13.630107+0800 24 10 5660 5650 235.374 227 0.042527 0.0424412
2023-09-21T22:00:14.630281+0800 25 10 5893 5883 235.278 233 0.0415217 0.0424618
2023-09-21T22:00:15.630466+0800 26 10 6139 6129 235.688 246 0.0416429 0.0423908
2023-09-21T22:00:16.630638+0800 27 10 6383 6373 235.995 244 0.0302444 0.0423147
2023-09-21T22:00:17.630801+0800 28 10 6616 6606 235.886 233 0.0264553 0.0423603
2023-09-21T22:00:18.630967+0800 29 10 6842 6832 235.544 226 0.0239288 0.042045
2023-09-21T22:00:19.631136+0800 30 10 7079 7069 235.591 237 0.0171055 0.04134
2023-09-21T22:00:20.631330+0800 31 10 7320 7310 235.764 241 0.042195 0.0423773
2023-09-21T22:00:21.631554+0800 32 10 7558 7548 235.833 238 0.0437976 0.0423641
2023-09-21T22:00:22.631735+0800 33 10 7773 7763 235.2 215 0.041803 0.042481
2023-09-21T22:00:23.631900+0800 34 10 8014 8004 235.37 241 0.0424048 0.0424514
2023-09-21T22:00:24.632100+0800 35 10 8244 8234 235.215 230 0.0440427 0.042478
2023-09-21T22:00:25.632269+0800 36 10 8485 8475 235.374 241 0.0495573 0.0424507
2023-09-21T22:00:26.632443+0800 37 10 8717 8707 235.282 232 0.0474286 0.0424687
2023-09-21T22:00:27.632604+0800 38 10 8949 8939 235.195 232 0.041791 0.0424856
2023-09-21T22:00:28.632757+0800 39 10 9194 9184 235.445 245 0.0405403 0.0424448
# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/21/2023 10:00:06 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 1.00 972.00 0.01 2.27 4.79 0.09 0.10 0.00 0.10 0.10 9.40
sdf 0.00 2904.00 0.00 972.00 0.00 243.00 512.00 34.37 35.43 0.00 35.43 1.03 100.10
09/21/2023 10:00:07 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 988.00 0.00 2.30 4.77 0.10 0.10 0.00 0.10 0.10 9.80
sdf 0.00 2964.00 0.00 988.00 0.00 247.00 512.00 34.42 34.79 0.00 34.79 1.01 100.00
09/21/2023 10:00:08 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 952.00 0.00 2.23 4.79 0.10 0.11 0.00 0.11 0.11 10.00
sdf 0.00 2868.00 0.00 956.00 0.00 239.00 512.00 34.60 36.30 0.00 36.30 1.04 99.90
09/21/2023 10:00:09 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 904.00 0.00 2.11 4.78 0.09 0.10 0.00 0.10 0.10 9.20
sdf 0.00 2712.00 0.00 900.00 0.00 225.00 512.00 34.99 38.67 0.00 38.67 1.11 100.10
09/21/2023 10:00:10 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 893.00 0.00 2.08 4.77 0.09 0.10 0.00 0.10 0.10 8.90
sdf 0.00 2712.00 0.00 908.00 0.00 227.00 512.00 33.09 36.49 0.00 36.49 1.10 100.00
test-3 : client_lim = 1 / client_res = 0.5 / client_wgt = 60 / iops=240 / bw=240M > bsize=1M > 190M / 240M = 79%
# ceph daemon osd.0 config show | grep -e osd_mclock -e bluestore_throttle
"bluestore_throttle_bytes": "0", //unlimited
"bluestore_throttle_cost_per_io": "0",
"bluestore_throttle_cost_per_io_hdd": "670000",
"bluestore_throttle_cost_per_io_ssd": "4000",
"bluestore_throttle_deferred_bytes": "0", //unlimited
"bluestore_throttle_trace_rate": "0.000000",
"osd_mclock_force_run_benchmark_on_init": "false",
"osd_mclock_iops_capacity_threshold_hdd": "500.000000",
"osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
"osd_mclock_max_capacity_iops_hdd": "240.000000", //240 IOPS
"osd_mclock_max_capacity_iops_ssd": "21500.000000",
"osd_mclock_max_sequential_bandwidth_hdd": "251658240", //240_M
"osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
"osd_mclock_override_recovery_settings": "false",
"osd_mclock_profile": "custom",
"osd_mclock_scheduler_anticipation_timeout": "0.000000",
"osd_mclock_scheduler_background_best_effort_lim": "0.100000",
"osd_mclock_scheduler_background_best_effort_res": "0.200000",
"osd_mclock_scheduler_background_best_effort_wgt": "20",
"osd_mclock_scheduler_background_recovery_lim": "0.500000",
"osd_mclock_scheduler_background_recovery_res": "0.300000",
"osd_mclock_scheduler_background_recovery_wgt": "20",
"osd_mclock_scheduler_client_lim": "1.000000",
"osd_mclock_scheduler_client_res": "0.500000",
"osd_mclock_scheduler_client_wgt": "60",
"osd_mclock_skip_benchmark": "true",
# rados bench output --> 190M / 240M = 79%
2023-09-21T22:16:55.688680+0800 min lat: 0.00692427 max lat: 0.280151 avg lat: 0.0521959 lat p50: 0.0444432 lat p90: 0.102672 lat p99: 0.157707 lat p999: 0.243541 lat p100: 0.280151
2023-09-21T22:16:55.688680+0800 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
2023-09-21T22:16:55.688680+0800 40 10 7670 7660 191.468 193 0.0258948 0.0521959
2023-09-21T22:16:56.688877+0800 41 10 7848 7838 191.139 178 0.0298819 0.0522755
2023-09-21T22:16:57.689070+0800 42 10 8045 8035 191.277 197 0.0209257 0.0522216
2023-09-21T22:16:58.689211+0800 43 10 8238 8228 191.317 193 0.107852 0.0522338
2023-09-21T22:16:59.689373+0800 44 10 8429 8419 191.309 191 0.103653 0.0522236
2023-09-21T22:17:00.689522+0800 45 10 8615 8605 191.19 186 0.0454725 0.0522409
2023-09-21T22:17:01.689676+0800 46 10 8817 8807 191.425 202 0.0469294 0.0521794
2023-09-21T22:17:02.689815+0800 47 10 8999 8989 191.224 182 0.0383939 0.0522514
2023-09-21T22:17:03.689984+0800 48 10 9198 9188 191.385 199 0.0326067 0.0522025
2023-09-21T22:17:04.690152+0800 49 10 9395 9385 191.499 197 0.0737887 0.0521808
2023-09-21T22:17:05.690299+0800 50 10 9595 9585 191.668 200 0.0132583 0.0521397
2023-09-21T22:17:06.690476+0800 51 10 9792 9782 191.772 197 0.0432396 0.0521141
2023-09-21T22:17:07.690624+0800 52 10 9964 9954 191.391 172 0.0432432 0.0522076
2023-09-21T22:17:08.690746+0800 53 10 10158 10148 191.44 194 0.0164307 0.0521984
2023-09-21T22:17:09.690859+0800 54 10 10364 10354 191.709 206 0.00817016 0.0521235
2023-09-21T22:17:10.690969+0800 55 10 10568 10558 191.932 204 0.0361452 0.0520669
2023-09-21T22:17:11.691108+0800 56 10 10742 10732 191.612 174 0.0464498 0.0521503
2023-09-21T22:17:12.691265+0800 57 10 10931 10921 191.565 189 0.0620719 0.0521665
2023-09-21T22:17:13.691406+0800 58 10 11115 11105 191.435 184 0.0567793 0.0522109
2023-09-21T22:17:14.691570+0800 59 10 11310 11300 191.494 195 0.0529321 0.0521737
# cat test-bench-write-1M.sh
> writelog
for i in {1..1} ; do
name1=$(echo $RANDOM)
name2=$(echo $RANDOM)
echo "test-bench-write-$name1-$name2"
nohup rados -c ./ceph.conf bench 600 write --no-cleanup -t 10 -p test-pool -b 1048576 --show-time --run-name "test-bench-write-$name1-$name2" >> writelog 2>&1 &
done
# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/21/2023 10:17:03 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 800.00 0.00 1.87 4.79 0.08 0.11 0.00 0.11 0.10 8.40
sdf 0.00 2364.00 0.00 804.00 0.00 201.00 512.00 12.27 15.72 0.00 15.72 1.22 98.40
09/21/2023 10:17:04 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 797.00 0.00 1.86 4.78 0.08 0.10 0.00 0.10 0.09 7.50
sdf 0.00 2364.00 0.00 793.00 0.00 198.25 512.00 9.10 11.55 0.00 11.55 1.22 96.80
09/21/2023 10:17:05 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 772.00 0.00 1.79 4.76 0.08 0.10 0.00 0.10 0.10 7.70
sdf 0.00 2376.00 0.00 779.00 0.00 194.75 512.00 13.60 17.24 0.00 17.24 1.26 98.50
09/21/2023 10:17:06 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 792.00 0.00 1.84 4.77 0.08 0.10 0.00 0.10 0.10 7.70
sdf 0.00 2424.00 0.00 809.00 0.00 202.25 512.00 13.76 17.00 0.00 17.00 1.21 98.20
test-4 : client_lim = 1 / client_res = 0.5 / client_wgt = 60 / iops=240 / bw=240M > bsize=8K> 1.6M * 1024 / 8K = 204 IOPS ==> 204 / 240 = 85%
# ceph daemon osd.0 config show | grep -e osd_mclock -e bluestore_throttle
"bluestore_throttle_bytes": "0", //unlimited
"bluestore_throttle_cost_per_io": "0",
"bluestore_throttle_cost_per_io_hdd": "670000",
"bluestore_throttle_cost_per_io_ssd": "4000",
"bluestore_throttle_deferred_bytes": "0", //unlimited
"bluestore_throttle_trace_rate": "0.000000",
"osd_mclock_force_run_benchmark_on_init": "false",
"osd_mclock_iops_capacity_threshold_hdd": "500.000000",
"osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
"osd_mclock_max_capacity_iops_hdd": "240.000000", //240 IOPS
"osd_mclock_max_capacity_iops_ssd": "21500.000000",
"osd_mclock_max_sequential_bandwidth_hdd": "251658240", //240M
"osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
"osd_mclock_override_recovery_settings": "false",
"osd_mclock_profile": "custom",
"osd_mclock_scheduler_anticipation_timeout": "0.000000",
"osd_mclock_scheduler_background_best_effort_lim": "0.100000",
"osd_mclock_scheduler_background_best_effort_res": "0.200000",
"osd_mclock_scheduler_background_best_effort_wgt": "20",
"osd_mclock_scheduler_background_recovery_lim": "0.500000",
"osd_mclock_scheduler_background_recovery_res": "0.300000",
"osd_mclock_scheduler_background_recovery_wgt": "20",
"osd_mclock_scheduler_client_lim": "1.000000",
"osd_mclock_scheduler_client_res": "0.500000",
"osd_mclock_scheduler_client_wgt": "60",
"osd_mclock_skip_benchmark": "true",
# cat test-bench-write-8K.sh
> writelog
for i in {1..1} ; do
name1=$(echo $RANDOM)
name2=$(echo $RANDOM)
echo "test-bench-write-$name1-$name2"
nohup rados -c ./ceph.conf bench 600 write --no-cleanup -t 10 -p test-pool -b 8192 --show-time --run-name "test-bench-write-$name1-$name2" >> writelog 2>&1 &
done
# rados bench output ==> 1.6M * 1024 / 8K = 204 IOPS ==> 204 / 240 = 85%
2023-09-21T22:22:22.328084+0800 min lat: 0.00096772 max lat: 0.205785 avg lat: 0.048328 lat p50: 0.0399051 lat p90: 0.106316 lat p99: 0.178427 lat p999: 0.205785 lat p100: 0.205785
2023-09-21T22:22:22.328084+0800 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
2023-09-21T22:22:22.328084+0800 60 10 12416 12406 1.6151 1.63281 0.0623918 0.048328
2023-09-21T22:22:23.328281+0800 61 10 12616 12606 1.61423 1.5625 0.019324 0.0483596
2023-09-21T22:22:24.328438+0800 62 10 12817 12807 1.61352 1.57031 0.127102 0.0483708
2023-09-21T22:22:25.328597+0800 63 10 13008 12998 1.61159 1.49219 0.0204207 0.0484451
2023-09-21T22:22:26.328779+0800 64 10 13228 13218 1.61326 1.71875 0.00126544 0.0483676
2023-09-21T22:22:27.328941+0800 65 10 13446 13436 1.61464 1.70312 0.00127368 0.0483457
2023-09-21T22:22:28.329112+0800 66 10 13660 13650 1.6155 1.67188 0.0207621 0.0483041
2023-09-21T22:22:29.329278+0800 67 10 13866 13856 1.61541 1.60938 0.00123517 0.0483221
2023-09-21T22:22:30.329441+0800 68 10 14044 14034 1.6121 1.39062 0.041497 0.0484098
2023-09-21T22:22:31.329607+0800 69 10 14248 14238 1.61183 1.59375 0.0417745 0.0484319
2023-09-21T22:22:32.329779+0800 70 10 14439 14429 1.61012 1.49219 0.0218883 0.0484935
2023-09-21T22:22:33.329948+0800 71 10 14639 14629 1.60944 1.5625 0.0210867 0.0485107
2023-09-21T22:22:34.330113+0800 72 10 14856 14846 1.61063 1.69531 0.0415444 0.048472
2023-09-21T22:22:35.330280+0800 73 10 15057 15047 1.61007 1.57031 0.0871025 0.0484789
2023-09-21T22:22:36.330444+0800 74 10 15271 15261 1.61091 1.67188 0.0630531 0.0484612
2023-09-21T22:22:37.330584+0800 75 10 15476 15466 1.61078 1.60156 0.0637163 0.0484811
2023-09-21T22:22:38.330745+0800 76 10 15661 15651 1.6086 1.44531 0.0160527 0.0485229
2023-09-21T22:22:39.330905+0800 77 10 15861 15851 1.608 1.5625 0.105571 0.0485428
2023-09-21T22:22:40.331072+0800 78 10 16053 16043 1.60661 1.5 0.0014667 0.0485908
2023-09-21T22:22:41.331241+0800 79 10 16271 16261 1.60783 1.70312 0.00116457 0.0485708
# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/21/2023 10:22:14 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 852.00 0.00 3.63 8.73 0.09 0.11 0.00 0.11 0.11 9.40
sdf 0.00 0.00 0.00 216.00 0.00 1.69 16.00 0.08 0.37 0.00 0.37 0.24 5.10
09/21/2023 10:22:14 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 788.00 0.00 3.38 8.79 0.07 0.10 0.00 0.10 0.10 7.50
sdf 0.00 0.00 0.00 202.00 0.00 1.58 16.00 0.01 0.06 0.00 0.06 0.06 1.30
09/21/2023 10:22:15 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 801.00 0.00 3.51 8.98 0.08 0.10 0.00 0.10 0.10 8.00
sdf 0.00 0.00 0.00 214.00 0.00 1.67 16.00 0.01 0.07 0.00 0.07 0.07 1.50
test-5 : client_lim = 1 / client_res = 0.5 / client_wgt = 60 / iops=100 / bw=100M > bsize=1M > 84M / 100M = 84%
# ceph daemon osd.0 config show | grep -e osd_mclock -e bluestore_throttle
"bluestore_throttle_bytes": "0", // unlimited
"bluestore_throttle_cost_per_io": "0",
"bluestore_throttle_cost_per_io_hdd": "670000",
"bluestore_throttle_cost_per_io_ssd": "4000",
"bluestore_throttle_deferred_bytes": "0", // unlimited
"bluestore_throttle_trace_rate": "0.000000",
"osd_mclock_force_run_benchmark_on_init": "false",
"osd_mclock_iops_capacity_threshold_hdd": "500.000000",
"osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
"osd_mclock_max_capacity_iops_hdd": "100.000000", // 100 iops
"osd_mclock_max_capacity_iops_ssd": "21500.000000",
"osd_mclock_max_sequential_bandwidth_hdd": "104857600", // 100_M
"osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
"osd_mclock_override_recovery_settings": "false",
"osd_mclock_profile": "custom",
"osd_mclock_scheduler_anticipation_timeout": "0.000000",
"osd_mclock_scheduler_background_best_effort_lim": "0.100000",
"osd_mclock_scheduler_background_best_effort_res": "0.200000",
"osd_mclock_scheduler_background_best_effort_wgt": "20",
"osd_mclock_scheduler_background_recovery_lim": "0.500000",
"osd_mclock_scheduler_background_recovery_res": "0.300000",
"osd_mclock_scheduler_background_recovery_wgt": "20",
"osd_mclock_scheduler_client_lim": "1.000000",
"osd_mclock_scheduler_client_res": "0.500000",
"osd_mclock_scheduler_client_wgt": "60",
"osd_mclock_skip_benchmark": "true",
# cat test-bench-write-1M.sh
> writelog
for i in {1..1} ; do
name1=$(echo $RANDOM)
name2=$(echo $RANDOM)
echo "test-bench-write-$name1-$name2"
nohup rados -c ./ceph.conf bench 600 write --no-cleanup -t 10 -p test-pool -b 1048576 --show-time --run-name "test-bench-write-$name1-$name2" >> writelog 2>&1 &
done
# rados bench output ==> 84M / 100M = 84%
2023-09-21T22:11:07.590247+0800 min lat: 0.00665102 max lat: 0.506691 avg lat: 0.117194 lat p50: 0.0938236 lat p90: 0.263444 lat p99: 0.46757 lat p999: 0.506691 lat p100: 0.506691
2023-09-21T22:11:07.590247+0800 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
2023-09-21T22:11:07.590247+0800 60 10 5118 5108 85.1189 89 0.0576973 0.117194
2023-09-21T22:11:08.590451+0800 61 10 5185 5175 84.8217 67 0.041496 0.117657
2023-09-21T22:11:09.590623+0800 62 10 5260 5250 84.6631 75 0.0879328 0.117885
2023-09-21T22:11:10.590789+0800 63 10 5346 5336 84.684 86 0.0369602 0.117808
2023-09-21T22:11:11.590975+0800 64 10 5434 5424 84.7356 88 0.139532 0.117881
2023-09-21T22:11:12.591107+0800 65 10 5523 5513 84.801 89 0.055929 0.117786
2023-09-21T22:11:13.591272+0800 66 10 5618 5608 84.9553 95 0.0881378 0.117566
2023-09-21T22:11:14.591439+0800 67 10 5707 5697 85.0155 89 0.134108 0.117414
2023-09-21T22:11:15.591563+0800 68 10 5785 5775 84.9122 78 0.391141 0.11746
2023-09-21T22:11:16.591743+0800 69 10 5855 5845 84.6959 70 0.0336723 0.11791
2023-09-21T22:11:17.591912+0800 70 10 5945 5935 84.7714 90 0.164689 0.117806
2023-09-21T22:11:18.592058+0800 71 10 6031 6021 84.7885 86 0.204223 0.117755
2023-09-21T22:11:19.592226+0800 72 10 6111 6101 84.7218 80 0.0878715 0.117879
2023-09-21T22:11:20.592402+0800 73 10 6190 6180 84.6433 79 0.0121704 0.117972
2023-09-21T22:11:21.592573+0800 74 10 6270 6260 84.5803 80 0.0120983 0.117998
2023-09-21T22:11:22.592716+0800 75 10 6356 6346 84.5991 86 0.0696619 0.117987
2023-09-21T22:11:23.592886+0800 76 10 6433 6423 84.4989 77 0.124536 0.118201
2023-09-21T22:11:24.593056+0800 77 10 6508 6498 84.3754 75 0.360682 0.118245
2023-09-21T22:11:25.593214+0800 78 10 6590 6580 84.3448 82 0.22629 0.118426
2023-09-21T22:11:26.593396+0800 79 10 6674 6664 84.3402 84 0.0130458 0.118399
# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/21/2023 10:10:59 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 356.00 0.00 0.83 4.79 0.03 0.10 0.00 0.10 0.10 3.40
sdf 0.00 1080.00 0.00 353.00 0.00 88.25 512.00 2.93 8.28 0.00 8.28 1.82 64.40
09/21/2023 10:11:00 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 1.00 352.00 0.01 0.82 4.83 0.04 0.11 1.00 0.11 0.11 4.00
sdf 0.00 1056.00 0.00 359.00 0.00 89.75 512.00 3.36 9.38 0.00 9.38 1.83 65.70
09/21/2023 10:11:01 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 276.00 0.00 0.64 4.78 0.03 0.09 0.00 0.09 0.09 2.60
sdf 0.00 828.00 0.00 273.00 0.00 68.25 512.00 2.00 7.37 0.00 7.37 2.07 56.50
09/21/2023 10:11:02 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 316.00 0.00 0.73 4.73 0.04 0.11 0.00 0.11 0.11 3.60
sdf 0.00 960.00 0.00 319.00 0.00 79.75 512.00 2.62 8.14 0.00 8.14 2.01 64.10
Updated by jianwei zhang 8 months ago
- bsize=200K
- BW = 39M / 240M = 16.25%
# ceph daemon osd.0 config show | grep -e osd_mclock -e bluestore_throttle
"bluestore_throttle_bytes": "0", //unlimited
"bluestore_throttle_cost_per_io": "0",
"bluestore_throttle_cost_per_io_hdd": "670000",
"bluestore_throttle_cost_per_io_ssd": "4000",
"bluestore_throttle_deferred_bytes": "0", //unlimited
"bluestore_throttle_trace_rate": "0.000000",
"osd_mclock_force_run_benchmark_on_init": "false",
"osd_mclock_iops_capacity_threshold_hdd": "500.000000",
"osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
"osd_mclock_max_capacity_iops_hdd": "240.000000", //240IOPS
"osd_mclock_max_capacity_iops_ssd": "21500.000000",
"osd_mclock_max_sequential_bandwidth_hdd": "251658240", //240M
"osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
"osd_mclock_override_recovery_settings": "false",
"osd_mclock_profile": "custom",
"osd_mclock_scheduler_anticipation_timeout": "0.000000",
"osd_mclock_scheduler_background_best_effort_lim": "0.100000",
"osd_mclock_scheduler_background_best_effort_res": "0.200000",
"osd_mclock_scheduler_background_best_effort_wgt": "20",
"osd_mclock_scheduler_background_recovery_lim": "0.500000",
"osd_mclock_scheduler_background_recovery_res": "0.300000",
"osd_mclock_scheduler_background_recovery_wgt": "20",
"osd_mclock_scheduler_client_lim": "1.000000",
"osd_mclock_scheduler_client_res": "0.500000",
"osd_mclock_scheduler_client_wgt": "60",
"osd_mclock_skip_benchmark": "true",
# cat test-bench-write-200K.sh
> writelog
for i in {1..1} ; do
name1=$(echo $RANDOM)
name2=$(echo $RANDOM)
echo "test-bench-write-$name1-$name2"
nohup rados -c ./ceph.conf bench 600 write --no-cleanup -t 10 -p test-pool -b 204800 --show-time --run-name "test-bench-write-$name1-$name2" >> writelog 2>&1 &
done
# rados bench output ==> 39 MB/s
2023-09-21T22:40:30.038698+0800 min lat: 0.00208069 max lat: 0.28836 avg lat: 0.0499285 lat p50: 0.0399047 lat p90: 0.10739 lat p99: 0.178679 lat p999: 0.246243 lat p100: 0.28836
2023-09-21T22:40:30.038698+0800 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
2023-09-21T22:40:30.038698+0800 100 10 20029 20019 39.0932 37.1094 0.0365631 0.0499285
2023-09-21T22:40:31.038925+0800 101 10 20222 20212 39.0792 37.6953 0.0972425 0.0499445
2023-09-21T22:40:32.039082+0800 102 10 20415 20405 39.0656 37.6953 0.0497988 0.0499663
2023-09-21T22:40:33.039240+0800 103 10 20617 20607 39.0693 39.4531 0.010729 0.0499684
2023-09-21T22:40:34.039408+0800 104 10 20808 20798 39.0523 37.3047 0.0417104 0.0499912
2023-09-21T22:40:35.039584+0800 105 10 21005 20995 39.0467 38.4766 0.0305145 0.0499838
2023-09-21T22:40:36.039737+0800 106 10 21193 21183 39.0247 36.7188 0.0196327 0.0500246
2023-09-21T22:40:37.039912+0800 107 10 21380 21370 39.0013 36.5234 0.164901 0.0500521
2023-09-21T22:40:38.040083+0800 108 10 21579 21569 39 38.8672 0.154728 0.0500468
2023-09-21T22:40:39.040256+0800 109 10 21760 21750 38.9665 35.3516 0.124795 0.0500964
2023-09-21T22:40:40.040425+0800 110 10 21949 21939 38.9478 36.9141 0.0408142 0.0501265
2023-09-21T22:40:41.040597+0800 111 10 22152 22142 38.954 39.6484 0.0419623 0.0501202
2023-09-21T22:40:42.040772+0800 112 10 22353 22343 38.9567 39.2578 0.009937 0.0501097
2023-09-21T22:40:43.040942+0800 113 10 22554 22544 38.9593 39.2578 0.00886829 0.0500964
2023-09-21T22:40:44.041090+0800 114 10 22759 22749 38.9687 40.0391 0.0634586 0.050102
2023-09-21T22:40:45.041214+0800 115 10 22961 22951 38.9728 39.4531 0.0403335 0.0500961
2023-09-21T22:40:46.041394+0800 116 10 23160 23150 38.9719 38.8672 0.0608101 0.0500976
2023-09-21T22:40:47.041570+0800 117 10 23371 23361 38.991 41.2109 0.137616 0.0500709
2023-09-21T22:40:48.041740+0800 118 10 23585 23575 39.0147 41.7969 0.0301397 0.0500334
2023-09-21T22:40:49.041916+0800 119 10 23790 23780 39.0232 40.0391 0.0495194 0.050023
# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/21/2023 10:42:01 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 840.00 0.00 1.92 4.69 0.09 0.11 0.00 0.11 0.11 8.90
sdf 0.00 627.00 0.00 210.00 0.00 41.02 400.00 1.32 6.26 0.00 6.26 3.53 74.10
09/21/2023 10:42:02 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 748.00 0.00 1.73 4.73 0.08 0.11 0.00 0.11 0.11 8.10
sdf 0.00 561.00 0.00 187.00 0.00 36.52 400.00 1.11 5.93 0.00 5.93 3.78 70.60
09/21/2023 10:42:03 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 832.00 0.00 1.91 4.71 0.07 0.09 0.00 0.09 0.09 7.10
sdf 0.00 624.00 0.00 208.00 0.00 40.62 400.00 1.07 5.16 0.00 5.16 2.76 57.50
Updated by jianwei zhang 8 months ago
cost = max (1M, 200K) = 1M
bw = 240M
mclock_queue_delay = 1 / 240 = 0.0041 s
200K_on_disk_lat = 200 / 1024 / 240 = 0.0008 s
all_lat = 0.0041 + 0.0008 = 0.0049s
iops = 1 / 0.0049 = 204
bw = 204 * 200 / 1024 = 39M
Updated by jianwei zhang 8 months ago
test-7: client_lim = 1 / client_res = 0.5 / client_wgt = 60 / iops=240 / bw=240M
* write : buffer=1M, bw=192MB/s, iops=192
* randread : buffer=1M, bw=108MB/s (45%), iops=108
* seqread : buffer=1M, bw=126MB/s (52.5%), iops=126
* mclock_queue_delay = 1 / 240 = 0.0041 s
* 1M_on_disk_lat = 1 / 240 = 0.0041 s
* 1 / (0.0041 * 2) = 121 * 1M = 121 MB/s
1. lat1 = Queuing time calculated by mclock based on bytes
2. lat2 = Disk seek time + transfer time cannot be ignored
3. Even if limit = 1 (means that disk bandwidth 240MB/s can be fully used), but there will still be a loss in read and write bandwidth, especially read bandwidth
# ceph daemon osd.0 config show | grep -e osd_mclock -e bluestore_throttle
"bluestore_throttle_bytes": "0",
"bluestore_throttle_cost_per_io": "0",
"bluestore_throttle_cost_per_io_hdd": "670000",
"bluestore_throttle_cost_per_io_ssd": "4000",
"bluestore_throttle_deferred_bytes": "0",
"bluestore_throttle_trace_rate": "0.000000",
"osd_mclock_force_run_benchmark_on_init": "false",
"osd_mclock_iops_capacity_threshold_hdd": "500.000000",
"osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
"osd_mclock_max_capacity_iops_hdd": "240.000000",
"osd_mclock_max_capacity_iops_ssd": "21500.000000",
"osd_mclock_max_sequential_bandwidth_hdd": "251658240",
"osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
"osd_mclock_override_recovery_settings": "false",
"osd_mclock_profile": "custom",
"osd_mclock_scheduler_anticipation_timeout": "0.000000",
"osd_mclock_scheduler_background_best_effort_lim": "0.100000",
"osd_mclock_scheduler_background_best_effort_res": "0.200000",
"osd_mclock_scheduler_background_best_effort_wgt": "20",
"osd_mclock_scheduler_background_recovery_lim": "0.500000",
"osd_mclock_scheduler_background_recovery_res": "0.300000",
"osd_mclock_scheduler_background_recovery_wgt": "20",
"osd_mclock_scheduler_client_lim": "1.000000",
"osd_mclock_scheduler_client_res": "0.500000",
"osd_mclock_scheduler_client_wgt": "60",
"osd_mclock_skip_benchmark": "true",
# cat test-bench-randread-1M.sh
> readlog
> writelog
ceph tell osd.0 cache drop
rados -c ./ceph.conf bench 300 write --no-cleanup -t 10 -p test-pool -b 1048576 --show-time > writelog 2>&1
ceph tell osd.0 cache drop
rados -c ./ceph.conf bench 300 rand -t 10 -p test-pool --show-time > readlog 2>&1
ceph tell osd.0 cache drop
rados -c ./ceph.conf bench 300 seq -t 10 -p test-pool --show-time > seq-readlog 2>&1
# rados -c ./ceph.conf bench 300 write --no-cleanup -t 10 -p test-pool -b 1048576
2023-09-21T23:17:32.018512+0800 min lat: 0.00677637 max lat: 0.345545 avg lat: 0.0520317 lat p50: 0.0441953 lat p90: 0.103738 lat p99: 0.158788 lat p999: 0.240164 lat p100: 0.345545
2023-09-21T23:17:32.018512+0800 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
2023-09-21T23:17:32.018512+0800 180 10 34594 34584 192.105 188 0.143394 0.0520317
2023-09-21T23:17:33.018704+0800 181 10 34778 34768 192.06 184 0.102056 0.0520474
2023-09-21T23:17:34.018820+0800 182 10 34966 34956 192.038 188 0.0288972 0.0520533
2023-09-21T23:17:35.018992+0800 183 10 35171 35161 192.108 205 0.0199566 0.0520353
2023-09-21T23:17:36.019171+0800 184 10 35349 35339 192.031 178 0.0525519 0.0520571
2023-09-21T23:17:37.019338+0800 185 10 35538 35528 192.015 189 0.0338643 0.0520639
2023-09-21T23:17:38.019502+0800 186 10 35740 35730 192.068 202 0.0236666 0.0520508
2023-09-21T23:17:39.019640+0800 187 10 35931 35921 192.062 191 0.0177018 0.0520489
2023-09-21T23:17:40.019789+0800 188 10 36123 36113 192.062 192 0.0333076 0.0520442
2023-09-21T23:17:41.019926+0800 189 10 36326 36316 192.12 203 0.0302826 0.0520373
2023-09-21T23:17:42.020063+0800 190 10 36521 36511 192.135 195 0.0300076 0.05203
2023-09-21T23:17:43.020209+0800 191 10 36711 36701 192.123 190 0.0613027 0.0520309
2023-09-21T23:17:44.020331+0800 192 10 36903 36893 192.123 192 0.0208707 0.052036
2023-09-21T23:17:45.020486+0800 193 10 37097 37087 192.132 194 0.0458238 0.0520295
2023-09-21T23:17:46.020604+0800 194 10 37284 37274 192.106 187 0.0207243 0.0520379
2023-09-21T23:17:47.020814+0800 195 10 37474 37464 192.095 190 0.00812529 0.0520349
2023-09-21T23:17:48.020959+0800 196 10 37677 37667 192.15 203 0.02463 0.0520271
2023-09-21T23:17:49.021133+0800 197 10 37875 37865 192.18 198 0.0600757 0.0520202
2023-09-21T23:17:50.021214+0800 198 10 38047 38037 192.078 172 0.0397751 0.0520491
2023-09-21T23:17:51.021411+0800 199 10 38241 38231 192.087 194 0.0580198 0.0520454
# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/21/2023 11:17:56 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 760.00 0.00 1.78 4.79 0.08 0.10 0.00 0.10 0.10 7.90
sdf 0.00 2184.00 0.00 757.00 0.00 189.25 512.00 11.65 16.39 0.00 16.39 1.24 93.90
09/21/2023 11:17:57 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 1.00 725.00 0.01 1.68 4.77 0.07 0.09 0.00 0.09 0.09 6.50
sdf 0.00 2256.00 0.00 734.00 0.00 183.50 512.00 9.09 12.08 0.00 12.08 1.30 95.70
09/21/2023 11:17:58 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 1.00 728.00 0.01 1.69 4.77 0.08 0.11 0.00 0.11 0.11 7.70
sdf 0.00 2142.00 0.00 727.00 0.00 181.75 512.00 10.13 14.18 0.00 14.18 1.29 94.10
09/21/2023 11:17:59 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 820.00 0.00 1.92 4.80 0.09 0.10 0.00 0.10 0.10 8.60
sdf 0.00 2502.00 0.00 825.00 0.00 206.25 512.00 12.20 14.79 0.00 14.79 1.20 98.70
# rados -c ./ceph.conf bench 300 rand -t 10 -p test-pool --show-time
2023-09-21T23:21:14.801963+0800 min lat: 0.000950983 max lat: 0.329227 avg lat: 0.0918224 lat p50: 0.079395 lat p90: 0.188812 lat p99: 0.2499 lat p999: 0.329227 lat p100: 0.329227
2023-09-21T23:21:14.801963+0800 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
2023-09-21T23:21:14.801963+0800 100 10 10870 10860 108.582 112 0.0475709 0.0918224
2023-09-21T23:21:15.802203+0800 101 10 10982 10972 108.616 112 0.0714889 0.0918054
2023-09-21T23:21:16.802396+0800 102 10 11088 11078 108.59 106 0.19222 0.0918347
2023-09-21T23:21:17.802598+0800 103 10 11197 11187 108.594 109 0.0498854 0.0918089
2023-09-21T23:21:18.802744+0800 104 10 11299 11289 108.53 102 0.148587 0.0918716
2023-09-21T23:21:19.802888+0800 105 10 11412 11402 108.573 113 0.139777 0.091857
2023-09-21T23:21:20.803039+0800 106 10 11520 11510 108.567 108 0.080589 0.0918555
2023-09-21T23:21:21.803180+0800 107 10 11626 11616 108.543 106 0.0585344 0.0918476
2023-09-21T23:21:22.803323+0800 108 10 11733 11723 108.528 107 0.0725719 0.0918789
2023-09-21T23:21:23.803399+0800 109 10 11841 11831 108.524 108 0.0258373 0.0918775
2023-09-21T23:21:24.803523+0800 110 10 11949 11939 108.519 108 0.0499517 0.0918971
2023-09-21T23:21:25.803697+0800 111 10 12060 12050 108.541 111 0.0370205 0.0918196
2023-09-21T23:21:26.803851+0800 112 10 12169 12159 108.545 109 0.0773861 0.0918556
2023-09-21T23:21:27.803986+0800 113 10 12279 12269 108.558 110 0.0938997 0.0918722
2023-09-21T23:21:28.804130+0800 114 10 12387 12377 108.553 108 0.0322604 0.091867
2023-09-21T23:21:29.804282+0800 115 10 12495 12485 108.548 108 0.176533 0.0918611
2023-09-21T23:21:30.804605+0800 116 9 12605 12596 108.568 111 0.163263 0.0918711
2023-09-21T23:21:31.804745+0800 117 10 12715 12705 108.572 109 0.111872 0.0918562
2023-09-21T23:21:32.804906+0800 118 10 12823 12813 108.567 108 0.180895 0.0918464
2023-09-21T23:21:33.805088+0800 119 10 12936 12926 108.604 113 0.0160037 0.0918286
# iostat -xmt 1 -d /dev/sdf /dev/sdfb
09/21/2023 11:21:36 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 1278.00 0.00 526.00 0.00 114.13 0.00 444.35 16.03 31.38 31.38 0.00 1.90 99.90
09/21/2023 11:21:37 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 1225.00 0.00 515.00 0.00 108.88 0.00 432.96 16.40 31.86 31.86 0.00 1.94 100.00
09/21/2023 11:21:38 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 1221.00 0.00 515.00 0.00 107.81 0.00 428.74 15.82 30.66 30.66 0.00 1.94 100.10
# rados -c ./ceph.conf bench 300 seq -t 10 -p test-pool --show-time
2023-09-21T23:28:31.738419+0800 min lat: 0.00553556 max lat: 0.32825 avg lat: 0.0785171 lat p50: 0.0663674 lat p90: 0.154047 lat p99: 0.244373 lat p999: 0.32825 lat p100: 0.32825
2023-09-21T23:28:31.738419+0800 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
2023-09-21T23:28:31.738419+0800 40 10 5086 5076 126.878 123 0.114363 0.0785171
2023-09-21T23:28:32.738621+0800 41 10 5218 5208 127.003 132 0.0976044 0.0784495
2023-09-21T23:28:33.738798+0800 42 10 5345 5335 127.002 127 0.0842019 0.0784338
2023-09-21T23:28:34.738972+0800 43 10 5476 5466 127.095 131 0.20592 0.0783696
2023-09-21T23:28:35.739144+0800 44 10 5612 5602 127.296 136 0.0140386 0.0782273
2023-09-21T23:28:36.739313+0800 45 10 5742 5732 127.356 130 0.20805 0.0781625
2023-09-21T23:28:37.739507+0800 46 10 5860 5850 127.152 118 0.0301462 0.0782675
2023-09-21T23:28:38.739688+0800 47 10 5986 5976 127.127 126 0.177597 0.0783592
2023-09-21T23:28:39.739859+0800 48 10 6112 6102 127.103 126 0.100173 0.0784004
2023-09-21T23:28:40.740046+0800 49 10 6243 6233 127.182 131 0.063912 0.078341
2023-09-21T23:28:41.740213+0800 50 10 6380 6370 127.378 137 0.012598 0.078227
2023-09-21T23:28:42.740375+0800 51 10 6514 6504 127.508 134 0.20194 0.0781202
2023-09-21T23:28:43.740552+0800 52 10 6649 6639 127.651 135 0.13733 0.0780518
2023-09-21T23:28:44.740710+0800 53 10 6775 6765 127.62 126 0.106213 0.0780625
2023-09-21T23:28:45.740873+0800 54 10 6891 6881 127.404 116 0.114592 0.0782213
2023-09-21T23:28:46.741043+0800 55 10 7012 7002 127.287 121 0.0441335 0.0782742
2023-09-21T23:28:47.741225+0800 56 10 7136 7126 127.228 124 0.0276655 0.0783007
2023-09-21T23:28:48.741405+0800 57 10 7267 7257 127.294 131 0.0299355 0.0782478
2023-09-21T23:28:49.741577+0800 58 10 7396 7386 127.323 129 0.192391 0.0782472
2023-09-21T23:28:50.741762+0800 59 10 7525 7515 127.351 129 0.0426686 0.078246
# iostat -xmt 1 -d /dev/sdb /dev/sdf
09/21/2023 11:29:01 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 1403.00 0.00 626.00 0.00 126.31 0.00 413.24 14.60 23.03 23.03 0.00 1.60 100.00
09/21/2023 11:29:02 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 1426.00 0.00 647.00 0.00 130.50 0.00 413.08 15.45 24.34 24.34 0.00 1.55 100.00
09/21/2023 11:29:03 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 1431.00 0.00 642.00 0.00 128.94 0.00 411.31 15.26 23.68 23.68 0.00 1.56 100.00
Updated by jianwei zhang 8 months ago
test-8 : client_lim = 0 / client_res = 0.5 / client_wgt = 60 / iops=240 / bw=240M
* write : buffer=1M, bw=238MB/s, iops=238
* randread : buffer=1M, bw=106MB/s, iops=106
* seqread : buffer=1M, bw=127MB/s (52.5%), iops=127
# ceph daemon osd.0 config show | grep -e osd_mclock -e bluestore_throttle
"bluestore_throttle_bytes": "0", //unlimited
"bluestore_throttle_cost_per_io": "0",
"bluestore_throttle_cost_per_io_hdd": "670000",
"bluestore_throttle_cost_per_io_ssd": "4000",
"bluestore_throttle_deferred_bytes": "0", //unlimited
"bluestore_throttle_trace_rate": "0.000000",
"osd_mclock_force_run_benchmark_on_init": "false",
"osd_mclock_iops_capacity_threshold_hdd": "500.000000",
"osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
"osd_mclock_max_capacity_iops_hdd": "240.000000", // iops=240
"osd_mclock_max_capacity_iops_ssd": "21500.000000",
"osd_mclock_max_sequential_bandwidth_hdd": "251658240", //bw=240M
"osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
"osd_mclock_override_recovery_settings": "false",
"osd_mclock_profile": "custom",
"osd_mclock_scheduler_anticipation_timeout": "0.000000",
"osd_mclock_scheduler_background_best_effort_lim": "0.100000",
"osd_mclock_scheduler_background_best_effort_res": "0.200000",
"osd_mclock_scheduler_background_best_effort_wgt": "20",
"osd_mclock_scheduler_background_recovery_lim": "0.500000",
"osd_mclock_scheduler_background_recovery_res": "0.300000",
"osd_mclock_scheduler_background_recovery_wgt": "20",
"osd_mclock_scheduler_client_lim": "0.000000", //unlimited
"osd_mclock_scheduler_client_res": "0.500000",
"osd_mclock_scheduler_client_wgt": "60",
"osd_mclock_skip_benchmark": "true",
# cat test-bench-read-1M.sh
#> readlog
#> writelog
ceph tell osd.0 cache drop
rados -c ./ceph.conf bench 300 write --no-cleanup -t 10 -p test-pool -b 1048576 --show-time > writelog 2>&1
ceph tell osd.0 cache drop
rados -c ./ceph.conf bench 300 rand -t 10 -p test-pool --show-time > readlog 2>&1
ceph tell osd.0 cache drop
rados -c ./ceph.conf bench 300 seq -t 10 -p test-pool --show-time > seq-readlog 2>&1
# rados -c ./ceph.conf bench 300 write --no-cleanup -t 10 -p test-pool -b 1048576 --show-time > writelog 2>&1
2023-09-22T15:32:17.676876+0800 min lat: 0.00747302 max lat: 2.55199 avg lat: 0.0420207 lat p50: 0.0395603 lat p90: 0.0472786 lat p99: 0.0521989 lat p999: 1.4363 lat p100: 2.55199
2023-09-22T15:32:17.676876+0800 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
2023-09-22T15:32:17.676876+0800 60 9 14283 14274 237.862 244 0.0407354 0.0420207
2023-09-22T15:32:18.677013+0800 61 10 14519 14509 237.814 235 0.0468292 0.0420267
2023-09-22T15:32:19.677158+0800 62 10 14751 14741 237.72 232 0.0412701 0.0420435
2023-09-22T15:32:20.677295+0800 63 10 14997 14987 237.851 246 0.0402811 0.042023
2023-09-22T15:32:21.677467+0800 64 10 15243 15233 237.978 246 0.0432071 0.0419993
2023-09-22T15:32:22.677597+0800 65 10 15479 15469 237.947 236 0.0465882 0.0420044
2023-09-22T15:32:23.677787+0800 66 10 15710 15700 237.841 231 0.0412817 0.0420231
2023-09-22T15:32:24.677899+0800 67 10 15952 15942 237.903 242 0.0417247 0.0420128
2023-09-22T15:32:25.678044+0800 68 10 16199 16189 238.036 247 0.0422215 0.04199
2023-09-22T15:32:26.678153+0800 69 10 16435 16425 238.006 236 0.0411166 0.0419948
2023-09-22T15:32:27.678284+0800 70 10 16663 16653 237.863 228 0.0432414 0.0420201
2023-09-22T15:32:28.678437+0800 71 10 16903 16893 237.892 240 0.0414538 0.0420163
2023-09-22T15:32:29.678612+0800 72 10 17151 17141 238.032 248 0.0402458 0.0419915
2023-09-22T15:32:30.678786+0800 73 10 17394 17384 238.099 243 0.0406311 0.0419796
2023-09-22T15:32:31.678931+0800 74 10 17623 17613 237.976 229 0.0422751 0.0420012
2023-09-22T15:32:32.679100+0800 75 10 17863 17853 238.002 240 0.0407005 0.0419974
2023-09-22T15:32:33.679242+0800 76 10 18107 18097 238.081 244 0.0391934 0.0419851
2023-09-22T15:32:34.679354+0800 77 10 18351 18341 238.157 244 0.0407713 0.0419697
2023-09-22T15:32:35.679497+0800 78 10 18588 18578 238.142 237 0.042393 0.0419723
2023-09-22T15:32:36.679640+0800 79 10 18817 18807 238.026 229 0.0424886 0.0419926
# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/22/2023 03:33:41 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 956.00 0.00 2.23 4.77 0.10 0.10 0.00 0.10 0.10 9.80
sdf 0.00 2868.00 0.00 956.00 0.00 239.00 512.00 33.66 35.21 0.00 35.21 1.05 100.00
09/22/2023 03:33:42 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 982.00 0.00 2.29 4.77 0.10 0.10 0.00 0.10 0.10 9.70
sdf 0.00 2952.00 0.00 984.00 0.00 246.00 512.00 33.45 34.08 0.00 34.08 1.02 100.00
09/22/2023 03:33:43 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 983.00 0.00 2.29 4.78 0.10 0.10 0.00 0.10 0.10 9.80
sdf 0.00 2940.00 0.00 980.00 0.00 245.00 512.00 33.37 33.98 0.00 33.98 1.02 100.00
09/22/2023 03:33:44 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 960.00 0.00 2.25 4.80 0.10 0.10 0.00 0.10 0.10 9.80
sdf 0.00 2880.00 0.00 960.00 0.00 240.00 512.00 33.55 34.96 0.00 34.96 1.04 100.00
# rados -c ./ceph.conf bench 300 rand -t 10 -p test-pool --show-time > readlog 2>&1
2023-09-22T15:37:59.274959+0800 min lat: 0.00140976 max lat: 0.365673 avg lat: 0.0932798 lat p50: 0.0796965 lat p90: 0.194479 lat p99: 0.296663 lat p999: 0.362666 lat p100: 0.365673
2023-09-22T15:37:59.274959+0800 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
2023-09-22T15:37:59.274959+0800 100 10 10705 10695 106.931 106 0.140561 0.0932798
2023-09-22T15:38:00.275178+0800 101 10 10807 10797 106.882 102 0.0374205 0.0933135
2023-09-22T15:38:01.275309+0800 102 10 10908 10898 106.824 101 0.113367 0.0933688
2023-09-22T15:38:02.275474+0800 103 10 11016 11006 106.836 108 0.187362 0.0933498
2023-09-22T15:38:03.275637+0800 104 10 11118 11108 106.789 102 0.086711 0.0934097
2023-09-22T15:38:04.275834+0800 105 10 11223 11213 106.772 105 0.22285 0.0933999
2023-09-22T15:38:05.275989+0800 106 10 11327 11317 106.745 104 0.03781 0.0934276
2023-09-22T15:38:06.276172+0800 107 10 11437 11427 106.776 110 0.143004 0.0934115
2023-09-22T15:38:07.276310+0800 108 10 11543 11533 106.768 106 0.0481894 0.0934059
2023-09-22T15:38:08.276455+0800 109 10 11651 11641 106.779 108 0.087842 0.0934107
2023-09-22T15:38:09.276657+0800 110 10 11758 11748 106.781 107 0.0620523 0.0934106
2023-09-22T15:38:10.276819+0800 111 10 11866 11856 106.792 108 0.059065 0.0934023
2023-09-22T15:38:11.276980+0800 112 10 11975 11965 106.812 109 0.0368395 0.093396
2023-09-22T15:38:12.277138+0800 113 10 12087 12077 106.857 112 0.0540547 0.0933415
2023-09-22T15:38:13.277300+0800 114 10 12195 12185 106.867 108 0.0432903 0.0933042
2023-09-22T15:38:14.277465+0800 115 10 12298 12288 106.834 103 0.0805061 0.0933639
2023-09-22T15:38:15.277632+0800 116 10 12406 12396 106.843 108 0.0938381 0.0933551
2023-09-22T15:38:16.277800+0800 117 10 12516 12506 106.87 110 0.0220266 0.0933227
2023-09-22T15:38:17.277964+0800 118 10 12624 12614 106.88 108 0.0594453 0.0933304
2023-09-22T15:38:18.278136+0800 119 10 12730 12720 106.872 106 0.161019 0.0933295
# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/22/2023 03:38:39 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 1211.00 0.00 547.00 0.00 109.00 0.00 408.10 16.74 30.15 30.15 0.00 1.83 100.10
09/22/2023 03:38:40 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 1145.00 0.00 512.00 0.00 105.00 0.00 420.00 15.84 31.92 31.92 0.00 1.95 100.00
09/22/2023 03:38:41 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 1188.00 0.00 522.00 0.00 106.75 0.00 418.82 16.19 30.71 30.71 0.00 1.92 100.00
# rados -c ./ceph.conf bench 300 seq -t 10 -p test-pool --show-time > seq-readlog 2>&1
2023-09-22T15:43:41.147441+0800 min lat: 0.00518757 max lat: 0.365872 avg lat: 0.0782036 lat p50: 0.0608217 lat p90: 0.174392 lat p99: 0.253574 lat p999: 0.358357 lat p100: 0.365872
2023-09-22T15:43:41.147441+0800 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
2023-09-22T15:43:41.147441+0800 140 10 17862 17852 127.491 124 0.0459142 0.0782036
2023-09-22T15:43:42.147654+0800 141 10 17990 17980 127.495 128 0.0692941 0.0782011
2023-09-22T15:43:43.147826+0800 142 10 18120 18110 127.512 130 0.0235289 0.0781802
2023-09-22T15:43:44.148011+0800 143 10 18249 18239 127.523 129 0.0882931 0.0781882
2023-09-22T15:43:45.148191+0800 144 10 18378 18368 127.533 129 0.0375746 0.0781872
2023-09-22T15:43:46.148362+0800 145 10 18502 18492 127.508 124 0.0173108 0.0781833
2023-09-22T15:43:47.148554+0800 146 10 18631 18621 127.518 129 0.218181 0.0781818
2023-09-22T15:43:48.148746+0800 147 10 18761 18751 127.535 130 0.0258726 0.0781515
2023-09-22T15:43:49.148908+0800 148 10 18896 18886 127.585 135 0.104194 0.07815
2023-09-22T15:43:50.149076+0800 149 10 19021 19011 127.568 125 0.0986657 0.078166
2023-09-22T15:43:51.149213+0800 150 10 19151 19141 127.584 130 0.139945 0.0781536
2023-09-22T15:43:52.149362+0800 151 10 19278 19268 127.58 127 0.116396 0.078159
2023-09-22T15:43:53.149531+0800 152 10 19412 19402 127.622 134 0.0435113 0.0781177
2023-09-22T15:43:54.149691+0800 153 10 19540 19530 127.624 128 0.208505 0.0781201
2023-09-22T15:43:55.149859+0800 154 10 19659 19649 127.568 119 0.0754471 0.0781465
2023-09-22T15:43:56.149997+0800 155 10 19795 19785 127.622 136 0.0374644 0.0781163
2023-09-22T15:43:57.150164+0800 156 10 19919 19909 127.599 124 0.189175 0.0781295
2023-09-22T15:43:58.150328+0800 157 10 20055 20045 127.652 136 0.236116 0.078095
2023-09-22T15:43:59.150477+0800 158 10 20194 20184 127.724 139 0.0241363 0.0780588
2023-09-22T15:44:00.150636+0800 159 10 20328 20318 127.763 134 0.0271561 0.0780272
# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/22/2023 03:44:20 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 1508.00 0.00 640.00 0.00 134.94 0.00 431.80 14.70 22.97 22.97 0.00 1.56 99.90
09/22/2023 03:44:21 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 1327.00 0.00 581.00 0.00 120.00 0.00 422.99 15.30 26.18 26.18 0.00 1.72 100.00
09/22/2023 03:44:22 PM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 1551.00 0.00 662.00 0.00 136.44 0.00 422.09 15.50 23.64 23.64 0.00 1.51 100.00
Updated by Laura Flores 8 months ago
- Status changed from New to Pending Backport
- Backport set to quincy,reef
Updated by Laura Flores 8 months ago
- Copied to Backport #63125: reef: osd: Is it necessary to unconditionally increase osd_bandwidth_cost_per_io in mClockScheduler::calc_scaled_cost? added
Updated by Laura Flores 8 months ago
- Copied to Backport #63126: quincy: osd: Is it necessary to unconditionally increase osd_bandwidth_cost_per_io in mClockScheduler::calc_scaled_cost? added
Updated by Ilya Dryomov 5 months ago
- Status changed from Pending Backport to Resolved
- Target version deleted (
v18.2.0)