Project

General

Profile

Actions

Bug #62812

closed

osd: Is it necessary to unconditionally increase osd_bandwidth_cost_per_io in mClockScheduler::calc_scaled_cost?

Added by jianwei zhang 8 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
backport_processed
Backport:
quincy,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In this PR, the IOPS-based QoS cost calculation method is removed and the Bandwidth-based QoS cost calculation method is started :
- https://github.com/ceph/ceph/commit/514cb598fb616dc96f143b0b3a8cc708c212d556
- https://tracker.ceph.com/issues/58529
- https://tracker.ceph.com/issues/59080
- https://github.com/ceph/ceph/pull/49975

uint32_t mClockScheduler::calc_scaled_cost(int item_cost)
{
   auto cost = static_cast<uint32_t>(
      std::max<int>(
          1, // ensure cost is non-zero and positive
          item_cost));

   auto cost_per_io = static_cast<uint32_t>(osd_bandwidth_cost_per_io);

   // Calculate total scaled cost in bytes
   return cost_per_io + cost;
}

The osd_bandwidth_cost_per_io parameter used in the function is explained as follows:

  /**
   * osd_bandwidth_cost_per_io
   *
   * mClock expects all queued items to have a uniform expression of
   * "cost".  However, IO devices generally have quite different capacity
   * for sequential IO vs small random IO.  This implementation handles this
   * by expressing all costs as a number of sequential bytes written adding
   * additional cost for each random IO equal to osd_bandwidth_cost_per_io.
   *
   * Thus, an IO operation requiring a total of <size> bytes to be written
   * accross <iops> different locations will have a cost of
   * <size> + (osd_bandwidth_cost_per_io * <iops>) bytes.
   *
   * Set in set_osd_capacity_params_from_config in the constructor and upon
   * config change.
   *
   * Has units bytes/io.
   */
  double osd_bandwidth_cost_per_io;

osd_bandwidth_cost_per_io is calculated as follows:

void mClockScheduler::set_osd_capacity_params_from_config()
{
  uint64_t osd_bandwidth_capacity;
  double osd_iop_capacity;

  std::tie(osd_bandwidth_capacity, osd_iop_capacity) = [&, this] {
    if (is_rotational) {
      return std::make_tuple(
        cct->_conf.get_val<Option::size_t>("osd_mclock_max_sequential_bandwidth_hdd"),
        cct->_conf.get_val<double>("osd_mclock_max_capacity_iops_hdd"));
    } else {
      return std::make_tuple(
        cct->_conf.get_val<Option::size_t>("osd_mclock_max_sequential_bandwidth_ssd"),
        cct->_conf.get_val<double>("osd_mclock_max_capacity_iops_ssd"));
    }
  }();

  osd_bandwidth_capacity = std::max<uint64_t>(1, osd_bandwidth_capacity);
  osd_iop_capacity = std::max<double>(1.0, osd_iop_capacity);

  osd_bandwidth_cost_per_io = static_cast<double>(osd_bandwidth_capacity) / osd_iop_city;
  osd_bandwidth_capacity_per_shard = static_cast<double>(osd_bandwidth_capacity) / static_cast<double>(num_shards);
}

To illustrate the problem, I give an example as follows:

Preconditions:
- osd_mclock_max_sequential_bandwidth_hdd = 100MB/s
- osd_mclock_max_capacity_iops_hdd = 100 io/s
- osd_op_num_threads_per_shard = 5
- osd_mclock_scheduler_client_res = 1. //100%
- osd_mclock_scheduler_client_lim = 1. //100%
- osd_mclock_scheduler_client_wgt = 2
- write a 200KB/IO

osd_bandwidth_cost_per_io = 209715.2 bytes/io = 204.8 KB/IO
osd_bandwidth_capacity_per_shard = 100MB/s / 5 = 20 MB/s = 20480 KB/s

Divided into two scenarios,
- One is not to add osd_bandwidth_cost_per_io cost ,
  - 200KB/IO cost is 9.76 ms
- One is to add osd_bandwidth_cost_per_io cost
  - 200KB/IO+204.8KB/IO is 19.76 ms. (cost doubled)

One is not to add osd_bandwidth_cost_per_io cost:

One is to add osd_bandwidth_cost_per_io cost:

Question:
Is it necessary to increase osd_bandwidth_cost_per_io for each IO?


Files

add_osd_bandwidth_cost_per_io.png (414 KB) add_osd_bandwidth_cost_per_io.png add_osd_bandwidth_cost_per_io jianwei zhang, 09/12/2023 05:26 AM
not_add_osd_bandwidth_cost_per_io.png (550 KB) not_add_osd_bandwidth_cost_per_io.png not_add_osd_bandwidth_cost_per_io jianwei zhang, 09/12/2023 05:26 AM
hdd_iops.png (191 KB) hdd_iops.png jianwei zhang, 09/12/2023 07:15 AM
tell_bench.png (428 KB) tell_bench.png jianwei zhang, 09/12/2023 07:17 AM
rados_bench_pr_52809.png (350 KB) rados_bench_pr_52809.png jianwei zhang, 09/12/2023 07:32 AM

Related issues 2 (0 open2 closed)

Copied to RADOS - Backport #63125: reef: osd: Is it necessary to unconditionally increase osd_bandwidth_cost_per_io in mClockScheduler::calc_scaled_cost?ResolvedLaura FloresActions
Copied to RADOS - Backport #63126: quincy: osd: Is it necessary to unconditionally increase osd_bandwidth_cost_per_io in mClockScheduler::calc_scaled_cost?ResolvedLaura FloresActions
Actions #1

Updated by jianwei zhang 8 months ago

One is not to add osd_bandwidth_cost_per_io cost:

One is to add osd_bandwidth_cost_per_io cost:

Actions #2

Updated by jianwei zhang 8 months ago

Incremental step calculation method:

void mClockScheduler::ClientRegistry::update_from_config(const ConfigProxy &conf, const double capacity_per_shard)
{
  auto get_res = [&](double res) {
    if (res) {
      return res * capacity_per_shard;
    } else {
      return default_min; // min reservation --> constexpr double default_min = 0.0;
    }
  };

  auto get_lim = [&](double lim) {
    if (lim) { // 如果 osd_mclock_scheduler_client_lim 为 0,则使用无穷大数作为上限
      return lim * capacity_per_shard;
    } else {
      return default_max; // high limit --> constexpr double default_max = std::numeric_limits<double>::is_iec559 ?
                                                                        // std::numeric_limits<double>::infinity() :
                                                                        // std::numeric_limits<double>::max();
    }
  };

  // Set external client infos
  double res = conf.get_val<double>("osd_mclock_scheduler_client_res");
  double lim = conf.get_val<double>("osd_mclock_scheduler_client_lim");
  uint64_t wgt = conf.get_val<uint64_t>("osd_mclock_scheduler_client_wgt");
  default_external_client_info.update(get_res(res), wgt, get_lim(lim));
}

// order parameters -- min, "normal", max
ClientInfo(double _reservation, double _weight, double _limit) 
{
    update(_reservation, _weight, _limit);
}

inline void update(double _reservation, double _weight, double _limit) 
{
       reservation = _reservation;
       weight = _weight;
       limit = _limit;
       reservation_inv = (0.0 == reservation) ? 0.0 : 1.0 / reservation;
       weight_inv = (0.0 == weight) ? 0.0 : 1.0 / weight;
       limit_inv = (0.0 == limit) ? 0.0 : 1.0 / limit; 
}

An IO is added to the tag initialization in the mclock queue:

// data_mtx must be held by caller
RequestTag initial_tag(DelayedTagCalc delayed, ClientRec& client, const ReqParams& params, Time time, Cost cost) 
{
    RequestTag tag(0, 0, 0, time, 0, 0, cost);

    // only calculate a tag if the request is going straight to the front
    if (!client.has_request()) {
      const ClientInfo* client_info = get_cli_info(client);
      assert(client_info);
      tag = RequestTag(client.get_req_tag(), *client_info, params, time, cost, anticipation_timeout);

      // copy tag to previous tag for client
      client.update_req_tag(tag, tick);
    }
    return tag;
}

// inline crimson::dmclock::RequestTag::RequestTag(double _res, double _prop, double _lim, crimson::dmclock::Time _arrival, 
//                                                 uint32_t _delta = 0U, uint32_t _rho = 0U, crimson::dmclock::Cost _cost = 1U)
 RequestTag(const double _res, const double _prop, const double _lim,
            const Time _arrival,
            const uint32_t _delta = 0,
            const uint32_t _rho = 0,
            const Cost _cost = 1u) :
    reservation(_res),    //0
    proportion(_prop),    //0
    limit(_lim),          //0
    delta(_delta),        //0
    rho(_rho),            //0
    cost(_cost),          //非0
    ready(false),         //false
    arrival(_arrival)     //非0
{
    assert(cost > 0);
    assert(reservation < max_tag || proportion < max_tag);
}

RequestTag(const RequestTag& prev_tag,
           const ClientInfo& client,
           const uint32_t _delta,
           const uint32_t _rho,
           const Time time,
           const Cost _cost = 1u,
           const double anticipation_timeout = 0.0) :
    delta(_delta),
    rho(_rho),
    cost(_cost),
    ready(false),
    arrival(time)
{
    assert(cost > 0);
    Time max_time = time;
    if (time - anticipation_timeout < prev_tag.arrival)
      max_time -= anticipation_timeout;

    reservation = tag_calc(max_time, prev_tag.reservation, client.reservation_inv, rho, true, cost);
    proportion =  tag_calc(max_time, prev_tag.proportion,  client.weight_inv,    delta, true, cost);
    limit =       tag_calc(max_time, prev_tag.limit,       client.limit_inv,     delta, false, cost);

    assert(reservation < max_tag || proportion < max_tag);
}

static double tag_calc(const Time time,
                 const double prev,
                 const double increment,
                 const uint32_t dist_req_val,
                 const bool extreme_is_high,
                 const Cost cost) 
{
    if (0.0 == increment) {
      return extreme_is_high ? max_tag : min_tag;
    } else {
      // insure 64-bit arithmetic before conversion to double
      double tag_increment = increment * (uint64_t(dist_req_val) + cost);
      return std::max(time, prev + tag_increment);
    }
}

Actions #3

Updated by jianwei zhang 8 months ago

Please help me check if there are any errors in the process of calculating cost.
If nothing goes wrong,
Please discuss whether it is necessary to add osd_bandwidth_cost_per_io

Actions #4

Updated by jianwei zhang 8 months ago

jianwei zhang wrote:

One is not to add osd_bandwidth_cost_per_io cost:

One is to add osd_bandwidth_cost_per_io cost:

One is not to add osd_bandwidth_cost_per_io cost:

One is to add osd_bandwidth_cost_per_io cost:

Actions #5

Updated by jianwei zhang 8 months ago

another question :
how to test and get osd_mclock_max_sequential_bandwidth_hdd and osd_mclock_max_capacity_iops_hdd ???

The community recommends using:
ceph tell osd.N bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS]

The code logic of osd bench is as follows:
1. Prewrite NUM_OBJS objects of size OBJ_SIZE
2. begin_time
3. Randomly select 1 object from the pre-written objects according to the offset + BYTES_PER_WRITE of TOTAL_BYTES
and write it until TOTAL_BYTES is finished.
4. end_time
5. elapsed = end_time - start_time;
6. bw = TOTAL_BYTES / elapsed
7. iops = bw / BYTES_PER_WRITE

question:
If the store is using bluestore,
It does not overwrite the original object.
Instead, newly allocated disk space is used for additional writing.
In this case, the tested iops deviate greatly

bluestore_throttle_deferred_bytes = 0
bluestore_prefer_deferred_size_hdd = 0
HDD can be 10000 iops

- name: osd_mclock_max_sequential_bandwidth_hdd
  type: size
  level: basic
  desc: The maximum sequential bandwidth in bytes/second of the OSD (for rotational media)
  long_desc: This option specifies the maximum sequential bandwidth to consider
             for an OSD whose underlying device type is rotational media. This is
             considered by the mclock scheduler to derive the cost factor to be used in
             QoS calculations. Only considered for osd_op_queue = mclock_scheduler
  fmt_desc: The maximum sequential bandwidth in bytes/second to consider for the
            OSD (for rotational media)
  default: 150_M
  flags:
  - runtime
- name: osd_mclock_max_capacity_iops_hdd
  type: float
  level: basic
  desc: Max random write IOPS capacity (at 4KiB block size) to consider per OSD (for rotational media)
  long_desc: This option specifies the max OSD random write IOPS capacity per
             OSD. Contributes in QoS calculations when enabling a dmclock profile. Only
             considered for osd_op_queue = mclock_scheduler
  fmt_desc: Max random write IOPS capacity (at 4 KiB block size) to consider per
            OSD (for rotational media)
  default: 315
  flags:
  - runtime

https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/

  else if (prefix == "bench") {
    // default count 1G, size 4MB
    int64_t count = cmd_getval_or<int64_t>(cmdmap, "count", 1LL << 30);
    int64_t bsize = cmd_getval_or<int64_t>(cmdmap, "size", 4LL << 20);
    int64_t osize = cmd_getval_or<int64_t>(cmdmap, "object_size", 0);
    int64_t onum = cmd_getval_or<int64_t>(cmdmap, "object_num", 0);
    double elapsed = 0.0;

    ret = run_osd_bench_test(count, bsize, osize, onum, &elapsed, ss);
    if (ret != 0) {
      goto out;
    }

    double rate = count / elapsed;
    double iops = rate / bsize;
    f->open_object_section("osd_bench_results");
    f->dump_int("bytes_written", count);
    f->dump_int("blocksize", bsize);
    f->dump_float("elapsed_sec", elapsed);
    f->dump_float("bytes_per_sec", rate);
    f->dump_float("iops", iops);
    f->close_section();
  }

int OSD::run_osd_bench_test(
  int64_t count,
  int64_t bsize,
  int64_t osize,
  int64_t onum,
  double *elapsed,
  ostream &ss)
{
  int ret = 0;
  ... ...
  if (osize && onum) {
    bufferlist bl;
    bufferptr bp(osize);
    memset(bp.c_str(), 'a', bp.length());
    bl.push_back(std::move(bp));
    bl.rebuild_page_aligned();
    for (int i=0; i<onum; ++i) {
      char nm[30];
      snprintf(nm, sizeof(nm), "disk_bw_test_%d", i);
      object_t oid(nm);
      hobject_t soid(sobject_t(oid, 0));
      ObjectStore::Transaction t;
      t.write(coll_t(), ghobject_t(soid), 0, osize, bl);
      store->queue_transaction(service.meta_ch, std::move(t), nullptr);
      cleanupt.remove(coll_t(), ghobject_t(soid));
    }
  }
  ... ...

  bufferlist bl;
  utime_t start = ceph_clock_now();
  for (int64_t pos = 0; pos < count; pos += bsize) {
    char nm[34];
    unsigned offset = 0;
    bufferptr bp(bsize);
    memset(bp.c_str(), rand() & 0xff, bp.length());
    bl.push_back(std::move(bp));
    bl.rebuild_page_aligned();
    if (onum && osize) {
      snprintf(nm, sizeof(nm), "disk_bw_test_%d", (int)(rand() % onum));
      offset = rand() % (osize / bsize) * bsize;
    } else {
      snprintf(nm, sizeof(nm), "disk_bw_test_%lld", (long long)pos);
    }
    object_t oid(nm);
    hobject_t soid(sobject_t(oid, 0));
    ObjectStore::Transaction t;
    t.write(coll_t::meta(), ghobject_t(soid), offset, bsize, bl);
    store->queue_transaction(service.meta_ch, std::move(t), nullptr);
    if (!onum || !osize) {
      cleanupt.remove(coll_t::meta(), ghobject_t(soid));
    }
    bl.clear();
  }

  {
    C_SaferCond waiter;
    if (!service.meta_ch->flush_commit(&waiter)) {
      waiter.wait();
    }
  }
  utime_t end = ceph_clock_now();
  *elapsed = end - start;
 ... ...
 return ret;
}

Actions #6

Updated by jianwei zhang 8 months ago

Actions #7

Updated by jianwei zhang 8 months ago

Actions #9

Updated by jianwei zhang 8 months ago

osd/scheduler/mClockScheduler: Use same profile and client ids for all clients to ensure allocated QoS limit consumption.
https://github.com/ceph/ceph/pull/52809

Hi sseshasa,
How did you test OSD IOPS and bandwidth Capacity?

Actions #10

Updated by jianwei zhang 8 months ago

A version has been modified, please review
For cost calculation, the core idea is to take the larger value of user item_cost and osd_bandwidth_cost_per_io:
https://github.com/ceph/ceph/pull/53417

Actions #11

Updated by Neha Ojha 8 months ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)
Actions #12

Updated by Sridhar Seshasayee 8 months ago

Responses to your questions.

Q: how to test and get osd_mclock_max_sequential_bandwidth_hdd and osd_mclock_max_capacity_iops_hdd ?

osd_mclock_max_capacity_iops_hdd is determined during OSD boot up by running an OSD bench test using
4 KiB writes. Although we write to random offsets within an object, the results do vary sometimes.
This could be due to drive specific settings and/or optimizations. Due to these deviations,
osd_mclock_iops_capacity_threshold_hdd was introduced to fallback to saner settings. These options
are configurable. If the default settings do not accurately represent the capability of the device, then
it's recommended to run benchmark tests using other tools (fio for e.g.) and then set the OSD IOPS
capacity. We do log cluster warnings if the threshold values are exceeded so that further steps can be
taken by the user. At this point, the OSD bench is the only tool we can run during OSD boot-up until
another alternative can be identified.

For osd_mclock_max_sequential_bandwidth_hdd (default: 150 MiB/s), the thought is that this is a
reasonable generic setting to use. We currently do not measure this. But this can be changed to
reflect the actual capability of the device by measuring using Fio or other tools.

Q: in https://github.com/ceph/ceph/pull/52809 How did you test OSD IOPS and bandwidth Capacity?

In our test environment, the OSD bench reported IOPS at 4 KiB randw is close to the actual
capability of the device (~375 IOPS). Tests with Fio too reported close to the IOPS value shown
in the graph. For the test, the custom mClock profile was enabled and
osd_mclock_scheduler_client_lim was set to 30% of the OSD's IOPS capacity. With these settings,
5 Rados Bench instances were started and the graph shows the average IOPS reported by each Rados
Bench instance.

Thoughts About Your Proposed Fix

The osd_bandwidth_cost_per_io is currently calculated using the IOPS capacity at 4 KiB
block size. This represents the base cost per IO. For progressively larger IO sizes, the
idea is that the cost should be increased appropriately. This is the reason for adding
the item cost to the base cost_per_io parameter in calc_scaled_cost(). But this approach
as you have noted results in lower than expected IOPS for an item whose cost is lower
than the cost_per_io parameter.

Therefore, your proposed fix to pass only the cost_per_io in the tag calculation and
passing the item cost only if it's greater than the cost_per_io seems good to me.

However, I would also like to hear thoughts from Sam Just on this proposed change.

Actions #13

Updated by jianwei zhang 8 months ago

Sridhar Seshasayee wrote:

Responses to your questions.

Q: how to test and get osd_mclock_max_sequential_bandwidth_hdd and osd_mclock_max_capacity_iops_hdd ?

osd_mclock_max_capacity_iops_hdd is determined during OSD boot up by running an OSD bench test using
4 KiB writes. Although we write to random offsets within an object, the results do vary sometimes.
This could be due to drive specific settings and/or optimizations. Due to these deviations,
osd_mclock_iops_capacity_threshold_hdd was introduced to fallback to saner settings. These options
are configurable. If the default settings do not accurately represent the capability of the device, then
it's recommended to run benchmark tests using other tools (fio for e.g.) and then set the OSD IOPS
capacity. We do log cluster warnings if the threshold values are exceeded so that further steps can be
taken by the user. At this point, the OSD bench is the only tool we can run during OSD boot-up until
another alternative can be identified.

For osd_mclock_max_sequential_bandwidth_hdd (default: 150 MiB/s), the thought is that this is a
reasonable generic setting to use. We currently do not measure this. But this can be changed to
reflect the actual capability of the device by measuring using Fio or other tools.

Q: in https://github.com/ceph/ceph/pull/52809 How did you test OSD IOPS and bandwidth Capacity?

In our test environment, the OSD bench reported IOPS at 4 KiB randw is close to the actual
capability of the device (~375 IOPS). Tests with Fio too reported close to the IOPS value shown
in the graph. For the test, the custom mClock profile was enabled and
osd_mclock_scheduler_client_lim was set to 30% of the OSD's IOPS capacity. With these settings,
5 Rados Bench instances were started and the graph shows the average IOPS reported by each Rados
Bench instance.

Thoughts About Your Proposed Fix

The osd_bandwidth_cost_per_io is currently calculated using the IOPS capacity at 4 KiB
block size. This represents the base cost per IO. For progressively larger IO sizes, the
idea is that the cost should be increased appropriately. This is the reason for adding
the item cost to the base cost_per_io parameter in calc_scaled_cost(). But this approach
as you have noted results in lower than expected IOPS for an item whose cost is lower
than the cost_per_io parameter.

Therefore, your proposed fix to pass only the cost_per_io in the tag calculation and
passing the item cost only if it's greater than the cost_per_io seems good to me.

However, I would also like to hear thoughts from Sam Just on this proposed change.

Thanks for your reply

The difference between fio and osd bench tests is still too big

fio buffer=200KB direct=1 libaio can achieve 200MB/s bandwidth
The osd bench can reach a maximum bandwidth of 100MB/s.

Actions #14

Updated by jianwei zhang 8 months ago

we prepare
Use osd bench or rados bench to test the bandwidth of osd,
use fio to t est random iops of hdd

Actions #15

Updated by jianwei zhang 8 months ago

hi Sridhar Seshasayee,

I am still quite confused about cost calculation and latency tag.

For HDD,

The problem scenario is as follows:
Preconditions:
bandwidth=100MiB/s
IOPS = 100

1. IO buffer size = 1MiB
2. cost = 1 / 100 = 0.01s = 10ms
These 10 ms are a reference value for the cost of transmission on disk
3. This IO will wait in the mclock queue for 10 ms and then be scheduled
4. This IO actually takes about 10 ms to execute on the disk.

Confused points:
In fact, the IO delay took a total of 20 ms (mclock queue wait + run on disk),
IO latency almost doubled
If we want to limit recovery_limit to 100 MB/s, what we actually get may be 50MB/s, which is lower than expected

mclock’s thoughts:
IO can be scheduled fairly and evenly on the timeline
For example, if it is 100 IOPS, it is expected to schedule an IO every 10 ms.
Assume that the hardware devices are CPU, Memory, and NVMe. Since they are all high-speed hardware, the execution time of IO on them can be ignored, so the delay of IO is basically the delay of mclock scheduling queuing.

Back to HDD low-speed hardware devices:
The basis for time tagging IO is cost.
The cost of IO is calculated based on bandwidth/IOPS.
Since it is a low-speed hardware device, the time cost of actual execution of IO on the disk cannot be ignored.

How do you think about this?

Actions #16

Updated by Sridhar Seshasayee 8 months ago

jianwei zhang wrote:

For HDD,

The problem scenario is as follows:
Preconditions:
bandwidth=100MiB/s
IOPS = 100

1. IO buffer size = 1MiB
2. cost = 1 / 100 = 0.01s = 10ms
These 10 ms are a reference value for the cost of transmission on disk
3. This IO will wait in the mclock queue for 10 ms and then be scheduled
4. This IO actually takes about 10 ms to execute on the disk.

Confused points:
In fact, the IO delay took a total of 20 ms (mclock queue wait + run on disk),
IO latency almost doubled
If we want to limit recovery_limit to 100 MB/s, what we actually get may be 50MB/s, which is lower than expected

In addition to the op queue which is managed by mClock, items get transferred to the
operation sequencer at the objectstore layer. Once mClock dequeues an item from the
op queue, it no longer has control. The time an op spends in the operation sequencer
must also be factored in the latency calculation.

The above is also mentioned in this section:
https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#caveats

A subset of the options that influence the items in the operation sequencer are:
bluestore_throttle_bytes and bluestore_throttle_deferred_bytes.

To figure out if the above is contributing to the latency, the options may be
tuned to ensure items spend as little time as possible in the operation sequencer.
One way to tune this is mentioned here:
https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#benchmarking-test-steps-using-osd-bench

You can use the IO tool of your choice with block size 1 MiB. The idea is that
for each iteration you set the bluestore throttle options and measure the
throughput and compare it with the baseline (measured with default bluestore throttles).
The throttle values are incremented in each iteration until the throughput matches
the baseline. At this point the throttles can be considered as optimal.

The steps can be easily automated and the optimal bluestore throttle options determined.
The idea with the above exercise is to figure out if operation sequencer is the cause of
the additional latency you are observing.

For HDDs, I expect the throttles values be on the higher side.

How do you think about this?

Let me investigate this a bit from my side as well and get back to you.

Actions #17

Updated by jianwei zhang 8 months ago

Sridhar Seshasayee wrote:

jianwei zhang wrote:

For HDD,

The problem scenario is as follows:
Preconditions:
bandwidth=100MiB/s
IOPS = 100

1. IO buffer size = 1MiB
2. cost = 1 / 100 = 0.01s = 10ms
These 10 ms are a reference value for the cost of transmission on disk
3. This IO will wait in the mclock queue for 10 ms and then be scheduled
4. This IO actually takes about 10 ms to execute on the disk.

Confused points:
In fact, the IO delay took a total of 20 ms (mclock queue wait + run on disk),
IO latency almost doubled
If we want to limit recovery_limit to 100 MB/s, what we actually get may be 50MB/s, which is lower than expected

In addition to the op queue which is managed by mClock, items get transferred to the
operation sequencer at the objectstore layer. Once mClock dequeues an item from the
op queue, it no longer has control. The time an op spends in the operation sequencer
must also be factored in the latency calculation.

The above is also mentioned in this section:
https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#caveats

A subset of the options that influence the items in the operation sequencer are:
bluestore_throttle_bytes and bluestore_throttle_deferred_bytes.

To figure out if the above is contributing to the latency, the options may be
tuned to ensure items spend as little time as possible in the operation sequencer.
One way to tune this is mentioned here:
https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#benchmarking-test-steps-using-osd-bench

You can use the IO tool of your choice with block size 1 MiB. The idea is that
for each iteration you set the bluestore throttle options and measure the
throughput and compare it with the baseline (measured with default bluestore throttles).
The throttle values are incremented in each iteration until the throughput matches
the baseline. At this point the throttles can be considered as optimal.

The steps can be easily automated and the optimal bluestore throttle options determined.
The idea with the above exercise is to figure out if operation sequencer is the cause of
the additional latency you are observing.

For HDDs, I expect the throttles values be on the higher side.

How do you think about this?

Let me investigate this a bit from my side as well and get back to you.

add this patch : https://github.com/ceph/ceph/pull/53417
bluestore_throttle_bytes = 0
bluestore_throttle_deferred_bytes = 0

test-0 : osd bench
     * bsize=1M
     * IOPS=237
     * BW=237M
test-1 : client_limit = 1.0 / client_res = 0.5 / client_wgt = 60 / iops=200 / bw=200M 
     * bsize = 1M 
     * BW = 160M / 200M = 80% 
test-2 : client_lim = 0 / client_res = 0.5 / client_wgt = 60 / iops=200 / bw=200M 
     * bsize = 1M 
     * 235M / 200M = 117.5%
test-3 : client_lim = 1 / client_res = 0.5 / client_wgt = 60 / iops=240 / bw=240M 
     * bsize = 1M  
     * 190M / 240M = 79%
test-4 : client_lim = 1 / client_res = 0.5 / client_wgt = 60 / iops=240 / bw=240M  
     * bsize = 8K
     * BW = 1.6M
     * IOPS = 1.6M * 1024 / 8K = 204 IOPS ==> 204 / 240 = 85%
test-5 :  client_lim = 1 / client_res = 0.5 / client_wgt = 60 / iops=100 / bw=100M 
     * bsize = 1M 
     * BW = 84M / 100M = 84%

test-0 : osd bench ==> bsize=1M, IOPS=237 BW=237M

osd_bench_duration = 300
osd_bench_large_size_max_throughput = 104857600
osd_bench_max_block_size = 67108864
osd_bench_small_size_max_iops = 100

# ceph tell osd.0 cache drop

# ceph tell osd.0 bench 10737418240 1048576 1048576 10240
{
    "bytes_written": 10737418240,
    "blocksize": 1048576,
    "elapsed_sec": 43.103751181,
    "bytes_per_sec": 249106352.59821704,
    "iops": 237.56633052655891
}

# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/21/2023 09:44:43 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  904.00     0.00     2.12     4.81     0.10    0.11    0.00    0.11   0.11  10.00
sdf               0.00  2667.00    0.00  908.00     0.00   227.00   512.00   140.87  154.84    0.00  154.84   1.10 100.00

09/21/2023 09:44:44 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  949.00     0.00     2.23     4.82     0.09    0.10    0.00    0.10   0.10   9.30
sdf               0.00  2913.00    0.00  948.00     0.00   237.00   512.00   142.02  149.23    0.00  149.23   1.05 100.00

09/21/2023 09:44:45 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  968.00     0.00     2.27     4.80     0.10    0.10    0.00    0.10   0.10   9.80
sdf               0.00  2904.00    0.00  964.00     0.00   241.00   512.00   143.00  148.77    0.00  148.77   1.04 100.00

09/21/2023 09:44:46 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  952.00     0.00     2.24     4.82     0.10    0.11    0.00    0.11   0.11  10.50
sdf               0.00  2784.00    0.00  952.00     0.00   238.00   512.00   143.32  149.99    0.00  149.99   1.05 100.00

09/21/2023 09:44:47 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  900.00     0.00     2.12     4.82     0.09    0.10    0.00    0.10   0.10   8.60
sdf               0.00  2784.00    0.00  900.00     0.00   225.00   512.00   143.46  158.80    0.00  158.80   1.11 100.00

test-1: client_limit = 1.0 / client_res = 0.5 / client_wgt = 60 / iops=200 / bw=200M > bsize=1M > 160M / 200M = 80%

ceph cluster:
# ceph -s
  cluster:
    id:     0348ad4a-7f88-4cfe-b49f-b3bd80856b79
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum a (age 7m)
    mgr: x(active, since 6m)
    osd: 1 osds: 1 up (since 6m), 1 in (since 6h)
         flags noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub

  data:
    pools:   1 pools, 128 pgs
    objects: 7.76k objects, 7.6 GiB
    usage:   186 GiB used, 9.1 TiB / 9.3 TiB avail
    pgs:     128 active+clean

# ceph osd df tree
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA    OMAP     META    AVAIL    %USE  VAR   PGS  STATUS  TYPE NAME         
-1         9.27039         -  9.3 TiB  189 GiB  10 GiB    1 KiB  52 MiB  9.1 TiB  1.99  1.00    -          root default      
-3         9.27039         -  9.3 TiB  189 GiB  10 GiB    1 KiB  52 MiB  9.1 TiB  1.99  1.00    -              host zjw-q-dev
 0    hdd  9.27039   1.00000  9.3 TiB  189 GiB  10 GiB    1 KiB  52 MiB  9.1 TiB  1.99  1.00  128      up          osd.0     
                       TOTAL  9.3 TiB  189 GiB  10 GiB  1.1 KiB  52 MiB  9.1 TiB  1.99                                       
MIN/MAX VAR: 1.00/1.00  STDDEV: 0

# ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    9.3 TiB  9.1 TiB  189 GiB   189 GiB       1.99
TOTAL  9.3 TiB  9.1 TiB  189 GiB   189 GiB       1.99

--- POOLS ---
POOL       ID  PGS  STORED  OBJECTS    USED  %USED  MAX AVAIL
test-pool   1  128  10 GiB   10.22k  10 GiB   0.11    8.6 TiB

# ceph osd pool ls detail
pool 1 'test-pool' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off last_change 26 flags hashpspool stripe_width 0 application rgw

# ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default" 
            },
            {
                "op": "choose_firstn",
                "num": 0,
                "type": "osd" 
            },
            {
                "op": "emit" 
            }
        ]
    }
]

Preconditions:
# ceph daemon osd.0 config show | grep -e osd_mclock -e bluestore_throttle
    "bluestore_throttle_bytes": "0",                          //unlimited
    "bluestore_throttle_cost_per_io": "0",
    "bluestore_throttle_cost_per_io_hdd": "670000",
    "bluestore_throttle_cost_per_io_ssd": "4000",
    "bluestore_throttle_deferred_bytes": "0",                 //unlimited
    "bluestore_throttle_trace_rate": "0.000000",
    "osd_mclock_force_run_benchmark_on_init": "false",
    "osd_mclock_iops_capacity_threshold_hdd": "500.000000",
    "osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
    "osd_mclock_max_capacity_iops_hdd": "200.000000",         //200 IOPS
    "osd_mclock_max_capacity_iops_ssd": "21500.000000",
    "osd_mclock_max_sequential_bandwidth_hdd": "209715200",   //200_M
    "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
    "osd_mclock_override_recovery_settings": "false",
    "osd_mclock_profile": "custom",
    "osd_mclock_scheduler_anticipation_timeout": "0.000000",
    "osd_mclock_scheduler_background_best_effort_lim": "0.100000",
    "osd_mclock_scheduler_background_best_effort_res": "0.200000",
    "osd_mclock_scheduler_background_best_effort_wgt": "20",
    "osd_mclock_scheduler_background_recovery_lim": "0.500000",
    "osd_mclock_scheduler_background_recovery_res": "0.300000",
    "osd_mclock_scheduler_background_recovery_wgt": "20",
    "osd_mclock_scheduler_client_lim": "1.000000",            //100% --> 200_M
    "osd_mclock_scheduler_client_res": "0.500000",
    "osd_mclock_scheduler_client_wgt": "60",
    "osd_mclock_skip_benchmark": "true" 

# cat test-bench-write-1M.sh 
> writelog

for i in {1..1} ; do
    name1=$(echo $RANDOM)
    name2=$(echo $RANDOM)
    echo "test-bench-write-$name1-$name2" 
    nohup rados -c ./ceph.conf bench 600 write --no-cleanup -t 10 -p test-pool  -b 1048576 --show-time --run-name "test-bench-write-$name1-$name2" >> writelog 2>&1 &
done

# rados bench output. ==> 160M / 200M = 80% 
2023-09-21T22:02:59.047725+0800 min lat: 0.00659005 max lat: 0.340699 avg lat: 0.0613749 lat p50: 0.0499426 lat p90: 0.131342 lat p99: 0.226897 lat p999: 0.249803 lat p100: 0.340699
2023-09-21T22:02:59.047725+0800   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
2023-09-21T22:02:59.047725+0800    40      10      6514      6504   162.571       156   0.0380342   0.0613749
2023-09-21T22:03:00.047956+0800    41      10      6682      6672   162.703       168   0.0173561   0.0613827
2023-09-21T22:03:01.048102+0800    42      10      6850      6840   162.828       168   0.0370583   0.0613393
2023-09-21T22:03:02.048281+0800    43      10      7027      7017   163.157       177   0.0119378   0.0612306
2023-09-21T22:03:03.048453+0800    44      10      7187      7177   163.085       160   0.0123712   0.0612312
2023-09-21T22:03:04.048619+0800    45      10      7354      7344   163.171       167    0.131077   0.0612135
2023-09-21T22:03:05.048763+0800    46      10      7514      7504   163.102       160   0.0284153   0.0612553
2023-09-21T22:03:06.048931+0800    47      10      7640      7630   162.312       126   0.0126964   0.0614655
2023-09-21T22:03:07.049076+0800    48      10      7786      7776   161.972       146    0.124256   0.0616731
2023-09-21T22:03:08.049236+0800    49      10      7953      7943   162.074       167   0.0389563   0.0616579
2023-09-21T22:03:09.049372+0800    50      10      8102      8092   161.812       149   0.0578777   0.0617027
2023-09-21T22:03:10.049539+0800    51      10      8267      8257   161.874       165   0.0878024   0.0617294
2023-09-21T22:03:11.049708+0800    52      10      8436      8426    162.01       169   0.0568071   0.0616733
2023-09-21T22:03:12.049884+0800    53      10      8598      8588    162.01       162   0.0551456   0.0616418
2023-09-21T22:03:13.050060+0800    54      10      8752      8742   161.861       154   0.0323046   0.0617172
2023-09-21T22:03:14.050238+0800    55      10      8909      8899   161.772       157   0.0292404   0.0617236
2023-09-21T22:03:15.050418+0800    56      10      9070      9060   161.758       161   0.0476941   0.0617766
2023-09-21T22:03:16.050595+0800    57      10      9223      9213   161.603       153    0.072164    0.061812
2023-09-21T22:03:17.050767+0800    58      10      9390      9380   161.696       167    0.141416   0.0617799
2023-09-21T22:03:18.050900+0800    59      10      9552      9542   161.701       162   0.0166232   0.0617668

# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/21/2023 10:02:55 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  656.00     0.00     1.54     4.79     0.07    0.11    0.00    0.11   0.11   6.90
sdf               0.00  2064.00    0.00  685.00     0.00   171.25   512.00    13.54   19.60    0.00   19.60   1.40  96.20

09/21/2023 10:02:56 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  652.00     0.00     1.51     4.74     0.06    0.10    0.00    0.10   0.10   6.20
sdf               0.00  1956.00    0.00  655.00     0.00   163.75   512.00     7.11   11.07    0.00   11.07   1.42  93.20

09/21/2023 10:02:57 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  673.00     0.00     1.56     4.75     0.07    0.11    0.00    0.11   0.11   7.50
sdf               0.00  2040.00    0.00  678.00     0.00   169.50   512.00     7.72   11.36    0.00   11.36   1.40  94.70

09/21/2023 10:02:58 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  700.00     0.00     1.64     4.81     0.08    0.11    0.00    0.11   0.11   7.70
sdf               0.00  2088.00    0.00  703.00     0.00   175.75   512.00    10.00   14.28    0.00   14.28   1.37  96.30

test-2: client_lim = 0 / client_res = 0.5 / client_wgt = 60 / iops=200 / bw=200M > bsize=1M > 235M / 200M = 117.5%

# ceph daemon osd.0 config show | grep -e osd_mclock -e bluestore_throttle
    "bluestore_throttle_bytes": "0",
    "bluestore_throttle_cost_per_io": "0",
    "bluestore_throttle_cost_per_io_hdd": "670000",
    "bluestore_throttle_cost_per_io_ssd": "4000",
    "bluestore_throttle_deferred_bytes": "0",
    "bluestore_throttle_trace_rate": "0.000000",
    "osd_mclock_force_run_benchmark_on_init": "false",
    "osd_mclock_iops_capacity_threshold_hdd": "500.000000",
    "osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
    "osd_mclock_max_capacity_iops_hdd": "200.000000",           // 200 iops
    "osd_mclock_max_capacity_iops_ssd": "21500.000000",
    "osd_mclock_max_sequential_bandwidth_hdd": "209715200",     // 200_M
    "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
    "osd_mclock_override_recovery_settings": "false",
    "osd_mclock_profile": "custom",
    "osd_mclock_scheduler_anticipation_timeout": "0.000000",
    "osd_mclock_scheduler_background_best_effort_lim": "0.100000",
    "osd_mclock_scheduler_background_best_effort_res": "0.200000",
    "osd_mclock_scheduler_background_best_effort_wgt": "20",
    "osd_mclock_scheduler_background_recovery_lim": "0.500000",
    "osd_mclock_scheduler_background_recovery_res": "0.300000",
    "osd_mclock_scheduler_background_recovery_wgt": "20",
    "osd_mclock_scheduler_client_lim": "0.000000",              //0 unlimited
    "osd_mclock_scheduler_client_res": "0.500000",
    "osd_mclock_scheduler_client_wgt": "60",
    "osd_mclock_skip_benchmark": "true",

# cat test-bench-write-1M.sh 
> writelog

for i in {1..1} ; do
    name1=$(echo $RANDOM)
    name2=$(echo $RANDOM)
    echo "test-bench-write-$name1-$name2" 
    nohup rados -c ./ceph.conf bench 600 write --no-cleanup -t 10 -p test-pool  -b 1048576 --show-time --run-name "test-bench-write-$name1-$name2" >> writelog 2>&1 &
done

# rados bench output. ==> 235M / 200M = 117.5%
2023-09-21T22:00:09.629327+0800 min lat: 0.0080338 max lat: 1.44649 avg lat: 0.0423585 lat p50: 0.0406062 lat p90: 0.0475816 lat p99: 0.0622597 lat p999: 1.0747 lat p100: 1.44649
2023-09-21T22:00:09.629327+0800   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
2023-09-21T22:00:09.629327+0800    20      10      4726      4716   235.758       225   0.0437467   0.0423585
2023-09-21T22:00:10.629573+0800    21      10      4954      4944   235.386       228   0.0422658   0.0424362
2023-09-21T22:00:11.629745+0800    22      10      5196      5186   235.685       242   0.0122676   0.0416933
2023-09-21T22:00:12.629926+0800    23      10      5433      5423    235.74       237   0.0409134   0.0423776
2023-09-21T22:00:13.630107+0800    24      10      5660      5650   235.374       227    0.042527   0.0424412
2023-09-21T22:00:14.630281+0800    25      10      5893      5883   235.278       233   0.0415217   0.0424618
2023-09-21T22:00:15.630466+0800    26      10      6139      6129   235.688       246   0.0416429   0.0423908
2023-09-21T22:00:16.630638+0800    27      10      6383      6373   235.995       244   0.0302444   0.0423147
2023-09-21T22:00:17.630801+0800    28      10      6616      6606   235.886       233   0.0264553   0.0423603
2023-09-21T22:00:18.630967+0800    29      10      6842      6832   235.544       226   0.0239288    0.042045
2023-09-21T22:00:19.631136+0800    30      10      7079      7069   235.591       237   0.0171055     0.04134
2023-09-21T22:00:20.631330+0800    31      10      7320      7310   235.764       241    0.042195   0.0423773
2023-09-21T22:00:21.631554+0800    32      10      7558      7548   235.833       238   0.0437976   0.0423641
2023-09-21T22:00:22.631735+0800    33      10      7773      7763     235.2       215    0.041803    0.042481
2023-09-21T22:00:23.631900+0800    34      10      8014      8004    235.37       241   0.0424048   0.0424514
2023-09-21T22:00:24.632100+0800    35      10      8244      8234   235.215       230   0.0440427    0.042478
2023-09-21T22:00:25.632269+0800    36      10      8485      8475   235.374       241   0.0495573   0.0424507
2023-09-21T22:00:26.632443+0800    37      10      8717      8707   235.282       232   0.0474286   0.0424687
2023-09-21T22:00:27.632604+0800    38      10      8949      8939   235.195       232    0.041791   0.0424856
2023-09-21T22:00:28.632757+0800    39      10      9194      9184   235.445       245   0.0405403   0.0424448

# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/21/2023 10:00:06 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    1.00  972.00     0.01     2.27     4.79     0.09    0.10    0.00    0.10   0.10   9.40
sdf               0.00  2904.00    0.00  972.00     0.00   243.00   512.00    34.37   35.43    0.00   35.43   1.03 100.10

09/21/2023 10:00:07 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  988.00     0.00     2.30     4.77     0.10    0.10    0.00    0.10   0.10   9.80
sdf               0.00  2964.00    0.00  988.00     0.00   247.00   512.00    34.42   34.79    0.00   34.79   1.01 100.00

09/21/2023 10:00:08 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  952.00     0.00     2.23     4.79     0.10    0.11    0.00    0.11   0.11  10.00
sdf               0.00  2868.00    0.00  956.00     0.00   239.00   512.00    34.60   36.30    0.00   36.30   1.04  99.90

09/21/2023 10:00:09 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  904.00     0.00     2.11     4.78     0.09    0.10    0.00    0.10   0.10   9.20
sdf               0.00  2712.00    0.00  900.00     0.00   225.00   512.00    34.99   38.67    0.00   38.67   1.11 100.10

09/21/2023 10:00:10 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  893.00     0.00     2.08     4.77     0.09    0.10    0.00    0.10   0.10   8.90
sdf               0.00  2712.00    0.00  908.00     0.00   227.00   512.00    33.09   36.49    0.00   36.49   1.10 100.00

test-3 : client_lim = 1 / client_res = 0.5 / client_wgt = 60 / iops=240 / bw=240M > bsize=1M > 190M / 240M = 79%


# ceph daemon osd.0 config show | grep -e osd_mclock -e bluestore_throttle
    "bluestore_throttle_bytes": "0",                         //unlimited
    "bluestore_throttle_cost_per_io": "0", 
    "bluestore_throttle_cost_per_io_hdd": "670000",
    "bluestore_throttle_cost_per_io_ssd": "4000",
    "bluestore_throttle_deferred_bytes": "0",                //unlimited
    "bluestore_throttle_trace_rate": "0.000000",
    "osd_mclock_force_run_benchmark_on_init": "false",
    "osd_mclock_iops_capacity_threshold_hdd": "500.000000",
    "osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
    "osd_mclock_max_capacity_iops_hdd": "240.000000",        //240 IOPS
    "osd_mclock_max_capacity_iops_ssd": "21500.000000",
    "osd_mclock_max_sequential_bandwidth_hdd": "251658240",  //240_M
    "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
    "osd_mclock_override_recovery_settings": "false",
    "osd_mclock_profile": "custom",
    "osd_mclock_scheduler_anticipation_timeout": "0.000000",
    "osd_mclock_scheduler_background_best_effort_lim": "0.100000",
    "osd_mclock_scheduler_background_best_effort_res": "0.200000",
    "osd_mclock_scheduler_background_best_effort_wgt": "20",
    "osd_mclock_scheduler_background_recovery_lim": "0.500000",
    "osd_mclock_scheduler_background_recovery_res": "0.300000",
    "osd_mclock_scheduler_background_recovery_wgt": "20",
    "osd_mclock_scheduler_client_lim": "1.000000",
    "osd_mclock_scheduler_client_res": "0.500000",
    "osd_mclock_scheduler_client_wgt": "60",
    "osd_mclock_skip_benchmark": "true",

# rados bench output --> 190M / 240M = 79%
2023-09-21T22:16:55.688680+0800 min lat: 0.00692427 max lat: 0.280151 avg lat: 0.0521959 lat p50: 0.0444432 lat p90: 0.102672 lat p99: 0.157707 lat p999: 0.243541 lat p100: 0.280151
2023-09-21T22:16:55.688680+0800   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
2023-09-21T22:16:55.688680+0800    40      10      7670      7660   191.468       193   0.0258948   0.0521959
2023-09-21T22:16:56.688877+0800    41      10      7848      7838   191.139       178   0.0298819   0.0522755
2023-09-21T22:16:57.689070+0800    42      10      8045      8035   191.277       197   0.0209257   0.0522216
2023-09-21T22:16:58.689211+0800    43      10      8238      8228   191.317       193    0.107852   0.0522338
2023-09-21T22:16:59.689373+0800    44      10      8429      8419   191.309       191    0.103653   0.0522236
2023-09-21T22:17:00.689522+0800    45      10      8615      8605    191.19       186   0.0454725   0.0522409
2023-09-21T22:17:01.689676+0800    46      10      8817      8807   191.425       202   0.0469294   0.0521794
2023-09-21T22:17:02.689815+0800    47      10      8999      8989   191.224       182   0.0383939   0.0522514
2023-09-21T22:17:03.689984+0800    48      10      9198      9188   191.385       199   0.0326067   0.0522025
2023-09-21T22:17:04.690152+0800    49      10      9395      9385   191.499       197   0.0737887   0.0521808
2023-09-21T22:17:05.690299+0800    50      10      9595      9585   191.668       200   0.0132583   0.0521397
2023-09-21T22:17:06.690476+0800    51      10      9792      9782   191.772       197   0.0432396   0.0521141
2023-09-21T22:17:07.690624+0800    52      10      9964      9954   191.391       172   0.0432432   0.0522076
2023-09-21T22:17:08.690746+0800    53      10     10158     10148    191.44       194   0.0164307   0.0521984
2023-09-21T22:17:09.690859+0800    54      10     10364     10354   191.709       206  0.00817016   0.0521235
2023-09-21T22:17:10.690969+0800    55      10     10568     10558   191.932       204   0.0361452   0.0520669
2023-09-21T22:17:11.691108+0800    56      10     10742     10732   191.612       174   0.0464498   0.0521503
2023-09-21T22:17:12.691265+0800    57      10     10931     10921   191.565       189   0.0620719   0.0521665
2023-09-21T22:17:13.691406+0800    58      10     11115     11105   191.435       184   0.0567793   0.0522109
2023-09-21T22:17:14.691570+0800    59      10     11310     11300   191.494       195   0.0529321   0.0521737

# cat test-bench-write-1M.sh 
> writelog

for i in {1..1} ; do
    name1=$(echo $RANDOM)
    name2=$(echo $RANDOM)
    echo "test-bench-write-$name1-$name2" 
    nohup rados -c ./ceph.conf bench 600 write --no-cleanup -t 10 -p test-pool  -b 1048576 --show-time --run-name "test-bench-write-$name1-$name2" >> writelog 2>&1 &
done

# iostat -xmt 1 -d /dev/sdf /dev/sdb 
09/21/2023 10:17:03 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  800.00     0.00     1.87     4.79     0.08    0.11    0.00    0.11   0.10   8.40
sdf               0.00  2364.00    0.00  804.00     0.00   201.00   512.00    12.27   15.72    0.00   15.72   1.22  98.40

09/21/2023 10:17:04 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  797.00     0.00     1.86     4.78     0.08    0.10    0.00    0.10   0.09   7.50
sdf               0.00  2364.00    0.00  793.00     0.00   198.25   512.00     9.10   11.55    0.00   11.55   1.22  96.80

09/21/2023 10:17:05 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  772.00     0.00     1.79     4.76     0.08    0.10    0.00    0.10   0.10   7.70
sdf               0.00  2376.00    0.00  779.00     0.00   194.75   512.00    13.60   17.24    0.00   17.24   1.26  98.50

09/21/2023 10:17:06 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  792.00     0.00     1.84     4.77     0.08    0.10    0.00    0.10   0.10   7.70
sdf               0.00  2424.00    0.00  809.00     0.00   202.25   512.00    13.76   17.00    0.00   17.00   1.21  98.20

test-4 : client_lim = 1 / client_res = 0.5 / client_wgt = 60 / iops=240 / bw=240M > bsize=8K> 1.6M * 1024 / 8K = 204 IOPS ==> 204 / 240 = 85%

# ceph daemon osd.0 config show | grep -e osd_mclock -e bluestore_throttle
    "bluestore_throttle_bytes": "0",                         //unlimited
    "bluestore_throttle_cost_per_io": "0",
    "bluestore_throttle_cost_per_io_hdd": "670000",
    "bluestore_throttle_cost_per_io_ssd": "4000",
    "bluestore_throttle_deferred_bytes": "0",                //unlimited
    "bluestore_throttle_trace_rate": "0.000000",
    "osd_mclock_force_run_benchmark_on_init": "false",
    "osd_mclock_iops_capacity_threshold_hdd": "500.000000",
    "osd_mclock_iops_capacity_threshold_ssd": "80000.000000", 
    "osd_mclock_max_capacity_iops_hdd": "240.000000",         //240 IOPS
    "osd_mclock_max_capacity_iops_ssd": "21500.000000",
    "osd_mclock_max_sequential_bandwidth_hdd": "251658240",   //240M
    "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
    "osd_mclock_override_recovery_settings": "false",
    "osd_mclock_profile": "custom",
    "osd_mclock_scheduler_anticipation_timeout": "0.000000",
    "osd_mclock_scheduler_background_best_effort_lim": "0.100000",
    "osd_mclock_scheduler_background_best_effort_res": "0.200000",
    "osd_mclock_scheduler_background_best_effort_wgt": "20",
    "osd_mclock_scheduler_background_recovery_lim": "0.500000",
    "osd_mclock_scheduler_background_recovery_res": "0.300000",
    "osd_mclock_scheduler_background_recovery_wgt": "20",
    "osd_mclock_scheduler_client_lim": "1.000000",
    "osd_mclock_scheduler_client_res": "0.500000",
    "osd_mclock_scheduler_client_wgt": "60",
    "osd_mclock_skip_benchmark": "true",

# cat test-bench-write-8K.sh 
> writelog

for i in {1..1} ; do
    name1=$(echo $RANDOM)
    name2=$(echo $RANDOM)
    echo "test-bench-write-$name1-$name2" 
    nohup rados -c ./ceph.conf bench 600 write --no-cleanup -t 10 -p test-pool -b 8192 --show-time --run-name "test-bench-write-$name1-$name2" >> writelog 2>&1 &
done

# rados bench output ==> 1.6M * 1024 / 8K = 204 IOPS ==> 204 / 240 = 85%
2023-09-21T22:22:22.328084+0800 min lat: 0.00096772 max lat: 0.205785 avg lat: 0.048328 lat p50: 0.0399051 lat p90: 0.106316 lat p99: 0.178427 lat p999: 0.205785 lat p100: 0.205785
2023-09-21T22:22:22.328084+0800   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
2023-09-21T22:22:22.328084+0800    60      10     12416     12406    1.6151   1.63281   0.0623918    0.048328
2023-09-21T22:22:23.328281+0800    61      10     12616     12606   1.61423    1.5625    0.019324   0.0483596
2023-09-21T22:22:24.328438+0800    62      10     12817     12807   1.61352   1.57031    0.127102   0.0483708
2023-09-21T22:22:25.328597+0800    63      10     13008     12998   1.61159   1.49219   0.0204207   0.0484451
2023-09-21T22:22:26.328779+0800    64      10     13228     13218   1.61326   1.71875  0.00126544   0.0483676
2023-09-21T22:22:27.328941+0800    65      10     13446     13436   1.61464   1.70312  0.00127368   0.0483457
2023-09-21T22:22:28.329112+0800    66      10     13660     13650    1.6155   1.67188   0.0207621   0.0483041
2023-09-21T22:22:29.329278+0800    67      10     13866     13856   1.61541   1.60938  0.00123517   0.0483221
2023-09-21T22:22:30.329441+0800    68      10     14044     14034    1.6121   1.39062    0.041497   0.0484098
2023-09-21T22:22:31.329607+0800    69      10     14248     14238   1.61183   1.59375   0.0417745   0.0484319
2023-09-21T22:22:32.329779+0800    70      10     14439     14429   1.61012   1.49219   0.0218883   0.0484935
2023-09-21T22:22:33.329948+0800    71      10     14639     14629   1.60944    1.5625   0.0210867   0.0485107
2023-09-21T22:22:34.330113+0800    72      10     14856     14846   1.61063   1.69531   0.0415444    0.048472
2023-09-21T22:22:35.330280+0800    73      10     15057     15047   1.61007   1.57031   0.0871025   0.0484789
2023-09-21T22:22:36.330444+0800    74      10     15271     15261   1.61091   1.67188   0.0630531   0.0484612
2023-09-21T22:22:37.330584+0800    75      10     15476     15466   1.61078   1.60156   0.0637163   0.0484811
2023-09-21T22:22:38.330745+0800    76      10     15661     15651    1.6086   1.44531   0.0160527   0.0485229
2023-09-21T22:22:39.330905+0800    77      10     15861     15851     1.608    1.5625    0.105571   0.0485428
2023-09-21T22:22:40.331072+0800    78      10     16053     16043   1.60661       1.5   0.0014667   0.0485908
2023-09-21T22:22:41.331241+0800    79      10     16271     16261   1.60783   1.70312  0.00116457   0.0485708

# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/21/2023 10:22:14 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  852.00     0.00     3.63     8.73     0.09    0.11    0.00    0.11   0.11   9.40
sdf               0.00     0.00    0.00  216.00     0.00     1.69    16.00     0.08    0.37    0.00    0.37   0.24   5.10

09/21/2023 10:22:14 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  788.00     0.00     3.38     8.79     0.07    0.10    0.00    0.10   0.10   7.50
sdf               0.00     0.00    0.00  202.00     0.00     1.58    16.00     0.01    0.06    0.00    0.06   0.06   1.30

09/21/2023 10:22:15 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  801.00     0.00     3.51     8.98     0.08    0.10    0.00    0.10   0.10   8.00
sdf               0.00     0.00    0.00  214.00     0.00     1.67    16.00     0.01    0.07    0.00    0.07   0.07   1.50

test-5 : client_lim = 1 / client_res = 0.5 / client_wgt = 60 / iops=100 / bw=100M > bsize=1M > 84M / 100M = 84%

# ceph daemon osd.0 config show | grep -e osd_mclock -e bluestore_throttle
    "bluestore_throttle_bytes": "0",                         // unlimited
    "bluestore_throttle_cost_per_io": "0",
    "bluestore_throttle_cost_per_io_hdd": "670000",
    "bluestore_throttle_cost_per_io_ssd": "4000", 
    "bluestore_throttle_deferred_bytes": "0",                // unlimited
    "bluestore_throttle_trace_rate": "0.000000",  
    "osd_mclock_force_run_benchmark_on_init": "false",
    "osd_mclock_iops_capacity_threshold_hdd": "500.000000",
    "osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
    "osd_mclock_max_capacity_iops_hdd": "100.000000",        // 100 iops
    "osd_mclock_max_capacity_iops_ssd": "21500.000000", 
    "osd_mclock_max_sequential_bandwidth_hdd": "104857600",  // 100_M
    "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
    "osd_mclock_override_recovery_settings": "false",
    "osd_mclock_profile": "custom",
    "osd_mclock_scheduler_anticipation_timeout": "0.000000",
    "osd_mclock_scheduler_background_best_effort_lim": "0.100000",
    "osd_mclock_scheduler_background_best_effort_res": "0.200000",
    "osd_mclock_scheduler_background_best_effort_wgt": "20",
    "osd_mclock_scheduler_background_recovery_lim": "0.500000",
    "osd_mclock_scheduler_background_recovery_res": "0.300000",
    "osd_mclock_scheduler_background_recovery_wgt": "20",
    "osd_mclock_scheduler_client_lim": "1.000000",
    "osd_mclock_scheduler_client_res": "0.500000",
    "osd_mclock_scheduler_client_wgt": "60",
    "osd_mclock_skip_benchmark": "true",

# cat test-bench-write-1M.sh 
> writelog

for i in {1..1} ; do
    name1=$(echo $RANDOM)
    name2=$(echo $RANDOM)
    echo "test-bench-write-$name1-$name2" 
    nohup rados -c ./ceph.conf bench 600 write --no-cleanup -t 10 -p test-pool  -b 1048576 --show-time --run-name "test-bench-write-$name1-$name2" >> writelog 2>&1 &
done

# rados bench output ==> 84M / 100M = 84%
2023-09-21T22:11:07.590247+0800 min lat: 0.00665102 max lat: 0.506691 avg lat: 0.117194 lat p50: 0.0938236 lat p90: 0.263444 lat p99: 0.46757 lat p999: 0.506691 lat p100: 0.506691
2023-09-21T22:11:07.590247+0800   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
2023-09-21T22:11:07.590247+0800    60      10      5118      5108   85.1189        89   0.0576973    0.117194
2023-09-21T22:11:08.590451+0800    61      10      5185      5175   84.8217        67    0.041496    0.117657
2023-09-21T22:11:09.590623+0800    62      10      5260      5250   84.6631        75   0.0879328    0.117885
2023-09-21T22:11:10.590789+0800    63      10      5346      5336    84.684        86   0.0369602    0.117808
2023-09-21T22:11:11.590975+0800    64      10      5434      5424   84.7356        88    0.139532    0.117881
2023-09-21T22:11:12.591107+0800    65      10      5523      5513    84.801        89    0.055929    0.117786
2023-09-21T22:11:13.591272+0800    66      10      5618      5608   84.9553        95   0.0881378    0.117566
2023-09-21T22:11:14.591439+0800    67      10      5707      5697   85.0155        89    0.134108    0.117414
2023-09-21T22:11:15.591563+0800    68      10      5785      5775   84.9122        78    0.391141     0.11746
2023-09-21T22:11:16.591743+0800    69      10      5855      5845   84.6959        70   0.0336723     0.11791
2023-09-21T22:11:17.591912+0800    70      10      5945      5935   84.7714        90    0.164689    0.117806
2023-09-21T22:11:18.592058+0800    71      10      6031      6021   84.7885        86    0.204223    0.117755
2023-09-21T22:11:19.592226+0800    72      10      6111      6101   84.7218        80   0.0878715    0.117879
2023-09-21T22:11:20.592402+0800    73      10      6190      6180   84.6433        79   0.0121704    0.117972
2023-09-21T22:11:21.592573+0800    74      10      6270      6260   84.5803        80   0.0120983    0.117998
2023-09-21T22:11:22.592716+0800    75      10      6356      6346   84.5991        86   0.0696619    0.117987
2023-09-21T22:11:23.592886+0800    76      10      6433      6423   84.4989        77    0.124536    0.118201
2023-09-21T22:11:24.593056+0800    77      10      6508      6498   84.3754        75    0.360682    0.118245
2023-09-21T22:11:25.593214+0800    78      10      6590      6580   84.3448        82     0.22629    0.118426
2023-09-21T22:11:26.593396+0800    79      10      6674      6664   84.3402        84   0.0130458    0.118399

# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/21/2023 10:10:59 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  356.00     0.00     0.83     4.79     0.03    0.10    0.00    0.10   0.10   3.40
sdf               0.00  1080.00    0.00  353.00     0.00    88.25   512.00     2.93    8.28    0.00    8.28   1.82  64.40

09/21/2023 10:11:00 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    1.00  352.00     0.01     0.82     4.83     0.04    0.11    1.00    0.11   0.11   4.00
sdf               0.00  1056.00    0.00  359.00     0.00    89.75   512.00     3.36    9.38    0.00    9.38   1.83  65.70

09/21/2023 10:11:01 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  276.00     0.00     0.64     4.78     0.03    0.09    0.00    0.09   0.09   2.60
sdf               0.00   828.00    0.00  273.00     0.00    68.25   512.00     2.00    7.37    0.00    7.37   2.07  56.50

09/21/2023 10:11:02 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  316.00     0.00     0.73     4.73     0.04    0.11    0.00    0.11   0.11   3.60
sdf               0.00   960.00    0.00  319.00     0.00    79.75   512.00     2.62    8.14    0.00    8.14   2.01  64.10

Actions #18

Updated by jianwei zhang 8 months ago

test-6 : client_lim = 1 / client_res = 0.5 / client_wgt = 60 / iops=240 / bw=240M
  • bsize=200K
  • BW = 39M / 240M = 16.25%
# ceph daemon osd.0 config show | grep -e osd_mclock -e bluestore_throttle
    "bluestore_throttle_bytes": "0",                            //unlimited
    "bluestore_throttle_cost_per_io": "0",
    "bluestore_throttle_cost_per_io_hdd": "670000", 
    "bluestore_throttle_cost_per_io_ssd": "4000",
    "bluestore_throttle_deferred_bytes": "0",                   //unlimited
    "bluestore_throttle_trace_rate": "0.000000",
    "osd_mclock_force_run_benchmark_on_init": "false",
    "osd_mclock_iops_capacity_threshold_hdd": "500.000000",
    "osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
    "osd_mclock_max_capacity_iops_hdd": "240.000000",           //240IOPS
    "osd_mclock_max_capacity_iops_ssd": "21500.000000",
    "osd_mclock_max_sequential_bandwidth_hdd": "251658240",     //240M
    "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
    "osd_mclock_override_recovery_settings": "false",
    "osd_mclock_profile": "custom",
    "osd_mclock_scheduler_anticipation_timeout": "0.000000",
    "osd_mclock_scheduler_background_best_effort_lim": "0.100000",
    "osd_mclock_scheduler_background_best_effort_res": "0.200000",
    "osd_mclock_scheduler_background_best_effort_wgt": "20",
    "osd_mclock_scheduler_background_recovery_lim": "0.500000",
    "osd_mclock_scheduler_background_recovery_res": "0.300000",
    "osd_mclock_scheduler_background_recovery_wgt": "20",
    "osd_mclock_scheduler_client_lim": "1.000000",
    "osd_mclock_scheduler_client_res": "0.500000",
    "osd_mclock_scheduler_client_wgt": "60",
    "osd_mclock_skip_benchmark": "true",

# cat test-bench-write-200K.sh 
> writelog

for i in {1..1} ; do
    name1=$(echo $RANDOM)
    name2=$(echo $RANDOM)
    echo "test-bench-write-$name1-$name2" 
    nohup rados -c ./ceph.conf bench 600 write --no-cleanup -t 10 -p test-pool -b 204800 --show-time --run-name "test-bench-write-$name1-$name2" >> writelog 2>&1 &
done

# rados bench output ==> 39 MB/s
2023-09-21T22:40:30.038698+0800 min lat: 0.00208069 max lat: 0.28836 avg lat: 0.0499285 lat p50: 0.0399047 lat p90: 0.10739 lat p99: 0.178679 lat p999: 0.246243 lat p100: 0.28836
2023-09-21T22:40:30.038698+0800   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
2023-09-21T22:40:30.038698+0800   100      10     20029     20019   39.0932   37.1094   0.0365631   0.0499285
2023-09-21T22:40:31.038925+0800   101      10     20222     20212   39.0792   37.6953   0.0972425   0.0499445
2023-09-21T22:40:32.039082+0800   102      10     20415     20405   39.0656   37.6953   0.0497988   0.0499663
2023-09-21T22:40:33.039240+0800   103      10     20617     20607   39.0693   39.4531    0.010729   0.0499684
2023-09-21T22:40:34.039408+0800   104      10     20808     20798   39.0523   37.3047   0.0417104   0.0499912
2023-09-21T22:40:35.039584+0800   105      10     21005     20995   39.0467   38.4766   0.0305145   0.0499838
2023-09-21T22:40:36.039737+0800   106      10     21193     21183   39.0247   36.7188   0.0196327   0.0500246
2023-09-21T22:40:37.039912+0800   107      10     21380     21370   39.0013   36.5234    0.164901   0.0500521
2023-09-21T22:40:38.040083+0800   108      10     21579     21569        39   38.8672    0.154728   0.0500468
2023-09-21T22:40:39.040256+0800   109      10     21760     21750   38.9665   35.3516    0.124795   0.0500964
2023-09-21T22:40:40.040425+0800   110      10     21949     21939   38.9478   36.9141   0.0408142   0.0501265
2023-09-21T22:40:41.040597+0800   111      10     22152     22142    38.954   39.6484   0.0419623   0.0501202
2023-09-21T22:40:42.040772+0800   112      10     22353     22343   38.9567   39.2578    0.009937   0.0501097
2023-09-21T22:40:43.040942+0800   113      10     22554     22544   38.9593   39.2578  0.00886829   0.0500964
2023-09-21T22:40:44.041090+0800   114      10     22759     22749   38.9687   40.0391   0.0634586    0.050102
2023-09-21T22:40:45.041214+0800   115      10     22961     22951   38.9728   39.4531   0.0403335   0.0500961
2023-09-21T22:40:46.041394+0800   116      10     23160     23150   38.9719   38.8672   0.0608101   0.0500976
2023-09-21T22:40:47.041570+0800   117      10     23371     23361    38.991   41.2109    0.137616   0.0500709
2023-09-21T22:40:48.041740+0800   118      10     23585     23575   39.0147   41.7969   0.0301397   0.0500334
2023-09-21T22:40:49.041916+0800   119      10     23790     23780   39.0232   40.0391   0.0495194    0.050023

# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/21/2023 10:42:01 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  840.00     0.00     1.92     4.69     0.09    0.11    0.00    0.11   0.11   8.90
sdf               0.00   627.00    0.00  210.00     0.00    41.02   400.00     1.32    6.26    0.00    6.26   3.53  74.10

09/21/2023 10:42:02 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  748.00     0.00     1.73     4.73     0.08    0.11    0.00    0.11   0.11   8.10
sdf               0.00   561.00    0.00  187.00     0.00    36.52   400.00     1.11    5.93    0.00    5.93   3.78  70.60

09/21/2023 10:42:03 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  832.00     0.00     1.91     4.71     0.07    0.09    0.00    0.09   0.09   7.10
sdf               0.00   624.00    0.00  208.00     0.00    40.62   400.00     1.07    5.16    0.00    5.16   2.76  57.50

Actions #19

Updated by jianwei zhang 8 months ago

cost = max (1M, 200K) = 1M
bw = 240M

mclock_queue_delay = 1 / 240 = 0.0041 s

200K_on_disk_lat = 200 / 1024 / 240 = 0.0008 s

all_lat = 0.0041 + 0.0008 = 0.0049s

iops = 1 / 0.0049 = 204

bw = 204 * 200 / 1024 = 39M

Actions #20

Updated by jianwei zhang 8 months ago

test-7: client_lim = 1 / client_res = 0.5 / client_wgt = 60 / iops=240 / bw=240M

* write : buffer=1M, bw=192MB/s, iops=192
* randread : buffer=1M, bw=108MB/s (45%),   iops=108
* seqread :  buffer=1M, bw=126MB/s  (52.5%), iops=126
  * mclock_queue_delay = 1 / 240 = 0.0041 s
  * 1M_on_disk_lat = 1 / 240 = 0.0041 s
  * 1 / (0.0041 * 2) = 121  * 1M = 121 MB/s

1. lat1 = Queuing time calculated by mclock based on bytes
2. lat2 = Disk seek time + transfer time cannot be ignored
3. Even if limit = 1 (means that disk bandwidth 240MB/s can be fully used), but there will still be a loss in read and write bandwidth, especially read bandwidth


# ceph daemon osd.0 config show | grep -e osd_mclock -e bluestore_throttle
    "bluestore_throttle_bytes": "0",
    "bluestore_throttle_cost_per_io": "0",
    "bluestore_throttle_cost_per_io_hdd": "670000",
    "bluestore_throttle_cost_per_io_ssd": "4000",
    "bluestore_throttle_deferred_bytes": "0",
    "bluestore_throttle_trace_rate": "0.000000",
    "osd_mclock_force_run_benchmark_on_init": "false",
    "osd_mclock_iops_capacity_threshold_hdd": "500.000000",
    "osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
    "osd_mclock_max_capacity_iops_hdd": "240.000000",
    "osd_mclock_max_capacity_iops_ssd": "21500.000000",
    "osd_mclock_max_sequential_bandwidth_hdd": "251658240",
    "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
    "osd_mclock_override_recovery_settings": "false",
    "osd_mclock_profile": "custom",
    "osd_mclock_scheduler_anticipation_timeout": "0.000000",
    "osd_mclock_scheduler_background_best_effort_lim": "0.100000",
    "osd_mclock_scheduler_background_best_effort_res": "0.200000",
    "osd_mclock_scheduler_background_best_effort_wgt": "20",
    "osd_mclock_scheduler_background_recovery_lim": "0.500000",
    "osd_mclock_scheduler_background_recovery_res": "0.300000",
    "osd_mclock_scheduler_background_recovery_wgt": "20",
    "osd_mclock_scheduler_client_lim": "1.000000",
    "osd_mclock_scheduler_client_res": "0.500000",
    "osd_mclock_scheduler_client_wgt": "60",
    "osd_mclock_skip_benchmark": "true",

# cat test-bench-randread-1M.sh 
> readlog
> writelog

ceph tell osd.0 cache drop
rados -c ./ceph.conf bench 300 write --no-cleanup -t 10 -p test-pool -b 1048576 --show-time > writelog 2>&1

ceph tell osd.0 cache drop
rados -c ./ceph.conf bench 300 rand -t 10 -p test-pool --show-time > readlog 2>&1

ceph tell osd.0 cache drop
rados -c ./ceph.conf bench 300 seq -t 10 -p test-pool --show-time > seq-readlog 2>&1

# rados -c ./ceph.conf bench 300 write --no-cleanup -t 10 -p test-pool -b 1048576
2023-09-21T23:17:32.018512+0800 min lat: 0.00677637 max lat: 0.345545 avg lat: 0.0520317 lat p50: 0.0441953 lat p90: 0.103738 lat p99: 0.158788 lat p999: 0.240164 lat p100: 0.345545
2023-09-21T23:17:32.018512+0800   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
2023-09-21T23:17:32.018512+0800   180      10     34594     34584   192.105       188    0.143394   0.0520317
2023-09-21T23:17:33.018704+0800   181      10     34778     34768    192.06       184    0.102056   0.0520474
2023-09-21T23:17:34.018820+0800   182      10     34966     34956   192.038       188   0.0288972   0.0520533
2023-09-21T23:17:35.018992+0800   183      10     35171     35161   192.108       205   0.0199566   0.0520353
2023-09-21T23:17:36.019171+0800   184      10     35349     35339   192.031       178   0.0525519   0.0520571
2023-09-21T23:17:37.019338+0800   185      10     35538     35528   192.015       189   0.0338643   0.0520639
2023-09-21T23:17:38.019502+0800   186      10     35740     35730   192.068       202   0.0236666   0.0520508
2023-09-21T23:17:39.019640+0800   187      10     35931     35921   192.062       191   0.0177018   0.0520489
2023-09-21T23:17:40.019789+0800   188      10     36123     36113   192.062       192   0.0333076   0.0520442
2023-09-21T23:17:41.019926+0800   189      10     36326     36316    192.12       203   0.0302826   0.0520373
2023-09-21T23:17:42.020063+0800   190      10     36521     36511   192.135       195   0.0300076     0.05203
2023-09-21T23:17:43.020209+0800   191      10     36711     36701   192.123       190   0.0613027   0.0520309
2023-09-21T23:17:44.020331+0800   192      10     36903     36893   192.123       192   0.0208707    0.052036
2023-09-21T23:17:45.020486+0800   193      10     37097     37087   192.132       194   0.0458238   0.0520295
2023-09-21T23:17:46.020604+0800   194      10     37284     37274   192.106       187   0.0207243   0.0520379
2023-09-21T23:17:47.020814+0800   195      10     37474     37464   192.095       190  0.00812529   0.0520349
2023-09-21T23:17:48.020959+0800   196      10     37677     37667    192.15       203     0.02463   0.0520271
2023-09-21T23:17:49.021133+0800   197      10     37875     37865    192.18       198   0.0600757   0.0520202
2023-09-21T23:17:50.021214+0800   198      10     38047     38037   192.078       172   0.0397751   0.0520491
2023-09-21T23:17:51.021411+0800   199      10     38241     38231   192.087       194   0.0580198   0.0520454

# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/21/2023 11:17:56 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  760.00     0.00     1.78     4.79     0.08    0.10    0.00    0.10   0.10   7.90
sdf               0.00  2184.00    0.00  757.00     0.00   189.25   512.00    11.65   16.39    0.00   16.39   1.24  93.90

09/21/2023 11:17:57 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    1.00  725.00     0.01     1.68     4.77     0.07    0.09    0.00    0.09   0.09   6.50
sdf               0.00  2256.00    0.00  734.00     0.00   183.50   512.00     9.09   12.08    0.00   12.08   1.30  95.70

09/21/2023 11:17:58 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    1.00  728.00     0.01     1.69     4.77     0.08    0.11    0.00    0.11   0.11   7.70
sdf               0.00  2142.00    0.00  727.00     0.00   181.75   512.00    10.13   14.18    0.00   14.18   1.29  94.10

09/21/2023 11:17:59 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  820.00     0.00     1.92     4.80     0.09    0.10    0.00    0.10   0.10   8.60
sdf               0.00  2502.00    0.00  825.00     0.00   206.25   512.00    12.20   14.79    0.00   14.79   1.20  98.70

# rados -c ./ceph.conf bench 300 rand -t 10 -p test-pool --show-time
2023-09-21T23:21:14.801963+0800 min lat: 0.000950983 max lat: 0.329227 avg lat: 0.0918224 lat p50: 0.079395 lat p90: 0.188812 lat p99: 0.2499 lat p999: 0.329227 lat p100: 0.329227
2023-09-21T23:21:14.801963+0800   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
2023-09-21T23:21:14.801963+0800   100      10     10870     10860   108.582       112   0.0475709   0.0918224
2023-09-21T23:21:15.802203+0800   101      10     10982     10972   108.616       112   0.0714889   0.0918054
2023-09-21T23:21:16.802396+0800   102      10     11088     11078    108.59       106     0.19222   0.0918347
2023-09-21T23:21:17.802598+0800   103      10     11197     11187   108.594       109   0.0498854   0.0918089
2023-09-21T23:21:18.802744+0800   104      10     11299     11289    108.53       102    0.148587   0.0918716
2023-09-21T23:21:19.802888+0800   105      10     11412     11402   108.573       113    0.139777    0.091857
2023-09-21T23:21:20.803039+0800   106      10     11520     11510   108.567       108    0.080589   0.0918555
2023-09-21T23:21:21.803180+0800   107      10     11626     11616   108.543       106   0.0585344   0.0918476
2023-09-21T23:21:22.803323+0800   108      10     11733     11723   108.528       107   0.0725719   0.0918789
2023-09-21T23:21:23.803399+0800   109      10     11841     11831   108.524       108   0.0258373   0.0918775
2023-09-21T23:21:24.803523+0800   110      10     11949     11939   108.519       108   0.0499517   0.0918971
2023-09-21T23:21:25.803697+0800   111      10     12060     12050   108.541       111   0.0370205   0.0918196
2023-09-21T23:21:26.803851+0800   112      10     12169     12159   108.545       109   0.0773861   0.0918556
2023-09-21T23:21:27.803986+0800   113      10     12279     12269   108.558       110   0.0938997   0.0918722
2023-09-21T23:21:28.804130+0800   114      10     12387     12377   108.553       108   0.0322604    0.091867
2023-09-21T23:21:29.804282+0800   115      10     12495     12485   108.548       108    0.176533   0.0918611
2023-09-21T23:21:30.804605+0800   116       9     12605     12596   108.568       111    0.163263   0.0918711
2023-09-21T23:21:31.804745+0800   117      10     12715     12705   108.572       109    0.111872   0.0918562
2023-09-21T23:21:32.804906+0800   118      10     12823     12813   108.567       108    0.180895   0.0918464
2023-09-21T23:21:33.805088+0800   119      10     12936     12926   108.604       113   0.0160037   0.0918286

# iostat -xmt 1 -d /dev/sdf /dev/sdfb
09/21/2023 11:21:36 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf            1278.00     0.00  526.00    0.00   114.13     0.00   444.35    16.03   31.38   31.38    0.00   1.90  99.90

09/21/2023 11:21:37 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf            1225.00     0.00  515.00    0.00   108.88     0.00   432.96    16.40   31.86   31.86    0.00   1.94 100.00

09/21/2023 11:21:38 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf            1221.00     0.00  515.00    0.00   107.81     0.00   428.74    15.82   30.66   30.66    0.00   1.94 100.10

# rados -c ./ceph.conf bench 300 seq -t 10 -p test-pool --show-time
2023-09-21T23:28:31.738419+0800 min lat: 0.00553556 max lat: 0.32825 avg lat: 0.0785171 lat p50: 0.0663674 lat p90: 0.154047 lat p99: 0.244373 lat p999: 0.32825 lat p100: 0.32825
2023-09-21T23:28:31.738419+0800   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
2023-09-21T23:28:31.738419+0800    40      10      5086      5076   126.878       123    0.114363   0.0785171
2023-09-21T23:28:32.738621+0800    41      10      5218      5208   127.003       132   0.0976044   0.0784495
2023-09-21T23:28:33.738798+0800    42      10      5345      5335   127.002       127   0.0842019   0.0784338
2023-09-21T23:28:34.738972+0800    43      10      5476      5466   127.095       131     0.20592   0.0783696
2023-09-21T23:28:35.739144+0800    44      10      5612      5602   127.296       136   0.0140386   0.0782273
2023-09-21T23:28:36.739313+0800    45      10      5742      5732   127.356       130     0.20805   0.0781625
2023-09-21T23:28:37.739507+0800    46      10      5860      5850   127.152       118   0.0301462   0.0782675
2023-09-21T23:28:38.739688+0800    47      10      5986      5976   127.127       126    0.177597   0.0783592
2023-09-21T23:28:39.739859+0800    48      10      6112      6102   127.103       126    0.100173   0.0784004
2023-09-21T23:28:40.740046+0800    49      10      6243      6233   127.182       131    0.063912    0.078341
2023-09-21T23:28:41.740213+0800    50      10      6380      6370   127.378       137    0.012598    0.078227
2023-09-21T23:28:42.740375+0800    51      10      6514      6504   127.508       134     0.20194   0.0781202
2023-09-21T23:28:43.740552+0800    52      10      6649      6639   127.651       135     0.13733   0.0780518
2023-09-21T23:28:44.740710+0800    53      10      6775      6765    127.62       126    0.106213   0.0780625
2023-09-21T23:28:45.740873+0800    54      10      6891      6881   127.404       116    0.114592   0.0782213
2023-09-21T23:28:46.741043+0800    55      10      7012      7002   127.287       121   0.0441335   0.0782742
2023-09-21T23:28:47.741225+0800    56      10      7136      7126   127.228       124   0.0276655   0.0783007
2023-09-21T23:28:48.741405+0800    57      10      7267      7257   127.294       131   0.0299355   0.0782478
2023-09-21T23:28:49.741577+0800    58      10      7396      7386   127.323       129    0.192391   0.0782472
2023-09-21T23:28:50.741762+0800    59      10      7525      7515   127.351       129   0.0426686    0.078246

# iostat -xmt 1 -d /dev/sdb /dev/sdf
09/21/2023 11:29:01 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf            1403.00     0.00  626.00    0.00   126.31     0.00   413.24    14.60   23.03   23.03    0.00   1.60 100.00

09/21/2023 11:29:02 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf            1426.00     0.00  647.00    0.00   130.50     0.00   413.08    15.45   24.34   24.34    0.00   1.55 100.00

09/21/2023 11:29:03 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf            1431.00     0.00  642.00    0.00   128.94     0.00   411.31    15.26   23.68   23.68    0.00   1.56 100.00

Actions #21

Updated by jianwei zhang 8 months ago

test-8 : client_lim = 0 / client_res = 0.5 / client_wgt = 60 / iops=240 / bw=240M

* write : buffer=1M, bw=238MB/s, iops=238
* randread : buffer=1M, bw=106MB/s,   iops=106
* seqread :  buffer=1M, bw=127MB/s  (52.5%), iops=127


# ceph daemon osd.0 config show | grep -e osd_mclock -e bluestore_throttle
    "bluestore_throttle_bytes": "0",                       //unlimited
    "bluestore_throttle_cost_per_io": "0",
    "bluestore_throttle_cost_per_io_hdd": "670000",
    "bluestore_throttle_cost_per_io_ssd": "4000",
    "bluestore_throttle_deferred_bytes": "0",              //unlimited
    "bluestore_throttle_trace_rate": "0.000000",
    "osd_mclock_force_run_benchmark_on_init": "false",
    "osd_mclock_iops_capacity_threshold_hdd": "500.000000",
    "osd_mclock_iops_capacity_threshold_ssd": "80000.000000",
    "osd_mclock_max_capacity_iops_hdd": "240.000000",      // iops=240
    "osd_mclock_max_capacity_iops_ssd": "21500.000000",
    "osd_mclock_max_sequential_bandwidth_hdd": "251658240", //bw=240M
    "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
    "osd_mclock_override_recovery_settings": "false",
    "osd_mclock_profile": "custom",
    "osd_mclock_scheduler_anticipation_timeout": "0.000000",
    "osd_mclock_scheduler_background_best_effort_lim": "0.100000",
    "osd_mclock_scheduler_background_best_effort_res": "0.200000",
    "osd_mclock_scheduler_background_best_effort_wgt": "20",
    "osd_mclock_scheduler_background_recovery_lim": "0.500000",
    "osd_mclock_scheduler_background_recovery_res": "0.300000",
    "osd_mclock_scheduler_background_recovery_wgt": "20",
    "osd_mclock_scheduler_client_lim": "0.000000",      //unlimited
    "osd_mclock_scheduler_client_res": "0.500000", 
    "osd_mclock_scheduler_client_wgt": "60",
    "osd_mclock_skip_benchmark": "true",

# cat test-bench-read-1M.sh 
#> readlog
#> writelog

ceph tell osd.0 cache drop
rados -c ./ceph.conf bench 300 write --no-cleanup -t 10 -p test-pool -b 1048576 --show-time > writelog 2>&1

ceph tell osd.0 cache drop
rados -c ./ceph.conf bench 300 rand -t 10 -p test-pool --show-time > readlog 2>&1

ceph tell osd.0 cache drop
rados -c ./ceph.conf bench 300 seq -t 10 -p test-pool --show-time > seq-readlog 2>&1

# rados -c ./ceph.conf bench 300 write --no-cleanup -t 10 -p test-pool -b 1048576 --show-time > writelog 2>&1
2023-09-22T15:32:17.676876+0800 min lat: 0.00747302 max lat: 2.55199 avg lat: 0.0420207 lat p50: 0.0395603 lat p90: 0.0472786 lat p99: 0.0521989 lat p999: 1.4363 lat p100: 2.55199
2023-09-22T15:32:17.676876+0800   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
2023-09-22T15:32:17.676876+0800    60       9     14283     14274   237.862       244   0.0407354   0.0420207
2023-09-22T15:32:18.677013+0800    61      10     14519     14509   237.814       235   0.0468292   0.0420267
2023-09-22T15:32:19.677158+0800    62      10     14751     14741    237.72       232   0.0412701   0.0420435
2023-09-22T15:32:20.677295+0800    63      10     14997     14987   237.851       246   0.0402811    0.042023
2023-09-22T15:32:21.677467+0800    64      10     15243     15233   237.978       246   0.0432071   0.0419993
2023-09-22T15:32:22.677597+0800    65      10     15479     15469   237.947       236   0.0465882   0.0420044
2023-09-22T15:32:23.677787+0800    66      10     15710     15700   237.841       231   0.0412817   0.0420231
2023-09-22T15:32:24.677899+0800    67      10     15952     15942   237.903       242   0.0417247   0.0420128
2023-09-22T15:32:25.678044+0800    68      10     16199     16189   238.036       247   0.0422215     0.04199
2023-09-22T15:32:26.678153+0800    69      10     16435     16425   238.006       236   0.0411166   0.0419948
2023-09-22T15:32:27.678284+0800    70      10     16663     16653   237.863       228   0.0432414   0.0420201
2023-09-22T15:32:28.678437+0800    71      10     16903     16893   237.892       240   0.0414538   0.0420163
2023-09-22T15:32:29.678612+0800    72      10     17151     17141   238.032       248   0.0402458   0.0419915
2023-09-22T15:32:30.678786+0800    73      10     17394     17384   238.099       243   0.0406311   0.0419796
2023-09-22T15:32:31.678931+0800    74      10     17623     17613   237.976       229   0.0422751   0.0420012
2023-09-22T15:32:32.679100+0800    75      10     17863     17853   238.002       240   0.0407005   0.0419974
2023-09-22T15:32:33.679242+0800    76      10     18107     18097   238.081       244   0.0391934   0.0419851
2023-09-22T15:32:34.679354+0800    77      10     18351     18341   238.157       244   0.0407713   0.0419697
2023-09-22T15:32:35.679497+0800    78      10     18588     18578   238.142       237    0.042393   0.0419723
2023-09-22T15:32:36.679640+0800    79      10     18817     18807   238.026       229   0.0424886   0.0419926

# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/22/2023 03:33:41 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  956.00     0.00     2.23     4.77     0.10    0.10    0.00    0.10   0.10   9.80
sdf               0.00  2868.00    0.00  956.00     0.00   239.00   512.00    33.66   35.21    0.00   35.21   1.05 100.00

09/22/2023 03:33:42 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  982.00     0.00     2.29     4.77     0.10    0.10    0.00    0.10   0.10   9.70
sdf               0.00  2952.00    0.00  984.00     0.00   246.00   512.00    33.45   34.08    0.00   34.08   1.02 100.00

09/22/2023 03:33:43 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  983.00     0.00     2.29     4.78     0.10    0.10    0.00    0.10   0.10   9.80
sdf               0.00  2940.00    0.00  980.00     0.00   245.00   512.00    33.37   33.98    0.00   33.98   1.02 100.00

09/22/2023 03:33:44 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  960.00     0.00     2.25     4.80     0.10    0.10    0.00    0.10   0.10   9.80
sdf               0.00  2880.00    0.00  960.00     0.00   240.00   512.00    33.55   34.96    0.00   34.96   1.04 100.00

# rados -c ./ceph.conf bench 300 rand -t 10 -p test-pool --show-time > readlog 2>&1
2023-09-22T15:37:59.274959+0800 min lat: 0.00140976 max lat: 0.365673 avg lat: 0.0932798 lat p50: 0.0796965 lat p90: 0.194479 lat p99: 0.296663 lat p999: 0.362666 lat p100: 0.365673
2023-09-22T15:37:59.274959+0800   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
2023-09-22T15:37:59.274959+0800   100      10     10705     10695   106.931       106    0.140561   0.0932798
2023-09-22T15:38:00.275178+0800   101      10     10807     10797   106.882       102   0.0374205   0.0933135
2023-09-22T15:38:01.275309+0800   102      10     10908     10898   106.824       101    0.113367   0.0933688
2023-09-22T15:38:02.275474+0800   103      10     11016     11006   106.836       108    0.187362   0.0933498
2023-09-22T15:38:03.275637+0800   104      10     11118     11108   106.789       102    0.086711   0.0934097
2023-09-22T15:38:04.275834+0800   105      10     11223     11213   106.772       105     0.22285   0.0933999
2023-09-22T15:38:05.275989+0800   106      10     11327     11317   106.745       104     0.03781   0.0934276
2023-09-22T15:38:06.276172+0800   107      10     11437     11427   106.776       110    0.143004   0.0934115
2023-09-22T15:38:07.276310+0800   108      10     11543     11533   106.768       106   0.0481894   0.0934059
2023-09-22T15:38:08.276455+0800   109      10     11651     11641   106.779       108    0.087842   0.0934107
2023-09-22T15:38:09.276657+0800   110      10     11758     11748   106.781       107   0.0620523   0.0934106
2023-09-22T15:38:10.276819+0800   111      10     11866     11856   106.792       108    0.059065   0.0934023
2023-09-22T15:38:11.276980+0800   112      10     11975     11965   106.812       109   0.0368395    0.093396
2023-09-22T15:38:12.277138+0800   113      10     12087     12077   106.857       112   0.0540547   0.0933415
2023-09-22T15:38:13.277300+0800   114      10     12195     12185   106.867       108   0.0432903   0.0933042
2023-09-22T15:38:14.277465+0800   115      10     12298     12288   106.834       103   0.0805061   0.0933639
2023-09-22T15:38:15.277632+0800   116      10     12406     12396   106.843       108   0.0938381   0.0933551
2023-09-22T15:38:16.277800+0800   117      10     12516     12506    106.87       110   0.0220266   0.0933227
2023-09-22T15:38:17.277964+0800   118      10     12624     12614    106.88       108   0.0594453   0.0933304
2023-09-22T15:38:18.278136+0800   119      10     12730     12720   106.872       106    0.161019   0.0933295

# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/22/2023 03:38:39 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf            1211.00     0.00  547.00    0.00   109.00     0.00   408.10    16.74   30.15   30.15    0.00   1.83 100.10

09/22/2023 03:38:40 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf            1145.00     0.00  512.00    0.00   105.00     0.00   420.00    15.84   31.92   31.92    0.00   1.95 100.00

09/22/2023 03:38:41 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf            1188.00     0.00  522.00    0.00   106.75     0.00   418.82    16.19   30.71   30.71    0.00   1.92 100.00

# rados -c ./ceph.conf bench 300 seq -t 10 -p test-pool --show-time > seq-readlog 2>&1
2023-09-22T15:43:41.147441+0800 min lat: 0.00518757 max lat: 0.365872 avg lat: 0.0782036 lat p50: 0.0608217 lat p90: 0.174392 lat p99: 0.253574 lat p999: 0.358357 lat p100: 0.365872
2023-09-22T15:43:41.147441+0800   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
2023-09-22T15:43:41.147441+0800   140      10     17862     17852   127.491       124   0.0459142   0.0782036
2023-09-22T15:43:42.147654+0800   141      10     17990     17980   127.495       128   0.0692941   0.0782011
2023-09-22T15:43:43.147826+0800   142      10     18120     18110   127.512       130   0.0235289   0.0781802
2023-09-22T15:43:44.148011+0800   143      10     18249     18239   127.523       129   0.0882931   0.0781882
2023-09-22T15:43:45.148191+0800   144      10     18378     18368   127.533       129   0.0375746   0.0781872
2023-09-22T15:43:46.148362+0800   145      10     18502     18492   127.508       124   0.0173108   0.0781833
2023-09-22T15:43:47.148554+0800   146      10     18631     18621   127.518       129    0.218181   0.0781818
2023-09-22T15:43:48.148746+0800   147      10     18761     18751   127.535       130   0.0258726   0.0781515
2023-09-22T15:43:49.148908+0800   148      10     18896     18886   127.585       135    0.104194     0.07815
2023-09-22T15:43:50.149076+0800   149      10     19021     19011   127.568       125   0.0986657    0.078166
2023-09-22T15:43:51.149213+0800   150      10     19151     19141   127.584       130    0.139945   0.0781536
2023-09-22T15:43:52.149362+0800   151      10     19278     19268    127.58       127    0.116396    0.078159
2023-09-22T15:43:53.149531+0800   152      10     19412     19402   127.622       134   0.0435113   0.0781177
2023-09-22T15:43:54.149691+0800   153      10     19540     19530   127.624       128    0.208505   0.0781201
2023-09-22T15:43:55.149859+0800   154      10     19659     19649   127.568       119   0.0754471   0.0781465
2023-09-22T15:43:56.149997+0800   155      10     19795     19785   127.622       136   0.0374644   0.0781163
2023-09-22T15:43:57.150164+0800   156      10     19919     19909   127.599       124    0.189175   0.0781295
2023-09-22T15:43:58.150328+0800   157      10     20055     20045   127.652       136    0.236116    0.078095
2023-09-22T15:43:59.150477+0800   158      10     20194     20184   127.724       139   0.0241363   0.0780588
2023-09-22T15:44:00.150636+0800   159      10     20328     20318   127.763       134   0.0271561   0.0780272

# iostat -xmt 1 -d /dev/sdf /dev/sdb
09/22/2023 03:44:20 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf            1508.00     0.00  640.00    0.00   134.94     0.00   431.80    14.70   22.97   22.97    0.00   1.56  99.90

09/22/2023 03:44:21 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf            1327.00     0.00  581.00    0.00   120.00     0.00   422.99    15.30   26.18   26.18    0.00   1.72 100.00

09/22/2023 03:44:22 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf            1551.00     0.00  662.00    0.00   136.44     0.00   422.09    15.50   23.64   23.64    0.00   1.51 100.00

Actions #22

Updated by Radoslaw Zarzynski 8 months ago

Bump this up for next bug scrub.

Actions #23

Updated by Laura Flores 8 months ago

  • Status changed from New to Pending Backport
  • Backport set to quincy,reef
Actions #24

Updated by Laura Flores 8 months ago

  • Copied to Backport #63125: reef: osd: Is it necessary to unconditionally increase osd_bandwidth_cost_per_io in mClockScheduler::calc_scaled_cost? added
Actions #25

Updated by Laura Flores 8 months ago

  • Copied to Backport #63126: quincy: osd: Is it necessary to unconditionally increase osd_bandwidth_cost_per_io in mClockScheduler::calc_scaled_cost? added
Actions #26

Updated by Laura Flores 8 months ago

  • Tags set to backport_processed
Actions #27

Updated by Laura Flores 8 months ago

  • Pull request ID set to 53417
Actions #28

Updated by Ilya Dryomov 5 months ago

  • Status changed from Pending Backport to Resolved
  • Target version deleted (v18.2.0)
Actions

Also available in: Atom PDF