Actions

Copy link

Bug #53593

open

RBD cloned image is slow in 4k write with "waiting for rw locks"

Added by Cuicui Zhao over 2 years ago. Updated 5 months ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Ilya Dryomov

Category:

Performance/Resource Usage

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

pacific,quincy,reef

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

45852

Crash signature (v1):

Crash signature (v2):

Description

[Observed Poor Performance]
On a rbd image, we found the 4k write IOPS is much lower than expected.
I understood there was op sequencer in ObjectStore, and 4k seq write IOPS would perform worse than 4k randwrite.
But in my performance test, it differed a lot on a cloned rbd image, and a rbd image without a parent.
And after I flattened the cloned rbd image, it appeared much better than before.
My concern is, it may also block ops on the same PG, and influence the whole cluster performance.¶

With "ceph daemon osd.id perf dump", I saw plenty of confusing stats in "osd" with "op_rw", while I expected "op_w" or "op_r" as our application should not require rw. And "op_latency" was much bigger than expected.
And ops experienced "waiting for rw locks", thus took a longer time to complete.
A demanstration is as follows. =======================================
$ ceph daemon osd.id perf dump
...
"osd": {
"op_wip": 0,
"op": 103630,
"op_in_bytes": 424282624,
"op_out_bytes": 0,
"op_latency": {
"avgcount": 103630,
"sum": 15927.179444258,
"avgtime": 0.153692747 <===bigger than expected.
},
...
"op_rw": 103579,
"op_rw_in_bytes": 424282624,
"op_rw_out_bytes": 0,
"op_rw_latency": {
"avgcount": 103579,
"sum": 15927.174100753,
"avgtime": 0.153768371
},

// osd_op is stat + write.
ceph daemon osd.id dump_historic_ops | grep write
"description": "osd_op(client.204166.0:954770 4.298 4:19642c7c:::rbd_data.2f65e994b93ec.0000000000002ca8:head [stat, object_size 4194304 write_size 4194304,write 4186112~4096] snapc 0=[] ondisk+write+known_if_redirected e196)",
...

ceph daemon osd.id dump_historic_ops | grep "waiting for rw"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
... =======================================

[Root Cause]
With "waiting for rw locks", I located PrimaryLogPG::get_rw_locks() in PrimaryLogPG::do_op(). I think it should have allow concurrent write on the same object with write lock, until I noticed it's description, "[stat,..., write]".
The op "stat" will trigger "may_read()", and "[stat + write]" will result in excl lock in RWState::RWEXCL and thus suppress concurrency. The ops on the same 4MiB object will be done sequencially.
So the next question is, why there is a "stat" with this write op from librbd. After some debugging, I found it in AbstractObjectWriteRequest<I>::write_object(), in src/librbd/io/ObjectRequest.cc.
It seems that the stat op is mean to give a better performance in copyup case. =======================================
template <typename I>
void AbstractObjectWriteRequest<I>::write_object() {
I *image_ctx = this->m_ictx;
ldout(image_ctx->cct, 20) << dendl; =======================================
commit 4240846f5cf1525952ac42c0d6eddf9dfa459ce4
Author: Sage Weil <sage@inktank.com>
Date: Mon Oct 22 17:57:08 2012 -0700 =======================================

librbd: use assert_exists() to simplify copyup check

Previously we would explicitly STAT the object to see if it exists before
    sending the write to the OSD.  Instead, send the write optimistically, and
    assert that the object already exists.  This avoids an extra round trip in
    the optimistic/common case, and makes the existence check in the initial
    first-write case more expensive because we send the data payload along.

Signed-off-by: Sage Weil &lt;sage@inktank.com&gt;

[How to Reproduce]
Clone a disk from a snapshot.
$ rbd snap create PoolName/AParentImage@AParentImageSnapshot
$ rbd snap protect PoolName/AParentImage@AParentImageSnapshot
$ rbd clone PoolName/AParentImage@AParentImageSnapshot PoolName/ClonedImage
$ rbd info volumes/volume-dee0979c-d6b0-44bc-a270-6bfa75715c4b | grep parent
parent: PoolName/AParentImage@AParentImageSnapshot¶

Create a disk without a parent.
$ rbd create PoolName/ImageWithoutParent --size XXX

Run fio on the two images respectively with your pool name and image name. You may also use "-rw=write" to get the same result.
$ fio -direct=1 -iodepth=128 -rw=write:4k -ioengine=rbd -bs=4k -numjobs=1 -number_ios=1 -size=1G -pool=PoolName -rbdname=ClonedImage -clientname=admin -group_reporting -name=Rand_Write_Testing

[Suggestion]
I think it's possible and necessary to optimize the seq write performance on the same object for cloned rbd image as shown above. It's not a corner case.
A new explicit API like "write exist" is proposed for this scenario, to substitute the "stat" op which is in fact READ and suppressed concurrency. And it'll require upgrade in both client and osd.
A work around in osd for this certain case is also welcomed. But it may require more careful checking in OpInfo::set_from_op().
And now I have to flatten the rbd image for reported performance issue in field.¶

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Cuicui Zhao over 2 years ago

[Observed Poor Performance]
On a rbd image, we found the 4k write IOPS is much lower than expected.
I understood there was op sequencer in ObjectStore, and 4k seq write IOPS would perform worse than 4k randwrite.
But in my performance test, it differed a lot on a cloned rbd image, and a rbd image without a parent.
And after I flattened the cloned rbd image, it appeared much better than before.
**My concern is, it may also block ops on the same PG, and influence the whole cluster performance.

=======================================
$ ceph daemon osd.id perf dump
...
"osd": {
"op_wip": 0,
"op": 103630,
"op_in_bytes": 424282624,
"op_out_bytes": 0,
"op_latency": {
"avgcount": 103630,
"sum": 15927.179444258,
"avgtime": 0.153692747 <===bigger than expected.
},
...
"op_rw": 103579,
"op_rw_in_bytes": 424282624,
"op_rw_out_bytes": 0,
"op_rw_latency": {
"avgcount": 103579,
"sum": 15927.174100753,
"avgtime": 0.153768371
},

=======================================

[Root Cause]
With "waiting for rw locks", I located PrimaryLogPG::get_rw_locks() in PrimaryLogPG::do_op(). I think it should have allow concurrent write on the same object with write lock, until I noticed it's description, "[stat,..., write]".
The op "stat" will trigger "may_read()", and "[stat + write]" will result in excl lock in RWState::RWEXCL and thus suppress concurrency. The ops on the same 4MiB object will be done sequencially.
So the next question is, why there is a "stat" with this write op from librbd. After some debugging, I found it in AbstractObjectWriteRequest<I>::write_object(), in src/librbd/io/ObjectRequest.cc.
It seems that the stat op is mean to give a better performance in copyup case.

=======================================
template <typename I>
void AbstractObjectWriteRequest<I>::write_object() {
I *image_ctx = this->m_ictx;
ldout(image_ctx->cct, 20) << dendl;

neorados::WriteOp write_op;
  if (m_copyup_enabled) {
    if (m_guarding_migration_write) {
      auto snap_seq = (this->m_io_context->write_snap_context() ?
          this->m_io_context->write_snap_context()->first : 0);
      ldout(image_ctx->cct, 20) << "guarding write: snap_seq=" << snap_seq
                                << dendl;

cls_client::assert_snapc_seq(
        &write_op, snap_seq, cls::rbd::ASSERT_SNAPC_SEQ_LE_SNAPSET_SEQ);
    } else {
      ldout(image_ctx->cct, 20) << "guarding write" << dendl;
      write_op.assert_exists(); // <============A stat op is add here.
    }
  }

add_write_hint(&write_op);
  add_write_ops(&write_op);
  ceph_assert(write_op.size() != 0);

image_ctx->rados_api.execute(
  {data_object_name(this->m_ictx, this->m_object_no)},
    *this->m_io_context, std::move(write_op),
    librbd::asio::util::get_callback_adapter(
      [this](int r) { handle_write_object(r); }), nullptr,
      (this->m_trace.valid() ? this->m_trace.get_info() : nullptr));
}

=======================================
commit 4240846f5cf1525952ac42c0d6eddf9dfa459ce4
Author: Sage Weil <sage@inktank.com>
Date: Mon Oct 22 17:57:08 2012 -0700

librbd: use assert_exists() to simplify copyup check

Previously we would explicitly STAT the object to see if it exists before
    sending the write to the OSD.  Instead, send the write optimistically, and
    assert that the object already exists.  This avoids an extra round trip in
    the optimistic/common case, and makes the existence check in the initial
    first-write case more expensive because we send the data payload along.

Signed-off-by: Sage Weil &lt;sage@inktank.com&gt;

=======================================

[How to Reproduce]
Clone a disk from a snapshot.
$ rbd snap create PoolName/AParentImage@AParentImageSnapshot
$ rbd snap protect PoolName/AParentImage@AParentImageSnapshot
$ rbd clone PoolName/AParentImage@AParentImageSnapshot PoolName/ClonedImage
$ rbd info volumes/volume-dee0979c-d6b0-44bc-a270-6bfa75715c4b | grep parent
parent: PoolName/AParentImage@AParentImageSnapshot

Create a disk without a parent.
$ rbd create PoolName/ImageWithoutParent --size XXX

[Suggestion]
I think it's possible and necessary to optimize the seq write performance on the same object for cloned rbd image as shown above. It's not a corner case.
A new explicit API like "write exist" is proposed for this scenario, to substitute the "stat" op which is in fact READ and suppressed concurrency. And it'll require upgrade in both client and osd.
A work around in osd for this certain case is also welcomed. But it may require more careful checking in OpInfo::set_from_op().
And now I have to flatten the rbd image for reported performance issue in field.

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Project changed from Ceph to rbd

Actions

Copy link

Updated by Ilya Dryomov 5 months ago

Status changed from New to Pending Backport
Pull request ID set to 45852

The fix landed in main, but post reef freeze. I think we need to backport this at least to reef, and probably older releases too.

Actions

Copy link

Updated by Ilya Dryomov 5 months ago

Project changed from rbd to RADOS
Subject changed from librbd: cloned image is slow in 4k write with "waiting for rw locks" to RBD cloned image is slow in 4k write with "waiting for rw locks"
Category set to Performance/Resource Usage
Assignee set to Ilya Dryomov
Backport set to pacific,quincy,reef