Project

General

Profile

Actions

Bug #53593

open

RBD cloned image is slow in 4k write with "waiting for rw locks"

Added by Cuicui Zhao over 2 years ago. Updated 5 months ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
pacific,quincy,reef
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

[Observed Poor Performance]
On a rbd image, we found the 4k write IOPS is much lower than expected.
I understood there was op sequencer in ObjectStore, and 4k seq write IOPS would perform worse than 4k randwrite.
But in my performance test, it differed a lot on a cloned rbd image, and a rbd image without a parent.
And after I flattened the cloned rbd image, it appeared much better than before.
My concern is, it may also block ops on the same PG, and influence the whole cluster performance.

With "ceph daemon osd.id perf dump", I saw plenty of confusing stats in "osd" with "op_rw", while I expected "op_w" or "op_r" as our application should not require rw. And "op_latency" was much bigger than expected.
And ops experienced "waiting for rw locks", thus took a longer time to complete.
A demanstration is as follows. =======================================
$ ceph daemon osd.id perf dump
...
"osd": {
"op_wip": 0,
"op": 103630,
"op_in_bytes": 424282624,
"op_out_bytes": 0,
"op_latency": {
"avgcount": 103630,
"sum": 15927.179444258,
"avgtime": 0.153692747 <===bigger than expected.
},
...
"op_rw": 103579,
"op_rw_in_bytes": 424282624,
"op_rw_out_bytes": 0,
"op_rw_latency": {
"avgcount": 103579,
"sum": 15927.174100753,
"avgtime": 0.153768371
},

// osd_op is stat + write.
ceph daemon osd.id dump_historic_ops | grep write
"description": "osd_op(client.204166.0:954770 4.298 4:19642c7c:::rbd_data.2f65e994b93ec.0000000000002ca8:head [stat, object_size 4194304 write_size 4194304,write 4186112~4096] snapc 0=[] ondisk+write+known_if_redirected e196)",
...

ceph daemon osd.id dump_historic_ops | grep "waiting for rw"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
... =======================================

[Root Cause]
With "waiting for rw locks", I located PrimaryLogPG::get_rw_locks() in PrimaryLogPG::do_op(). I think it should have allow concurrent write on the same object with write lock, until I noticed it's description, "[stat,..., write]".
The op "stat" will trigger "may_read()", and "[stat + write]" will result in excl lock in RWState::RWEXCL and thus suppress concurrency. The ops on the same 4MiB object will be done sequencially.
So the next question is, why there is a "stat" with this write op from librbd. After some debugging, I found it in AbstractObjectWriteRequest<I>::write_object(), in src/librbd/io/ObjectRequest.cc.
It seems that the stat op is mean to give a better performance in copyup case. =======================================
template <typename I>
void AbstractObjectWriteRequest<I>::write_object() {
I *image_ctx = this->m_ictx;
ldout(image_ctx->cct, 20) << dendl; =======================================
commit 4240846f5cf1525952ac42c0d6eddf9dfa459ce4
Author: Sage Weil <>
Date: Mon Oct 22 17:57:08 2012 -0700 =======================================

librbd: use assert_exists() to simplify copyup check
Previously we would explicitly STAT the object to see if it exists before
sending the write to the OSD. Instead, send the write optimistically, and
assert that the object already exists. This avoids an extra round trip in
the optimistic/common case, and makes the existence check in the initial
first-write case more expensive because we send the data payload along.
Signed-off-by: Sage Weil &lt;&gt;

[How to Reproduce]
Clone a disk from a snapshot.
$ rbd snap create PoolName/AParentImage@AParentImageSnapshot
$ rbd snap protect PoolName/AParentImage@AParentImageSnapshot
$ rbd clone PoolName/AParentImage@AParentImageSnapshot PoolName/ClonedImage
$ rbd info volumes/volume-dee0979c-d6b0-44bc-a270-6bfa75715c4b | grep parent
parent: PoolName/AParentImage@AParentImageSnapshot

Create a disk without a parent.
$ rbd create PoolName/ImageWithoutParent --size XXX

Run fio on the two images respectively with your pool name and image name. You may also use "-rw=write" to get the same result.
$ fio -direct=1 -iodepth=128 -rw=write:4k -ioengine=rbd -bs=4k -numjobs=1 -number_ios=1 -size=1G -pool=PoolName -rbdname=ClonedImage -clientname=admin -group_reporting -name=Rand_Write_Testing

[Suggestion]
I think it's possible and necessary to optimize the seq write performance on the same object for cloned rbd image as shown above. It's not a corner case.
A new explicit API like "write exist" is proposed for this scenario, to substitute the "stat" op which is in fact READ and suppressed concurrency. And it'll require upgrade in both client and osd.
A work around in osd for this certain case is also welcomed. But it may require more careful checking in OpInfo::set_from_op().
And now I have to flatten the rbd image for reported performance issue in field.


Related issues 3 (0 open3 closed)

Copied to RADOS - Backport #63600: pacific: RBD cloned image is slow in 4k write with "waiting for rw locks"ResolvedIlya DryomovActions
Copied to RADOS - Backport #63601: quincy: RBD cloned image is slow in 4k write with "waiting for rw locks"ResolvedIlya DryomovActions
Copied to RADOS - Backport #63602: reef: RBD cloned image is slow in 4k write with "waiting for rw locks"ResolvedIlya DryomovActions
Actions #1

Updated by Cuicui Zhao over 2 years ago

[Observed Poor Performance]
On a rbd image, we found the 4k write IOPS is much lower than expected.
I understood there was op sequencer in ObjectStore, and 4k seq write IOPS would perform worse than 4k randwrite.
But in my performance test, it differed a lot on a cloned rbd image, and a rbd image without a parent.
And after I flattened the cloned rbd image, it appeared much better than before.
**My concern is, it may also block ops on the same PG, and influence the whole cluster performance.

With "ceph daemon osd.id perf dump", I saw plenty of confusing stats in "osd" with "op_rw", while I expected "op_w" or "op_r" as our application should not require rw. And "op_latency" was much bigger than expected.
And ops experienced "waiting for rw locks", thus took a longer time to complete.
A demanstration is as follows.

=======================================
$ ceph daemon osd.id perf dump
...
"osd": {
"op_wip": 0,
"op": 103630,
"op_in_bytes": 424282624,
"op_out_bytes": 0,
"op_latency": {
"avgcount": 103630,
"sum": 15927.179444258,
"avgtime": 0.153692747 <===bigger than expected.
},
...
"op_rw": 103579,
"op_rw_in_bytes": 424282624,
"op_rw_out_bytes": 0,
"op_rw_latency": {
"avgcount": 103579,
"sum": 15927.174100753,
"avgtime": 0.153768371
},

// osd_op is stat + write.
ceph daemon osd.id dump_historic_ops | grep write
"description": "osd_op(client.204166.0:954770 4.298 4:19642c7c:::rbd_data.2f65e994b93ec.0000000000002ca8:head [stat, object_size 4194304 write_size 4194304,write 4186112~4096] snapc 0=[] ondisk+write+known_if_redirected e196)",
...

ceph daemon osd.id dump_historic_ops | grep "waiting for rw"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
"event": "waiting for rw locks"
...

=======================================

[Root Cause]
With "waiting for rw locks", I located PrimaryLogPG::get_rw_locks() in PrimaryLogPG::do_op(). I think it should have allow concurrent write on the same object with write lock, until I noticed it's description, "[stat,..., write]".
The op "stat" will trigger "may_read()", and "[stat + write]" will result in excl lock in RWState::RWEXCL and thus suppress concurrency. The ops on the same 4MiB object will be done sequencially.
So the next question is, why there is a "stat" with this write op from librbd. After some debugging, I found it in AbstractObjectWriteRequest<I>::write_object(), in src/librbd/io/ObjectRequest.cc.
It seems that the stat op is mean to give a better performance in copyup case.

=======================================
template <typename I>
void AbstractObjectWriteRequest<I>::write_object() {
I *image_ctx = this->m_ictx;
ldout(image_ctx->cct, 20) << dendl;

neorados::WriteOp write_op;
if (m_copyup_enabled) {
if (m_guarding_migration_write) {
auto snap_seq = (this->m_io_context->write_snap_context() ?
this->m_io_context->write_snap_context()->first : 0);
ldout(image_ctx->cct, 20) << "guarding write: snap_seq=" << snap_seq
<< dendl;
cls_client::assert_snapc_seq(
&write_op, snap_seq, cls::rbd::ASSERT_SNAPC_SEQ_LE_SNAPSET_SEQ);
} else {
ldout(image_ctx->cct, 20) << "guarding write" << dendl;
write_op.assert_exists(); // <============A stat op is add here.
}
}
add_write_hint(&write_op);
add_write_ops(&write_op);
ceph_assert(write_op.size() != 0);
image_ctx->rados_api.execute(
  {data_object_name(this->m_ictx, this->m_object_no)},
*this->m_io_context, std::move(write_op),
librbd::asio::util::get_callback_adapter(
[this](int r) { handle_write_object(r); }), nullptr,
(this->m_trace.valid() ? this->m_trace.get_info() : nullptr));
}

=======================================
commit 4240846f5cf1525952ac42c0d6eddf9dfa459ce4
Author: Sage Weil <>
Date: Mon Oct 22 17:57:08 2012 -0700

librbd: use assert_exists() to simplify copyup check
Previously we would explicitly STAT the object to see if it exists before
sending the write to the OSD. Instead, send the write optimistically, and
assert that the object already exists. This avoids an extra round trip in
the optimistic/common case, and makes the existence check in the initial
first-write case more expensive because we send the data payload along.
Signed-off-by: Sage Weil &lt;&gt;

=======================================

[How to Reproduce]
Clone a disk from a snapshot.
$ rbd snap create PoolName/AParentImage@AParentImageSnapshot
$ rbd snap protect PoolName/AParentImage@AParentImageSnapshot
$ rbd clone PoolName/AParentImage@AParentImageSnapshot PoolName/ClonedImage
$ rbd info volumes/volume-dee0979c-d6b0-44bc-a270-6bfa75715c4b | grep parent
parent: PoolName/AParentImage@AParentImageSnapshot

Create a disk without a parent.
$ rbd create PoolName/ImageWithoutParent --size XXX

Run fio on the two images respectively with your pool name and image name. You may also use "-rw=write" to get the same result.
$ fio -direct=1 -iodepth=128 -rw=write:4k -ioengine=rbd -bs=4k -numjobs=1 -number_ios=1 -size=1G -pool=PoolName -rbdname=ClonedImage -clientname=admin -group_reporting -name=Rand_Write_Testing

[Suggestion]
I think it's possible and necessary to optimize the seq write performance on the same object for cloned rbd image as shown above. It's not a corner case.
A new explicit API like "write exist" is proposed for this scenario, to substitute the "stat" op which is in fact READ and suppressed concurrency. And it'll require upgrade in both client and osd.
A work around in osd for this certain case is also welcomed. But it may require more careful checking in OpInfo::set_from_op().
And now I have to flatten the rbd image for reported performance issue in field.

Actions #2

Updated by Neha Ojha over 2 years ago

  • Project changed from Ceph to rbd
Actions #3

Updated by Ilya Dryomov 5 months ago

  • Status changed from New to Pending Backport
  • Pull request ID set to 45852

The fix landed in main, but post reef freeze. I think we need to backport this at least to reef, and probably older releases too.

Actions #4

Updated by Ilya Dryomov 5 months ago

  • Project changed from rbd to RADOS
  • Subject changed from librbd: cloned image is slow in 4k write with "waiting for rw locks" to RBD cloned image is slow in 4k write with "waiting for rw locks"
  • Category set to Performance/Resource Usage
  • Assignee set to Ilya Dryomov
  • Backport set to pacific,quincy,reef
Actions #5

Updated by Backport Bot 5 months ago

  • Tags set to backport_processed
  • Backport deleted (pacific,quincy,reef)
Actions #6

Updated by Ilya Dryomov 5 months ago

  • Tags deleted (backport_processed)
  • Backport set to pacific,quincy,reef
Actions #7

Updated by Backport Bot 5 months ago

  • Copied to Backport #63600: pacific: RBD cloned image is slow in 4k write with "waiting for rw locks" added
Actions #8

Updated by Backport Bot 5 months ago

  • Copied to Backport #63601: quincy: RBD cloned image is slow in 4k write with "waiting for rw locks" added
Actions #9

Updated by Backport Bot 5 months ago

  • Copied to Backport #63602: reef: RBD cloned image is slow in 4k write with "waiting for rw locks" added
Actions #10

Updated by Backport Bot 5 months ago

  • Tags set to backport_processed
Actions

Also available in: Atom PDF