Bug #15353: librbd: disable optimizations that result in pipelining guarded writes mixed with non-guarded writes - rbd - Ceph

Custom queries

Bug queue
Bug triage
Crash queue
Crash triage
Feedback
My issues
Need Review
Pending backports
Product Backlog Scrub

Actions

Copy link

Bug #15353

closed

librbd: disable optimizations that result in pipelining guarded writes mixed with non-guarded writes

Added by Josh Durgin about 8 years ago. Updated almost 8 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

These cause ordering issues, resulting in an osd crash as seen here:

http://qa-proxy.ceph.com/teuthology/teuthology-2016-03-30_12:01:01-rbd-jewel-distro-basic-smithi/97339/remote/smithi024/log/ceph-osd.4.log.gz

This is complex to fix in rados, and doesn't give rbd that much benefit, so let's disable it for now.

Related rados issue: http://tracker.ceph.com/issues/14468

History
Notes
Property changes

Actions

Copy link

Updated by Jason Dillaman about 8 years ago

Status changed from New to In Progress
Assignee set to Jason Dillaman

Actions

Copy link

Updated by Jason Dillaman about 8 years ago

@Josh Jones: I think I am missing something. The only times we drop the guard are during copy-ups.

osd_op_reply(9109 rbd_data.100e71ea1109.00000000000000a6 [stat,write 588800~1032192] v0'0 uv576 ondisk = -2 ((2) No such file or directory)) v7 -- ?+0 0x7fb31a36d8c0 con 0x7fb31b555380
osd_op_reply(9110 rbd_data.100e71ea1109.00000000000000a6 [stat,write 1620992~1032192] v0'0 uv576 ondisk = -2 ((2) No such file or directory)) v7 -- ?+0 0x7fb31bb29080 con 0x7fb31b555380
osd_op_reply(9111 rbd_data.100e71ea1109.00000000000000a6 [stat,write 2653184~1032192] v0'0 uv576 ondisk = -2 ((2) No such file or directory)) v7 -- ?+0 0x7fb31a36cb00 con 0x7fb31b555380
osd_op_reply(9112 rbd_data.100e71ea1109.00000000000000a6 [stat,write 3685376~64512] v0'0 uv576 ondisk = -2 ((2) No such file or directory)) v7 -- ?+0 0x7fb31bb28dc0 con 0x7fb31b555380
osd_op_reply(9113 rbd_data.100e71ea1109.00000000000000a6 [write 588800~1032192] v8'577 uv577 ack = 0) v7 -- ?+0 0x7fb3192662c0 con 0x7fb31b555380

In the example above, 9109 was the initial (guarded) write. The parent extent must have been zeroed because the copy-up op 9113 doesn't include the exec call.

We added the VM clone testing with Jewel, which might be why we are seeing this more often. We can add additional tracking to stall pipelined copy-ups which would help in the single-client case.

Actions

Copy link

Updated by Josh Durgin about 8 years ago

It seems like it's more probable to have the flat write overlap with guarded writes now that the object map tells us we don't need to copyup anything, and we go directly to the flat write.

There also seem to be a couple ways around this - 1) stall writes until guarded ops on the same object are all complete 2) continue sending the stat guard after we know the object exists

1) is more complex code-wise, but lets us still do plain writes after an initial penalty (which may be slow due to copyup anyway) and puts the extra cost on the client, so it seems like the better option to me.

Actions

Copy link

Updated by Josh Durgin about 8 years ago

A fortuitous message to ceph-users makes a good point - we could also drop the guarded write when object map is used entirely, since we know when the object exists already. Was there some reason that didn't work?

Actions

Copy link

Updated by Jason Dillaman about 8 years ago

@Josh Jones: You will hit this issue with object map disabled when you have multiple in-flight writes to a cloned image's object. We don't currently use the object map to determine if we should do guarded writes. If the object already exists (as per the object map), removing the guard didn't save anything on the client side (saves a few cycles on the OSDs however) so I never worried about it.

The optimization that we do make (starting w/ Jewel) is that we can skip right ahead to reading from the parent if we know the object doesn't exist in the clone -- saving a guard check that we know will fail. We can remove this optimization but the problem will remain.

Actions

Copy link

Updated by Jason Dillaman about 8 years ago

@Josh Jones: my assumption was that dropping the guard if we know the object exists doesn't really save us (the client) much of anything. The guard should pass and the write proceeds -- a few more bytes in the op request and some additional compute time on the OSD. If we yanked the guard when we think the object exists, we would have to track in-flight ops to the same object so that you can't inject a new write between the time an old write updated the object map and started the copy up.

Actions

Copy link

Updated by Josh Durgin about 8 years ago

Since this wasn't caused by the recent optimizations, and we haven't seen any reports of it in the wild, I'm wondering if we should punt on this for jewel.

Actions

Copy link

Updated by Jason Dillaman about 8 years ago

Status changed from In Progress to New
Assignee deleted (~~Jason Dillaman~~)
Priority changed from Urgent to Normal

@Josh Jones: OK, I'm more than happy to avoid changing the IO path right before the Jewel release. :-)

Actions

Copy link

Updated by Jason Dillaman almost 8 years ago

Status changed from New to Need More Info

@Josh Jones: do the recent PG log changes make this ticket obsolete?

Actions

Copy link

#10

Updated by Josh Durgin almost 8 years ago

Status changed from Need More Info to Rejected

Yes, now that we store write errors in the pg log this shouldn't be an issue.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rbd

Custom queries

Bug #15353

librbd: disable optimizations that result in pipelining guarded writes mixed with non-guarded writes

Updated by Jason Dillaman about 8 years ago

Updated by Jason Dillaman about 8 years ago

Updated by Josh Durgin about 8 years ago

Updated by Josh Durgin about 8 years ago

Updated by Jason Dillaman about 8 years ago

Updated by Jason Dillaman about 8 years ago

Updated by Josh Durgin about 8 years ago

Updated by Jason Dillaman about 8 years ago

Updated by Jason Dillaman almost 8 years ago

Updated by Josh Durgin almost 8 years ago