Bug #15353
closed
- Status changed from New to In Progress
- Assignee set to Jason Dillaman
@Josh Jones: I think I am missing something. The only times we drop the guard are during copy-ups.
osd_op_reply(9109 rbd_data.100e71ea1109.00000000000000a6 [stat,write 588800~1032192] v0'0 uv576 ondisk = -2 ((2) No such file or directory)) v7 -- ?+0 0x7fb31a36d8c0 con 0x7fb31b555380
osd_op_reply(9110 rbd_data.100e71ea1109.00000000000000a6 [stat,write 1620992~1032192] v0'0 uv576 ondisk = -2 ((2) No such file or directory)) v7 -- ?+0 0x7fb31bb29080 con 0x7fb31b555380
osd_op_reply(9111 rbd_data.100e71ea1109.00000000000000a6 [stat,write 2653184~1032192] v0'0 uv576 ondisk = -2 ((2) No such file or directory)) v7 -- ?+0 0x7fb31a36cb00 con 0x7fb31b555380
osd_op_reply(9112 rbd_data.100e71ea1109.00000000000000a6 [stat,write 3685376~64512] v0'0 uv576 ondisk = -2 ((2) No such file or directory)) v7 -- ?+0 0x7fb31bb28dc0 con 0x7fb31b555380
osd_op_reply(9113 rbd_data.100e71ea1109.00000000000000a6 [write 588800~1032192] v8'577 uv577 ack = 0) v7 -- ?+0 0x7fb3192662c0 con 0x7fb31b555380
In the example above, 9109 was the initial (guarded) write. The parent extent must have been zeroed because the copy-up op 9113 doesn't include the exec call.
We added the VM clone testing with Jewel, which might be why we are seeing this more often. We can add additional tracking to stall pipelined copy-ups which would help in the single-client case.
It seems like it's more probable to have the flat write overlap with guarded writes now that the object map tells us we don't need to copyup anything, and we go directly to the flat write.
There also seem to be a couple ways around this - 1) stall writes until guarded ops on the same object are all complete 2) continue sending the stat guard after we know the object exists
1) is more complex code-wise, but lets us still do plain writes after an initial penalty (which may be slow due to copyup anyway) and puts the extra cost on the client, so it seems like the better option to me.
A fortuitous message to ceph-users makes a good point - we could also drop the guarded write when object map is used entirely, since we know when the object exists already. Was there some reason that didn't work?
@Josh Jones: You will hit this issue with object map disabled when you have multiple in-flight writes to a cloned image's object. We don't currently use the object map to determine if we should do guarded writes. If the object already exists (as per the object map), removing the guard didn't save anything on the client side (saves a few cycles on the OSDs however) so I never worried about it.
The optimization that we do make (starting w/ Jewel) is that we can skip right ahead to reading from the parent if we know the object doesn't exist in the clone -- saving a guard check that we know will fail. We can remove this optimization but the problem will remain.
@Josh Jones: my assumption was that dropping the guard if we know the object exists doesn't really save us (the client) much of anything. The guard should pass and the write proceeds -- a few more bytes in the op request and some additional compute time on the OSD. If we yanked the guard when we think the object exists, we would have to track in-flight ops to the same object so that you can't inject a new write between the time an old write updated the object map and started the copy up.
Since this wasn't caused by the recent optimizations, and we haven't seen any reports of it in the wild, I'm wondering if we should punt on this for jewel.
- Status changed from In Progress to New
- Assignee deleted (
Jason Dillaman)
- Priority changed from Urgent to Normal
@Josh Jones: OK, I'm more than happy to avoid changing the IO path right before the Jewel release. :-)
- Status changed from New to Need More Info
@Josh Jones: do the recent PG log changes make this ticket obsolete?
- Status changed from Need More Info to Rejected
Yes, now that we store write errors in the pg log this shouldn't be an issue.
Also available in: Atom
PDF