Project

General

Profile

Bug #19252

read/write inconsistent

Added by Xuehan Xu about 3 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Recently, in our test, we found a strange phenomenon: a READ req from client A that arrived later than a WRITE req from client B is finished ealier than that WRITE req.

The logs are as follows(we did a little modification to the level of some logs to 1 in order to get some insights of OSDs while avoiding too much logs filling up the disk):

2017-03-07 18:58:27.439107 7f80ae9fc700 1 2017-03-07 18:58:27.439109 7f2dbefe5700 1 -- 10.208.129.17:6800/1653080 <== client.5234411 10.208.129.31:0/1029364 99819 ==== osd_op(client.5234411.0:1312595 rbd_data.4ff53c1322a29c.0000000000000cad [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 3981312~12288] 4.395b22ac snapc 2d1c=[2b6a,2ae2] ack+ondisk+write+known_if_redirected e15531) v5 ==== 291+0+12288 (84537638 0 1181937587) 0x7f80113c5e80 con 0x7f80988c7800-- 10.208.140.17:6802/1653275 --> 10.208.140.34:6802/545067 -- osd_repop_reply(client.5278878.0:1311414 4.19a ondisk, result = 0) v1 -- ?+0 0x7f2ce94bf800 con 0x7f2d62169100
2017-03-07 18:58:27.439130 7f80e33ff700 1 osd.14 15531 dequeue_op 0x7f7fe2827f00 prio 63 cost 4096 latency 0.019538 osd_op(client.5234411.0:1312144 rbd_data.4ff53c1322a29c.0000000000000cad [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 2719744~4096] 4.395b22ac snapc 2d1c=[2b6a,2ae2] ack+ondisk+write+known_if_redirected e15531) v5 pg pg[4.ac( v 15531'8325461 (15531'8322451,15531'8325461] local-les=11579 n=176 ec=1384 les/c 11579/11579 11578/11578/11551) [14,6,15] r=0 lpr=11578 luod=15531'8325460 lua=15531'8325460 crt=15531'8325459 lcod 15531'8325459 mlcod 15531'8325459 active+clean]
2017-03-07 18:58:27.439176 7f80c3fff700 1 osd.14 15531 dequeue_op 0x7f7fe281d100 prio 63 cost 8192 latency 0.019477 osd_op(client.5234411.0:1312145 rbd_data.4ff53c1322a29c.0000000000000cad [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 3112960~8192] 4.395b22ac snapc 2d1c=[2b6a,2ae2] ack+ondisk+write+known_if_redirected e15531) v5 pg pg[4.ac( v 15531'8325461 (15531'8322451,15531'8325461] local-les=11579 n=176 ec=1384 les/c 11579/11579 11578/11578/11551) [14,6,15] r=0 lpr=11578 luod=15531'8325460 lua=15531'8325460 crt=15531'8325459 lcod 15531'8325459 mlcod 15531'8325459 active+clean]
2017-03-07 18:58:27.439191 7f80ae9fc700 1 -- 10.208.129.17:6800/1653080 <== client.5234411 10.208.129.31:0/1029364 99820 ==== osd_op(client.5234411.0:1312596 rbd_data.4ff53c1322a29c.0000000000000cad [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 1990656~4096] 4.395b22ac snapc 2d1c=[2b6a,2ae2] ack+ondisk+write+known_if_redirected e15531) v5 ==== 291+0+4096 (2656609427 0 1238378996) 0x7f80113c6100 con 0x7f80988c7800
2017-03-07 18:58:27.439230 7f80e33ff700 1 osd.14 15531 dequeue_op 0x7f7fe281d200 prio 63 cost 4096 latency 0.019387 osd_op(client.5234411.0:1312148 rbd_data.4ff53c1322a29c.0000000000000cad [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 3104768~4096] 4.395b22ac snapc 2d1c=[2b6a,2ae2] ack+ondisk+write+known_if_redirected e15531) v5 pg pg[4.ac( v 15531'8325461 (15531'8322451,15531'8325461] local-les=11579 n=176 ec=1384 les/c 11579/11579 11578/11578/11551) [14,6,15] r=0 lpr=11578 luod=15531'8325460 lua=15531'8325460 crt=15531'8325459 lcod 15531'8325459 mlcod 15531'8325459 active+clean]
2017-03-07 18:58:27.439258 7f80ae9fc700 1 -- 10.208.129.17:6800/1653080 <== client.5234411 10.208.129.31:0/1029364 99821 ==== osd_op(client.5234411.0:1312603 rbd_data.4ff53c1322a29c.0000000000000cad [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 1384448~4096] 4.395b22ac snapc 2d1c=[2b6a,2ae2] ack+ondisk+write+known_if_redirected e15531) v5 ==== 291+0+4096 (1049117786 0 1515194573) 0x7f80113c6380 con 0x7f80988c7800 .
.
.
.
.
.
2017-03-07 18:59:55.022570 7f80a3cdf700 1 -- 10.208.129.17:6800/1653080 <== client.5239664 10.208.129.12:0/1005804 302 ==== osd_op(client.5239664.0:5646 rbd_data.24e35147272369.0000000000000b0b@99 [sparse-read 0~4194304] 4.e3bddcca ack+read+localize_reads+known_if_redirected e15533) v5 ==== 199+0+0 (32967759 0 0) 0x7f7fc165ef80 con 0x7f8098810c80
2017-03-07 18:59:55.026579 7f2d4b7ff700 1 -- 10.208.129.17:6802/1653275 --> 10.208.129.12:0/1006309 -- osd_op_reply(6751 rbd_data.4ff52432edaf14.0000000000000d99 [sparse-read 0~4194304 [fadvise_sequential+fadvise_nocache]] v0'0 uv8885110 ondisk = 0) v6 -- ?+0 0x7f2c62c18c00 con 0x7f2cfb237300
2017-03-07 18:59:55.030936 7f80b31fb700 1 -- 10.208.129.17:6800/1653080 <== client.5290815 10.208.129.12:0/1005958 330 ==== osd_op(client.5290815.0:6476 rbd_data.4ff53c1322a29c.0000000000000cad@2d4c [sparse-read 0~4194304 [fadvise_sequential+fadvise_nocache]] 4.395b22ac ack+read+known_if_redirected e15533) v5 ==== 199+0+0 (1930673694 0 0) 0x7f7fbc287800 con 0x7f809882c280
2017-03-07 18:59:55.032485 7f2d1af49700 1 -- 10.208.129.17:6802/1653275 <== client.5290821 10.208.129.12:0/1006079 427 ==== osd_op(client.5290821.0:6249 rbd_data.24e35147272369.0000000000000c38@99 [sparse-read 0~4194304] 4.20749aa8 ack+read+localize_reads+known_if_redirected e15533) v5 ==== 199+0+0 (1792808058 0 0) 0x7f2d54165600 con 0x7f2cfb236d00
2017-03-07 18:59:55.032592 7f2d49fff700 1 osd.12 15533 dequeue_op 0x7f2d54108100 prio 63 cost 0 latency 0.000137 osd_op(client.5290821.0:6249 rbd_data.24e35147272369.0000000000000c38@99 [sparse-read 0~4194304] 4.20749aa8 ack+read+localize_reads+known_if_redirected e15533) v5 pg pg[4.a8( v 15533'11075160 (15531'11072094,15533'11075160] local-les=11582 n=235 ec=1384 les/c 11582/11582 11581/11581/11549) [12,7,5] r=0 lpr=11581 lua=15533'11075153 crt=15533'11075157 lcod 15533'11075159 mlcod 15533'11075157 active+clean]
2017-03-07 18:59:55.032752 7f2d49fff700 1 -- 10.208.129.17:6802/1653275 --> 10.208.129.12:0/1006079 -- osd_op_reply(6249 rbd_data.24e35147272369.0000000000000c38 [sparse-read 0~0] v0'0 uv1078 ondisk = 0) v6 -- ?+0 0x7f2c5cb29500 con 0x7f2cfb236d00
2017-03-07 18:59:55.034696 7ff0055de700 1 -- 10.208.140.17:6804/1653469 <== osd.10 10.208.140.16:6800/3059838 73519837 ==== osd_repop(client.5244972.0:86462 4.1ab 8aa851ab/rbd_data.5002fb59dea2e9.0000000000001849/head//4 v 15533'9265107) v1 ==== 929+0+4935 (3993174113 0 2812931730) 0x7fef41b3f400 con 0x7fefc5077d80 .
.
.
.
.
2017-03-07 19:00:20.333216 7f80d43ff700 1 osd.14 15533 dequeue_op 0x7f8075712900 prio 63 cost 4096 latency 90.895706 osd_op(client.5279770.0:1304103 rbd_data.4ff5c3390a4ab8.0000000000000cae [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 2842624~4096] 6.80a65a5a snapc 1311=[1311] ack+ondisk+write+known_if_redirected e15533) v5 pg pg[6.5a( v 15533'5522134 (15531'5519127,15533'5522134] local-les=11561 n=248 ec=1678 les/c 11561/11561 11560/11560/11551) [14,2,15] r=0 lpr=11560 lua=15533'5522133 crt=15533'5522131 lcod 15533'5522133 mlcod 15533'5522133 active+clean]
2017-03-07 19:00:20.333332 7f8143bf9700 1 filestore(/home/ceph/software/ceph/var/lib/ceph/osd/ceph-14) write 4.fb_head/a7a6d8fb/rbd_data.4ff2f62f19768d.000000000000111e/head//4 950272~4096 = 4096
2017-03-07 19:00:20.333389 7f80e33ff700 1 -- 10.208.129.17:6800/1653080 --> 10.208.129.12:0/1005958 -- osd_op_reply(6476 rbd_data.4ff53c1322a29c.0000000000000cad [sparse-read 0~4194304 [fadvise_sequential+fadvise_nocache]] v0'0 uv8325748 ondisk = 0) v6 -- ?+0 0x7f7fac48c800 con 0x7f809882c280
2017-03-07 19:00:20.333440 7f80d43ff700 1 -- 10.208.140.17:6800/1653080 --> 10.208.140.34:6804/545263 -- osd_repop(client.5279770.0:1304103 6.5a 80a65a5a/rbd_data.4ff5c3390a4ab8.0000000000000cae/head//6 v 15533'5522135) v1 -- ?+4882 0x7f8068377600 con 0x7f80e7c51180
2017-03-07 19:00:20.333437 7f80c3fff700 1 osd.14 15533 dequeue_op 0x7f813949f100 prio 63 cost 12288 latency 112.894373 osd_op(client.5234411.0:1312595 rbd_data.4ff53c1322a29c.0000000000000cad [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 3981312~12288] 4.395b22ac snapc 2d1c=[2b6a,2ae2] ack+ondisk+write+known_if_redirected e15531) v5 pg pg[4.ac( v 15533'8325756 (15531'8322751,15533'8325756] local-les=11579 n=198 ec=1384 les/c 11579/11579 11578/11578/11551) [14,6,15] r=0 lpr=11578 lua=15533'8325748 crt=15533'8325754 lcod 15533'8325755 mlcod 15533'8325755 active+clean].
.
.
.
2017-03-07 19:00:20.335061 7f81413f4700 1 -- 10.208.129.17:6800/1653080 --> 10.208.129.31:0/1029364 -- osd_op_reply(1312595 rbd_data.4ff53c1322a29c.0000000000000cad [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 3981312~12288] v15533'8325757 uv8325757 ondisk = 0) v6 -- ?+0 0x7f7fac8ed100 con 0x7f80988c7800 .
.
.
2017-03-07 19:00:25.201989 7f8143bf9700 1 filestore(/home/ceph/software/ceph/var/lib/ceph/osd/ceph-14) write 6.5a_head/80a65a5a/rbd_data.4ff5c3390a4ab8.0000000000000cae/head//6 2842624~4096 = 4096
2017-03-07 19:00:25.202221 7f80b63ff700 1 -- 10.208.140.17:6800/1653080 --> 10.208.140.34:6800/544873 -- osd_repop(client.5279704.0:1273591 4.1fb bf5d9bfb/rbd_data.4ff52432edaf14.0000000000000cae/head//4 v 15533'9631363) v1 -- ?+5383 0x7f80169e0c00 con 0x7f80e3851300
2017-03-07 19:00:25.202260 7f80b63ff700 1 -- 10.208.140.17:6800/1653080 --> 10.208.140.35:6804/1477738 -- osd_repop(client.5279704.0:1273591 4.1fb bf5d9bfb/rbd_data.4ff52432edaf14.0000000000000cae/head//4 v 15533'9631363) v1 -- ?+5383 0x7f80169e1200 con 0x7f80d584c600
2017-03-07 19:00:25.202321 7f81423f6700 1 filestore(/home/ceph/software/ceph/var/lib/ceph/osd/ceph-14) write 4.ac_head/395b22ac/rbd_data.4ff53c1322a29c.0000000000000cad/head//4 3981312~12288 = 12288
2017-03-07 19:00:25.202412 7f80b63ff700 1 filestore(/home/ceph/software/ceph/var/lib/ceph/osd/ceph-14) queue_transactions (writeahead) 1571508734 0x7f7fa343cb00
2017-03-07 19:00:25.202414 7f8143bf9700 1 filestore(/home/ceph/software/ceph/var/lib/ceph/osd/ceph-14) write 4.15c_head/674fc75c/rbd_data.5001d22aa53889.0000000000001849/head//4 700416~4096 = 4096

As shown above, WRITE req with tid 1312595 arrived at 18:58:27.439107 and READ req with tid 6476 arrived at 18:59:55.030936, however, the latter finished at 19:00:20:333389 while the former finished commit at 19:00:20.335061 and filestore write at 19:00:25.202321. And in these logs, we found that between the start and finish of each req, there was a lot of "dequeue_op" of that req. We read the source code, it seems that this is due to "RWState", is that correct?

And also, it seems that OSD won't distinguish reqs from different clients, so is it possible that io reqs from the same client also finish in a different order than that they were created in? Could this affect the read/write consistency? For instance, that a read can't acquire the data that were written by the same client just before it.

History

#1 Updated by Xuehan Xu about 3 years ago

I read the source code, it seems that, for writes, in ReplicatedPG::do_op method, the thread in OSD_op_tp calls ReplicatedPG::get_rw_lock method which tries to get RWState::RWWRITE. If it fails, the op will be put into obc->rwstate.waiters queue and be requeued when repop finishes, however, the OSD_op_tp's thread doesn't wait for repop and tries to get the next OP. Can this be the cause?

#2 Updated by Greg Farnum about 3 years ago

  • Status changed from New to Closed
  • Priority changed from Urgent to Normal

Any storage system is perfectly free to reorder simultaneous requests -- defined as those whose submit-reply time overlaps. So you submitted write W, then submitted read R, then got a response to R before W. That's allowed, and preventing it is actually impossible in general. In the specific case you've outlined, we could try to prevent it, but doing so is pretty ludicrously expensive and, since the "reorder" can happen anyway, doesn't provide any benefit.
So we don't try. :)

That said, obviously we do provide strict ordering across write boundaries: a read submitted after a write completed will always see the results of that write.

#3 Updated by Greg Farnum about 3 years ago

I do notice that your write takes a loooong time to complete, which may be why this popped up for you. You might want to look into why that happened, but there's not enough logging here to tell.

#4 Updated by Sage Weil almost 3 years ago

Oh, I see. Writes are ordered with respect to other writes, and reads to other reads, but reads and writes are allowed to reorder relative to each other. If you want write ordering for a read, add the RWORDERED flag to the read op.

Also available in: Atom PDF