https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2017-09-26T18:24:42ZCeph rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=998852017-09-26T18:24:42ZCasey Bodleycbodley@redhat.com
<ul><li><strong>Priority</strong> changed from <i>Normal</i> to <i>High</i></li></ul> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1001352017-10-02T20:06:28ZVaibhav Bhembrevaibhav@digitalocean.com
<ul></ul><p>Is this reproducible? In our case it looks to be intermittent (racey) but certainly exists for certain users/buckets.</p> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1001362017-10-02T20:33:43ZVaibhav Bhembrevaibhav@digitalocean.com
<ul></ul><p>Interestingly I am seeing the same error while suspending/unsuspending a user. Pasting the log of a relevant excerpt of the call being made using admin ops API.</p>
<pre><code class="text syntaxhl"><span class="CodeRay">2017-10-02 20:27:20.637145 7fa716b45700 0 NOTICE: put_bucket_info on bucket=dotestone7 returned err=-125, skipping bucket
2017-10-02 20:27:20.637202 7fa716b45700 1 -- 10.116.45.136:0/778345655 --> 10.116.45.140:6812/7493 -- osd_op(client.198745.0:31812000 96.9a6796cb meta.log.2daacbce-73f9-4628-875f-2a287e41b66b.47 [call log.add] snapc 0=[] ondisk+write+known_if_redirected e10992) v7 -- ?+0 0x7fab1003de00 con 0x7facc808c1c0
2017-10-02 20:27:20.638614 7facd05f6700 1 -- 10.116.45.136:0/778345655 <== osd.22 10.116.45.140:6812/7493 16746 ==== osd_op_reply(31812000 meta.log.2daacbce-73f9-4628-875f-2a287e41b66b.47 [call] v10992'277264 uv861070 ondisk = 0) v7 ==== 168+0+0 (3411247474 0 0) 0x7fac6c033a10 con 0x7facc808c1c0
2017-10-02 20:27:20.638728 7fa716b45700 1 -- 10.116.45.136:0/778345655 --> 10.116.45.139:6802/162720 -- osd_op(client.198745.0:31812001 94.35fa86fb .bucket.meta.dotestone8:d80e1915-c1a1-4342-968c-36af146cae93.76010.3699 [delete,create 0~0,call version.check_conds,call version.set,writefull 0~344,setxattr user.rgw.acl (133)] snapc 0=[] ondisk+write+known_if_redirected e10992) v7 -- ?+0 0x7fab1003de00 con 0x7facb8044cd0
2017-10-02 20:27:20.639314 7fab806f7700 1 -- 10.116.45.136:0/778345655 <== osd.15 10.116.45.139:6802/162720 898577 ==== osd_op_reply(31812001 .bucket.meta.dotestone8:d80e1915-c1a1-4342-968c-36af146cae93.76010.3699 [delete,create 0~0,call,call,writefull 0~344,setxattr (133)] v0'0 uv0 ondisk = -125 ((125) Operation canceled)) v7 ==== 401+0+0 (3179511437 0 0) 0x7faca8001ed0 con 0x7facb8044cd0
2017-10-02 20:27:20.639436 7fa716b45700 1 -- 10.116.45.136:0/778345655 --> 10.116.45.140:6812/7493 -- osd_op(client.198745.0:31812002 96.9a6796cb meta.log.2daacbce-73f9-4628-875f-2a287e41b66b.47 [call log.add] snapc 0=[] ondisk+write+known_if_redirected e10992) v7 -- ?+0 0x7fab1003de00 con 0x7facc808c1c0
2017-10-02 20:27:20.640887 7facd05f6700 1 -- 10.116.45.136:0/778345655 <== osd.22 10.116.45.140:6812/7493 16747 ==== osd_op_reply(31812002 meta.log.2daacbce-73f9-4628-875f-2a287e41b66b.47 [call] v10992'277265 uv861071 ondisk = 0) v7 ==== 168+0+0 (1043085568 0 0) 0x7fac6c033a10 con 0x7facc808c1c0
2017-10-02 20:27:20.640948 7fa716b45700 0 NOTICE: put_bucket_info on bucket=dotestone8 returned err=-125, skipping bucket
2017-10-02 20:27:20.641013 7fa716b45700 1 -- 10.116.45.136:0/778345655 --> 10.116.45.15:6810/70265 -- osd_op(client.198745.0:31812003 96.81d5914c meta.log.2daacbce-73f9-4628-875f-2a287e41b66b.31 [call log.add] snapc 0=[] ondisk+write+known_if_redirected e10992) v7 -- ?+0 0x7fab1003de00 con 0x7facc80642d0
2017-10-02 20:27:20.642564 7fac484f5700 1 -- 10.116.45.136:0/778345655 <== osd.5 10.116.45.15:6810/70265 2492992 ==== osd_op_reply(31812003 meta.log.2daacbce-73f9-4628-875f-2a287e41b66b.31 [call] v10992'641300 uv1979958 ondisk = 0) v7 ==== 168+0+0 (2128419504 0 0) 0x7fabac032d00 con 0x7facc80642d0
2017-10-02 20:27:20.642727 7fa716b45700 1 -- 10.116.45.136:0/778345655 --> 10.116.45.16:6802/66031 -- osd_op(client.198745.0:31812004 94.1459fce4 .bucket.meta.dotestone9:d80e1915-c1a1-4342-968c-36af146cae93.76010.3690 [delete,create 0~0,call version.check_conds,call version.set,writefull 0~344,setxattr user.rgw.acl (133)] snapc 0=[] ondisk+write+known_if_redirected e10992) v7 -- ?+0 0x7fab1003de00 con 0x7fac4401ced0
2017-10-02 20:27:20.643278 7fac307e8700 1 -- 10.116.45.136:0/778345655 <== osd.8 10.116.45.16:6802/66031 758271 ==== osd_op_reply(31812004 .bucket.meta.dotestone9:d80e1915-c1a1-4342-968c-36af146cae93.76010.3690 [delete,create 0~0,call,call,writefull 0~344,setxattr (133)] v0'0 uv0 ondisk = -125 ((125) Operation canceled)) v7 ==== 401+0+0 (1249275997 0 0) 0x7fac44038db0 con 0x7fac4401ced0
2017-10-02 20:27:20.643381 7fa716b45700 1 -- 10.116.45.136:0/778345655 --> 10.116.45.15:6810/70265 -- osd_op(client.198745.0:31812005 96.81d5914c meta.log.2daacbce-73f9-4628-875f-2a287e41b66b.31 [call log.add] snapc 0=[] ondisk+write+known_if_redirected e10992) v7 -- ?+0 0x7fab1003de00 con 0x7facc80642d0
2017-10-02 20:27:20.645118 7fac484f5700 1 -- 10.116.45.136:0/778345655 <== osd.5 10.116.45.15:6810/70265 2492993 ==== osd_op_reply(31812005 meta.log.2daacbce-73f9-4628-875f-2a287e41b66b.31 [call] v10992'641301 uv1979959 ondisk = 0) v7 ==== 168+0+0 (2342691522 0 0) 0x7fabac04fe30 con 0x7facc80642d0
</span></code></pre> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1001372017-10-02T20:38:33ZVaibhav Bhembrevaibhav@digitalocean.com
<ul></ul><p>The way I am resolving this presently is by kicking RGW daemons, after which it becomes quite hard to reproduce this issue for a while. But it does kick back in after a considerable time has passed.</p> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1002242017-10-03T20:26:06ZVaibhav Bhembrevaibhav@digitalocean.com
<ul></ul><p>This (issue for CORS and user suspension) seems to happen when read/write to the user object touches the cached data that is somehow stale. The behavior is completely different on other RGW nodes versus the node which has the stale cache. Calls through all other RGWs succeed. Not sure why the watch doesn't get triggered through to all RGWs or why a specific cached object becomes stale.</p>
<p>The following sequence keeps repeating for every single bucket the user has on the broken RGW node:</p>
<pre>
2017-10-03 20:02:52.005929 7f76a4a49700 20 enabling bucket name=dotest414
2017-10-03 20:02:52.005955 7f76a4a49700 1 -- 10.74.46.141:0/2622364622 --> 10.74.46.142:6800/29123 -- osd_op(client.98183.0:23831916 76.ea319066 meta.log.6eba1c17-6fee-41d9-81f7-04c95893905f.53 [call log.add] snapc 0=[] ondisk+write+known_if_redirected e8993) v7 -- ?+0 0x7f7f3c0c3c60 con 0x7f7fbc0077a0
2017-10-03 20:02:52.006588 7f76a4a49700 1 -- 10.74.46.141:0/2622364622 --> 10.74.46.12:6806/25887 -- osd_op(client.98183.0:23831918 75.d4bd8791 .bucket.meta.dotest414:76bcb3c5-9013-4817-8e0b-9ec86fa9a561.36392.1025 [delete,create 0~0,call version.check_conds,call version.set,writefull 0~343,setxattr user.rgw.acl (133)] snapc 0=[] ondisk+write+known_if_redirected e8993) v7 -- ?+0 0x7f7f3c0c3c60 con 0x7f8034090e80
2017-10-03 20:02:52.006817 7f76a4a49700 1 -- 10.74.46.141:0/2622364622 --> 10.74.46.142:6800/29123 -- osd_op(client.98183.0:23831919 76.ea319066 meta.log.6eba1c17-6fee-41d9-81f7-04c95893905f.53 [call log.add] snapc 0=[] ondisk+write+known_if_redirected e8993) v7 -- ?+0 0x7f7f3c07c070 con 0x7f7fbc0077a0
2017-10-03 20:02:52.007431 7f76a4a49700 0 NOTICE: put_bucket_info on bucket=dotest414 returned err=-125, skipping bucket
</pre>
<p>The RGW node is still however able to serve all other requests.</p> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1002332017-10-03T21:00:45ZCasey Bodleycbodley@redhat.com
<ul><li><strong>Assignee</strong> set to <i>Casey Bodley</i></li></ul> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1002342017-10-03T21:27:20ZVaibhav Bhembrevaibhav@digitalocean.com
<ul></ul><p>Another small update after upping debug_objclass=20. The reads/sets for both success and failure are tracked below for the same bucket: "dotest4179".</p>
<pre>
Success:
2017-10-03 21:19:37.837206 7f1d9a752700 20 <cls> cls/version/cls_version.cc:202: cls_version: read_version _0RVXE-BtO0ggUryf-qRYGqe:25
2017-10-03 21:19:37.837216 7f1d9a752700 20 <cls> cls/version/cls_version.cc:39: cls_version: set_version _0RVXE-BtO0ggUryf-qRYGqe:26
2017-10-03 21:19:37.837254 7f1d9a752700 1 -- 169.254.1.34:6800/29123 --> 169.254.1.31:6806/57349 -- osd_repop(client.119201.0:135785057 75.5 75:a2d2e69c:::.bucket.meta.dotest4179%3a76bcb3c5-9013-4817-8e0b-9ec86fa9a561.36392.1362:head v 8993'16592) v1 -- ?+1579 0x7f1df55e6400 con 0x7f1dc6362f00
2017-10-03 21:19:37.837277 7f1d9a752700 1 -- 169.254.1.34:6800/29123 --> 169.254.1.33:6810/27653 -- osd_repop(client.119201.0:135785057 75.5 75:a2d2e69c:::.bucket.meta.dotest4179%3a76bcb3c5-9013-4817-8e0b-9ec86fa9a561.36392.1362:head v 8993'16592) v1 -- ?+1579 0x7f1dcae70c00 con 0x7f1dc6363200
2017-10-03 21:19:37.837555 7f1d80116700 1 -- 169.254.1.34:6800/29123 <== osd.3 169.254.1.31:6806/57349 8988657 ==== osd_repop_reply(client.119201.0:135785057 75.5) v1 ==== 83+0+0 (1860090718 0 0) 0x7f1dd565c300 con 0x7f1dc6362f00
2017-10-03 21:19:37.837567 7f1d80c21700 1 -- 169.254.1.34:6800/29123 <== osd.17 169.254.1.33:6810/27653 9055085 ==== osd_repop_reply(client.119201.0:135785057 75.5) v1 ==== 83+0+0 (2447246805 0 0) 0x7f1de096c780 con 0x7f1dc6363200
2017-10-03 21:19:37.837602 7f1d9cf57700 1 -- 10.74.46.142:6800/29123 --> 10.74.45.11:0/2203932309 -- osd_op_reply(135785057 .bucket.meta.dotest4179:76bcb3c5-9013-4817-8e0b-9ec86fa9a561.36392.1362 [delete,create 0~0,call version.check_conds,call version.set,writefull 0~344,setxattr user.rgw.acl (133)] v8993'16592 uv16592 ondisk = 0) v7 -- ?+0 0x7f1de9a98840 con 0x7f1de97e5600
Failure:
2017-10-03 21:21:14.438808 7f1d9a752700 20 <cls> cls/version/cls_version.cc:202: cls_version: read_version _0RVXE-BtO0ggUryf-qRYGqe:26
2017-10-03 21:21:14.438815 7f1d9a752700 20 <cls> cls/version/cls_version.cc:205: cls_version: failed condition check
2017-10-03 21:21:14.438820 7f1d9a752700 1 -- 10.74.46.142:6800/29123 --> 10.74.46.11:0/2867213628 -- osd_op_reply(26145514 .bucket.meta.dotest4179:76bcb3c5-9013-4817-8e0b-9ec86fa9a561.36392.1362 [delete,create 0~0,call version.check_conds,call version.set,writefull 0~344,setxattr user.rgw.acl (133)] v0'0 uv0 ondisk = -125 ((125) Operation canceled)) v7 -- ?+0 0x7f1dd3f80000 con 0x7f1dd5ca8a80
</pre> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1002702017-10-04T17:16:20ZCasey Bodleycbodley@redhat.com
<ul></ul><p>Vaibhav Bhembre wrote:</p>
<blockquote>
<p>Another small update after upping debug_objclass=20. The reads/sets for both success and failure are tracked below for the same bucket: "dotest4179".</p>
</blockquote>
<p>Thanks. It's clear from that output that these two operations are initiated by different gateways. The first reply is sent to 10.74.45.11, and the second to 10.74.46.11. Are you able to track that first successful response back to its radosgw log? If debug-ms is enabled there, you could search for 'osd_op_reply(135785057'. If not, can you at least verify that the log doesn't contain "ERROR: failed to distribute cache"?</p> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1003152017-10-05T18:10:28ZVaibhav Bhembrevaibhav@digitalocean.com
<ul></ul><p>I couldn't find the `osd_op_reply(135785057` response back oddly (I could track it back for other requests interestingly). However I did find a lot of "ERROR: failed to distribute cache" messages for various buckets, user.uid and user.keys.</p>
<p>I couldn't find out for ".rgw.data.root:.bucket.meta.dotest4179:76bcb3c5-9013-4817-8e0b-9ec86fa9a561.36392.1362" but I only have logs for past 7 days at this point and that could have creeped up before that.</p>
<p>Is this helpful?</p> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1006862017-10-12T17:39:01ZMatt Benjaminmbenjamin@redhat.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1010252017-10-19T17:46:09ZMatt Benjaminmbenjamin@redhat.com
<ul></ul><p>a group of RGW cache consistency issues are being investigated in 'project november'</p> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1010262017-10-19T17:46:55ZOrit Wassermanowasserm@redhat.com
<ul><li><strong>Assignee</strong> changed from <i>Casey Bodley</i> to <i>J. Eric Ivancich</i></li></ul> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1016202017-10-30T20:51:39ZVaibhav Bhembrevaibhav@digitalocean.com
<ul></ul><p>Hi, is there any other logs or information I can provide here? Still hitting this issue for CORS updates and user suspension.</p> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1017512017-11-02T17:43:33ZYehuda Sadehyehuda@redhat.com
<ul></ul><p>@cbodley @vdb I'm not completely convinced that it's a cache coherency issue. Could still be, but could also be that we don't retry appropriately when we identify a racing write.</p> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1021602017-11-10T15:02:59ZVaibhav Bhembrevaibhav@digitalocean.com
<ul></ul><p>That does make sense, Yehuda! Meanwhile please let me know if there's any other information I can pull to expedite this.</p> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1030912017-11-30T19:00:57ZMatt Benjaminmbenjamin@redhat.com
<ul></ul><p>This may be addressed by cache consistency bugfix with similar result:<br /><a class="external" href="https://github.com/ceph/ceph/pull/18954">https://github.com/ceph/ceph/pull/18954</a></p> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1030922017-11-30T19:01:30ZMatt Benjaminmbenjamin@redhat.com
<ul></ul><p>(that has merged to master and is pending backport)</p> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1031072017-11-30T21:30:22ZVaibhav Bhembrevaibhav@digitalocean.com
<ul></ul><p>This is awesome! Would it be a part of 12.2.2 by any chance?</p> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1031082017-11-30T21:36:30ZMatt Benjaminmbenjamin@redhat.com
<ul></ul><p>Unfortunately not 12.2.2, but master has the change. It should reach 12.2.3.</p>
<p>Matt</p> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1042492018-01-04T18:43:17ZMatt Benjaminmbenjamin@redhat.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Pending Backport</i></li><li><strong>Backport</strong> set to <i>luminous, jewel</i></li></ul> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1043272018-01-05T12:19:17ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/22588">Backport #22588</a>: luminous: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)</i> added</li></ul> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1043292018-01-05T12:19:21ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/22589">Backport #22589</a>: jewel: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)</i> added</li></ul> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1044142018-01-05T19:32:19ZJ. Eric Ivancichivancich@redhat.com
<ul></ul><p>The luminous backport is currently a PR waiting to be merged: <a class="external" href="https://github.com/ceph/ceph/pull/19788">https://github.com/ceph/ceph/pull/19788</a></p>
<p>The jewel backport is currently a PR waiting to be merged: <a class="external" href="https://github.com/ceph/ceph/pull/19769">https://github.com/ceph/ceph/pull/19769</a></p> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1045962018-01-09T13:30:19ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-3 priority-4 priority-default closed" href="/issues/22517">Bug #22517</a>: Cache never becoming consistent after failed updates</i> added</li></ul> rgw - Bug #21560: rgw: put cors operation returns 500 unknown error (ops are ECANCELED)https://tracker.ceph.com/issues/21560?journal_id=1066392018-02-02T21:16:26ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Status</strong> changed from <i>Pending Backport</i> to <i>Resolved</i></li></ul>