https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2017-01-14T04:35:52ZCeph Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=842072017-01-14T04:35:52ZDan Mickdmick@redhat.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/84207/diff?detail_id=81233">diff</a>)</li></ul> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=842152017-01-16T01:11:34ZBrad Hubbardbhubbard@redhat.com
<ul></ul><p>Dan, could this be a duplicate of <a class="external" href="http://tracker.ceph.com/issues/17177">http://tracker.ceph.com/issues/17177</a> ? What does the deep scrub output look like?</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=842522017-01-17T00:50:27ZDan Mickdmick@redhat.com
<ul></ul><p>it is claimed that the version we are running (46f4285) should have the fix (73a1b45) for 17177</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=842532017-01-17T02:01:46ZDan Mickdmick@redhat.com
<ul></ul><p>There's another instance, pg 1.15, object 604.00000000:</p>
<pre>
rados -c /home/dmick/lrc/ceph.conf list-inconsistent-obj 1.15
{
"epoch": 769827,
"inconsistents": [
{
"object": {
"name": "604.00000000",
"nspace": "",
"locator": "",
"snap": "head",
"version": 9207591
},
"errors": [
"omap_digest_mismatch"
],
"union_shard_errors": [],
"selected_object_info": "1:a93a17c2:::604.00000000:head(772361'9207591 mds.0.95185:12666245 dirty|omap|data_digest s 0 uv 9207591 dd ffffffff alloc_hint [0 0 0])",
"shards": [
{
"osd": 33,
"errors": [],
"size": 0,
"omap_digest": "0x93abd7d2",
"data_digest": "0xffffffff"
},
{
"osd": 62,
"errors": [],
"size": 0,
"omap_digest": "0x93abd7d2",
"data_digest": "0xffffffff"
},
{
"osd": 77,
"errors": [],
"size": 0,
"omap_digest": "0x53fcb579",
"data_digest": "0xffffffff"
},
{
"osd": 120,
"errors": [],
"size": 0,
"omap_digest": "0x53fcb579",
"data_digest": "0xffffffff"
}
]
}
]
</pre> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=842542017-01-17T02:04:19ZDavid Zafmandzafman@redhat.com
<ul></ul><p>We should look and see if a case of 17177 was missed in that fix.</p>
<p>The checksum is based on this set of keys:</p>
<p>$ sudo rados -p metadata listomapkeys 607.00000000<br />100130cf96b_head<br />100130cf971_head<br />100130cf9e0_head<br />100130cf9ec_head<br />100130cfcd3_head<br />100130cfcdd_head<br />100130cff88_head<br />100130cffab_head<br />100130cffb6_head<br />100130d0ac3_head<br />10014d73f2e_head<br />10014d744fb_head<br />1001ee36f99_head<br />1001ee36f9a_head<br />1001ee36f9b_head<br />1001f5289aa_head<br />1001f530b2f_head<br />1001f530b32_head<br />1001f530d8a_head<br />1001f530d9b_head<br />1001f5401da_head<br />1001f54ecbd_head<br />1001f5532c8_head<br />1001f5532c9_head<br />1001f553353_head<br />1001f5533aa_head<br />1001f5533cd_head<br />1001f5612d3_head<br />1001f5612d9_head<br />1001fb0d36f_head<br />1001fb0d3e1_head<br />1001fb0d3e3_head<br />10020346a54_head</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=842552017-01-17T02:15:59ZDan Mickdmick@redhat.com
<ul></ul><p>pg 1.15, object 604.00000000 has 9251 omap entries (!)</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=843962017-01-19T23:25:04ZDan Mickdmick@redhat.com
<ul></ul><p>The 1.3c object has disappeared.</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=846522017-01-24T01:31:48ZDan Mickdmick@redhat.com
<ul></ul><p>The last scrub data disappeared. The state today on 1.15 604.00000000 is:</p>
<pre>
{
"epoch": 772952,
"inconsistents": [
{
"object": {
"name": "604.00000000",
"nspace": "",
"locator": "",
"snap": "head",
"version": 9313347
},
"errors": [
"omap_digest_mismatch"
],
"union_shard_errors": [],
"selected_object_info": "1:a93a17c2:::604.00000000:head(772953'9313347 mds.0.95440:17792072 dirty|omap|data_digest s 0 uv 9313347 dd ffffffff alloc_hint [0 0 0])",
"shards": [
{
"osd": 33,
"errors": [],
"size": 0,
"omap_digest": "0x627e933d",
"data_digest": "0xffffffff"
},
{
"osd": 62,
"errors": [],
"size": 0,
"omap_digest": "0x627e933d",
"data_digest": "0xffffffff"
},
{
"osd": 77,
"errors": [],
"size": 0,
"omap_digest": "0x627e933d",
"data_digest": "0xffffffff"
},
{
"osd": 120,
"errors": [],
"size": 0,
"omap_digest": "0xd9223271",
"data_digest": "0xffffffff"
}
]
}
]
}
</pre>
<p>Primary reports (through rados listomapkeys) 7379 keys</p>
<p>Replica (osd.77) reports 11 keys, which are a strict subset of the 7379:</p>
<p>100130cda86_head<br />100130d4598_head<br />1001f05f4fe_head<br />1001f05f763_head<br />1001f05f764_head<br />1001f05f765_head<br />1001fb02621_head<br />1001fb02622_head<br />1001fb02623_head<br />1001fb069b7_head<br />1002065dcfd_head</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=846532017-01-24T01:32:42ZDan Mickdmick@redhat.com
<ul><li><strong>Assignee</strong> set to <i>David Zafman</i></li></ul> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=846552017-01-24T03:24:33ZDan Mickdmick@redhat.com
<ul></ul><p>I think c66e466d4ed76cd7a063b9b982ba455150ef1f14 was brought up as a possibly-related issue.</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=846602017-01-24T05:23:20ZDan Mickdmick@redhat.com
<ul></ul><p>Possibly of interest, possibly garbage:</p>
<p>the omapheader for 604.00000000, obtained by rados getomaphdr and dumped with ceph-dencoder, has corruption in it a lot like the corrupted rstats in <a class="external" href="http://tracker.ceph.com/issues/18532">http://tracker.ceph.com/issues/18532</a>:</p>
<pre>
{
"version": 294525375,
"snap_purged_thru": 0,
"fragstat": {
"version": 120,
"mtime": "2017-01-23 09:59:12.999262",
"num_files": 18446744073709542630,
"num_subdirs": 18446744073709551378
},
"accounted_fragstat": {
"version": 120,
"mtime": "2017-01-23 09:59:12.999262",
"num_files": 18446744073709543400,
"num_subdirs": 18446744073709551381
},
"rstat": {
"version": 7399,
"rbytes": 18446744067936750485,
"rfiles": 18446744073709542630,
"rsubdirs": 18446744073709551378,
"rsnaprealms": 0,
"rctime": "2017-01-23 09:59:12.999262"
},
"accounted_rstat": {
"version": 7399,
"rbytes": 18446744067983294228,
"rfiles": 18446744073709542853,
"rsubdirs": 18446744073709551381,
"rsnaprealms": 0,
"rctime": "2017-01-23 09:59:12.999262"
}
}
</pre>
<p>603.00000000's omaphdr, using the same methodology, looks saner:</p>
<pre>
{
"version": 294624821,
"snap_purged_thru": 0,
"fragstat": {
"version": 119,
"mtime": "2017-01-23 13:43:49.738195",
"num_files": 5,
"num_subdirs": 17
},
"accounted_fragstat": {
"version": 119,
"mtime": "2017-01-23 13:43:49.738195",
"num_files": 6,
"num_subdirs": 17
},
"rstat": {
"version": 7267,
"rbytes": 13210238,
"rfiles": 5,
"rsubdirs": 17,
"rsnaprealms": 0,
"rctime": "2017-01-23 13:43:49.738195"
},
"accounted_rstat": {
"version": 7267,
"rbytes": 13222526,
"rfiles": 6,
"rsubdirs": 17,
"rsnaprealms": 0,
"rctime": "2017-01-23 13:43:49.738195"
}
}
</pre> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=846612017-01-24T05:36:09ZBrad Hubbardbhubbard@redhat.com
<ul></ul><p>Given std::numeric_limits<int64_t>::max() = 9223372036854775807 18446744067936750485 seems too large a value to me?</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=846632017-01-24T07:52:31ZBrad Hubbardbhubbard@redhat.com
<ul></ul><p>So these are signed values dumped out as unsigned values.</p>
<p>void nest_info_t::dump(Formatter *f) const
{<br /> f->dump_unsigned("version", version);<br /> f->dump_unsigned("rbytes", rbytes);<br /> f->dump_unsigned("rfiles", rfiles);<br /> f->dump_unsigned("rsubdirs", rsubdirs);<br /> f->dump_unsigned("rsnaprealms", rsnaprealms);<br /> f->dump_stream("rctime") << rctime;<br />}</p>
<p>void JSONFormatter::dump_unsigned(const char *name, uint64_t u)
{<br /> print_name(name);<br /> m_ss << u;<br />}</p>
<p>badone | geordi: { uint64_t sixfour = -238; cout << sixfour << endl;}<br />geordi | 18446744073709551378</p>
<p>So these look like negative values, some considerably negative as well.</p>
<p>$ echo 18446744067983294228-18446744073709551616|bc -iq<br />18446744067983294228-18446744073709551616<br />-5726257388</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=846872017-01-24T19:19:46ZDan Mickdmick@redhat.com
<ul></ul><p>Yes, they're negative, but the point is the bad rstats are also hugely negative. (and given that they're counts, they ought to be unsigned anyway)</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=847122017-01-25T03:34:30ZDan Mickdmick@redhat.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/84712/diff?detail_id=81733">diff</a>)</li></ul> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=847932017-01-25T21:30:43ZDan Mickdmick@redhat.com
<ul></ul><p>OK, I don't know how to repair this damage. Gonna need advice.</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=847942017-01-25T21:30:57ZDan Mickdmick@redhat.com
<ul><li><strong>Priority</strong> changed from <i>Normal</i> to <i>High</i></li></ul> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=850552017-01-28T04:41:42ZDan Mickdmick@redhat.com
<ul></ul><ul>
<li>Removed the object from the primary with ceph-objectstore-tool </li>
<li>got all the omap keys/vals and the omap header with c-o-t from a replica</li>
<li>got all the xattrs on the filestore file by hand with getfattr</li>
<li>Touched a file on the primary in the current/ dir to create the object (no such op in c-o-t? or is import that operation?)</li>
<li>restored xattrs on the file on the primary with setfattr, which allowed c-o-t to recognize it again</li>
<li>restored omap header and key/vals with c-o-t on the primary</li>
<li>re-deep-scrubbed the pg, error is gone!</li>
</ul>
<p>I look forward to the tool for automating this which I hear is coming shortly (<a class="external" href="https://github.com/ceph/ceph/pull/9203">https://github.com/ceph/ceph/pull/9203</a>)</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=854842017-02-01T22:46:48ZDan Mickdmick@redhat.com
<ul></ul><p>Another instance today:</p>
<pre>
{
"epoch": 772925,
"inconsistents": [
{
"object": {
"name": "100011cf577.00000000",
"nspace": "",
"locator": "",
"snap": "head",
"version": 9314695
},
"errors": [
"omap_digest_mismatch"
],
"union_shard_errors": [],
"selected_object_info": "1:a7f0f16e:::100011cf577.00000000:head(773004'9314695 mds.0.95616:2509263 dirty|omap|data_digest s 0 uv 9314695 dd ffffffff alloc_hint [0 0 0])",
"shards": [
{
"osd": 7,
"errors": [],
"size": 0,
"omap_digest": "0x618ae52f",
"data_digest": "0xffffffff"
},
{
"osd": 47,
"errors": [],
"size": 0,
"omap_digest": "0x618ae52f",
"data_digest": "0xffffffff"
},
{
"osd": 60,
"errors": [],
"size": 0,
"omap_digest": "0x618ae52f",
"data_digest": "0xffffffff"
},
{
"osd": 72,
"errors": [],
"size": 0,
"omap_digest": "0x6ba4c015",
"data_digest": "0xffffffff"
}
]
}
]
}
</pre>
<p>That object is the /teuthology-archive directory, which had 16274 omap keys, and the omap header said it had 16268 dirs + 6 files, which matched. All three replicas were out of sync with the primary. It's not known which version of the omap data is correct yet.</p>
<p>Simultaneously, mds damage was noted:<br /><pre>
[{"damage_type":"dir_frag","id":650990821,"ino":1099619790351,"frag":"*"}]
</pre></p>
<p>1099619790351 is 0x10006726e0f, and that inode does not appear to be listed in the omap keys for the object. If it is present on the replicas, that might suggest that the primary was updated for a deletion but the replicas were not.</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=854892017-02-01T22:56:14ZDavid Zafmandzafman@redhat.com
<ul></ul><p>Using the keys from osd.7 and comparing with the primary of the running cluster these 22 omap keys are missing from the replicas which have matching omap_digest.</p>
<p>teuthology-2014-12-10_17:15:01-upgrade:dumpling-firefly-x:stress-split-next-distro-basic-multi_head<br />teuthology-2014-12-10_17:15:01-upgrade:giant-giant-distro-basic-vps_head<br />teuthology-2014-12-10_17:18:01-upgrade:firefly-x-next-distro-basic-vps_head<br />teuthology-2014-12-10_17:25:02-upgrade:dumpling-firefly-x:stress-split-next-distro-basic-vps_head<br />teuthology-2014-12-10_18:10:03-upgrade:dumpling-firefly-x:parallel-giant-distro-basic-multi_head<br />teuthology-2014-12-10_18:13:01-upgrade:firefly-x-giant-distro-basic-multi_head<br />teuthology-2014-12-10_18:15:01-upgrade:dumpling-firefly-x:stress-split-giant-distro-basic-multi_head<br />teuthology-2014-12-10_19:13:03-upgrade:dumpling-x-firefly-distro-basic-vps_head<br />teuthology-2014-12-10_23:00:03-rbd-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:02:01-rgw-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:04:05-fs-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:06:01-krbd-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:08:01-kcephfs-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:10:02-knfs-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:12:01-hadoop-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:14:01-samba-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:16:01-rest-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:18:01-multimds-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:20:02-multi-version-giant-distro-basic-multi_head<br />teuthology-2014-12-10_23:20:02-multi-version-master-distro-basic-multi_head<br />teuthology-2014-12-11_01:10:03-ceph-deploy-firefly-distro-basic-multi_head<br />teuthology-2014-12-11_02:35:03-smoke-master-distro-basic-multi_head</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=854982017-02-01T23:56:24ZSamuel Justsjust@redhat.com
<ul></ul><p>Whatever happened, happened in the last few days.</p>
<p>samuelj@mira049:~$ ( for i in {7..1}; do sudo zcat /var/log/ceph/ceph.log.$i.gz; done; sudo cat /var/log/ceph/ceph.log ) | grep ' 1.25 deep-scrub '<br />2017-01-28 06:19:18.117347 osd.72 172.21.4.140:6812/1162 1582 : cluster [INF] 1.25 deep-scrub starts<br />2017-01-28 06:25:03.841868 osd.72 172.21.4.140:6812/1162 1583 : cluster [INF] 1.25 deep-scrub ok<br />2017-02-01 10:17:43.639140 osd.72 172.21.4.140:6812/1162 2466 : cluster [INF] 1.25 deep-scrub starts<br />2017-02-01 10:22:56.239724 osd.72 172.21.4.140:6812/1162 2467 : cluster [ERR] 1.25 deep-scrub 1 errors</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=854992017-02-02T00:00:02ZSamuel Justsjust@redhat.com
<ul></ul><p>samuelj@mira049:~$ ( for i in {7..1}; do sudo zcat /var/log/ceph/ceph.log.$i.gz; done; sudo cat /var/log/ceph/ceph.log ) | grep ' osd\.72 \| osd\.7 \| osd\.60 \| osd\.47 ' | grep -v scrub | grep -v 'slow request'<br />2017-02-01 22:35:22.762750 mon.0 172.21.4.136:6789/0 1840408 : cluster [INF] osd.7 marked itself down<br />2017-02-01 22:36:18.409617 mon.0 172.21.4.136:6789/0 1840471 : cluster [INF] osd.7 172.21.5.114:6820/29826 boot<br />2017-02-01 22:37:28.508304 mon.0 172.21.4.136:6789/0 1840550 : cluster [INF] osd.7 marked itself down<br />2017-02-01 22:57:08.729956 mon.0 172.21.4.136:6789/0 1841715 : cluster [INF] osd.7 172.21.5.114:6820/30583 boot</p>
<p>Except for today (presumably for the C-O-T checks), none of those osds went down.</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=855012017-02-02T00:01:10ZSamuel Justsjust@redhat.com
<ul></ul><p>I suggest grabbing a copy of the leveldb instances from primary and a replica and examining the actual keys in the store, perhaps that will yield some kind of smoking gun.</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=855022017-02-02T00:03:40ZSamuel Justsjust@redhat.com
<ul></ul><p>ubuntu@mira049:~$ ( for i in {7..1}; do sudo zcat /var/log/ceph/ceph.log.$i.gz; done; sudo cat /var/log/ceph/ceph.log ) | grep ' 1\.25 '<br />2017-01-26 08:33:59.919704 osd.72 172.21.4.140:6812/1162 1311 : cluster [INF] 1.25 scrub starts<br />2017-01-26 08:37:09.218916 osd.72 172.21.4.140:6812/1162 1312 : cluster [INF] 1.25 scrub ok<br />2017-01-28 06:19:18.117347 osd.72 172.21.4.140:6812/1162 1582 : cluster [INF] 1.25 deep-scrub starts<br />2017-01-28 06:25:03.841868 osd.72 172.21.4.140:6812/1162 1583 : cluster [INF] 1.25 deep-scrub ok<br />2017-01-29 10:54:29.025508 osd.72 172.21.4.140:6812/1162 1850 : cluster [INF] 1.25 scrub starts<br />2017-01-29 10:58:47.110057 osd.72 172.21.4.140:6812/1162 1851 : cluster [INF] 1.25 scrub ok<br />2017-01-31 00:13:49.329845 osd.72 172.21.4.140:6812/1162 2139 : cluster [INF] 1.25 scrub starts<br />2017-01-31 00:17:16.258921 osd.72 172.21.4.140:6812/1162 2140 : cluster [INF] 1.25 scrub ok<br />2017-02-01 10:17:43.639140 osd.72 172.21.4.140:6812/1162 2466 : cluster [INF] 1.25 deep-scrub starts<br />2017-02-01 10:22:56.239724 osd.72 172.21.4.140:6812/1162 2467 : cluster [ERR] 1.25 deep-scrub 1 errors</p>
<p>1.25 doesn't seem to have backfilled either.</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=855062017-02-02T00:45:52ZSamuel Justsjust@redhat.com
<ul></ul><p>I have copied the omap dirs for osds 72 (mira019:~samuelj/omap-osd-72), 7 (mira049:~samuelj/omap-osd-7), and 60 (mira120:~samuelj/omap-osd-60).</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=857812017-02-07T23:23:13ZJosh Durgin
<ul></ul><p>Here's the output from a deep-scrub on 2/7:</p>
<pre>
{
"epoch": 780872,
"inconsistents": [
{
"object": {
"name": "100011cf577.00000000",
"nspace": "",
"locator": "",
"snap": "head",
"version": 9408654
},
"errors": [
"omap_digest_mismatch"
],
"union_shard_errors": [],
"selected_object_info": "1:a7f0f16e:::100011cf577.00000000:head(780877'9408654 mds.0.96709:7255604 dirty|omap|data_digest s 0 uv 9408654 dd ffffffff alloc_hint [0 0 0])",
"shards": [
{
"osd": 7,
"errors": [],
"size": 0,
"omap_digest": "0xe1c0a6ac",
"data_digest": "0xffffffff"
},
{
"osd": 47,
"errors": [],
"size": 0,
"omap_digest": "0xf463e691",
"data_digest": "0xffffffff"
},
{
"osd": 60,
"errors": [],
"size": 0,
"omap_digest": "0xf463e691",
"data_digest": "0xffffffff"
},
{
"osd": 72,
"errors": [],
"size": 0,
"omap_digest": "0xe1c0a6ac",
"data_digest": "0xffffffff"
}
]
}
]
}
</pre>
<p>Inspecting the leveldbs furthur, we've found an invariant violation: the complete_region on the nodes with extra entries has overlapping ranges (the ranges are stored as [start, end) key/value pairs.</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=858092017-02-08T19:53:54ZDavid Zafmandzafman@redhat.com
<ul></ul><p>Corrupt complete mapping found on pg 1.25 primary osd.72 for oid 100011cf577.00000000:</p>
<p><a class="external" href="http://pastebin.com/19W78B6U">http://pastebin.com/19W78B6U</a></p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=858122017-02-08T20:28:01ZSamuel Justsjust@redhat.com
<ul></ul><davidzlap> sjust: 100011cf577.00000000<br /><davidzlap> sjust: I meant <a class="external" href="http://pastebin.com/19W78B6U">http://pastebin.com/19W78B6U</a><br /><sjust> davidzlap: can you characterize the overlaps?
<ul>
<li>batman has quit (Remote host closed the connection)<br /><sjust> how many are partial overlaps -- like [1, 5) with [4, 10) -- vs contains -- [1, 10) with [5, 7) -- ?<br /><sjust> do our problem keys fall exclusively into one of these complete regions/<br /><sjust> or different ones<br /><sjust> if so, what do they have in common?<br /><sjust> joshd davidzlap: good news, I just pushed a unit test which causes an incorrect result with a point query<br /><sjust> annoyingly, doesn't work for an interator<br /><sjust> that is, the iterator returns the right value in this case<br /><sjust> full contains trip up a point query, but not an interator, trying to find a case which trips up the iterator logic<br /><davidzlap> sjust: I don't think matters if the complete mapping is corrupt<br /><sjust> davidzlap: just found one case which does let us turn a corrupt complete mapping into an incorrect point query result<br /><sjust> still trying to find one which would trip up an iterator scan</li>
<li>mattbenjamin1 (~<a class="email" href="mailto:mbenjamin@nat-pool-rdu-t.redhat.com">mbenjamin@nat-pool-rdu-t.redhat.com</a>) has joined</li>
<li>bassam (<a class="email" href="mailto:sid154933@id-154933.brockwell.irccloud.com">sid154933@id-154933.brockwell.irccloud.com</a>) has joined</li>
<li>drk_ (<a class="email" href="mailto:sid171180@id-171180.brockwell.irccloud.com">sid171180@id-171180.brockwell.irccloud.com</a>) has joined</li>
<li>bene2 (~<a class="email" href="mailto:bene@nat-pool-bos-t.redhat.com">bene@nat-pool-bos-t.redhat.com</a>) has joined</li>
<li>davidzlap has quit (Quit: Leaving.)<br /><sjust> davidz: and I just pushed a comment to my wip-18533 explaining how a partial overlap will trip up an iterator scan<br /><sjust> now we just need a way to engineer a partial overlap</li>
<li>bene3 has quit (Ping timeout: 480 seconds)</li>
<li>rendar has quit (Ping timeout: 480 seconds)<br /><sjust> davidz: my mechanism requires that if there is a pair of complete regions like [a, e) and [c, f), e could be erroneously returned<br /><sjust> davidz: so do our phantom keys show up as the end of any complete regions?<br /><sjust> particularly as the end of a complete region with a subsequent overlapping one?</li>
</ul> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=858132017-02-08T20:28:53ZSamuel Justsjust@redhat.com
<ul></ul><p>debuggging: <a class="external" href="https://github.com/athanatos/ceph/tree/wip-18533">https://github.com/athanatos/ceph/tree/wip-18533</a></p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=858182017-02-08T23:30:41ZSamuel Justsjust@redhat.com
<ul></ul><p>wip-18533 above now has a unit test which causes the iterator to return a deleted value.</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=858192017-02-08T23:33:43ZSamuel Justsjust@redhat.com
<ul></ul><p>I'm pretty comfortable pinning the cluster trouble on that one, assuming the extra keys and the overlapping complete values we have match.</p>
<p>David, can you confirm that for each of the extra keys, there is a pair of complete entries [a, <key>) and [c, f) where <key> is the erroneously present key and c < <key> and f > <key> ?</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=858202017-02-09T00:13:15ZSamuel Justsjust@redhat.com
<ul></ul><p>David: Can you add the list of keys which are present on that node but shouldn't be?</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=858222017-02-09T00:28:54ZSamuel Justsjust@redhat.com
<ul></ul><p>If the entries David added a few days ago are the right ones, then the above bug doesn't explain what's happening in the cluster.</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=858272017-02-09T01:50:57ZSamuel Justsjust@redhat.com
<ul></ul><p>Nevermind, the bug can produce a more general set of errors than I had realized. See the more recent updates to the TestIterateBug18533 unit test in my branch. Still need to come up with a modification for the fuzzer that would let it produce this kind of error.</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=858282017-02-09T03:39:41ZDavid Zafmandzafman@redhat.com
<ul></ul><p>This is the result of one of Sam's now failing testing with my complete checking code which also outputs the complete mapping when the error is found.</p>
<p>Bad complete for #-1:a8ba2560:::foo2:head#<br />Complete mapping:<br />0000000013 -> 0000000098<br />0000000015 -> 0000000056<br />ceph_test_object_map: /home/dzafman/ceph/src/test/ObjectMap/test_object_map.cc:630: virtual void ObjectMapTest::TearDown(): Assertion `db->check(std::cerr) == 0' failed.</p>
<p>-------<br />Answer to the question of whether any of the extra keys are the end of a complete mapping entry is NO. None of the keys below are in the complete mapping (<a class="external" href="http://pastebin.com/19W78B6U">http://pastebin.com/19W78B6U</a>).</p>
<p>teuthology-2014-12-10_17:15:01-upgrade:dumpling-firefly-x:stress-split-next-distro-basic-multi_head<br />teuthology-2014-12-10_17:15:01-upgrade:giant-giant-distro-basic-vps_head<br />teuthology-2014-12-10_17:18:01-upgrade:firefly-x-next-distro-basic-vps_head<br />teuthology-2014-12-10_17:25:02-upgrade:dumpling-firefly-x:stress-split-next-distro-basic-vps_head<br />teuthology-2014-12-10_18:10:03-upgrade:dumpling-firefly-x:parallel-giant-distro-basic-multi_head<br />teuthology-2014-12-10_18:13:01-upgrade:firefly-x-giant-distro-basic-multi_head<br />teuthology-2014-12-10_18:15:01-upgrade:dumpling-firefly-x:stress-split-giant-distro-basic-multi_head<br />teuthology-2014-12-10_19:13:03-upgrade:dumpling-x-firefly-distro-basic-vps_head<br />teuthology-2014-12-10_23:00:03-rbd-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:02:01-rgw-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:04:05-fs-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:06:01-krbd-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:08:01-kcephfs-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:10:02-knfs-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:12:01-hadoop-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:14:01-samba-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:16:01-rest-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:18:01-multimds-master-testing-basic-multi_head<br />teuthology-2014-12-10_23:20:02-multi-version-giant-distro-basic-multi_head<br />teuthology-2014-12-10_23:20:02-multi-version-master-distro-basic-multi_head<br />teuthology-2014-12-11_01:10:03-ceph-deploy-firefly-distro-basic-multi_head<br />teuthology-2014-12-11_02:35:03-smoke-master-distro-basic-multi_head</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=858672017-02-09T18:51:57ZSamuel Justsjust@redhat.com
<ul></ul><p>wip-18533 is now cleaned up and has two specific unit tests and a fuzzer which reproduce invalid iterator results.</p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=874572017-03-10T03:10:37ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>17</i></li></ul> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=878792017-03-22T18:22:00ZDavid Zafmandzafman@redhat.com
<ul></ul><p>Now simplified rm_keys() by copying to clone() and no longer using complete mapping:</p>
<p>Master pull request</p>
<p><a class="external" href="https://github.com/ceph/ceph/pull/13423">https://github.com/ceph/ceph/pull/13423</a></p>
<p>Kraken pull request being used for testing on Large Rados Cluster</p>
<p><a class="external" href="https://github.com/ceph/ceph/pull/14024">https://github.com/ceph/ceph/pull/14024</a></p> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=880662017-03-27T22:16:42ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Status</strong> changed from <i>17</i> to <i>Pending Backport</i></li><li><strong>Backport</strong> set to <i>kraken</i></li></ul> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=880672017-03-27T22:29:40ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-5 priority-high3 closed" href="/issues/19391">Backport #19391</a>: kraken: two instances of omap_digest mismatch</i> added</li></ul> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=881252017-03-28T22:26:20ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Backport</strong> changed from <i>kraken</i> to <i>jewel, kraken</i></li></ul> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=881262017-03-28T22:26:41ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/19404">Backport #19404</a>: jewel: core: two instances of omap_digest mismatch</i> added</li></ul> Ceph - Bug #18533: two instances of omap_digest mismatchhttps://tracker.ceph.com/issues/18533?journal_id=906512017-05-03T08:33:04ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Status</strong> changed from <i>Pending Backport</i> to <i>Resolved</i></li></ul>