https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2017-05-24T10:26:03ZCeph RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=914812017-05-24T10:26:03ZJohn Sprayjcspray@gmail.com
<ul></ul><p>More instances from last night's master:<br />- <a class="external" href="http://pulpito.ceph.com/jspray-2017-05-23_22:31:39-fs-master-distro-basic-smithi/1222341">http://pulpito.ceph.com/jspray-2017-05-23_22:31:39-fs-master-distro-basic-smithi/1222341</a></p>
<ul>
<li>fs/snaps/{begin.yaml clusters/fixed-2-ucephfs.yaml mount/fuse.yaml objectstore/filestore-xfs.yaml overrides/{debug.yaml frag_enable.yaml whitelist_wrongly_marked_down.yaml} tasks/snaptests.yaml}</li>
</ul>
<p>- <a class="external" href="http://pulpito.ceph.com/jspray-2017-05-23_22:31:39-fs-master-distro-basic-smithi/1222394">http://pulpito.ceph.com/jspray-2017-05-23_22:31:39-fs-master-distro-basic-smithi/1222394</a></p>
<ul>
<li>fs/thrash/{begin.yaml ceph-thrash/default.yaml clusters/mds-1active-1standby.yaml mount/fuse.yaml msgr-failures/osd-mds-delay.yaml objectstore/filestore-xfs.yaml overrides/{debug.yaml frag_enable.yaml whitelist_wrongly_marked_down.yaml} tasks/cfuse_workunit_snaptests.yaml}</li>
</ul> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=916172017-05-24T23:13:08ZJosh Durgin
<ul><li><strong>Assignee</strong> set to <i>Kefu Chai</i></li></ul><p>Kefu, could you take a look at this one? Not sure if it's related to recent denc changes, or perhaps <a class="external" href="https://github.com/ceph/ceph/pull/15092">https://github.com/ceph/ceph/pull/15092</a></p> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=930492017-06-17T03:52:45ZKefu Chaitchaikov@gmail.com
<ul></ul><p>John, sorry. i missed this. will take a look at it next monday.</p> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=930972017-06-17T13:50:12ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Project</strong> changed from <i>Ceph</i> to <i>RADOS</i></li><li><strong>Category</strong> deleted (<del><i>OSD</i></del>)</li><li><strong>Component(RADOS)</strong> <i>OSD</i> added</li></ul> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=931742017-06-19T09:41:52ZKefu Chaitchaikov@gmail.com
<ul></ul><p>rerunning at <a class="external" href="http://pulpito.ceph.com/kchai-2017-06-19_09:40:27-fs-master---basic-smithi/">http://pulpito.ceph.com/kchai-2017-06-19_09:40:27-fs-master---basic-smithi/</a></p> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=931762017-06-19T10:48:29ZKefu Chaitchaikov@gmail.com
<ul></ul><p>all passed modulo a valgrind error in ceph-mds, see /a/kchai-2017-06-19_09:40:27-fs-master---basic-smithi/1300881/remote/smithi113/log/valgrind/mds.a.log.gz, tracked at <a class="issue tracker-1 status-3 priority-4 priority-default closed" title="Bug: mem leak in Journaler::_issue_read() in ceph-mds (Resolved)" href="https://tracker.ceph.com/issues/20338">#20338</a></p> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=931792017-06-19T10:54:27ZJohn Sprayjcspray@gmail.com
<ul></ul><p>Unless there was a patch, I wouldn't be too sure this is fixed -- it was an intermittent failure.</p> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=931832017-06-19T11:55:57ZKefu Chaitchaikov@gmail.com
<ul></ul><pre>
2017-05-16T10:15:48.418 INFO:tasks.ceph.osd.1.smithi158.stderr: 2: (std::enable_if<denc_traits<osd_reqid_t, void>::supported&&(!denc_traits<osd_reqid_t, void>::featured), void>::type decode<osd_reqid_t, denc_traits<osd_reqid_t, void> >(osd_reqid_t&, ceph::buffer::list::iterator&)+0x198) [0x55c29ab7f0b8]
</pre>
<p>77fbfc29d15e3b9b05f7434bea8792077a18aa42 does not contain the denc changes i introduced in 77fbfc29d15e3b9b05f7434bea8792077a18aa42. so it's not caused by my change. but i don't think my change will fix it either. as osd_reqid_t does have <code>traits::need_contiguous = true</code>.</p>
<p>also, the payload can never be discontiguous. see <code>AsyncConnection::process()</code>, <code>case STATE_OPEN_MESSAGE_READ_FRONT:</code></p>
<pre>
<code class="cpp syntaxhl"><span class="CodeRay"> {
<span class="comment">// read front</span>
<span class="predefined-type">unsigned</span> front_len = current_header.front_len;
<span class="keyword">if</span> (front_len) {
<span class="keyword">if</span> (!front.length())
front.push_back(buffer::create(front_len));
r = read_until(front_len, front.c_str());
<span class="keyword">if</span> (r < <span class="integer">0</span>) {
ldout(async_msgr->cct, <span class="integer">1</span>) << __func__ << <span class="string"><span class="delimiter">"</span><span class="content"> read message front failed</span><span class="delimiter">"</span></span> << dendl;
<span class="keyword">goto</span> fail;
} <span class="keyword">else</span> <span class="keyword">if</span> (r > <span class="integer">0</span>) {
<span class="keyword">break</span>;
}
ldout(async_msgr->cct, <span class="integer">20</span>) << __func__ << <span class="string"><span class="delimiter">"</span><span class="content"> got front </span><span class="delimiter">"</span></span> << front.length() << dendl;
}
state = STATE_OPEN_MESSAGE_READ_MIDDLE;
}
</span></code><br /></pre>
<p>so i am not sure how this could happen.</p> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=938572017-06-28T00:51:31ZPatrick Donnellypdonnell@redhat.com
<ul></ul><p>Here's another one:</p>
<p>/a/pdonnell-2017-06-27_19:50:40-fs-wip-pdonnell-20170627---basic-smithi/1333648</p>
<p>fs/snaps/{begin.yaml clusters/fixed-2-ucephfs.yaml mount/fuse.yaml objectstore/bluestore.yaml overrides/{debug.yaml frag_enable.yaml whitelist_wrongly_marked_down.yaml} tasks/snaptests.yaml}</p> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=939582017-06-28T16:05:26ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Priority</strong> changed from <i>High</i> to <i>Urgent</i></li></ul><p>Kefu, any new updates or should this be unassigned from you?</p> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=940242017-06-29T10:07:30ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Fix Under Review</i></li></ul><p><a class="external" href="https://github.com/ceph/ceph/pull/16008">https://github.com/ceph/ceph/pull/16008</a></p> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=940332017-06-29T14:53:30ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Backport</strong> set to <i>jewl, kraken</i></li></ul> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=940462017-06-29T17:45:00ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>Fix Under Review</i> to <i>7</i></li></ul> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=940762017-06-29T20:10:29ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Backport</strong> changed from <i>jewl, kraken</i> to <i>jewel, kraken</i></li></ul> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=941402017-06-30T21:24:55ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Status</strong> changed from <i>7</i> to <i>12</i></li></ul> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=947862017-07-10T03:01:41ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Assignee</strong> deleted (<del><i>Kefu Chai</i></del>)</li></ul> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=947872017-07-10T03:02:15ZKefu Chaitchaikov@gmail.com
<ul></ul><p>i will look at this issue again later on if no progress has been made before then.</p> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=950202017-07-12T17:57:42ZPatrick Donnellypdonnell@redhat.com
<ul></ul><p>Another one: /ceph/teuthology-archive/pdonnell-2017-07-07_20:24:01-fs-wip-pdonnell-20170706-distro-basic-smithi/1372305/teuthology.log</p>
<p>fs/snaps/{begin.yaml clusters/fixed-2-ucephfs.yaml mount/fuse.yaml objectstore/filestore-xfs.yaml overrides/{debug.yaml frag_enable.yaml whitelist_wrongly_marked_down.yaml} tasks/snaptests.yaml}</p> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=952122017-07-17T11:30:50ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/95212/diff?detail_id=92391">diff</a>)</li></ul> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=952132017-07-17T11:41:03ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Assignee</strong> set to <i>Kefu Chai</i></li></ul> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=952422017-07-18T05:44:04ZKefu Chaitchaikov@gmail.com
<ul></ul><p>i am able to reproduce this issue using qa/workunits/fs/snaps/untar_snap_rm.sh. but not always...</p> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=952462017-07-18T09:07:04ZKefu Chaitchaikov@gmail.com
<ul></ul><p>i found that the header.version of the MOSDRepOpReply message being decoded was 1. but i am using a vstart cluster for testing, so all OSDs are luminous. hence the header.version should be 2. i think that's why the reqid failed to decode.</p> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=953012017-07-18T20:14:48ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>We set it to 1 if the MSODRepOpReply is encoded with features that do not contain SERVER_LUMINOUS.</p>
<p>...which I think a connection from a client won't have? Not sure if that's related or not.</p> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=953752017-07-19T11:15:38ZKefu Chaitchaikov@gmail.com
<ul></ul><p>MSODRepOpReply is always sent by OSD.</p>
<p>core dump from osd.1<br /><pre>
(gdb) f 13
#13 0x000055ca58c13634 in decode_message (cct=0x55ca62c50d20, crcflags=3, header=..., footer=..., front=..., middle=..., data=..., conn=0x55ca63e1f800)
at /var/ceph/ceph/src/msg/Message.cc:839
(gdb) p header
$20 = (ceph_msg_header &) @0x55ca63e20b58: {seq = {v = 1}, tid = {v = 5074}, type = {v = 113}, priority = {v = 196}, version = {v = 1}, front_len = {v = 111}, middle_len = {
v = 0}, data_len = {v = 0}, data_off = {v = 0}, src = {type = 4 '\004', num = {v = 2}}, compat_version = {v = 1}, reserved = {v = 0}, crc = {v = 1040714380}}
(gdb) p p.p
$57 = {_raw = 0x55ca6563ea30, _off = 0, _len = 111}
(gdb) p *(ceph::buffer::raw*)0x55ca6563ea30
$3 = {_vptr.raw = 0x55ca59d37308 <vtable for ceph::buffer::raw_combined+16>, data = 0x55ca6563e9c0 "\025", len = 111, nref = {<std::__atomic_base<unsigned int>> = {
static _S_alignment = 4, _M_i = 2}, <No data fields>}, mempool = 10, crc_spinlock = {<std::__atomic_flag_base> = {_M_i = false}, <No data fields>},
crc_map = std::map with 1 element = {[{first = 0, second = 111}] = {first = 0, second = 2758781822}}}
(gdb) p (uint32_t[2])(*0x55ca6563e9c0)
$61 = {21, 16}
</pre></p>
<p>log from osd.2.log. and the address of osd.1 is 127.0.0.1:6805<br /><pre>
2017-07-19 14:55:07.735281 7fd565533700 1 -- 127.0.0.1:6814/1050987 --> 127.0.0.1:6805/50779 -- osd_repop_reply(client.4120.0:8880 1.0 e21/16 ack, result = 0) v2 -- 0x5607cc551c00 con 0
</pre></p>
<p>in this case, assuming the header is not corrupted. it was sent from osd.2 per the "src" field.</p>
<p>assuming "header.version>=2", if we decode the payload, the epoch and min_epoch are 21 and 16 respectively. which matches the previous messages sent from osd.2.</p>
<p>if i dump the first 49 bytes header without the trailing crc, and run it through ceph_crc32c(), the returned digest is 1040714380. which matches with header.crc.</p>
<pre>
$ grep 'osd_repop_reply' osd.*.log | grep -w v1
</pre>
<p>also, nobody claims that it ever sends or receives a v1 osd_repop_reply.</p> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=953812017-07-19T11:55:21ZKefu Chaitchaikov@gmail.com
<ul></ul><p>occasionally, i see <br /><pre>
2017-07-19 09:31:31.611176 7fc8d016d700 10 osd.1 15 OSD::ms_get_authorizer type=osd
2017-07-19 09:31:31.611362 7fc8cf16b700 -1 failed to decode message of type 70 v3: buffer::malformed_input: void osd_peer_stat_t::decode(ceph::buffer::list::iterator&) no longer understand old encoding version 1 < struct_compat
2017-07-19 09:31:31.611381 7fc8d016d700 10 osd.1 15 new session (outgoing) 0x5643b88e0000 con=0x5643ba960800 addr=127.0.0.1:6826/31872
2017-07-19 09:31:31.611455 7fc
</pre></p>
<p>in jenkins' make check output. type 70 is MSG_OSD_PING. not sure if this is related.</p> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=954052017-07-19T13:28:40ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Status</strong> changed from <i>12</i> to <i>Fix Under Review</i></li><li><strong>Backport</strong> deleted (<del><i>jewel, kraken</i></del>)</li></ul><p><a class="external" href="https://github.com/ceph/ceph/pull/16421">https://github.com/ceph/ceph/pull/16421</a></p> RADOS - Bug #19939: OSD crash in MOSDRepOpReply::decode_payloadhttps://tracker.ceph.com/issues/19939?journal_id=955402017-07-20T14:54:15ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Status</strong> changed from <i>Fix Under Review</i> to <i>Resolved</i></li></ul>