https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2017-10-03T14:37:36ZCeph RADOS - Bug #21660: Kraken client crash after upgrading cluster from Kraken to Luminoushttps://tracker.ceph.com/issues/21660?journal_id=1002072017-10-03T14:37:36ZSarah Brofeldt
<ul></ul><p>I managed to get some debug symbols working.</p>
<pre><code class="text syntaxhl"><span class="CodeRay">(gdb) run
Starting program: /usr/bin/rbd ls
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffe3532700 (LWP 27503)]
[New Thread 0x7fffe25a0700 (LWP 27504)]
[New Thread 0x7fffe0be3700 (LWP 27505)]
[New Thread 0x7fffe03e2700 (LWP 27506)]
[New Thread 0x7fffdfbe1700 (LWP 27507)]
[New Thread 0x7fffdf3e0700 (LWP 27508)]
[New Thread 0x7fffdebdf700 (LWP 27509)]
[New Thread 0x7fffde3de700 (LWP 27510)]
[New Thread 0x7fffddbdd700 (LWP 27511)]
[New Thread 0x7fffdd3dc700 (LWP 27512)]
[New Thread 0x7fffdcbdb700 (LWP 27513)]
[New Thread 0x7fffcffff700 (LWP 27514)]
terminate called after throwing an instance of '[New Thread 0x7fffcf7fe700 (LWP 27515)]
ceph::buffer::end_of_buffer'
what(): buffer::end_of_buffer
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffdebdf700 (LWP 27509)]
0x00007fffe3f9a067 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007fffe3f9a067 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fffe3f9b448 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007fffe4887b3d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007fffe4885bb6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007fffe4885c01 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007fffe4885e19 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007fffe6e40932 in ceph::buffer::ptr::iterator::get_pos_add (this=<optimized out>, this=<optimized out>, n=4)
at /build/ceph-11.2.1/src/include/buffer.h:196
#7 0x00007fffe6e5a650 in get_pos_add (this=<optimized out>, this=<optimized out>, n=4)
at /usr/include/c++/4.9/bits/vector.tcc:550
#8 decode (f=0, p=<synthetic pointer>, o=<optimized out>) at /build/ceph-11.2.1/src/include/denc.h:261
#9 denc<unsigned int, denc_traits<unsigned int> > (features=0, p=<synthetic pointer>, o=<optimized out>)
at /build/ceph-11.2.1/src/include/denc.h:533
#10 decode (f=0, p=<synthetic pointer>, s=std::vector of length 196608, capacity 196608 = {...})
at /build/ceph-11.2.1/src/include/denc.h:834
#11 decode<std::vector<unsigned int, std::allocator<unsigned int> >, denc_traits<std::vector<unsigned int, std::allocator<unsigned int> >, void> > (o=std::vector of length 196608, capacity 196608 = {...}, p=...)
at /build/ceph-11.2.1/src/include/denc.h:1324
#12 0x00007fffe6e5477d in OSDMap::decode (this=this@entry=0x55555df5db50, bl=...)
at /build/ceph-11.2.1/src/osd/OSDMap.cc:2162
#13 0x00007fffe6e5590e in OSDMap::decode (this=0x55555df5db50, bl=...) at /build/ceph-11.2.1/src/osd/OSDMap.cc:2009
#14 0x00007fffe6d331e1 in Objecter::handle_osd_map (this=this@entry=0x55555df5d540, m=m@entry=0x7fffd0002950)
at /build/ceph-11.2.1/src/osdc/Objecter.cc:1242
#15 0x00007fffe6d33bf7 in Objecter::ms_dispatch (this=0x55555df5d540, m=0x7fffd0002950)
at /build/ceph-11.2.1/src/osdc/Objecter.cc:1005
#16 0x00007fffe6f9f99a in ms_deliver_dispatch (m=0x7fffd0002950, this=0x55555ded0930)
at /build/ceph-11.2.1/src/msg/Messenger.h:593
#17 DispatchQueue::entry (this=0x55555ded0a80) at /build/ceph-11.2.1/src/msg/DispatchQueue.cc:197
---Type <return> to continue, or q <return> to quit---
#18 0x00007fffe6e1bedd in DispatchQueue::DispatchThread::entry (this=<optimized out>)
at /build/ceph-11.2.1/src/msg/DispatchQueue.h:102
#19 0x00007fffe61bc064 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#20 0x00007fffe404d62d in clone () from /lib/x86_64-linux-gnu/libc.so.6
</span></code></pre>
<p>Does this look similar to this older, unresolved issue?</p>
<p><a class="external" href="https://www.spinics.net/lists/ceph-devel/msg37489.html">https://www.spinics.net/lists/ceph-devel/msg37489.html</a></p>
<p>rbd on the crashing client reports 11.2.1 e0354f9d3b1eea1d75a7dd487ba8098311be38a7 with --version</p> RADOS - Bug #21660: Kraken client crash after upgrading cluster from Kraken to Luminoushttps://tracker.ceph.com/issues/21660?journal_id=1002522017-10-04T11:59:06ZAlfredo Dezaadeza@redhat.com
<ul><li><strong>Project</strong> changed from <i>Ceph</i> to <i>rbd</i></li><li><strong>Severity</strong> changed from <i>3 - minor</i> to <i>2 - major</i></li></ul> RADOS - Bug #21660: Kraken client crash after upgrading cluster from Kraken to Luminoushttps://tracker.ceph.com/issues/21660?journal_id=1002552017-10-04T13:10:07ZJason Dillamandillaman@redhat.com
<ul><li><strong>Project</strong> changed from <i>rbd</i> to <i>RADOS</i></li></ul><p>Crash in the messenger layer of librados.</p> RADOS - Bug #21660: Kraken client crash after upgrading cluster from Kraken to Luminoushttps://tracker.ceph.com/issues/21660?journal_id=1002792017-10-04T22:19:47ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Need More Info</i></li><li><strong>Priority</strong> changed from <i>Normal</i> to <i>High</i></li></ul><p>Do you still have the core file? I would be very interested in seeing the epoch for the OSDMap that was being decoded.</p>
<p>"f 12" and then "p epoch" ought to show it?</p>
<p>And then grab that osdmap from the cluster "ceph osd getmap NNN -o NNN" and share it with us with "ceph-post-file NNN".</p>
<p>Thanks!</p> RADOS - Bug #21660: Kraken client crash after upgrading cluster from Kraken to Luminoushttps://tracker.ceph.com/issues/21660?journal_id=1002922017-10-05T14:50:09ZSarah Brofeldt
<ul></ul><p>I wasn't clever enough to save the core file initially, so I've reproduced the issue on a reinstall of Kraken after upgrading to Luminous.</p>
<p>I'm not sure this is exactly usable, because the error seems to have changed slightly. Should I open a new issue?</p>
<pre><code class="text syntaxhl"><span class="CodeRay">(gdb) r
Starting program: /usr/bin/rbd ls
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffe3532700 (LWP 29574)]
[New Thread 0x7fffe25a0700 (LWP 29575)]
[New Thread 0x7fffe0be3700 (LWP 29576)]
[New Thread 0x7fffe03e2700 (LWP 29577)]
[New Thread 0x7fffdfbe1700 (LWP 29578)]
[New Thread 0x7fffdf3e0700 (LWP 29579)]
[New Thread 0x7fffdebdf700 (LWP 29580)]
[New Thread 0x7fffde3de700 (LWP 29581)]
[New Thread 0x7fffddbdd700 (LWP 29582)]
[New Thread 0x7fffdd3dc700 (LWP 29583)]
[New Thread 0x7fffdcbdb700 (LWP 29584)]
[New Thread 0x7fffcffff700 (LWP 29585)]
terminate called after throwing an instance of 'ceph::buffer::malformed_input'
what(): buffer::malformed_input: entity_addr_t marker != 1
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffdebdf700 (LWP 29580)]
0x00007fffe3f9a067 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007fffe3f9a067 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fffe3f9b448 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007fffe4887b3d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007fffe4885bb6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007fffe4885c01 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007fffe4885e19 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007fffe6dc62d0 in entity_addr_t::decode (this=0x7fffc8010090, bl=...) at /build/ceph-11.2.1/src/msg/msg_types.h:450
#7 0x00007fffe6e5a8a2 in decode (p=..., c=...) at /build/ceph-11.2.1/src/msg/msg_types.h:466
#8 decode<entity_addr_t, std::allocator<std::shared_ptr<entity_addr_t> > > (v=std::vector of length 3, capacity 3 = {...},
p=...) at /build/ceph-11.2.1/src/include/encoding.h:622
#9 0x00007fffe6e54789 in OSDMap::decode (this=this@entry=0x55555df5db60, bl=...)
at /build/ceph-11.2.1/src/osd/OSDMap.cc:2163
#10 0x00007fffe6e5590e in OSDMap::decode (this=0x55555df5db60, bl=...) at /build/ceph-11.2.1/src/osd/OSDMap.cc:2009
#11 0x00007fffe6d331e1 in Objecter::handle_osd_map (this=this@entry=0x55555df5d550, m=m@entry=0x7fffd0002770)
at /build/ceph-11.2.1/src/osdc/Objecter.cc:1242
#12 0x00007fffe6d33bf7 in Objecter::ms_dispatch (this=0x55555df5d550, m=0x7fffd0002770)
at /build/ceph-11.2.1/src/osdc/Objecter.cc:1005
#13 0x00007fffe6f9f99a in ms_deliver_dispatch (m=0x7fffd0002770, this=0x55555ded0940)
at /build/ceph-11.2.1/src/msg/Messenger.h:593
#14 DispatchQueue::entry (this=0x55555ded0a90) at /build/ceph-11.2.1/src/msg/DispatchQueue.cc:197
#15 0x00007fffe6e1bedd in DispatchQueue::DispatchThread::entry (this=<optimized out>)
at /build/ceph-11.2.1/src/msg/DispatchQueue.h:102
#16 0x00007fffe61bc064 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#17 0x00007fffe404d62d in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) f 9
#9 0x00007fffe6e54789 in OSDMap::decode (this=this@entry=0x55555df5db60, bl=...)
at /build/ceph-11.2.1/src/osd/OSDMap.cc:2163
2163 /build/ceph-11.2.1/src/osd/OSDMap.cc: No such file or directory.
(gdb) p epoch
$1 = 308
</span></code></pre> RADOS - Bug #21660: Kraken client crash after upgrading cluster from Kraken to Luminoushttps://tracker.ceph.com/issues/21660?journal_id=1002962017-10-05T16:08:35ZSage Weilsage@newdream.net
<ul></ul><p>Hi Sarah,<br />Can you 'ceph osd getmap 308 -o 308' and 'ceph-post-file 308'?</p> RADOS - Bug #21660: Kraken client crash after upgrading cluster from Kraken to Luminoushttps://tracker.ceph.com/issues/21660?journal_id=1003012017-10-05T16:47:24ZSarah Brofeldt
<ul></ul><p>fc655d9b-16cd-4342-bf4b-689a3c0d2891 generated on a Luminous client.</p>
<p>On the Kraken client, this results in:</p>
<pre><code class="text syntaxhl"><span class="CodeRay">terminate called after throwing an instance of 'ceph::buffer::malformed_input'
what(): buffer::malformed_input: entity_addr_t marker != 1
</span></code></pre> RADOS - Bug #21660: Kraken client crash after upgrading cluster from Kraken to Luminoushttps://tracker.ceph.com/issues/21660?journal_id=1003292017-10-05T20:27:47ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>Need More Info</i> to <i>Fix Under Review</i></li><li><strong>Priority</strong> changed from <i>High</i> to <i>Immediate</i></li></ul><pre>
00000000 08 07 b9 18 00 00 06 01 d7 12 00 00 b0 32 98 d0 |.............2..|
</pre><br />the first inner section is struct_v 6 compat_v 1,b ut kraken only understands struct_v 3. MOSDMap in luminous isn't triggering a reencode for kraken clients that lack SERVER_LUMINOUS but have all of the other mentioned feature bits.
<p><a class="external" href="https://github.com/ceph/ceph/pull/18134">https://github.com/ceph/ceph/pull/18134</a></p> RADOS - Bug #21660: Kraken client crash after upgrading cluster from Kraken to Luminoushttps://tracker.ceph.com/issues/21660?journal_id=1003352017-10-05T22:30:21ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Status</strong> changed from <i>Fix Under Review</i> to <i>Pending Backport</i></li><li><strong>Backport</strong> set to <i>luminous</i></li></ul> RADOS - Bug #21660: Kraken client crash after upgrading cluster from Kraken to Luminoushttps://tracker.ceph.com/issues/21660?journal_id=1003362017-10-05T22:31:10ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Category</strong> set to <i>Correctness/Safety</i></li><li><strong>Component(RADOS)</strong> <i>OSD</i> added</li></ul> RADOS - Bug #21660: Kraken client crash after upgrading cluster from Kraken to Luminoushttps://tracker.ceph.com/issues/21660?journal_id=1003372017-10-05T22:33:26ZSage Weilsage@newdream.net
<ul></ul><p><a class="external" href="https://github.com/ceph/ceph/pull/18140">https://github.com/ceph/ceph/pull/18140</a> backport</p> RADOS - Bug #21660: Kraken client crash after upgrading cluster from Kraken to Luminoushttps://tracker.ceph.com/issues/21660?journal_id=1003532017-10-06T03:18:18ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/21692">Backport #21692</a>: luminous: Kraken client crash after upgrading cluster from Kraken to Luminous</i> added</li></ul> RADOS - Bug #21660: Kraken client crash after upgrading cluster from Kraken to Luminoushttps://tracker.ceph.com/issues/21660?journal_id=1003822017-10-06T12:39:15ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>Pending Backport</i> to <i>Resolved</i></li></ul> RADOS - Bug #21660: Kraken client crash after upgrading cluster from Kraken to Luminoushttps://tracker.ceph.com/issues/21660?journal_id=1003832017-10-06T12:39:56ZSage Weilsage@newdream.net
<ul></ul><p>Sarah, the fix is in the current luminous branch now. Once it builds (~1 hrs), you can install the packages from <a class="external" href="https://shaman.ceph.com/builds/ceph/luminous">https://shaman.ceph.com/builds/ceph/luminous</a> and this issue should go away. Or, wait for 12.2.2.</p> RADOS - Bug #21660: Kraken client crash after upgrading cluster from Kraken to Luminoushttps://tracker.ceph.com/issues/21660?journal_id=1003842017-10-06T13:17:52ZSarah Brofeldt
<ul></ul><p>Much appreciated!</p> RADOS - Bug #21660: Kraken client crash after upgrading cluster from Kraken to Luminoushttps://tracker.ceph.com/issues/21660?journal_id=1003972017-10-06T17:15:44ZNathan Cutlerncutler@suse.cz
<ul></ul><p>@Yuri, @Sage - I guess the upgrade/kraken-x suite did not catch this because it does not do "/usr/bin/rbd ls" ?</p> RADOS - Bug #21660: Kraken client crash after upgrading cluster from Kraken to Luminoushttps://tracker.ceph.com/issues/21660?journal_id=1004062017-10-06T19:36:25ZYuri Weinsteinyweinste@redhat.com
<ul></ul><p>@sage is this just a matter to execute "/usr/bin/rbd ls" line at some point of a tests? I'd be happy to add this. Pls confirm.</p>
<p>NOTE: what about jewel-x and others ?</p>
<p>From IRC<br /><pre>
(12:40:55 PM) sage: for 21660: just need some random client (rbd cli works) to run *after* the upgrae completes and 'ceph osd require-osd-relase luminous' is run. the normal kraken-x upgrade doesn't test that.
(12:41:06 PM) sage: the client suite would/should, but it's not working or complete
</pre></p>