Project

General

Profile

Bug #21660

Kraken client crash after upgrading cluster from Kraken to Luminous

Added by Sarah Brofeldt 2 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Immediate
Assignee:
-
Category:
Correctness/Safety
Target version:
-
Start date:
10/03/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
luminous
Needs Doc:
No
Component(RADOS):
OSD

Description

I'm having some trouble making the debug symbols work, (I installed ceph-common-dbg, librbd1-dbg and librados2-dbg to no avail) but I've tried to make rbd as verbose as possible.

Luminous clients are able to connect just fine, Kraken clients crash. The error is similar for commands like ceph df or ceph-fuse [...] so they appear to all be affected by the same issue.

It's entirely possible that I messed up something in the upgrade path, but the error message is rather indecipherable, so I can't tell what I might have done wrong. Debugging hints are appreciated as well.

stack - rbd --debug-ms=20 --debug-rados=20 --debug-rbd=20 ls (154 KB) Sarah Brofeldt, 10/03/2017 05:39 AM


Related issues

Related to ceph-qa-suite - Feature #21713: Add option to test for clients upgrade ... Resolved 10/06/2017
Copied to RADOS - Backport #21692: luminous: Kraken client crash after upgrading cluster from Kraken to Luminous Resolved

History

#1 Updated by Sarah Brofeldt 2 months ago

I managed to get some debug symbols working.

(gdb) run
Starting program: /usr/bin/rbd ls
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffe3532700 (LWP 27503)]
[New Thread 0x7fffe25a0700 (LWP 27504)]
[New Thread 0x7fffe0be3700 (LWP 27505)]
[New Thread 0x7fffe03e2700 (LWP 27506)]
[New Thread 0x7fffdfbe1700 (LWP 27507)]
[New Thread 0x7fffdf3e0700 (LWP 27508)]
[New Thread 0x7fffdebdf700 (LWP 27509)]
[New Thread 0x7fffde3de700 (LWP 27510)]
[New Thread 0x7fffddbdd700 (LWP 27511)]
[New Thread 0x7fffdd3dc700 (LWP 27512)]
[New Thread 0x7fffdcbdb700 (LWP 27513)]
[New Thread 0x7fffcffff700 (LWP 27514)]
terminate called after throwing an instance of '[New Thread 0x7fffcf7fe700 (LWP 27515)]
ceph::buffer::end_of_buffer'
  what():  buffer::end_of_buffer

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffdebdf700 (LWP 27509)]
0x00007fffe3f9a067 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007fffe3f9a067 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fffe3f9b448 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fffe4887b3d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fffe4885bb6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fffe4885c01 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007fffe4885e19 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007fffe6e40932 in ceph::buffer::ptr::iterator::get_pos_add (this=<optimized out>, this=<optimized out>, n=4)
    at /build/ceph-11.2.1/src/include/buffer.h:196
#7  0x00007fffe6e5a650 in get_pos_add (this=<optimized out>, this=<optimized out>, n=4)
    at /usr/include/c++/4.9/bits/vector.tcc:550
#8  decode (f=0, p=<synthetic pointer>, o=<optimized out>) at /build/ceph-11.2.1/src/include/denc.h:261
#9  denc<unsigned int, denc_traits<unsigned int> > (features=0, p=<synthetic pointer>, o=<optimized out>)
    at /build/ceph-11.2.1/src/include/denc.h:533
#10 decode (f=0, p=<synthetic pointer>, s=std::vector of length 196608, capacity 196608 = {...})
    at /build/ceph-11.2.1/src/include/denc.h:834
#11 decode<std::vector<unsigned int, std::allocator<unsigned int> >, denc_traits<std::vector<unsigned int, std::allocator<unsigned int> >, void> > (o=std::vector of length 196608, capacity 196608 = {...}, p=...)
    at /build/ceph-11.2.1/src/include/denc.h:1324
#12 0x00007fffe6e5477d in OSDMap::decode (this=this@entry=0x55555df5db50, bl=...)
    at /build/ceph-11.2.1/src/osd/OSDMap.cc:2162
#13 0x00007fffe6e5590e in OSDMap::decode (this=0x55555df5db50, bl=...) at /build/ceph-11.2.1/src/osd/OSDMap.cc:2009
#14 0x00007fffe6d331e1 in Objecter::handle_osd_map (this=this@entry=0x55555df5d540, m=m@entry=0x7fffd0002950)
    at /build/ceph-11.2.1/src/osdc/Objecter.cc:1242
#15 0x00007fffe6d33bf7 in Objecter::ms_dispatch (this=0x55555df5d540, m=0x7fffd0002950)
    at /build/ceph-11.2.1/src/osdc/Objecter.cc:1005
#16 0x00007fffe6f9f99a in ms_deliver_dispatch (m=0x7fffd0002950, this=0x55555ded0930)
    at /build/ceph-11.2.1/src/msg/Messenger.h:593
#17 DispatchQueue::entry (this=0x55555ded0a80) at /build/ceph-11.2.1/src/msg/DispatchQueue.cc:197
---Type <return> to continue, or q <return> to quit---
#18 0x00007fffe6e1bedd in DispatchQueue::DispatchThread::entry (this=<optimized out>)
    at /build/ceph-11.2.1/src/msg/DispatchQueue.h:102
#19 0x00007fffe61bc064 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#20 0x00007fffe404d62d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Does this look similar to this older, unresolved issue?

https://www.spinics.net/lists/ceph-devel/msg37489.html

rbd on the crashing client reports 11.2.1 e0354f9d3b1eea1d75a7dd487ba8098311be38a7 with --version

#2 Updated by Alfredo Deza 2 months ago

  • Project changed from Ceph to rbd
  • Severity changed from 3 - minor to 2 - major

#3 Updated by Jason Dillaman 2 months ago

  • Project changed from rbd to RADOS

Crash in the messenger layer of librados.

#4 Updated by Sage Weil 2 months ago

  • Status changed from New to Need More Info
  • Priority changed from Normal to High

Do you still have the core file? I would be very interested in seeing the epoch for the OSDMap that was being decoded.

"f 12" and then "p epoch" ought to show it?

And then grab that osdmap from the cluster "ceph osd getmap NNN -o NNN" and share it with us with "ceph-post-file NNN".

Thanks!

#5 Updated by Sarah Brofeldt 2 months ago

I wasn't clever enough to save the core file initially, so I've reproduced the issue on a reinstall of Kraken after upgrading to Luminous.

I'm not sure this is exactly usable, because the error seems to have changed slightly. Should I open a new issue?

(gdb) r
Starting program: /usr/bin/rbd ls
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffe3532700 (LWP 29574)]
[New Thread 0x7fffe25a0700 (LWP 29575)]
[New Thread 0x7fffe0be3700 (LWP 29576)]
[New Thread 0x7fffe03e2700 (LWP 29577)]
[New Thread 0x7fffdfbe1700 (LWP 29578)]
[New Thread 0x7fffdf3e0700 (LWP 29579)]
[New Thread 0x7fffdebdf700 (LWP 29580)]
[New Thread 0x7fffde3de700 (LWP 29581)]
[New Thread 0x7fffddbdd700 (LWP 29582)]
[New Thread 0x7fffdd3dc700 (LWP 29583)]
[New Thread 0x7fffdcbdb700 (LWP 29584)]
[New Thread 0x7fffcffff700 (LWP 29585)]
terminate called after throwing an instance of 'ceph::buffer::malformed_input'
  what():  buffer::malformed_input: entity_addr_t marker != 1

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffdebdf700 (LWP 29580)]
0x00007fffe3f9a067 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007fffe3f9a067 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fffe3f9b448 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fffe4887b3d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fffe4885bb6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fffe4885c01 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007fffe4885e19 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007fffe6dc62d0 in entity_addr_t::decode (this=0x7fffc8010090, bl=...) at /build/ceph-11.2.1/src/msg/msg_types.h:450
#7  0x00007fffe6e5a8a2 in decode (p=..., c=...) at /build/ceph-11.2.1/src/msg/msg_types.h:466
#8  decode<entity_addr_t, std::allocator<std::shared_ptr<entity_addr_t> > > (v=std::vector of length 3, capacity 3 = {...},
    p=...) at /build/ceph-11.2.1/src/include/encoding.h:622
#9  0x00007fffe6e54789 in OSDMap::decode (this=this@entry=0x55555df5db60, bl=...)
    at /build/ceph-11.2.1/src/osd/OSDMap.cc:2163
#10 0x00007fffe6e5590e in OSDMap::decode (this=0x55555df5db60, bl=...) at /build/ceph-11.2.1/src/osd/OSDMap.cc:2009
#11 0x00007fffe6d331e1 in Objecter::handle_osd_map (this=this@entry=0x55555df5d550, m=m@entry=0x7fffd0002770)
    at /build/ceph-11.2.1/src/osdc/Objecter.cc:1242
#12 0x00007fffe6d33bf7 in Objecter::ms_dispatch (this=0x55555df5d550, m=0x7fffd0002770)
    at /build/ceph-11.2.1/src/osdc/Objecter.cc:1005
#13 0x00007fffe6f9f99a in ms_deliver_dispatch (m=0x7fffd0002770, this=0x55555ded0940)
    at /build/ceph-11.2.1/src/msg/Messenger.h:593
#14 DispatchQueue::entry (this=0x55555ded0a90) at /build/ceph-11.2.1/src/msg/DispatchQueue.cc:197
#15 0x00007fffe6e1bedd in DispatchQueue::DispatchThread::entry (this=<optimized out>)
    at /build/ceph-11.2.1/src/msg/DispatchQueue.h:102
#16 0x00007fffe61bc064 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#17 0x00007fffe404d62d in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) f 9
#9  0x00007fffe6e54789 in OSDMap::decode (this=this@entry=0x55555df5db60, bl=...)
    at /build/ceph-11.2.1/src/osd/OSDMap.cc:2163
2163    /build/ceph-11.2.1/src/osd/OSDMap.cc: No such file or directory.
(gdb) p epoch
$1 = 308

#6 Updated by Sage Weil 2 months ago

Hi Sarah,
Can you 'ceph osd getmap 308 -o 308' and 'ceph-post-file 308'?

#7 Updated by Sarah Brofeldt 2 months ago

fc655d9b-16cd-4342-bf4b-689a3c0d2891 generated on a Luminous client.

On the Kraken client, this results in:

terminate called after throwing an instance of 'ceph::buffer::malformed_input'
  what():  buffer::malformed_input: entity_addr_t marker != 1

#8 Updated by Sage Weil 2 months ago

  • Status changed from Need More Info to Need Review
  • Priority changed from High to Immediate
00000000  08 07 b9 18 00 00 06 01  d7 12 00 00 b0 32 98 d0  |.............2..|

the first inner section is struct_v 6 compat_v 1,b ut kraken only understands struct_v 3. MOSDMap in luminous isn't triggering a reencode for kraken clients that lack SERVER_LUMINOUS but have all of the other mentioned feature bits.

https://github.com/ceph/ceph/pull/18134

#9 Updated by Greg Farnum 2 months ago

  • Status changed from Need Review to Pending Backport
  • Backport set to luminous

#10 Updated by Greg Farnum 2 months ago

  • Category set to Correctness/Safety
  • Component(RADOS) OSD added

#12 Updated by Nathan Cutler 2 months ago

  • Copied to Backport #21692: luminous: Kraken client crash after upgrading cluster from Kraken to Luminous added

#13 Updated by Sage Weil 2 months ago

  • Status changed from Pending Backport to Resolved

#14 Updated by Sage Weil 2 months ago

Sarah, the fix is in the current luminous branch now. Once it builds (~1 hrs), you can install the packages from https://shaman.ceph.com/builds/ceph/luminous and this issue should go away. Or, wait for 12.2.2.

#15 Updated by Sarah Brofeldt 2 months ago

Much appreciated!

#16 Updated by Nathan Cutler 2 months ago

@Yuri, @Sage - I guess the upgrade/kraken-x suite did not catch this because it does not do "/usr/bin/rbd ls" ?

#17 Updated by Yuri Weinstein 2 months ago

@sage is this just a matter to execute "/usr/bin/rbd ls" line at some point of a tests? I'd be happy to add this. Pls confirm.

NOTE: what about jewel-x and others ?

From IRC

(12:40:55 PM) sage: for 21660: just need some random client (rbd cli works) to run *after* the upgrae completes and 'ceph osd require-osd-relase luminous' is run.  the normal kraken-x upgrade doesn't test that.
(12:41:06 PM) sage: the client suite would/should, but it's not working or complete

#18 Updated by Yuri Weinstein 2 months ago

  • Related to Feature #21713: Add option to test for clients upgrade ... added

Also available in: Atom PDF