Project

General

Profile

Bug #16771

mon crash in MDSMonitor::prepare_beacon on ARM

Added by stephane beuret over 7 years ago. Updated over 7 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDSMonitor
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph 10.2.2
ubuntu 16.10
in Docker version 1.11.1, build 5604cbe
on arch armhf (rapsberry pi running hypriot)

   -27> 2016-07-21 20:55:40.828826 73594d00  1 -- 172.31.100.2:6789/0 <== osd.2 192.168.100.152:6800/382 1 ==== auth(proto 0 26 bytes epoch 3) v1 ==== 56+0+0 (3518702297 0 0) 0x5bf35c20 con 0x5cc801c0
   -26> 2016-07-21 20:55:40.828972 73594d00  5 -- op tracker -- seq: 1126, time: 2016-07-21 20:55:40.828971, event: mon:_ms_dispatch, op: auth(proto 0 26 bytes epoch 3)
   -25> 2016-07-21 20:55:40.829030 73594d00  5 -- op tracker -- seq: 1126, time: 2016-07-21 20:55:40.829029, event: mon:dispatch_op, op: auth(proto 0 26 bytes epoch 3)
   -24> 2016-07-21 20:55:40.829067 73594d00  5 -- op tracker -- seq: 1126, time: 2016-07-21 20:55:40.829065, event: psvc:dispatch, op: auth(proto 0 26 bytes epoch 3)
   -23> 2016-07-21 20:55:40.829103 73594d00  5 mon.ceph2@1(leader).paxos(paxos updating c 1..316) is_readable = 1 - now=2016-07-21 20:55:40.829121 lease_expire=2016-07-21 20:55:45.748494 has v0 lc 316
   -22> 2016-07-21 20:55:40.829459 73594d00  5 -- op tracker -- seq: 1126, time: 2016-07-21 20:55:40.829457, event: send_reply, op: auth(proto 0 26 bytes epoch 3)
   -21> 2016-07-21 20:55:40.829497 73594d00  2 mon.ceph2@1(leader) e3 send_reply 0x5bdb5600 0x5bf34080 auth_reply(proto 2 0 (0) Success) v1
   -20> 2016-07-21 20:55:40.829549 73594d00  1 -- 172.31.100.2:6789/0 --> 192.168.100.152:6800/382 -- auth_reply(proto 2 0 (0) Success) v1 -- ?+0 0x5bf34080 con 0x5cc801c0
   -19> 2016-07-21 20:55:40.829630 73594d00  5 -- op tracker -- seq: 1126, time: 2016-07-21 20:55:40.829629, event: reply: send, op: auth(proto 0 26 bytes epoch 3)
   -18> 2016-07-21 20:55:40.829671 73594d00  5 -- op tracker -- seq: 1126, time: 2016-07-21 20:55:40.829669, event: done, op: auth(proto 0 26 bytes epoch 3)
   -17> 2016-07-21 20:55:40.829748 73594d00  1 -- 172.31.100.2:6789/0 <== osd.3 192.168.100.152:6802/383 3 ==== mon_subscribe({monmap=4+,osd_pg_creates=0+}) v2 ==== 50+0+0 (3172493646 0 0) 0x5bda3760 con 0x5cc7fe00
   -16> 2016-07-21 20:55:40.829826 73594d00  5 -- op tracker -- seq: 1127, time: 2016-07-21 20:55:40.829824, event: mon:_ms_dispatch, op: mon_subscribe({monmap=4+,osd_pg_creates=0+})
   -15> 2016-07-21 20:55:40.829859 73594d00  5 -- op tracker -- seq: 1127, time: 2016-07-21 20:55:40.829858, event: mon:dispatch_op, op: mon_subscribe({monmap=4+,osd_pg_creates=0+})
   -14> 2016-07-21 20:55:40.830032 73594d00  5 -- op tracker -- seq: 1127, time: 2016-07-21 20:55:40.830031, event: done, op: mon_subscribe({monmap=4+,osd_pg_creates=0+})
   -13> 2016-07-21 20:55:40.830109 73594d00  1 -- 172.31.100.2:6789/0 <== mds.? 192.168.100.151:6804/185 3 ==== mon_subscribe({mdsmap=3+,monmap=4+}) v2 ==== 42+0+0 (3086761029 0 0) 0x5bda5de0 con 0x5cc7ff40
   -12> 2016-07-21 20:55:40.830735 73594d00  5 -- op tracker -- seq: 1128, time: 2016-07-21 20:55:40.830734, event: mon:_ms_dispatch, op: mon_subscribe({mdsmap=3+,monmap=4+})
   -11> 2016-07-21 20:55:40.830809 73594d00  5 -- op tracker -- seq: 1128, time: 2016-07-21 20:55:40.830808, event: mon:dispatch_op, op: mon_subscribe({mdsmap=3+,monmap=4+})
   -10> 2016-07-21 20:55:40.831251 73594d00  5 -- op tracker -- seq: 1128, time: 2016-07-21 20:55:40.831250, event: done, op: mon_subscribe({mdsmap=3+,monmap=4+})
    -9> 2016-07-21 20:55:40.831339 73594d00  1 -- 172.31.100.2:6789/0 <== mds.? 192.168.100.151:6804/185 4 ==== mdsbeacon(14283/mds-ceph1 up:boot seq 5 v2) v7 ==== 768+0+0 (1025980531 0 0) 0x5bfc6080 con 0x5cc7ff40
    -8> 2016-07-21 20:55:40.831416 73594d00  5 -- op tracker -- seq: 1129, time: 2016-07-21 20:55:40.831415, event: mon:_ms_dispatch, op: mdsbeacon(14283/mds-ceph1 up:boot seq 5 v2)
    -7> 2016-07-21 20:55:40.831459 73594d00  5 -- op tracker -- seq: 1129, time: 2016-07-21 20:55:40.831457, event: mon:dispatch_op, op: mdsbeacon(14283/mds-ceph1 up:boot seq 5 v2)
    -6> 2016-07-21 20:55:40.831571 73594d00  5 -- op tracker -- seq: 1129, time: 2016-07-21 20:55:40.831570, event: psvc:dispatch, op: mdsbeacon(14283/mds-ceph1 up:boot seq 5 v2)
    -5> 2016-07-21 20:55:40.831805 73594d00  5 mon.ceph2@1(leader).paxos(paxos updating c 1..316) is_readable = 1 - now=2016-07-21 20:55:40.831829 lease_expire=2016-07-21 20:55:45.748494 has v0 lc 316
    -4> 2016-07-21 20:55:40.831871 73594d00  5 -- op tracker -- seq: 1129, time: 2016-07-21 20:55:40.831870, event: mdsmap:preprocess_query, op: mdsbeacon(14283/mds-ceph1 up:boot seq 5 v2)
    -3> 2016-07-21 20:55:40.831913 73594d00  5 -- op tracker -- seq: 1129, time: 2016-07-21 20:55:40.831912, event: mdsmap:preprocess_beacon, op: mdsbeacon(14283/mds-ceph1 up:boot seq 5 v2)
    -2> 2016-07-21 20:55:40.831974 73594d00  5 -- op tracker -- seq: 1129, time: 2016-07-21 20:55:40.831972, event: mdsmap:prepare_update, op: mdsbeacon(14283/mds-ceph1 up:boot seq 5 v2)
    -1> 2016-07-21 20:55:40.832012 73594d00  5 -- op tracker -- seq: 1129, time: 2016-07-21 20:55:40.832011, event: mdsmap:prepare_beacon, op: mdsbeacon(14283/mds-ceph1 up:boot seq 5 v2)
     0> 2016-07-21 20:55:40.849882 73594d00 -1 *** Caught signal (Segmentation fault) **
 in thread 73594d00 thread_name:ms_dispatch

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x3ebf7a) [0x54eb0f7a]
 2: (()+0x25250) [0x76a04250]
 3: (std::_Rb_tree_iterator<std::pair<mds_gid_t const, unsigned int> > std::_Rb_tree<mds_gid_t, std::pair<mds_gid_t const, unsigned int>, std::_Select1st<std::pair<mds_gid_t const, unsigned int> >, std::less<mds_gid_t>, std::allocator<std::pair<mds_gid_t const, unsigned int> > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<mds_gid_t const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<mds_gid_t const, unsigned int> >, std::piecewise_construct_t const&, std::tuple<mds_gid_t const&>&&, std::tuple<>&&)+0x2f) [0x54d9ade0]
 4: (FSMap::insert(MDSMap::mds_info_t const&)+0x11f) [0x54edf068]
 5: (MDSMonitor::prepare_beacon(std::shared_ptr<MonOpRequest>)+0xadd) [0x54d8f6f6]
 6: (MDSMonitor::prepare_update(std::shared_ptr<MonOpRequest>)+0x133) [0x54d93370]
 7: (PaxosService::dispatch(std::shared_ptr<MonOpRequest>)+0x749) [0x54d3e392]
 8: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0x28f) [0x54d15470]
 9: (Monitor::_ms_dispatch(Message*)+0x391) [0x54d15eae]
 10: (Monitor::ms_dispatch(Message*)+0x19) [0x54d2d0b2]
 11: (DispatchQueue::entry()+0x9cd) [0x5500cb32]
 12: (DispatchQueue::DispatchThread::entry()+0x7) [0x54f60554]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file
--- end dump of recent events ---
reraise_fatal: default handler for signal 11 didn't terminate the process?

History

#1 Updated by John Spray over 7 years ago

Please provide more information.

What series of commands did you run that caused the crash?

Were there already MDS daemons or a filesystem configured before you started your MDS daemon?

Is this happening every time you start the monitor now?

#2 Updated by John Spray over 7 years ago

Also, has this ever worked before for you? I don't know that we've ever done any cephfs testing at all on ARM builds.

#3 Updated by stephane beuret over 7 years ago

ceph mon crash when I launch:
/usr/bin/ceph-mds --cluster ceph -d -i mds-ceph1 --setuser ceph --s
It's the first mds daemon.
Yes, it happends any time. I'm trying to debug.
And, this never worked for me, but as ceph packages are available for yakkety, I try to build a ceph cluster on rpi.

#4 Updated by stephane beuret over 7 years ago

sorry: /usr/bin/ceph-mds ceph -d -i mds-ceph1 --setuser ceph --setgroup ceph

#5 Updated by stephane beuret over 7 years ago

output of ceph s
2016-07-22 19:08:42.844997 6c700470 0 -
:/1260326585 >> 192.168.100.151:6789/0 pipe(0x6c405b30 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x6c400ce8).fault

#6 Updated by stephane beuret over 7 years ago

don't know if it's usefull, but when I launch the command in debug mode 10 I have this log:

2016-07-22 23:18:43.438228 74052f70 5 mds.mds-ceph1 handle_mds_map epoch 2 from mon.0
2016-07-22 23:18:43.438313 74052f70 10 mds.mds-ceph1 my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=file layout v2}
2016-07-22 23:18:43.438642 74052f70 10 mds.mds-ceph1 mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
2016-07-22 23:18:43.438679 74052f70 10 mds.mds-ceph1 map says i am 192.168.100.151:6804/507 mds.-1.-1 state ???
2016-07-22 23:18:43.438705 74052f70 10 mds.mds-ceph1 handle_mds_map: handling map in rankless mode
2016-07-22 23:18:43.439142 74052f70 10 mds.mds-ceph1 not in map yet
2016-07-22 23:18:43.441456 76f3e000 10 mds.beacon.mds-ceph1 _send up:boot seq 1
2016-07-22 23:18:44.588711 74052f70 0 monclient: hunting for new mon

#7 Updated by John Spray over 7 years ago

  • Subject changed from running mds crash mon to mon crash in MDSMonitor::prepare_beacon on ARM
  • Component(FS) MDSMonitor added
  • Component(FS) deleted (MDS)

#8 Updated by John Spray over 7 years ago

Hmm, all we can tell from the backtrace is that a data structure got corrupted at some stage. The function where you're crashing is not doing anything controversial itself.

You could try running your monitor inside valgrind and/or gdb to see if it can pick up at an earlier stage when this is going wrong?

#9 Updated by stephane beuret over 7 years ago

I must admit that I have trouble putting it in place. I do not know enough how to use gdb, and as my ceph-mon is in a container, as ceph-mds, the implementation is not simple. Some tips would be very helpful to me.

#10 Updated by stephane beuret over 7 years ago

root@ceph1:/# ps -ef
UID PID PPID C STIME TTY TIME CMD
ceph 1 0 0 18:51 ? 00:00:01 /usr/bin/ceph-mon --cluster ceph -d -i ceph1 --public-addr 192.168.100.151:6789 --setuser ceph --setgro
root 182 0 0 18:51 ? 00:00:00 bash
root 307 182 0 18:56 ? 00:00:00 ps -ef
root@ceph1:/# gdb /usr/bin/ceph-mon -p 1
GNU gdb (Ubuntu 7.11.1-0ubuntu1) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "arm-linux-gnueabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/&gt;.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/&gt;.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/ceph-mon...(no debugging symbols found)...done.
Attaching to program: /usr/bin/ceph-mon, process 1
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
(gdb)

#11 Updated by stephane beuret over 7 years ago

so, I tried to run ceph outside of docker to run gdb on ceph-mon, but I don't know what I suppose to see.
$ gdb /usr/bin/ceph-mon -p 5366
GNU gdb (Ubuntu 7.11.1-0ubuntu1) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "arm-linux-gnueabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/&gt;.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/&gt;.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/ceph-mon...(no debugging symbols found)...done.
Attaching to program: /usr/bin/ceph-mon, process 5366
[New LWP 5368]
[New LWP 5369]
[New LWP 5370]
[New LWP 5372]
[New LWP 5373]
[New LWP 5374]
[New LWP 5375]
[New LWP 5376]
[New LWP 5377]
[New LWP 5378]
[New LWP 5379]
[New LWP 5380]
[New LWP 5381]
[New LWP 5382]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
__libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
46 ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory.
(gdb)

#12 Updated by Loïc Minier over 7 years ago

Hi,

I can confirm this issue on armhf; I was initially having it with 10.1.2-0ubuntu1 from xenial on a scaleway C1 instance, but it persists after updating to 10.2.3-0ubuntu0.16.04.2; here's the tail of gdb backtrace:

Thread 8 "ms_dispatch" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb365dd00 (LWP 12748)]
0x7f824d40 in std::_Rb_tree_iterator<std::pair<mds_gid_t const, unsigned int> > std::_Rb_tree<mds_gid_t, std::pair<mds_gid_t const, unsigned int>, std::_Select1st<std::pair<mds_gid_t const, unsigned int> >, std::less<mds_gid_t>, std::allocator<std::pair<mds_gid_t const, unsigned int> > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<mds_gid_t const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<mds_gid_t const, unsigned int> >, std::piecewise_construct_t const&, std::tuple<mds_gid_t const&>&&, std::tuple<>&&) ()
(gdb) bt
#0 0x7f824d40 in std::_Rb_tree_iterator<std::pair<mds_gid_t const, unsigned int> > std::_Rb_tree<mds_gid_t, std::pair<mds_gid_t const, unsigned int>, std::_Select1st<std::pair<mds_gid_t const, unsigned int> >, std::less<mds_gid_t>, std::allocator<std::pair<mds_gid_t const, unsigned int> > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<mds_gid_t const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<mds_gid_t const, unsigned int> >, std::piecewise_construct_t const&, std::tuple<mds_gid_t const&>&&, std::tuple<>&&) ()
#1 0x7f969268 in FSMap::insert(MDSMap::mds_info_t const&) ()
#2 0x7f8195de in MDSMonitor::prepare_beacon(std::shared_ptr<MonOpRequest>) ()
#3 0x7f81d2d4 in MDSMonitor::prepare_update(std::shared_ptr<MonOpRequest>) ()
#4 0x7f7c795a in PaxosService::dispatch(std::shared_ptr<MonOpRequest>) ()
#5 0x7f79ef78 in Monitor::dispatch_op(std::shared_ptr<MonOpRequest>) ()
#6 0x7f79f9b6 in Monitor::_ms_dispatch(Message*) ()
#7 0x7f7b667a in Monitor::ms_dispatch(Message*) ()
#8 0x7fa96eaa in DispatchQueue::entry() ()
#9 0x7f9ea3f4 in DispatchQueue::DispatchThread::entry() ()
#10 0xb6ec15b4 in start_thread (arg=0x0) at pthread_create.c:335
#11 0xb6b1faac in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:89 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Cheers,
- Loïc Minier

#13 Updated by Loïc Minier over 7 years ago

w/ debugging symbols:

Thread 9 "ms_dispatch" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb2e5dd00 (LWP 13820)]
std::pair<mds_gid_t const, unsigned int>::pair<mds_gid_t const&, 0u>(std::tuple<mds_gid_t const&>&, std::tuple<>&, std::_Index_tuple<0u>, std::_Index_tuple<>) (__tuple2=<synthetic pointer>, __tuple1=..., this=<optimized out>) at /usr/include/c++/5/tuple:1172
1172    /usr/include/c++/5/tuple: No such file or directory.
(gdb) bt full
#0  std::pair<mds_gid_t const, unsigned int>::pair<mds_gid_t const&, 0u>(std::tuple<mds_gid_t const&>&, std::tuple<>&, std::_Index_tuple<0u>, std::_Index_tuple<>) (__tuple2=<synthetic pointer>, __tuple1=..., this=<optimized out>) at /usr/include/c++/5/tuple:1172
No locals.
#1  std::pair<mds_gid_t const, unsigned int>::pair<mds_gid_t const&>(std::piecewise_construct_t, std::tuple<mds_gid_t const&>, std::tuple<>) (__second=..., __first=..., this=<optimized out>) at /usr/include/c++/5/tuple:1161
No locals.
#2  __gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<mds_gid_t const, unsigned int> > >::construct<std::pair<mds_gid_t const, unsigned int>, std::piecewise_construct_t const&, std::tuple<mds_gid_t const&>, std::tuple<> >(std::pair<mds_gid_t const, unsigned int>*, std::piecewise_construct_t const&, std::tuple<mds_gid_t const&>&&, std::tuple<>&&) (__p=<optimized out>, this=<optimized out>)
    at /usr/include/c++/5/ext/new_allocator.h:120
No locals.
#3  std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<mds_gid_t const, unsigned int> > > >::construct<std::pair<mds_gid_t const, unsigned int>, std::piecewise_construct_t const&, std::tuple<mds_gid_t const&>, std::tuple<> >(std::allocator<std::_Rb_tree_node<std::pair<mds_gid_t const, unsigned int> > >&, std::pair<mds_gid_t const, unsigned int>*, std::piecewise_construct_t const&, std::tuple<mds_gid_t const&>&&, std::tuple<>&&) (__p=<optimized out>, __a=...) at /usr/include/c++/5/bits/alloc_traits.h:530
No locals.
#4  std::_Rb_tree<mds_gid_t, std::pair<mds_gid_t const, unsigned int>, std::_Select1st<std::pair<mds_gid_t const, unsigned int> >, std::less<mds_gid_t>, std::allocator<std::pair<mds_gid_t const, unsigned int> > >::_M_construct_node<std::piecewise_construct_t const&, std::tuple<mds_gid_t const&>, std::tuple<> >(std::_Rb_tree_node<std::pair<mds_gid_t const, unsigned int> >*, std::piecewise_construct_t const&, std::tuple<mds_gid_t const&>&&, std::tuple<>&&) (__node=<optimized out>, this=<optimized out>)
    at /usr/include/c++/5/bits/stl_tree.h:529
No locals.
#5  std::_Rb_tree<mds_gid_t, std::pair<mds_gid_t const, unsigned int>, std::_Select1st<std::pair<mds_gid_t const, unsigned int> >, std::less<mds_gid_t>, std::allocator<std::pair<mds_gid_t const, unsigned int> > >::_M_create_node<std::piecewise_construct_t const&, std::tuple<mds_gid_t const&>, std::tuple<> >(std::piecewise_construct_t const&, std::tuple<mds_gid_t const&>&&, std::tuple<>&&) (
    this=0x854305dc) at /usr/include/c++/5/bits/stl_tree.h:546
        __tmp = <optimized out>
#6  std::_Rb_tree<mds_gid_t, std::pair<mds_gid_t const, unsigned int>, std::_Select1st<std::pair<mds_gid_t const, unsigned int> >, std::less<mds_gid_t>, std::allocator<std::pair<mds_gid_t const, unsigned int> > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<mds_gid_t const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<mds_gid_t const, unsigned int> >, std::piecewise_construct_t const&, std::tuple<mds_gid_t const&>&&, std::tuple<>&&) (this=this@entry=0x854305dc, __pos=...)
    at /usr/include/c++/5/bits/stl_tree.h:2170
        __z = <optimized out>
#7  0x7f969268 in std::map<mds_gid_t, unsigned int, std::less<mds_gid_t>, std::allocator<std::pair<mds_gid_t const, unsigned int> > >::operator[] (__k=..., this=0x854305dc) at /usr/include/c++/5/bits/stl_map.h:483
        __i = <optimized out>
#8  FSMap::insert (this=this@entry=0x85430518, new_info=...) at mds/FSMap.cc:794
No locals.
#9  0x7f8195de in MDSMonitor::prepare_beacon (this=this@entry=0x85430340, op=std::shared_ptr (count 5, weak 0) 0x8541da40)
    at mon/MDSMonitor.cc:533
        new_info = {
          global_id = {<boost::totally_ordered1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::less_than_comparable1<mds_gid_t, boost::equality_comparable1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > > >> = {<boost::equality_comparable1<mds_gid_t, boost::totally_ordered2<mds
_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> >> = {<boost::less_than_comparable2<mds_gid_t, unsigned long long, boost::equality_comparable2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::equality_comparable2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> >> = {<boost::detail::empty_base<mds_gid_t>> = {<No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, t = 4123}, name = "ceph-node-1", rank = -1, inc = 0, 
          state = MDSMap::STATE_STANDBY, state_seq = 1039, addr = {type = 0, nonce = 27783, {addr = {ss_family = 2, 
                __ss_padding = "\032\224\n\001\002k", '\000' <repeats 115 times>, __ss_align = 0}, addr4 = {sin_family = 2, 
                sin_port = 37914, sin_addr = {s_addr = 1795293450}, sin_zero = "\000\000\000\000\000\000\000"}, addr6 = {
                sin6_family = 2, sin6_port = 37914, sin6_flowinfo = 1795293450, sin6_addr = {__in6_u = {
                    __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, 
                sin6_scope_id = 0}}}, laggy_since = {tv = {tv_sec = 0, tv_nsec = 0}}, standby_for_rank = -1, standby_for_name = "", 
          standby_for_fscid = -1, standby_replay = false, export_targets = std::set with 0 elements, mds_features = 576460752032874495}
        info = <optimized out>
        __func__ = "prepare_beacon" 
        m = 0x85689980
        addr = {type = 0, nonce = 27783, {addr = {ss_family = 2, __ss_padding = "\032\224\n\001\002k", '\000' <repeats 115 times>, 
              __ss_align = 0}, addr4 = {sin_family = 2, sin_port = 37914, sin_addr = {s_addr = 1795293450}, 
              sin_zero = "\000\000\000\000\000\000\000"}, addr6 = {sin6_family = 2, sin6_port = 37914, sin6_flowinfo = 1795293450, 
              sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {
                    0, 0, 0, 0}}}, sin6_scope_id = 0}}}
        gid = {<boost::totally_ordered1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::less_than_comparable1<mds_gid_t, boost::equality_comparable1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > > >> = {<boost::equality_comparable1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> >> = {<boost::less_than_comparable2<mds_gid_t, unsigned long long, boost::equality_comparable2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::equality_comparable2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> >> = {<boost::detail::empty_base<mds_gid_t>> = {<No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, t = 4123}
        __PRETTY_FUNCTION__ = "bool MDSMonitor::prepare_beacon(MonOpRequestRef)" 
#10 0x7f81d2d4 in MDSMonitor::prepare_update (this=0x85430340, op=std::shared_ptr (count 5, weak 0) 0x8541da40)
    at mon/MDSMonitor.cc:469
        __func__ = "prepare_update" 
        m = 0x85689980
        __PRETTY_FUNCTION__ = "virtual bool MDSMonitor::prepare_update(MonOpRequestRef)" 
#11 0x7f7c795a in PaxosService::dispatch (this=this@entry=0x85430340, op=std::shared_ptr (count 5, weak 0) 0x8541da40)
    at mon/PaxosService.cc:96
        __PRETTY_FUNCTION__ = "bool PaxosService::dispatch(MonOpRequestRef)" 
        m = <optimized out>
#12 0x7f79ef78 in Monitor::dispatch_op (this=this@entry=0x855b5b00, op=std::shared_ptr (count 5, weak 0) 0x8541da40)
    at mon/Monitor.cc:3605
        __PRETTY_FUNCTION__ = "void Monitor::dispatch_op(MonOpRequestRef)" 
        dealt_with = true
        __func__ = "dispatch_op" 
#13 0x7f79f9b6 in Monitor::_ms_dispatch (this=this@entry=0x855b5b00, m=m@entry=0x85689980) at mon/Monitor.cc:3532
        op = std::shared_ptr (count 5, weak 0) 0x8541da40
        s = <optimized out>
        __func__ = "_ms_dispatch" 
#14 0x7f7b667a in Monitor::ms_dispatch (this=0x855b5b00, m=0x85689980) at mon/Monitor.h:905
No locals.
#15 0x7fa96eaa in Messenger::ms_deliver_dispatch (m=0x85689980, this=0x855bcb00) at ./msg/Messenger.h:584
        p = 
#16 DispatchQueue::entry (this=0x855bcc80) at msg/simple/DispatchQueue.cc:185
        msize = 737
        m = 0x85689980
        qitem = {type = -1, con = {px = 0x0}, m = {px = 0x85689980}}
        __PRETTY_FUNCTION__ = "void DispatchQueue::entry()" 
#17 0x7f9ea3f4 in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/simple/DispatchQueue.h:103
No locals.
#18 0xb6ec15b4 in start_thread (arg=0x0) at pthread_create.c:335
        pd = 0x0
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {487614551, 421112522, -1293558528, -1090527456, 0, -1293560048, -1090527456, 
                -1224806400, 0 <repeats 56 times>}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, 
              cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread" 
#19 0xb6b1faac in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:89 from /lib/arm-linux-gnueabihf/libc.so.6
No locals.

#14 Updated by Loïc Minier over 7 years ago

Valgrind:

==14440== Command: /usr/bin/ceph-mon -f --cluster ceph --id ceph-node-1 --setuser ceph --setgroup ceph
==14440== 
--14440-- WARNING: Serious error when reading debug info
--14440-- When reading debug info from /usr/bin/ceph-mon:
--14440-- Ignoring non-Dwarf2/3/4 block in .debug_info
--14440-- WARNING: Serious error when reading debug info
--14440-- When reading debug info from /usr/bin/ceph-mon:
--14440-- Ignoring non-Dwarf2/3/4 block in .debug_info
--14440-- WARNING: Serious error when reading debug info
--14440-- When reading debug info from /usr/bin/ceph-mon:
--14440-- Last block truncated in .debug_info; ignoring
--14440-- WARNING: Serious error when reading debug info
--14440-- When reading debug info from /usr/bin/ceph-mon:
--14440-- parse_CU_Header: is neither DWARF2 nor DWARF3 nor DWARF4
==14440== brk segment overflow in thread #1: can't grow to 0x5fb5000
==14440== Invalid write of size 4
==14440==    at 0x6429B52: ??? (in /lib/arm-linux-gnueabihf/libgcc_s.so.1)
==14440==  Address 0xbd9673b8 is on thread 1's stack
==14440==  16 bytes below stack pointer
==14440== 
==14440== Conditional jump or move depends on uninitialised value(s)
==14440==    at 0x642B2FE: __udivmoddi4 (in /lib/arm-linux-gnueabihf/libgcc_s.so.1)
==14440== 
==14440== Use of uninitialised value of size 4
==14440==    at 0x642B300: __udivmoddi4 (in /lib/arm-linux-gnueabihf/libgcc_s.so.1)
==14440== 
==14440== Use of uninitialised value of size 4
==14440==    at 0x6429B66: ??? (in /lib/arm-linux-gnueabihf/libgcc_s.so.1)
==14440== 
==14440== Invalid read of size 4
==14440==    at 0x6008C3A: ??? (in /usr/lib/libtcmalloc.so.4.2.6)
==14440==    by 0x6008DC3: GetStackTrace(void**, int, int) (in /usr/lib/libtcmalloc.so.4.2.6)
==14440==    by 0x5FFE5E1: tcmalloc::PageHeap::GrowHeap(unsigned int) (in /usr/lib/libtcmalloc.so.4.2.6)
==14440==  Address 0xbd9673a4 is on thread 1's stack
==14440==  4 bytes below stack pointer
==14440== 
disInstr(thumb): unhandled instruction: 0xDEFF 0xF8D4
==14440== Invalid write of size 4
==14440==    at 0x61DE74C: ??? (in /usr/lib/arm-linux-gnueabihf/libnspr4.so)
==14440==  Address 0xbd9660d8 is on thread 1's stack
==14440==  16 bytes below stack pointer
==14440== 
==14440== Conditional jump or move depends on uninitialised value(s)
==14440==    at 0x61DE85C: ??? (in /usr/lib/arm-linux-gnueabihf/libnspr4.so)
==14440== 
==14440== Use of uninitialised value of size 4
==14440==    at 0x61DE85E: ??? (in /usr/lib/arm-linux-gnueabihf/libnspr4.so)
==14440== 
==14440== Use of uninitialised value of size 4
==14440==    at 0x61DE758: ??? (in /usr/lib/arm-linux-gnueabihf/libnspr4.so)
==14440== 
--14440-- WARNING: Serious error when reading debug info
--14440-- When reading debug info from /usr/lib/arm-linux-gnueabihf/ceph/erasure-code/libec_jerasure.so:
--14440-- Ignoring non-Dwarf2/3/4 block in .debug_info
--14440-- WARNING: Serious error when reading debug info
--14440-- When reading debug info from /usr/lib/arm-linux-gnueabihf/ceph/erasure-code/libec_jerasure.so:
--14440-- Last block truncated in .debug_info; ignoring
--14440-- WARNING: Serious error when reading debug info
--14440-- When reading debug info from /usr/lib/arm-linux-gnueabihf/ceph/erasure-code/libec_jerasure.so:
--14440-- parse_CU_Header: is neither DWARF2 nor DWARF3 nor DWARF4
--14440-- WARNING: Serious error when reading debug info
--14440-- When reading debug info from /usr/lib/arm-linux-gnueabihf/ceph/erasure-code/libec_jerasure_generic.so:
--14440-- Ignoring non-Dwarf2/3/4 block in .debug_info
--14440-- WARNING: Serious error when reading debug info
--14440-- When reading debug info from /usr/lib/arm-linux-gnueabihf/ceph/erasure-code/libec_jerasure_generic.so:
--14440-- Last block truncated in .debug_info; ignoring
--14440-- WARNING: Serious error when reading debug info
--14440-- When reading debug info from /usr/lib/arm-linux-gnueabihf/ceph/erasure-code/libec_jerasure_generic.so:
--14440-- parse_CU_Header: is neither DWARF2 nor DWARF3 nor DWARF4
--14440-- WARNING: Serious error when reading debug info
--14440-- When reading debug info from /usr/lib/arm-linux-gnueabihf/ceph/erasure-code/libec_lrc.so:
--14440-- Ignoring non-Dwarf2/3/4 block in .debug_info
--14440-- WARNING: Serious error when reading debug info
--14440-- When reading debug info from /usr/lib/arm-linux-gnueabihf/ceph/erasure-code/libec_lrc.so:
--14440-- Last block truncated in .debug_info; ignoring
--14440-- WARNING: Serious error when reading debug info
--14440-- WARNING: Serious error when reading debug info
--14440-- When reading debug info from /usr/lib/arm-linux-gnueabihf/ceph/erasure-code/libec_lrc.so:
--14440-- parse_CU_Header: is neither DWARF2 nor DWARF3 nor DWARF4
==14440== Invalid write of size 4
==14440==    at 0x6429BE8: ??? (in /lib/arm-linux-gnueabihf/libgcc_s.so.1)
==14440==  Address 0xbd965fa8 is on thread 1's stack
==14440==  16 bytes below stack pointer
==14440== 
==14440== Use of uninitialised value of size 4
==14440==    at 0x6429BF4: ??? (in /lib/arm-linux-gnueabihf/libgcc_s.so.1)
==14440== 
starting mon.ceph-node-1 rank 0 at 10.1.2.107:6789/0 mon_data /var/lib/ceph/mon/ceph-ceph-node-1 fsid b024eaec-8c3b-4b79-be8f-444ab881ad2d
==14440== Thread 9 ms_dispatch:
==14440== Use of uninitialised value of size 4
==14440==    at 0x3D7D40: std::_Rb_tree_iterator<std::pair<mds_gid_t const, unsigned int> > std::_Rb_tree<mds_gid_t, std::pair<mds_gid_t const, unsigned int>, std::_Select1st<std::pair<mds_gid_t const, unsigned int> >, std::less<mds_gid_t>, std::allocator<std::pair<mds_gid_t const, unsigned int> > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<mds_gid_t const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<mds_gid_t const, unsigned int> >, std::piecewise_construct_t const&, std::tuple<mds_gid_t const&>&&, std::tuple<>&&) (in /usr/bin/ceph-mon)
==14440==    by 0x51C267: FSMap::insert(MDSMap::mds_info_t const&) (in /usr/bin/ceph-mon)
==14440==    by 0x3CC5DD: MDSMonitor::prepare_beacon(std::shared_ptr<MonOpRequest>) (in /usr/bin/ceph-mon)
==14440== 
==14440== Invalid read of size 4
==14440==    at 0x6313BEC: std::_Rb_tree_increment(std::_Rb_tree_node_base const*) (in /usr/lib/arm-linux-gnueabihf/libstdc++.so.6.0.21)
==14440==  Address 0x3a3a7069 is not stack'd, malloc'd or (recently) free'd
==14440== 
*** Caught signal (Segmentation fault) **
 in thread b55cd00 thread_name:ms_dispatch
 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (()+0x3e6216) [0x4ee216]
 2: (()+0x25260) [0x6468260]
2016-12-02 19:03:24.192482 b55cd00 -1 *** Caught signal (Segmentation fault) **
 in thread b55cd00 thread_name:ms_dispatch

 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (()+0x3e6216) [0x4ee216]
 2: (()+0x25260) [0x6468260]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2016-12-02 19:03:24.192482 b55cd00 -1 *** Caught signal (Segmentation fault) **
 in thread b55cd00 thread_name:ms_dispatch

 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (()+0x3e6216) [0x4ee216]
 2: (()+0x25260) [0x6468260]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

==14440== 
==14440== Process terminating with default action of signal 11 (SIGSEGV)
==14440==    at 0x60EA456: __libc_do_syscall (libc-do-syscall.S:47)
==14440==    by 0x60E95B5: raise (pt-raise.c:35)
==14440==    by 0x4EE30F: handle_fatal_signal(int) (in /usr/bin/ceph-mon)
==14440== 
==14440== HEAP SUMMARY:
==14440==     in use at exit: 0 bytes in 0 blocks
==14440==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==14440== 
==14440== All heap blocks were freed -- no leaks are possible
==14440== 
==14440== For counts of detected and suppressed errors, rerun with: -v
==14440== Use --track-origins=yes to see where uninitialised values come from
==14440== ERROR SUMMARY: 6019 errors from 13 contexts (suppressed: 12 from 10)
==14440== could not unlink /tmp/vgdb-pipe-from-vgdb-to-14440-by-root-on-???
==14440== could not unlink /tmp/vgdb-pipe-to-vgdb-from-14440-by-root-on-???
==14440== could not unlink /tmp/vgdb-pipe-shared-mem-vgdb-14440-by-root-on-???
Killed

Seems to me MDSMonitor::prepare_beacon is reading out of bounds when iterating over an array

#15 Updated by Loïc Minier over 7 years ago

I see MDSMonitor::prepare_beacon() proceed through the:

  if (state == MDSMap::STATE_BOOT) {

block fine, then segfault on the insert call in this block:
    // Add this daemon to the map
[...]
      pending_fsmap.insert(new_info);

This is pending_fsmap and new_info after the segv:

(gdb) print pending_fsmap 
$7 = {epoch = 2, next_filesystem_id = 1, legacy_client_fscid = -1, compat = {
    compat = {mask = 1, names = std::map with 0 elements}, ro_compat = {
      mask = 1, names = std::map with 0 elements}, incompat = {mask = 383, 
      names = std::map with 7 elements = {[1] = "base v0.20", 
        [2] = "client writeable ranges", [3] = "default file layouts on dirs", 
        [4] = "dir inode in separate object", 
        [5] = "mds uses versioned encoding", 
        [6] = "dirfrag is stored in omap", [8] = "file layout v2"}}}, 
  enable_multiple = false, ever_enabled_multiple = false, 
  filesystems = std::map with 0 elements, 
  mds_roles = std::map with 1 elements = {
    [{<boost::totally_ordered1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::less_than_comparable1<mds_gid_t, boost::equality_comparable1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > > >> = {<boost::equality_comparable1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> >> = {<boost::less_than_comparable2<mds_gid_t, unsigned long long, boost::equality_comparable2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::equality_comparable2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> >> = {<boost::detail::empty_base<mds_gid_t>> = {<No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, t = 4123}] = -1}, 
  standby_daemons = std::map with 1 elements = {
    [{<boost::totally_ordered1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::less_than_comparable1<mds_gid_t, boost::equality_comparable1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > > >> = {<boost::equality_comparable1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> >> = {<boost::less_than_comparable2<mds_gid_t, unsigned long long, boost::equality_comparable2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::equality_comparable2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> >> = {<boost::detail::empty_base<mds_gid_t>> = {<No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, t = 4123}] = {
      global_id = {<boost::totally_ordered1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::less_than_comparable1<mds_gid_t, boost::equality_comparable1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > > >> = {<boost::equality_comparable1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> >> = {<boost::less_than_comparable2<mds_gid_t, unsigned long long, boost::equality_comparable2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::equality_comparable2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> >> = {<boost::detail::empty_base<mds_gid_t---Type <return> to continue, or q <return> to quit---
>> = {<No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, t = 4123}, 
      name = "ceph-node-1", rank = -1, inc = 0, state = MDSMap::STATE_STANDBY, 
      state_seq = 1268, addr = {type = 0, nonce = 27783, {addr = {
            ss_family = 2, 
            __ss_padding = "\032\224\n\001\002k", '\000' <repeats 115 times>, 
            __ss_align = 0}, addr4 = {sin_family = 2, sin_port = 37914, 
            sin_addr = {s_addr = 1795293450}, 
            sin_zero = "\000\000\000\000\000\000\000"}, addr6 = {
            sin6_family = 2, sin6_port = 37914, sin6_flowinfo = 1795293450, 
            sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, 
                __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 
                  0, 0}}}, sin6_scope_id = 0}}}, laggy_since = {tv = {
          tv_sec = 0, tv_nsec = 0}}, standby_for_rank = -1, 
      standby_for_name = "", standby_for_fscid = -1, standby_replay = false, 
      export_targets = std::set with 0 elements, 
      mds_features = 576460752032874495}}, 
  standby_epochs = std::map with 0 elements}

(gdb) print new_info
$6 = {
  global_id = {<boost::totally_ordered1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::less_than_comparable1<mds_gid_t, boost::equality_comparable1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > > >> = {<boost::equality_comparable1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::totally_ordered2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> >> = {<boost::less_than_comparable2<mds_gid_t, unsigned long long, boost::equality_comparable2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::equality_comparable2<mds_gid_t, unsigned long long, boost::detail::empty_base<mds_gid_t> >> = {<boost::detail::empty_base<mds_gid_t>> = {<No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, t = 4123}, 
  name = "ceph-node-1", rank = -1, inc = 0, state = MDSMap::STATE_STANDBY, 
  state_seq = 1268, addr = {type = 0, nonce = 27783, {addr = {ss_family = 2, 
        __ss_padding = "\032\224\n\001\002k", '\000' <repeats 115 times>, 
        __ss_align = 0}, addr4 = {sin_family = 2, sin_port = 37914, 
        sin_addr = {s_addr = 1795293450}, 
        sin_zero = "\000\000\000\000\000\000\000"}, addr6 = {sin6_family = 2, 
        sin6_port = 37914, sin6_flowinfo = 1795293450, sin6_addr = {__in6_u = {
            __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 
              0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}}}, 
  laggy_since = {tv = {tv_sec = 0, tv_nsec = 0}}, standby_for_rank = -1, 
  standby_for_name = "", standby_for_fscid = -1, standby_replay = false, 
  export_targets = std::set with 0 elements, mds_features = 576460752032874495}

#16 Updated by Loïc Minier over 7 years ago

I'm a bit puzzled why the last insert fails:

src/mds/FSMap.cc:
void FSMap::insert(const MDSMap::mds_info_t &new_info)
{
  mds_roles[new_info.global_id] = FS_CLUSTER_ID_NONE;
  standby_daemons[new_info.global_id] = new_info;
  standby_epochs[new_info.global_id] = epoch;

src/mds/FSMap.h:
class FSMap {
protected:
  epoch_t epoch;
[...]
  std::map<mds_gid_t, MDSMap::mds_info_t> standby_daemons;
  std::map<mds_gid_t, epoch_t> standby_epochs;

especially since the map key type is the same; the ARM specific behavior would then come from the target type when allocating new entries (epoch_t)?

NB: src/include/types.h has typedef __u32 epoch_t

I'm afraid I'm hitting my BOOST/C++ limits here; help welcome :-)

#17 Updated by Loïc Minier over 7 years ago

BTW as you might have guessed, the crash occurred for me when I added a metadata server to a fresh cluster. Then ceph-mon wont launch anymore, presumably when iterating over the MDSes on startup.

#18 Updated by John Spray over 7 years ago

Hmm, still nothing's jumping out at me.

It is noteworthy that mds_gid_t is a BOOST_STRONG_TYPEDEF (unlike other things there), whose implementation I don't think I've looked at.

If I was debugging this on an ARM box I think my next step would be to put some extra code into prepare_beacon that tries constructing some std::maps with these types and inserting things into them, so that we can isolate exactly which type is not being defined properly on this architecture.

Also available in: Atom PDF