Bug #214
closeddon't fail on assertion when mkcephfs is mis-used
0%
Description
3 boxes, each with 1 mon 1 mds 1 osd
I wanted a clean base for further testing, so on each boxes, I did a:
mkcephfs -c /etc/ceph/ceph.conf --mkbtrfs --clobber_old_data -k /etc/ceph/keyring.bin
it apparently worked fine.
starting the cluster, every daeamon runs, but there seem to be a problem with osds joining in, ceph -w reports:
pg v2: 792 pgs: 792 creating; 0 KB data, 0 KB used, 0 KB / 0 KB avail mds e5: 1/1/1 up {0=up:creating}, 2 up:standby osd e1: 0 osds: 0 up, 0 in mon e1: 3 mons at 172.16.20.9:6789/0 172.16.20.10:6789/0 172.167.20.11:6789/0
then trying to restart osd0 issuing;
/etc/init.d/ceph restart osd
that osd crashes after being rstarted, and the two non-leader mon crash
stacktrace for the osd:
#0 0x00007f9dd1983a75 in raise () from /lib/libc.so.6 #1 0x00007f9dd19875c0 in abort () from /lib/libc.so.6 #2 0x00007f9dd22388e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6 #3 0x00007f9dd2236d16 in ?? () from /usr/lib/libstdc++.so.6 #4 0x00007f9dd2236d43 in std::terminate() () from /usr/lib/libstdc++.so.6 #5 0x00007f9dd2236e3e in __cxa_throw () from /usr/lib/libstdc++.so.6 #6 0x00000000005b39f8 in ceph::__ceph_assert_fail (assertion=0x5e3440 "ceph_fsid_compare(&inc.fsid, &fsid) == 0", file=<value optimized out>, line=482, func=<value optimized out>) at common/assert.cc:30 #7 0x0000000000514228 in OSDMap::apply_incremental(OSDMap::Incremental&) () #8 0x00000000004dc3a1 in OSD::handle_osd_map (this=0x18cf6b0, m=<value optimized out>) at osd/OSD.cc:2175 #9 0x00000000004e7c20 in OSD::_dispatch (this=0x18cf6b0, m=0x7f9dc000ac70) at osd/OSD.cc:1837 #10 0x00000000004e8619 in OSD::ms_dispatch (this=0x18cf6b0, m=0x7f9dc000ac70) at osd/OSD.cc:1728 #11 0x0000000000460769 in Messenger::ms_deliver_dispatch (this=<value optimized out>) at msg/Messenger.h:97 #12 SimpleMessenger::dispatch_entry (this=<value optimized out>) at msg/SimpleMessenger.cc:332 #13 0x00000000004567cc in SimpleMessenger::DispatchThread::entry (this=0x18ccb30) at msg/SimpleMessenger.h:497 #14 0x0000000000469a4a in Thread::_entry_func (arg=0x7ab9) at ./common/Thread.h:39 #15 0x00007f9dd28169ca in start_thread () from /lib/libpthread.so.0 #16 0x00007f9dd1a366cd in clone () from /lib/libc.so.6 #17 0x0000000000000000 in ?? ()
and for the monitors
#0 0x00007fc350d7da75 in raise () from /lib/libc.so.6 #1 0x00007fc350d815c0 in abort () from /lib/libc.so.6 #2 0x00007fc3516328e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6 #3 0x00007fc351630d16 in ?? () from /usr/lib/libstdc++.so.6 #4 0x00007fc351630d43 in std::terminate() () from /usr/lib/libstdc++.so.6 #5 0x00007fc351630e3e in __cxa_throw () from /usr/lib/libstdc++.so.6 #6 0x0000000000535ae8 in ceph::__ceph_assert_fail (assertion=0x56b668 "ceph_fsid_compare(&inc.fsid, &fsid) == 0", file=<value optimized out>, line=482, func=<value optimized out>) at common/assert.cc:30 #7 0x000000000049d5ef in OSDMap::apply_incremental(OSDMap::Incremental&) () #8 0x000000000048bbdc in OSDMonitor::update_from_paxos (this=0x1b79740) at mon/OSDMonitor.cc:97 #9 0x000000000046b261 in Monitor::_ms_dispatch (this=<value optimized out>, m=0x6038010) at mon/Monitor.cc:717 #10 0x0000000000476a0b in Monitor::ms_dispatch(Message*) () #11 0x0000000000450d39 in Messenger::ms_deliver_dispatch (this=<value optimized out>) at msg/Messenger.h:97 #12 SimpleMessenger::dispatch_entry (this=<value optimized out>) at msg/SimpleMessenger.cc:332 #13 0x000000000044793c in SimpleMessenger::DispatchThread::entry (this=0x1b70bf0) at msg/SimpleMessenger.h:497 #14 0x000000000045a02a in Thread::_entry_func (arg=0x4e9a) at ./common/Thread.h:39 #15 0x00007fc351c109ca in start_thread () from /lib/libpthread.so.0 #16 0x00007fc350e306cd in clone () from /lib/libc.so.6 #17 0x0000000000000000 in ?? ()
So, as discussed with Sage on IRC, I mis-used mkcephfs
<sage> ah. mkcephfs doesn't currently support running independnetly on differnt hosts.. it has to be run from one host with -a (--all-hosts) <sage> otherwise the fsid/shared data won't match. we still need to make a mode that will allow it to be run in parallel
maybe an error message instead of an assert would be a good thing ?
Updated by Sage Weil almost 14 years ago
- Category set to OSD
- Assignee set to Greg Farnum
handle_osd_map should log an error and return if the fsid doesn't match
Updated by Greg Farnum almost 14 years ago
- Status changed from New to Resolved
OSD will now warn to log and shutdown on a bad fsid. (Map updates can only come from trusted sources, so if it gets a mismatched fsid that needs attention.)
Fixed in 9bbeec4745fa6f04835587654492fc371fcfdbeb.