Bug #1195
closedceph mds crash on version upgrade
0%
Description
This may just not be something that's handled in Ceph yet, but while trying to upgrade from stable v0.28 to v0.29, I did't recreate the file system, and when I restarted all the ceph processes, all the mdses crashed. Below are the logs for each of the mdses. Also, this seems to be reproducible. If I try to restart, the mdses eventually hit the same assertion. Let me know if there's more info I can provide.
2011-06-16 18:43:20.262730 7ffe509c9700 mds0.6 handle_mds_map i am now mds0.62011-06-16 18:43:20.441554 7ffe509c9700 mds0.6 handle_mds_map state change up:standby --> up:replay
2011-06-16 18:43:20.441608 7ffe509c9700 mds0.6 replay_start
2011-06-16 18:43:20.441640 7ffe509c9700 mds0.6 recovery set is
2011-06-16 18:43:20.441663 7ffe509c9700 mds0.6 need osdmap epoch 219, have 194
2011-06-16 18:43:20.441681 7ffe509c9700 mds0.6 waiting for osdmap 219 (which blacklists prior instance)
2011-06-16 18:43:20.441763 7ffe509c9700 mds0.cache handle_mds_failure mds0 : recovery peers are
2011-06-16 18:43:22.725516 7ffe509c9700 mds0.6 ms_handle_connect on 192.168.60.104:6810/16472
2011-06-16 18:43:22.725900 7ffe509c9700 mds0.6 ms_handle_connect on 192.168.60.134:6804/15406
2011-06-16 18:43:22.725984 7ffe509c9700 mds0.6 ms_handle_connect on 192.168.60.129:6804/4162
- Caught signal (Aborted) *
in thread 0x7ffe509c9700
ceph version (commit:)
1: (ceph::BackTrace::BackTrace(int)+0x2d) [0xa66ab7]
2: /usr/ceph/bin/cmds() [0xa9e73f]
3: (()+0xfc60) [0x7ffe53c83c60]
4: (gsignal()+0x35) [0x7ffe53172d05]
5: (abort()+0x186) [0x7ffe53176ab6]
6: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7ffe53a296dd]
7: (()+0xb9926) [0x7ffe53a27926]
8: (()+0xb9953) [0x7ffe53a27953]
9: (()+0xb9a5e) [0x7ffe53a27a5e]
10: (ceph::buffer::list::iterator::copy(unsigned int, char)+0xc4) [0x7d94e6]
11: (void decode_raw<unsigned long long>(unsigned long long&, ceph::buffer::list::iterator&)+0x25) [0x7e1c65]
12: (decode(unsigned long&, ceph::buffer::list::iterator&)+0x23) [0x7d9843]
13: (SessionMap::decode(ceph::buffer::list::iterator&)+0x55) [0x9f986b]
14: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x4c) [0x9f9064]
15: (C_SM_Load::finish(int)+0x2c) [0x9fa2a2]
16: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe0f) [0xa081a1]
17: (MDS::handle_core_message(Message*)+0x84b) [0x7d5779]
18: (MDS::_dispatch(Message*)+0x604) [0x7d6f10]
19: (MDS::ms_dispatch(Message*)+0x38) [0x7d4d82]
20: (Messenger::ms_deliver_dispatch(Message*)+0x70) [0xa807a2]
21: (SimpleMessenger::dispatch_entry()+0x75c) [0xa6fd54]
22: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x7acfea]
23: (Thread::_entry_func(void*)+0x23) [0x9fe9c5]
24: (()+0x6d8c) [0x7ffe53c7ad8c]
25: (clone()+0x6d) [0x7ffe5322504d]
- Caught signal (Aborted) *
in thread 0x7ffe509c9700
ceph version (commit:)
1: (ceph::BackTrace::BackTrace(int)+0x2d) [0xa66ab7]
2: /usr/ceph/bin/cmds() [0xa9e73f]
3: (()+0xfc60) [0x7ffe53c83c60]
4: (gsignal()+0x35) [0x7ffe53172d05]
5: (abort()+0x186) [0x7ffe53176ab6]
6: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7ffe53a296dd]
7: (()+0xb9926) [0x7ffe53a27926]
8: (()+0xb9953) [0x7ffe53a27953]
9: (()+0xb9a5e) [0x7ffe53a27a5e]
10: (ceph::buffer::list::iterator::copy(unsigned int, char)+0xc4) [0x7d94e6]
11: (void decode_raw<unsigned long long>(unsigned long long&, ceph::buffer::list::iterator&)+0x25) [0x7e1c65]
12: (decode(unsigned long&, ceph::buffer::list::iterator&)+0x23) [0x7d9843]
13: (SessionMap::decode(ceph::buffer::list::iterator&)+0x55) [0x9f986b]
14: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x4c) [0x9f9064]
15: (C_SM_Load::finish(int)+0x2c) [0x9fa2a2]
16: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe0f) [0xa081a1]
17: (MDS::handle_core_message(Message*)+0x84b) [0x7d5779]
18: (MDS::_dispatch(Message*)+0x604) [0x7d6f10]
19: (MDS::ms_dispatch(Message*)+0x38) [0x7d4d82]
20: (Messenger::ms_deliver_dispatch(Message*)+0x70) [0xa807a2]
21: (SimpleMessenger::dispatch_entry()+0x75c) [0xa6fd54]
22: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x7acfea]
23: (Thread::_entry_func(void*)+0x23) [0x9fe9c5]
24: (()+0x6d8c) [0x7ffe53c7ad8c]
25: (clone()+0x6d) [0x7ffe5322504d]
2011-06-16 18:43:01.116466 7f907c636700 mds0.0 handle_mds_map state change down:dne --> up:standby-replay
2011-06-16 18:43:01.116493 7f907c636700 mds0.0 replay_start
2011-06-16 18:43:01.116521 7f907c636700 mds0.0 recovery set is
2011-06-16 18:43:01.116544 7f907c636700 mds0.0 need osdmap epoch 213, have 214
2011-06-16 18:43:01.117347 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.104:6810/16472
2011-06-16 18:43:01.118414 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.134:6804/15406
2011-06-16 18:43:01.118563 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.135:6812/15828
2011-06-16 18:43:01.118626 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.129:6804/4162
2011-06-16 18:43:01.119975 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.135:6800/15479
2011-06-16 18:43:01.121450 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.135:6803/15561
2011-06-16 18:43:01.121574 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.134:6813/17187
2011-06-16 18:43:01.123427 7f907c636700 mds0.cache creating system inode with ino:100
2011-06-16 18:43:01.123813 7f907c636700 mds0.cache creating system inode with ino:1
2011-06-16 18:43:01.125386 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.132:6812/11667
2011-06-16 18:43:01.125498 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.109:6810/7022
2011-06-16 18:43:01.131320 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.134:6807/16267
- Caught signal (Aborted) *
in thread 0x7f907901c700
ceph version (commit:)
1: (ceph::BackTrace::BackTrace(int)+0x2d) [0xa66ab7]
2: /usr/ceph/bin/cmds() [0xa9e73f]
3: (()+0xfc60) [0x7f907f8f0c60]
4: (gsignal()+0x35) [0x7f907eddfd05]
5: (abort()+0x186) [0x7f907ede3ab6]
6: (()+0x6cd7b) [0x7f907ee18d7b]
7: (()+0x78a8f) [0x7f907ee24a8f]
8: (cfree()+0x73) [0x7f907ee288e3]
9: (EMetaBlob::fullbit::update_inode(MDS, CInode*)+0x1cc) [0x804376]
10: (EMetaBlob::replay(MDS*, LogSegment*)+0x1579) [0x80598f]
11: (EUpdate::replay(MDS*)+0x44) [0x8092e4]
12: (MDLog::_replay_thread()+0xd9a) [0x9fdf1c]
13: (MDLog::ReplayThread::entry()+0x1c) [0x7df302]
14: (Thread::_entry_func(void*)+0x23) [0x9fe9c5]
15: (()+0x6d8c) [0x7f907f8e7d8c]
16: (clone()+0x6d) [0x7f907ee9204d]
Files
Updated by Greg Farnum almost 13 years ago
Hmm, Ceph should be upgradable to newer versions. These backtraces don't look familiar though and i don't see anything immediate that would cause them. Could you get the backtraces out of gdb (with line numbers!)
Also possibly useful would be a log with the debugging cranked up1, since it looks like this is reproducible.
[1]: Probably
debug ms = 1 debug mds = 20
Updated by Sage Weil almost 13 years ago
- Category set to 1
- Target version set to v0.30
Can you 'rados -p metadata get mds0_sessionmap /tmp/mds0_sessionmap' and attach? I'm curious what is in the object that won't decode.
Updated by Sam Lang almost 13 years ago
- File mds0_sessionmap mds0_sessionmap added
Attached result of above command.
Updated by Sam Lang almost 13 years ago
- File mds.alpha.log mds.alpha.log added
Attached log from mds crash with suggested debugging enabled.
Updated by Sage Weil almost 13 years ago
Oh, you have multiple MDSs.. can you dump the same object for whichever one(s) crashed in
13: (SessionMap::decode(ceph::buffer::list::iterator&)+0x55) [0x9f986b]
? Thanks!
Updated by Sage Weil almost 13 years ago
- Target version changed from v0.30 to v0.31
Updated by Sage Weil almost 13 years ago
pushed fix commit:cc644b842261dbeefde804ed999061b8733a9190 to stable branch
Updated by Sage Weil almost 13 years ago
- Target version changed from v0.31 to v0.32
Updated by John Spray over 7 years ago
- Project changed from Ceph to CephFS
- Category deleted (
1) - Target version deleted (
v0.32)
Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.