Project

General

Profile

Actions

Bug #1195

closed

ceph mds crash on version upgrade

Added by Sam Lang almost 13 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This may just not be something that's handled in Ceph yet, but while trying to upgrade from stable v0.28 to v0.29, I did't recreate the file system, and when I restarted all the ceph processes, all the mdses crashed. Below are the logs for each of the mdses. Also, this seems to be reproducible. If I try to restart, the mdses eventually hit the same assertion. Let me know if there's more info I can provide.

2011-06-16 18:43:20.262730 7ffe509c9700 mds0.6 handle_mds_map i am now mds0.6
2011-06-16 18:43:20.441554 7ffe509c9700 mds0.6 handle_mds_map state change up:standby --> up:replay
2011-06-16 18:43:20.441608 7ffe509c9700 mds0.6 replay_start
2011-06-16 18:43:20.441640 7ffe509c9700 mds0.6 recovery set is
2011-06-16 18:43:20.441663 7ffe509c9700 mds0.6 need osdmap epoch 219, have 194
2011-06-16 18:43:20.441681 7ffe509c9700 mds0.6 waiting for osdmap 219 (which blacklists prior instance)
2011-06-16 18:43:20.441763 7ffe509c9700 mds0.cache handle_mds_failure mds0 : recovery peers are
2011-06-16 18:43:22.725516 7ffe509c9700 mds0.6 ms_handle_connect on 192.168.60.104:6810/16472
2011-06-16 18:43:22.725900 7ffe509c9700 mds0.6 ms_handle_connect on 192.168.60.134:6804/15406
2011-06-16 18:43:22.725984 7ffe509c9700 mds0.6 ms_handle_connect on 192.168.60.129:6804/4162
  • Caught signal (Aborted) *
    in thread 0x7ffe509c9700
    ceph version (commit:)
    1: (ceph::BackTrace::BackTrace(int)+0x2d) [0xa66ab7]
    2: /usr/ceph/bin/cmds() [0xa9e73f]
    3: (()+0xfc60) [0x7ffe53c83c60]
    4: (gsignal()+0x35) [0x7ffe53172d05]
    5: (abort()+0x186) [0x7ffe53176ab6]
    6: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7ffe53a296dd]
    7: (()+0xb9926) [0x7ffe53a27926]
    8: (()+0xb9953) [0x7ffe53a27953]
    9: (()+0xb9a5e) [0x7ffe53a27a5e]
    10: (ceph::buffer::list::iterator::copy(unsigned int, char
    )+0xc4) [0x7d94e6]
    11: (void decode_raw<unsigned long long>(unsigned long long&, ceph::buffer::list::iterator&)+0x25) [0x7e1c65]
    12: (decode(unsigned long&, ceph::buffer::list::iterator&)+0x23) [0x7d9843]
    13: (SessionMap::decode(ceph::buffer::list::iterator&)+0x55) [0x9f986b]
    14: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x4c) [0x9f9064]
    15: (C_SM_Load::finish(int)+0x2c) [0x9fa2a2]
    16: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe0f) [0xa081a1]
    17: (MDS::handle_core_message(Message*)+0x84b) [0x7d5779]
    18: (MDS::_dispatch(Message*)+0x604) [0x7d6f10]
    19: (MDS::ms_dispatch(Message*)+0x38) [0x7d4d82]
    20: (Messenger::ms_deliver_dispatch(Message*)+0x70) [0xa807a2]
    21: (SimpleMessenger::dispatch_entry()+0x75c) [0xa6fd54]
    22: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x7acfea]
    23: (Thread::_entry_func(void*)+0x23) [0x9fe9c5]
    24: (()+0x6d8c) [0x7ffe53c7ad8c]
    25: (clone()+0x6d) [0x7ffe5322504d]
  • Caught signal (Aborted) *
    in thread 0x7ffe509c9700
    ceph version (commit:)
    1: (ceph::BackTrace::BackTrace(int)+0x2d) [0xa66ab7]
    2: /usr/ceph/bin/cmds() [0xa9e73f]
    3: (()+0xfc60) [0x7ffe53c83c60]
    4: (gsignal()+0x35) [0x7ffe53172d05]
    5: (abort()+0x186) [0x7ffe53176ab6]
    6: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7ffe53a296dd]
    7: (()+0xb9926) [0x7ffe53a27926]
    8: (()+0xb9953) [0x7ffe53a27953]
    9: (()+0xb9a5e) [0x7ffe53a27a5e]
    10: (ceph::buffer::list::iterator::copy(unsigned int, char
    )+0xc4) [0x7d94e6]
    11: (void decode_raw<unsigned long long>(unsigned long long&, ceph::buffer::list::iterator&)+0x25) [0x7e1c65]
    12: (decode(unsigned long&, ceph::buffer::list::iterator&)+0x23) [0x7d9843]
    13: (SessionMap::decode(ceph::buffer::list::iterator&)+0x55) [0x9f986b]
    14: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x4c) [0x9f9064]
    15: (C_SM_Load::finish(int)+0x2c) [0x9fa2a2]
    16: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe0f) [0xa081a1]
    17: (MDS::handle_core_message(Message*)+0x84b) [0x7d5779]
    18: (MDS::_dispatch(Message*)+0x604) [0x7d6f10]
    19: (MDS::ms_dispatch(Message*)+0x38) [0x7d4d82]
    20: (Messenger::ms_deliver_dispatch(Message*)+0x70) [0xa807a2]
    21: (SimpleMessenger::dispatch_entry()+0x75c) [0xa6fd54]
    22: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x7acfea]
    23: (Thread::_entry_func(void*)+0x23) [0x9fe9c5]
    24: (()+0x6d8c) [0x7ffe53c7ad8c]
    25: (clone()+0x6d) [0x7ffe5322504d]
2011-06-16 18:43:01.116350 7f907c636700 mds0.0 handle_mds_map i am now mds4248.0replaying mds0.0
2011-06-16 18:43:01.116466 7f907c636700 mds0.0 handle_mds_map state change down:dne --> up:standby-replay
2011-06-16 18:43:01.116493 7f907c636700 mds0.0 replay_start
2011-06-16 18:43:01.116521 7f907c636700 mds0.0 recovery set is
2011-06-16 18:43:01.116544 7f907c636700 mds0.0 need osdmap epoch 213, have 214
2011-06-16 18:43:01.117347 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.104:6810/16472
2011-06-16 18:43:01.118414 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.134:6804/15406
2011-06-16 18:43:01.118563 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.135:6812/15828
2011-06-16 18:43:01.118626 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.129:6804/4162
2011-06-16 18:43:01.119975 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.135:6800/15479
2011-06-16 18:43:01.121450 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.135:6803/15561
2011-06-16 18:43:01.121574 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.134:6813/17187
2011-06-16 18:43:01.123427 7f907c636700 mds0.cache creating system inode with ino:100
2011-06-16 18:43:01.123813 7f907c636700 mds0.cache creating system inode with ino:1
2011-06-16 18:43:01.125386 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.132:6812/11667
2011-06-16 18:43:01.125498 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.109:6810/7022
2011-06-16 18:43:01.131320 7f907c636700 mds0.0 ms_handle_connect on 192.168.60.134:6807/16267
  • Caught signal (Aborted) *
    in thread 0x7f907901c700
    ceph version (commit:)
    1: (ceph::BackTrace::BackTrace(int)+0x2d) [0xa66ab7]
    2: /usr/ceph/bin/cmds() [0xa9e73f]
    3: (()+0xfc60) [0x7f907f8f0c60]
    4: (gsignal()+0x35) [0x7f907eddfd05]
    5: (abort()+0x186) [0x7f907ede3ab6]
    6: (()+0x6cd7b) [0x7f907ee18d7b]
    7: (()+0x78a8f) [0x7f907ee24a8f]
    8: (cfree()+0x73) [0x7f907ee288e3]
    9: (EMetaBlob::fullbit::update_inode(MDS
    , CInode*)+0x1cc) [0x804376]
    10: (EMetaBlob::replay(MDS*, LogSegment*)+0x1579) [0x80598f]
    11: (EUpdate::replay(MDS*)+0x44) [0x8092e4]
    12: (MDLog::_replay_thread()+0xd9a) [0x9fdf1c]
    13: (MDLog::ReplayThread::entry()+0x1c) [0x7df302]
    14: (Thread::_entry_func(void*)+0x23) [0x9fe9c5]
    15: (()+0x6d8c) [0x7f907f8e7d8c]
    16: (clone()+0x6d) [0x7f907ee9204d]

Files

mds0_sessionmap (17 Bytes) mds0_sessionmap Sam Lang, 06/17/2011 09:30 AM
mds.alpha.log (66.8 KB) mds.alpha.log Sam Lang, 06/17/2011 09:38 AM
Actions #1

Updated by Greg Farnum almost 13 years ago

Hmm, Ceph should be upgradable to newer versions. These backtraces don't look familiar though and i don't see anything immediate that would cause them. Could you get the backtraces out of gdb (with line numbers!)

Also possibly useful would be a log with the debugging cranked up1, since it looks like this is reproducible.

[1]: Probably

debug ms = 1
debug mds = 20

Actions #2

Updated by Sage Weil almost 13 years ago

  • Category set to 1
  • Target version set to v0.30

Can you 'rados -p metadata get mds0_sessionmap /tmp/mds0_sessionmap' and attach? I'm curious what is in the object that won't decode.

Actions #3

Updated by Sam Lang almost 13 years ago

Attached result of above command.

Actions #4

Updated by Sam Lang almost 13 years ago

Attached log from mds crash with suggested debugging enabled.

Actions #5

Updated by Sage Weil almost 13 years ago

Oh, you have multiple MDSs.. can you dump the same object for whichever one(s) crashed in

13: (SessionMap::decode(ceph::buffer::list::iterator&)+0x55) [0x9f986b]

? Thanks!

Actions #6

Updated by Sage Weil almost 13 years ago

  • Target version changed from v0.30 to v0.31
Actions #7

Updated by Sage Weil almost 13 years ago

pushed fix commit:cc644b842261dbeefde804ed999061b8733a9190 to stable branch

Actions #8

Updated by Sage Weil almost 13 years ago

  • Target version changed from v0.31 to v0.32
Actions #9

Updated by Sage Weil almost 13 years ago

  • Status changed from New to Resolved

closing this out

Actions #10

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
  • Target version deleted (v0.32)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF