Project

General

Profile

Actions

Bug #1366

closed

mds segfault

Added by Sam Lang over 12 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have 4 mds's running in the following setup:

[mds.alpha]
host = 192.168.101.12

[mds.bravo]
host = 192.168.101.13

[mds.charlie]
host = 192.168.101.14
mds standby replay = true
mds standby for name = alpha

[mds.delta]
host = 192.168.101.15
mds standby replay = true
mds standby for name = bravo

After running for a while, I see a segfault on mds.charlie. I think some of this may be due to network connection getting reset on my system, which I'm still trying to figure out, but it looks like ceph is handling these resets up to a point. Here's the end of the charlie log. Let me know if more info/debugging is needed.

2011-08-05 09:15:21.013070 7f79fbbd2700 mds0.objecter FULL, paused modify 0x929b480 tid 75279
2011-08-05 09:15:21.013118 7f79fbbd2700 mds0.objecter FULL, paused modify 0x929b000 tid 75280
2011-08-05 09:15:21.117920 7f79fccd5700 mds0.3 ms_handle_reset on 192.168.101.114:6824/31082
2011-08-05 09:15:21.118549 7f79fccd5700 mds0.3 ms_handle_connect on 192.168.101.114:6824/31082
2011-08-05 09:15:42.321481 7f79fccd5700 mds0.3 ms_handle_reset on 192.168.101.114:6830/31258
2011-08-05 09:15:42.322121 7f79fccd5700 mds0.3 ms_handle_connect on 192.168.101.114:6830/31258
2011-08-05 09:16:07.017921 7f79fccd5700 mds0.3 ms_handle_reset on 192.168.101.113:6804/19439
2011-08-05 09:16:07.018804 7f79fccd5700 mds0.3 ms_handle_connect on 192.168.101.113:6804/19439
2011-08-05 09:16:11.077979 7f79fccd5700 mds0.3 ms_handle_reset on 192.168.101.11:6829/8222
2011-08-05 09:16:11.078865 7f79fccd5700 mds0.3 ms_handle_connect on 192.168.101.11:6829/8222
2011-08-05 09:16:36.679306 7f79fccd5700 mds0.3 ms_handle_reset on 192.168.101.113:6816/19485
2011-08-05 09:16:36.680255 7f79fccd5700 mds0.3 ms_handle_connect on 192.168.101.113:6816/19485
2011-08-05 09:17:22.481311 7f79fccd5700 mds0.3 ms_handle_reset on 192.168.101.11:6832/9489
2011-08-05 09:17:22.482177 7f79fccd5700 mds0.3 ms_handle_connect on 192.168.101.11:6832/9489
2011-08-05 09:17:47.747949 7f79fccd5700 mds0.3 ms_handle_reset on 192.168.101.113:6825/19531
2011-08-05 09:17:47.748818 7f79fccd5700 mds0.3 ms_handle_connect on 192.168.101.113:6825/19531
2011-08-05 09:18:13.458029 7f79fccd5700 mds0.3 ms_handle_reset on 192.168.101.114:6805/30973
2011-08-05 09:18:13.458654 7f79fccd5700 mds0.3 ms_handle_connect on 192.168.101.114:6805/30973
  • Caught signal (Segmentation fault) *
    in thread 0x7f79fccd5700
    ceph version (commit:)
    1: (ceph::BackTrace::BackTrace(int)+0x2d) [0xac0379]
    2: /usr/ceph/bin/cmds() [0xb383d3]
    3: (()+0xfc60) [0x7f79ffb93c60]
    4: (CInode::pop_and_dirty_projected_inode(LogSegment
    )+0x199) [0x9eb1ed]
    5: (Mutation::pop_and_dirty_projected_inodes()+0x50) [0x897fee]
    6: (Mutation::apply()+0x1b) [0x8980e9]
    7: (C_MDS_mknod_finish::finish(int)+0x134) [0x89b446]
    8: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x1b5) [0x7f5311]
    9: (Journaler::_finish_flush(int, unsigned long, utime_t)+0x528) [0xa85416]
    10: (Journaler::C_Flush::finish(int)+0x32) [0xa8c224]
    11: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x12b4) [0xa477dc]
    12: (MDS::handle_core_message(Message*)+0x936) [0x7ef8ce]
    13: (MDS::_dispatch(Message*)+0x6ac) [0x7f1400]
    14: (MDS::ms_dispatch(Message*)+0x38) [0x7eedbe]
    15: (Messenger::ms_deliver_dispatch(Message*)+0x70) [0xadfe2a]
    16: (SimpleMessenger::dispatch_entry()+0x810) [0xac9e7c]
    17: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x7c5468]
    18: (Thread::_entry_func(void*)+0x23) [0xa8f141]
    19: (()+0x6d8c) [0x7f79ffb8ad8c]
    20: (clone()+0x6d) [0x7f79fe7d804d]

mds config:

Actions #1

Updated by Sage Weil over 12 years ago

  • Target version set to v0.34
  • Translation missing: en.field_position set to 794
Actions #2

Updated by Sage Weil over 12 years ago

  • Translation missing: en.field_position deleted (797)
  • Translation missing: en.field_position set to 29
Actions #3

Updated by Sage Weil over 12 years ago

  • Status changed from New to 4
  • Assignee set to Sage Weil

Do you have a core for this? Which commit were you running?

My guess is that its related to the full -> not full transition in Objecter.

Actions #4

Updated by Sam Lang over 12 years ago

Sorry, there probably was a core file, but its gone now. This was with ceph stable commit: dcaca3e358f7f42c7c826d0b67f3a087903d9ab9

Anything I can do to reproduce it?

Actions #5

Updated by Sage Weil over 12 years ago

Hmm, can you describe the workload?

Were you doing the write a bunch of data then add osds type of test? (That would explain why it was full and then not full...)

Actions #6

Updated by Sam Lang over 12 years ago

IIRC, I was testing out metadata. Create a bunch of directories (1000), and then create 1000 files in each directory. All done serially. I didn't add an osd at any point, no.

Actions #7

Updated by Sage Weil over 12 years ago

  • Target version changed from v0.34 to v0.35
Actions #8

Updated by Sage Weil over 12 years ago

  • Status changed from 4 to Can't reproduce
Actions #9

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
  • Target version deleted (v0.35)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF