Bug #1366: mds segfault - CephFS - Ceph

Actions

Copy link

Bug #1366

closed

mds segfault

Added by Sam Lang over 12 years ago. Updated over 7 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Sage Weil

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I have 4 mds's running in the following setup:

[mds.alpha]
host = 192.168.101.12

[mds.bravo]
host = 192.168.101.13

[mds.charlie]
host = 192.168.101.14
mds standby replay = true
mds standby for name = alpha

[mds.delta]
host = 192.168.101.15
mds standby replay = true
mds standby for name = bravo

After running for a while, I see a segfault on mds.charlie. I think some of this may be due to network connection getting reset on my system, which I'm still trying to figure out, but it looks like ceph is handling these resets up to a point. Here's the end of the charlie log. Let me know if more info/debugging is needed.

2011-08-05 09:15:21.013070 7f79fbbd2700 mds0.objecter FULL, paused modify 0x929b480 tid 75279
2011-08-05 09:15:21.013118 7f79fbbd2700 mds0.objecter FULL, paused modify 0x929b000 tid 75280
2011-08-05 09:15:21.117920 7f79fccd5700 mds0.3 ms_handle_reset on 192.168.101.114:6824/31082
2011-08-05 09:15:21.118549 7f79fccd5700 mds0.3 ms_handle_connect on 192.168.101.114:6824/31082
2011-08-05 09:15:42.321481 7f79fccd5700 mds0.3 ms_handle_reset on 192.168.101.114:6830/31258
2011-08-05 09:15:42.322121 7f79fccd5700 mds0.3 ms_handle_connect on 192.168.101.114:6830/31258
2011-08-05 09:16:07.017921 7f79fccd5700 mds0.3 ms_handle_reset on 192.168.101.113:6804/19439
2011-08-05 09:16:07.018804 7f79fccd5700 mds0.3 ms_handle_connect on 192.168.101.113:6804/19439
2011-08-05 09:16:11.077979 7f79fccd5700 mds0.3 ms_handle_reset on 192.168.101.11:6829/8222
2011-08-05 09:16:11.078865 7f79fccd5700 mds0.3 ms_handle_connect on 192.168.101.11:6829/8222
2011-08-05 09:16:36.679306 7f79fccd5700 mds0.3 ms_handle_reset on 192.168.101.113:6816/19485
2011-08-05 09:16:36.680255 7f79fccd5700 mds0.3 ms_handle_connect on 192.168.101.113:6816/19485
2011-08-05 09:17:22.481311 7f79fccd5700 mds0.3 ms_handle_reset on 192.168.101.11:6832/9489
2011-08-05 09:17:22.482177 7f79fccd5700 mds0.3 ms_handle_connect on 192.168.101.11:6832/9489
2011-08-05 09:17:47.747949 7f79fccd5700 mds0.3 ms_handle_reset on 192.168.101.113:6825/19531
2011-08-05 09:17:47.748818 7f79fccd5700 mds0.3 ms_handle_connect on 192.168.101.113:6825/19531
2011-08-05 09:18:13.458029 7f79fccd5700 mds0.3 ms_handle_reset on 192.168.101.114:6805/30973
2011-08-05 09:18:13.458654 7f79fccd5700 mds0.3 ms_handle_connect on 192.168.101.114:6805/30973

Caught signal (Segmentation fault) *
in thread 0x7f79fccd5700
ceph version (commit:)
1: (ceph::BackTrace::BackTrace(int)+0x2d) [0xac0379]
2: /usr/ceph/bin/cmds() [0xb383d3]
3: (()+0xfc60) [0x7f79ffb93c60]
4: (CInode::pop_and_dirty_projected_inode(LogSegment)+0x199) [0x9eb1ed]
5: (Mutation::pop_and_dirty_projected_inodes()+0x50) [0x897fee]
6: (Mutation::apply()+0x1b) [0x8980e9]
7: (C_MDS_mknod_finish::finish(int)+0x134) [0x89b446]
8: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x1b5) [0x7f5311]
9: (Journaler::_finish_flush(int, unsigned long, utime_t)+0x528) [0xa85416]
10: (Journaler::C_Flush::finish(int)+0x32) [0xa8c224]
11: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x12b4) [0xa477dc]
12: (MDS::handle_core_message(Message*)+0x936) [0x7ef8ce]
13: (MDS::_dispatch(Message*)+0x6ac) [0x7f1400]
14: (MDS::ms_dispatch(Message*)+0x38) [0x7eedbe]
15: (Messenger::ms_deliver_dispatch(Message*)+0x70) [0xadfe2a]
16: (SimpleMessenger::dispatch_entry()+0x810) [0xac9e7c]
17: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x7c5468]
18: (Thread::_entry_func(void*)+0x23) [0xa8f141]
19: (()+0x6d8c) [0x7f79ffb8ad8c]
20: (clone()+0x6d) [0x7f79fe7d804d]

mds config:

Actions

Copy link

Updated by Sage Weil over 12 years ago

Target version set to v0.34
Translation missing: en.field_position set to 794

Actions

Copy link

Updated by Sage Weil over 12 years ago

Translation missing: en.field_position deleted (~~797~~)
Translation missing: en.field_position set to 29

Actions

Copy link

Updated by Sage Weil over 12 years ago

Status changed from New to 4
Assignee set to Sage Weil

Do you have a core for this? Which commit were you running?

My guess is that its related to the full -> not full transition in Objecter.

Actions

Copy link

Updated by Sam Lang over 12 years ago

Sorry, there probably was a core file, but its gone now. This was with ceph stable commit: dcaca3e358f7f42c7c826d0b67f3a087903d9ab9

Anything I can do to reproduce it?

Actions

Copy link

Updated by Sage Weil over 12 years ago

Hmm, can you describe the workload?

Were you doing the write a bunch of data then add osds type of test? (That would explain why it was full and then not full...)

Actions

Copy link

Updated by Sam Lang over 12 years ago

IIRC, I was testing out metadata. Create a bunch of directories (1000), and then create 1000 files in each directory. All done serially. I didn't add an osd at any point, no.

Actions

Copy link