Project

General

Profile

Actions

Bug #172

closed

OSD and MDS crash on rm -r

Added by Wido den Hollander almost 14 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm still using my test script which unpacks the kernel source and then removes it again with a few steps in between.

Right now the copy back from the snapshot goes fine, but afterwards the rm of the original kernel files fail.

Attached you will find various files with straces and logs in them, i'll try to point out the scenario:

The scripts runs fine and the copy back from the snapshot goes fine, expect for the messages in "ceph_client_script_log.txt"

Why can't those files be found? That seems like a different bug to me?

Well, the bug of the stalling cp seems to have been fixed, but while removing the files afterwards the MDS crashes first, 15 seconds later, all 5 OSD's go down as well.

In the mds strace and osd strace i've added a stat of /core, here you can see the MDS core dump is older then the OSD, thus the MDS crashes first.

After gathering all this information i started my OSD's and the crash MDS (192.168.6.206) again. While doing so, my cluster started to recover, but then mds0 (192.168.6.205) crashed. (Core and log is also attached in mds0_*)

I'm doing a clean mkcephfs right now and running the same test again, expecting the same result as it happened for the second time today.

My Envirioment:
  • Branch: unstable ( 7c0df0540700fe2816470f5cc2a2fc7a130e4456 )
  • OS: Ubuntu 10.04 (AMD64)
  • Kernel: 2.6.34

Files

mds1_crash_strace.txt (2.09 KB) mds1_crash_strace.txt Wido den Hollander, 06/03/2010 05:11 AM
osd_crash_log.txt (63.9 KB) osd_crash_log.txt Wido den Hollander, 06/03/2010 05:11 AM
osd_crash_trace.txt (9.66 KB) osd_crash_trace.txt Wido den Hollander, 06/03/2010 05:11 AM
mds0_crash_strace.txt (2.37 KB) mds0_crash_strace.txt Wido den Hollander, 06/03/2010 05:11 AM
ceph_client_test_script.sh (840 Bytes) ceph_client_test_script.sh Wido den Hollander, 06/03/2010 05:11 AM
ceph_client_script_log.txt (1.51 KB) ceph_client_script_log.txt Wido den Hollander, 06/03/2010 05:11 AM
ceph_client_ps.txt (632 Bytes) ceph_client_ps.txt Wido den Hollander, 06/03/2010 05:11 AM
ceph_client_kernel_log.txt (1.67 KB) ceph_client_kernel_log.txt Wido den Hollander, 06/03/2010 05:11 AM
mds0_crash_log.txt (20 KB) mds0_crash_log.txt Wido den Hollander, 06/03/2010 05:11 AM
mds1_crash_log.txt (26.9 KB) mds1_crash_log.txt Wido den Hollander, 06/03/2010 05:11 AM
mds0_crash_second_run_log.txt (26 KB) mds0_crash_second_run_log.txt MDS crash of the second run Wido den Hollander, 06/04/2010 06:22 AM
Actions #1

Updated by Wido den Hollander almost 14 years ago

Today i ran the same test again, almost the same result.

Before i ran the test i created a fresh fs with mkcephfs.

Result: MDS crashed again, after which all 5 OSD's followed.

The MDS log gave some more information this the log gave some more information.

Attached you will find the log.

Actions #2

Updated by Sage Weil almost 14 years ago

Wido den Hollander wrote:

Today i ran the same test again, almost the same result.

Before i ran the test i created a fresh fs with mkcephfs.

Result: MDS crashed again, after which all 5 OSD's followed.

The MDS log gave some more information this the log gave some more information.

Attached you will find the log.

Was this with the unstable code from yesterday? There were a zillion little fixes to random stuff. I'm still hitting problems (not done yet!) but it's succeeding much more frequently than before.

Actions #3

Updated by Wido den Hollander almost 14 years ago

Sage Weil wrote:

Wido den Hollander wrote:

Today i ran the same test again, almost the same result.

Before i ran the test i created a fresh fs with mkcephfs.

Result: MDS crashed again, after which all 5 OSD's followed.

The MDS log gave some more information this the log gave some more information.

Attached you will find the log.

Was this with the unstable code from yesterday? There were a zillion little fixes to random stuff. I'm still hitting problems (not done yet!) but it's succeeding much more frequently than before.

This was with:

commit c4e6482d302aa288031ced6cd845d60ba655e5c8
Author: Sage Weil <>
Date: Thu Jun 3 17:32:39 2010 -0700

Actions #4

Updated by Sage Weil almost 14 years ago

  • Status changed from New to Closed

Closing this one. The osd crash was a snap_trimmer bug fixed a few days ago.

Added a qa workunit that repeats this test.

Actions #5

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF