Bug #11462: kernel: crash (when MDS died?) - CephFS - Ceph

Actions

Copy link

Bug #11462

closed

kernel: crash (when MDS died?)

Added by Greg Farnum about 9 years ago. Updated almost 8 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

kceph

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

http://pulpito.ceph.com/teuthology-2015-04-20_23:18:01-multimds-next-testing-basic-multi/857045/

[3]kdb> bt
Stack traceback for pid 1
0xffff88040cdb0000        1        0  1    3   R  0xffff88040cdb0618 *init
 ffff88040cdbbba8 0000000000000018 0000000000000000 0000000000000000
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call Trace:
 <#DB>  <<EOE>>  [<ffffffff81114032>] ? kgdb_panic_event+0x22/0x50
 [<ffffffff8107d2ad>] ? notifier_call_chain+0x4d/0x70
 [<ffffffff8107d400>] ? __atomic_notifier_call_chain+0x70/0xb0
 [<ffffffff8107d395>] ? __atomic_notifier_call_chain+0x5/0xb0
 [<ffffffff8107d456>] ? atomic_notifier_call_chain+0x16/0x20
 [<ffffffff817573f8>] ? panic+0xed/0x1fa
 [<ffffffff8105eb33>] ? do_exit+0xa43/0xb50
 [<ffffffff81766120>] ? _raw_spin_unlock_irq+0x30/0x40
 [<ffffffff8105ece1>] ? do_group_exit+0x51/0xc0
 [<ffffffff8106b91e>] ? get_signal+0x26e/0x760
 [<ffffffff81002503>] ? do_signal+0x33/0xab0
 [<ffffffff8175725b>] ? mm_fault_error+0x130/0x14c
 [<ffffffff81049584>] ? __do_page_fault+0x374/0x4a0
 [<ffffffff81767631>] ? retint_signal+0x11/0x90
 [<ffffffff81002ff8>] ? do_notify_resume+0x78/0xa0
 [<ffffffff81767666>] ? retint_signal+0x46/0x90

If you look at the teuthology log you'll see that one of the MDSes crashed, so I bet they're related.

2015-04-21T11:28:28.094 INFO:tasks.ceph.mds.g.burnupi18.stderr:mds/StrayManager.cc: 538: FAILED assert(!dn->state_test(CDentry::STATE_PURGING))
2015-04-21T11:28:28.094 INFO:tasks.ceph.mds.g.burnupi18.stderr: ceph version 0.94-912-g33bdae7 (33bdae7d62ddc1fd77b758e2dd2876a1353f5db6)
2015-04-21T11:28:28.094 INFO:tasks.ceph.mds.g.burnupi18.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x95c46b]
2015-04-21T11:28:28.095 INFO:tasks.ceph.mds.g.burnupi18.stderr: 2: (StrayManager::eval_stray(CDentry*, bool)+0xb56) [0x6fa996]
2015-04-21T11:28:28.095 INFO:tasks.ceph.mds.g.burnupi18.stderr: 3: (StrayManager::advance_delayed()+0xf6) [0x6face6]
2015-04-21T11:28:28.095 INFO:tasks.ceph.mds.g.burnupi18.stderr: 4: (MDCache::trim(int, int)+0x15d) [0x671b0d]
2015-04-21T11:28:28.095 INFO:tasks.ceph.mds.g.burnupi18.stderr: 5: (MDS::tick()+0xd0) [0x5a4e50]
2015-04-21T11:28:28.095 INFO:tasks.ceph.mds.g.burnupi18.stderr: 6: (MDSInternalContextBase::complete(int)+0x153) [0x7d7f73]
2015-04-21T11:28:28.095 INFO:tasks.ceph.mds.g.burnupi18.stderr: 7: (SafeTimer::timer_thread()+0xec) [0x94db0c]
2015-04-21T11:28:28.095 INFO:tasks.ceph.mds.g.burnupi18.stderr: 8: (SafeTimerThread::entry()+0xd) [0x94eaad]
2015-04-21T11:28:28.095 INFO:tasks.ceph.mds.g.burnupi18.stderr: 9: (()+0x8182) [0x7f2069445182]
2015-04-21T11:28:28.096 INFO:tasks.ceph.mds.g.burnupi18.stderr: 10: (clone()+0x6d) [0x7f2067bb538d]
2015-04-21T11:28:28.096 INFO:tasks.ceph.mds.g.burnupi18.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2015-04-21T11:28:28.096 INFO:tasks.ceph.mds.g.burnupi18.stderr:2015-04-21 11:28:28.061173 7f205f795700 -1 mds/StrayManager.cc: In function 'bool StrayManager::eval_stray(CDentry*, bool)' thread 7f205f795700 time 2015-04-21 11:28:27.842066
2015-04-21T11:28:28.096 INFO:tasks.ceph.mds.g.burnupi18.stderr:mds/StrayManager.cc: 538: FAILED assert(!dn->state_test(CDentry::STATE_PURGING))

Actions

Copy link

Updated by Greg Farnum about 9 years ago

We also saw it in http://pulpito.ceph.com/teuthology-2015-04-20_23:18:01-multimds-next-testing-basic-multi/857003/, same MDS crash, the kernel one looks similar:

[4]kdb> bt
Stack traceback for pid 1
0xffff88040cdb0000        1        0  1    4   R  0xffff88040cdb0618 *init
 ffff88040cdbbba8 0000000000000018 0000000000000000 0000000000000000
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call Trace:
 <#DB>  <<EOE>>  [<ffffffff81114032>] ? kgdb_panic_event+0x22/0x50
 [<ffffffff8107d2ad>] ? notifier_call_chain+0x4d/0x70
 [<ffffffff8107d400>] ? __atomic_notifier_call_chain+0x70/0xb0
 [<ffffffff8107d395>] ? __atomic_notifier_call_chain+0x5/0xb0
 [<ffffffff8107d456>] ? atomic_notifier_call_chain+0x16/0x20
 [<ffffffff817573f8>] ? panic+0xed/0x1fa
 [<ffffffff8105eb33>] ? do_exit+0xa43/0xb50
 [<ffffffff81766120>] ? _raw_spin_unlock_irq+0x30/0x40
 [<ffffffff8105ece1>] ? do_group_exit+0x51/0xc0
 [<ffffffff8106b91e>] ? get_signal+0x26e/0x760
 [<ffffffff81002503>] ? do_signal+0x33/0xab0
 [<ffffffff8175725b>] ? mm_fault_error+0x130/0x14c
 [<ffffffff81049584>] ? __do_page_fault+0x374/0x4a0
 [<ffffffff81767631>] ? retint_signal+0x11/0x90
 [<ffffffff81002ff8>] ? do_notify_resume+0x78/0xa0
 [<ffffffff81767666>] ? retint_signal+0x46/0x90

Actions

Copy link