Project

General

Profile

Actions

Bug #14546

closed

mira033 kernel panic from MCE

Added by David Galloway about 8 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Low
Category:
Test Node
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

<0>[23897.753849] mce: [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 8: fe000ec00001009f
<0>[23897.762390] mce: [Hardware Error]: TSC 34504cd160cc ADDR 7cf25380 MISC e6a9590a00041283 
<0>[23897.770554] mce: [Hardware Error]: PROCESSOR 0:106e5 TIME 1453862482 SOCKET 0 APIC 0 microcode 3
<0>[23897.779355] mce: [Hardware Error]: Machine check: Processor context corrupt
<0>[23897.786325] Kernel panic - not syncing: Fatal Machine check
[dumpcommon]kdb>   -bt

Stack traceback for pid 27145
0xffff880165d20000    27145    25197  1    2   R  0xffff880165d204e8 *fn_anonymous
 ffff88043fc88d50 0000000000000000
Call Trace:
 <#DB>  <<EOE>>  <#MC>  [<ffffffff81102c79>] ? kgdb_panic_event+0x29/0x30
 [<ffffffff8173125c>] ? notifier_call_chain+0x4c/0x70
 [<ffffffff817312ba>] ? atomic_notifier_call_chain+0x1a/0x20
 [<ffffffff8171da17>] ? panic+0xec/0x1d7
 [<ffffffff8171e51e>] ? printk+0x67/0x69
 [<ffffffff81036e5a>] ? mce_panic+0x1fa/0x210
 [<ffffffff81038ca4>] ? do_machine_check+0xaa4/0xab0
 [<ffffffff8172d43f>] ? machine_check+0x1f/0x30

 <<EOE>>

Files

mira033-kdb (91.9 KB) mira033-kdb David Galloway, 01/28/2016 10:28 PM
Actions #1

Updated by Dan Mick about 8 years ago

  • Subject changed from mira033 kernel panic to mira033 kernel panic from MCE
Actions #2

Updated by David Galloway almost 8 years ago

  • Status changed from New to In Progress

Looked into this today. The system hung for quite some time waiting for RAID FW to load but eventually got past it.

I flashed the latest BIOS and RAID controller firmware (V1.49 to V1.52) available for the box.

After rebooting, RAID firmware loaded after a reasonable amount of time and system booted normally.

I've nuked and released the machine but will keep the ticket open for a bit in case more MCEs occur.

Actions #3

Updated by David Galloway almost 8 years ago

Tested DIMMs and didn't find a bad one. If MCEs persist, will retire machine.

Actions #4

Updated by David Galloway over 7 years ago

  • Status changed from In Progress to Resolved

Machine seems to be passing jobs w/o issue.

Actions

Also available in: Atom PDF