Project

General

Profile

Actions

Bug #14157

closed

smithi004, smithi005, smithi007, smithi055 NVMe cards bad

Added by David Galloway over 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Normal
Category:
Test Node
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Runs are failing due to not being able to partition the NVMe device. Upon further inspection, the entire device appears missing in lspci output.

I've marked the systems down for investigation and verified all other smithi have the device present.

Actions #1

Updated by David Galloway over 8 years ago

Examples:
http://pulpito.ceph.com/teuthology-2015-12-21_19:00:01-rados-jewel-distro-basic-smithi/1582
http://pulpito.ceph.com/teuthology-2015-12-21_19:00:01-rados-jewel-distro-basic-smithi/1583

for num in {001..060}; do echo -e "smithi$num: $(ssh -q -o StrictHostKeyChecking=no smithi$num.front.sepia.ceph.com "sudo lspci | grep -i volatile")"; done
smithi001: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi002: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi003: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi004: 
smithi005: 
smithi006: 
smithi007: 
smithi008: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi009: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi010: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi011: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi012: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi013: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi014: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi015: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi016: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi017: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi018: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi019: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi020: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi021: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi022: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi023: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi024: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi025: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi026: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi027: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi028: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi029: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
smithi030: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi031: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi032: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi033: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi034: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi035: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi036: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi037: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi038: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi039: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi040: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi041: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi042: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi043: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi044: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi045: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi046: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi047: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi048: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi049: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi050: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi051: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi052: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi053: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi054: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi055: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi056: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi057: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi058: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi059: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
smithi060: 02:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
Actions #2

Updated by David Galloway over 8 years ago

dmidecode indicates the PCI slot is not populated. Will check on these next time I'm at the DC.

Actions #3

Updated by David Galloway over 8 years ago

  • Subject changed from smithi{004..007} missing NVMe devices to smithi004, smithi005, smithi007, smithi055 NVMe cards bad

Here are today's findings/testing.

I took smithi003 with a known working NVMe card and tested both PCI slots. Confirmed both PCI slots are good.

For each system with a missing/broken NVMe card, I tested its corresponding card in known good PCI slots in smithi003.

Cards from smithi{004,005,007,055} would not power on in smithi003.

For good measure, I tested the PCI slot in smithi003 again with a known good card to verify the riser and slot are functional.

What's got me stumped is smithi055 had a working card last week. I had used it initially to test the bad cards and now its original (previously good) card won't power on.

Will open support tickets with Supermicro/Intel now.

Actions #4

Updated by David Galloway about 8 years ago

The defective cards were approved for RMA and delivered to SuperMicro on 12JAN2016.
FedEx tracking: 782146800750

SuperMicro RMA# is RI160107081

I've e-mailed SuperMicro this morning asking for an update since I've heard nothing.

Actions #5

Updated by David Galloway about 8 years ago

Replacement NVMe cards arrived but were shipped with low-profile brackets. Have asked SuperMicro expedite a shipment of regular PCI brackets ASAP.

Actions #6

Updated by David Galloway about 8 years ago

  • Status changed from In Progress to Resolved

David Galloway wrote:

Replacement NVMe cards arrived but were shipped with low-profile brackets. Have asked SuperMicro expedite a shipment of regular PCI brackets ASAP.

Just kidding. I threw them away. And then unthrew them away.

All 4 cards installed and confirmed functional!

Smithis reimaged and released for testing.

Actions

Also available in: Atom PDF