Project

General

Profile

Bug #52011

smartctl 7.1 crashes device/kernel with Micron_2200_MTFDHBA1T0TCK

Added by Johan Hattne over 2 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
packaging
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description of problem

smartmontools <7.2 crashes device/kernel with LPO read on Micron 2200 NVMe. This breaks any OSD that depends on such NVMes as soon as smartctl is run. The problem is fixed in smartmontools 7.2 (verified with package from Debian backports).

Environment

  • ceph version string: 15.2.13
  • Platform (OS/distro/release): Debian 10.10
  • Cluster details (nodes, monitors, OSDs): 5 nodes, 3 monitors, 2 OSDs
  • Browser used (e.g.: Version 86.0.4240.198 (Official Build) (64-bit)): N/A

How reproducible

Let the cluster run for long enough to schedule a health check on a drive.

Actual results

/var/log/kern.log:
Jul 19 17:06:05 REDACTED kernel: [1841014.347303] DMAR: DRHD: handling fault status reg 2
Jul 19 17:06:05 REDACTED kernel: [1841014.347372] DMAR: [DMA Read] Request device [06:00.0] PASID ffffffff fault addr ffbc0000 [fault reason 06] PTE Read access is not set

And /dev/nvme0 disappears. In this case, /dev/nvme0 was used for WAL, so the OSD breaks at this point.

Expected results

smartctl exits successfully without crashing the kernel.

Additional info

Upstream: https://www.smartmontools.org/ticket/1404

ceph-users thread: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/4EGK5AJ2OG2SYFTQ2ROX24VKVJGOJBYN

History

#1 Updated by Johan Hattne over 2 years ago

I'd be happy to submit a ticket at https://bugs.centos.org if this is best way to have this problem fixed in ceph.

#2 Updated by Loïc Dachary over 2 years ago

  • Target version deleted (v15.2.14)

#3 Updated by Neha Ojha over 2 years ago

  • Status changed from New to Closed

It appears that https://www.smartmontools.org/ticket/1404 has been addressed, you could open a bug with debian to get it resolved. I am closing this issue since it is not a ceph bug.

#4 Updated by Johan Hattne over 2 years ago

Yes, this is a Ceph bug, because smartmontools in the Ceph Docker images still have the problem. It does not matter whether the Debian package is fixed or not.

I don't see how to reopen a bug; if I can't figure it out, I'll just have to open a new bug pointing to this issue...

#5 Updated by Neha Ojha over 2 years ago

  • Subject changed from mgr/dashboard: smartctl 7.1 crashes device/kernel with Micron_2200_MTFDHBA1T0TCK to smartctl 7.1 crashes device/kernel with Micron_2200_MTFDHBA1T0TCK
  • Status changed from Closed to New

Also available in: Atom PDF