Bug #52011
smartctl 7.1 crashes device/kernel with Micron_2200_MTFDHBA1T0TCK
0%
Description
Description of problem¶
smartmontools <7.2 crashes device/kernel with LPO read on Micron 2200 NVMe. This breaks any OSD that depends on such NVMes as soon as smartctl is run. The problem is fixed in smartmontools 7.2 (verified with package from Debian backports).
Environment¶
ceph version
string: 15.2.13- Platform (OS/distro/release): Debian 10.10
- Cluster details (nodes, monitors, OSDs): 5 nodes, 3 monitors, 2 OSDs
- Browser used (e.g.:
Version 86.0.4240.198 (Official Build) (64-bit)
): N/A
How reproducible¶
Let the cluster run for long enough to schedule a health check on a drive.
Actual results¶
/var/log/kern.log:
Jul 19 17:06:05 REDACTED kernel: [1841014.347303] DMAR: DRHD: handling fault status reg 2
Jul 19 17:06:05 REDACTED kernel: [1841014.347372] DMAR: [DMA Read] Request device [06:00.0] PASID ffffffff fault addr ffbc0000 [fault reason 06] PTE Read access is not set
And /dev/nvme0 disappears. In this case, /dev/nvme0 was used for WAL, so the OSD breaks at this point.
Expected results¶
smartctl exits successfully without crashing the kernel.
Additional info¶
Upstream: https://www.smartmontools.org/ticket/1404
ceph-users thread: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/4EGK5AJ2OG2SYFTQ2ROX24VKVJGOJBYN
History
#1 Updated by Johan Hattne over 2 years ago
I'd be happy to submit a ticket at https://bugs.centos.org if this is best way to have this problem fixed in ceph.
#2 Updated by Loïc Dachary over 2 years ago
- Target version deleted (
v15.2.14)
#3 Updated by Neha Ojha over 2 years ago
- Status changed from New to Closed
It appears that https://www.smartmontools.org/ticket/1404 has been addressed, you could open a bug with debian to get it resolved. I am closing this issue since it is not a ceph bug.
#4 Updated by Johan Hattne over 2 years ago
Yes, this is a Ceph bug, because smartmontools in the Ceph Docker images still have the problem. It does not matter whether the Debian package is fixed or not.
I don't see how to reopen a bug; if I can't figure it out, I'll just have to open a new bug pointing to this issue...
#5 Updated by Neha Ojha over 2 years ago
- Subject changed from mgr/dashboard: smartctl 7.1 crashes device/kernel with Micron_2200_MTFDHBA1T0TCK to smartctl 7.1 crashes device/kernel with Micron_2200_MTFDHBA1T0TCK
- Status changed from Closed to New
#6 Updated by Johan Hattne over 2 years ago
Reported with CentOS: https://bugs.centos.org/view.php?id=18304