Project

General

Profile

Bug #48270

libstoragemgmt calls fatally wound Areca RAID controllers on mira

Added by Dan Mick 7 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Preface: mira is very old hardware and has shown a tendency to be buggy WRT SCSI commands in the past (since the drives are emulated in firmware; there is no direct passthrough)

Updating the long-running cluster to the latest octopus build (a7b62340ad) had a catastrophic effect on the three mira machines in the cluster (055, 060, 093): after about 90 minutes, all disk operations were damaged to all drives, including sda/root, which made diagnosis a bit dicey. A large team cooperated to find the culprit appeared to be periodic execution of "cephadm ceph-volume inventory". Disabling cephadm on the cluster has kept the hosts up, and while it was disabled, minimal-reproduction efforts led to libstoragemgmt code in ceph-volume.

A small reproducer is

from ceph_volume.util.lsmdisk import LSMDisk

disk = LSMDisk('/dev/sdb')
print(disk.json_report())

The problem appears when LSMDisk.json_report() accesses self.transport, which causes a call to the "link_type_get" function; this issues a bunch of SCSI commands through sg that make the Areca driver very angry. In the most recent invocation I was able to capture some dmesg output that showed the kernel low-memory corruption detector fired with many hundreds of messages like

[  922.619161] Corrupted low memory at 000000009f51663e (5fa0 phys) = 340247800000000
, followed by scsi abort commands and bus resets, etc.:
[  947.227830] ---[ end trace 62b80c80056598c2 ]---
[  951.311530] arcmsr4: abort device command of scsi id = 0 lun = 0
[  954.948616] arcmsr4: abort device command of scsi id = 0 lun = 0
[  958.605631] arcmsr: executing bus reset eh.....num_resets = 0, num_aborts = 3
[  998.629132] arcmsr4: wait 'abort all outstanding command' timeout
[  998.629138] arcmsr4: executing hw bus reset .....
[ 1051.721691] arcmsr4: wait 'get adapter firmware                      miscellaneous data' timeout
[ 1088.438320] INFO: task jbd2/sda1-8:243 blocked for more than 120 seconds.

filesystems started to be mounted readonly, more and more processes hung or misbehaved, etc.

It seems that at least this lsm call is incompatible with the Areca hardware and driver.


Related issues

Copied to ceph-volume - Backport #48366: octopus: libstoragemgmt calls fatally wound Areca RAID controllers on mira Resolved
Copied to ceph-volume - Backport #48367: nautilus: libstoragemgmt calls fatally wound Areca RAID controllers on mira Rejected

History

#1 Updated by Dan Mick 7 months ago

I've reserved mira083 installed with CentOS 7 (we don't yet have a CentOS 8 image for mira), and can reproduce the bus-reset problem with

#!/usr/bin/python
from lsm import LocalDisk

disk = LocalDisk()
disk.link_type_get('/dev/sdb')

The program also fails with a similar message to the CentOS 8 container run:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/lsm/_local_disk.py", line 299, in link_type_get
    return _use_c_lib_function(_local_disk_link_type_get, disk_path)
  File "/usr/lib/python2.7/site-packages/lsm/_local_disk.py", line 34, in _use_c_lib_function
    raise LsmError(err_no, err_msg)
lsm._common.LsmError: NO_SUPPORT(153): Malformed SCSI data: VPD page code '0x00', should be 0x83

#2 Updated by Jan Fajerski 7 months ago

  • Status changed from New to Fix Under Review
  • Backport set to octopus,nautilus
  • Pull request ID set to 38149

Oh wow...that is not very nice. I pushed a PR to make this data retrieval optional and opt-in.

We should probably also report this upstream for libstoragemgmt?

#3 Updated by Dan Mick 7 months ago

Yes, I intend to follow up with libstoragemgmt maintainers as well. It's possible there's a way to make this code friendlier.

For now, I guess opt-in is a reasonable middle ground, but I worry that someone who actually tries to opt-in won't be prepared for the possibly-horrid complications. Arecas were really popular once, and who knows what other hardware will misbehave here; I just don't have a feel for how wide this might be.

#4 Updated by Brad Hubbard 7 months ago

That's a good point Dan. Could we document this in the release notes (as prominently as possible) as a 'known issue'?

#5 Updated by Brad Hubbard 7 months ago

  • Status changed from Fix Under Review to In Progress

#6 Updated by Nathan Cutler 7 months ago

  • Assignee set to Jan Fajerski

#7 Updated by Dan Mick 7 months ago

Have been debugging the libstoragmgmt code behind link_type_get. It appears the command failure/bus reset occurs on the very first sg operation, _sg_io_vpd, which is an INQUIRY command to get the "supported VPD pages"...so the code is trying hard to be compliant but the Areca firmware just muffs that "tell me what you can tell me" standard command. (INQUIRY, page code 0).

Weirdly, the ioctl doesn't return an error, and it appears as though the command completes successfully (no mode sense data, and the page comes back with good fields, length 4, supported pages 0, 0x80, 0x83, 0xc7, which correspond to this page (the "supported pages" page), serial number, device id, and the unknown c7.

I note also that the 'sg_vpd' command seems to be able to cause the same damage, but seems like it does not unless you set the "max response length" higher than the default.

#8 Updated by Dan Mick 7 months ago

In fact

sg_vpd -p 0 -m 16384 /dev/sde

(0x4000) succeeds without damage, and
sg_vpd -p 0 -m 16385 /dev/sde

causes harm.

(libstoragemgmt uses a length of 0xFFFF, which is the max legal length.)

This is probably enough to go to libstoragemgmt with. The calling code cannot control this length (it shouldn't need to, but that is to say there's no workaround possible from the ceph-volume Python code).

#10 Updated by Jan Fajerski 7 months ago

  • Status changed from In Progress to Pending Backport

#11 Updated by Jan Fajerski 7 months ago

  • Copied to Backport #48366: octopus: libstoragemgmt calls fatally wound Areca RAID controllers on mira added

#12 Updated by Jan Fajerski 7 months ago

  • Copied to Backport #48367: nautilus: libstoragemgmt calls fatally wound Areca RAID controllers on mira added

#13 Updated by Jan Fajerski 7 months ago

  • Backport changed from octopus,nautilus to octopus

#14 Updated by Dan Mick 7 months ago

The libstoragemgmt bug has a fix upstream: https://github.com/libstorage/libstoragemgmt/pull/444

#15 Updated by Jan Fajerski 7 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF