libstoragemgmt calls fatally wound Areca RAID controllers on mira
Preface: mira is very old hardware and has shown a tendency to be buggy WRT SCSI commands in the past (since the drives are emulated in firmware; there is no direct passthrough)
Updating the long-running cluster to the latest octopus build (a7b62340ad) had a catastrophic effect on the three mira machines in the cluster (055, 060, 093): after about 90 minutes, all disk operations were damaged to all drives, including sda/root, which made diagnosis a bit dicey. A large team cooperated to find the culprit appeared to be periodic execution of "cephadm ceph-volume inventory". Disabling cephadm on the cluster has kept the hosts up, and while it was disabled, minimal-reproduction efforts led to libstoragemgmt code in ceph-volume.
A small reproducer is
from ceph_volume.util.lsmdisk import LSMDisk disk = LSMDisk('/dev/sdb') print(disk.json_report())
The problem appears when LSMDisk.json_report() accesses self.transport, which causes a call to the "link_type_get" function; this issues a bunch of SCSI commands through sg that make the Areca driver very angry. In the most recent invocation I was able to capture some dmesg output that showed the kernel low-memory corruption detector fired with many hundreds of messages like
[ 922.619161] Corrupted low memory at 000000009f51663e (5fa0 phys) = 340247800000000, followed by scsi abort commands and bus resets, etc.:
[ 947.227830] ---[ end trace 62b80c80056598c2 ]--- [ 951.311530] arcmsr4: abort device command of scsi id = 0 lun = 0 [ 954.948616] arcmsr4: abort device command of scsi id = 0 lun = 0 [ 958.605631] arcmsr: executing bus reset eh.....num_resets = 0, num_aborts = 3 [ 998.629132] arcmsr4: wait 'abort all outstanding command' timeout [ 998.629138] arcmsr4: executing hw bus reset ..... [ 1051.721691] arcmsr4: wait 'get adapter firmware miscellaneous data' timeout [ 1088.438320] INFO: task jbd2/sda1-8:243 blocked for more than 120 seconds.
filesystems started to be mounted readonly, more and more processes hung or misbehaved, etc.
It seems that at least this lsm call is incompatible with the Areca hardware and driver.
I've reserved mira083 installed with CentOS 7 (we don't yet have a CentOS 8 image for mira), and can reproduce the bus-reset problem with
#!/usr/bin/python from lsm import LocalDisk disk = LocalDisk() disk.link_type_get('/dev/sdb')
The program also fails with a similar message to the CentOS 8 container run:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/site-packages/lsm/_local_disk.py", line 299, in link_type_get return _use_c_lib_function(_local_disk_link_type_get, disk_path) File "/usr/lib/python2.7/site-packages/lsm/_local_disk.py", line 34, in _use_c_lib_function raise LsmError(err_no, err_msg) lsm._common.LsmError: NO_SUPPORT(153): Malformed SCSI data: VPD page code '0x00', should be 0x83
#2 Updated by Jan Fajerski 7 months ago
- Status changed from New to Fix Under Review
- Backport set to octopus,nautilus
- Pull request ID set to 38149
Oh wow...that is not very nice. I pushed a PR to make this data retrieval optional and opt-in.
We should probably also report this upstream for libstoragemgmt?
Yes, I intend to follow up with libstoragemgmt maintainers as well. It's possible there's a way to make this code friendlier.
For now, I guess opt-in is a reasonable middle ground, but I worry that someone who actually tries to opt-in won't be prepared for the possibly-horrid complications. Arecas were really popular once, and who knows what other hardware will misbehave here; I just don't have a feel for how wide this might be.
Have been debugging the libstoragmgmt code behind link_type_get. It appears the command failure/bus reset occurs on the very first sg operation, _sg_io_vpd, which is an INQUIRY command to get the "supported VPD pages"...so the code is trying hard to be compliant but the Areca firmware just muffs that "tell me what you can tell me" standard command. (INQUIRY, page code 0).
Weirdly, the ioctl doesn't return an error, and it appears as though the command completes successfully (no mode sense data, and the page comes back with good fields, length 4, supported pages 0, 0x80, 0x83, 0xc7, which correspond to this page (the "supported pages" page), serial number, device id, and the unknown c7.
I note also that the 'sg_vpd' command seems to be able to cause the same damage, but seems like it does not unless you set the "max response length" higher than the default.
sg_vpd -p 0 -m 16384 /dev/sde
(0x4000) succeeds without damage, and
sg_vpd -p 0 -m 16385 /dev/sde
(libstoragemgmt uses a length of 0xFFFF, which is the max legal length.)
This is probably enough to go to libstoragemgmt with. The calling code cannot control this length (it shouldn't need to, but that is to say there's no workaround possible from the ceph-volume Python code).
The libstoragemgmt bug has a fix upstream: https://github.com/libstorage/libstoragemgmt/pull/444