Bug #46285
osd: error from smartctl is always reported as invalid JSON
0%
Description
When smartctl returns an error, the osd always reports it as invalid json. We meant to give a better error, but the conditional here is incorrect https://github.com/ceph/ceph/pull/28848/files#diff-14b6f2e2d1fee0fc37c98071f13661f4R738-R739
For example:
[root@extensa003 ~]# ceph device get-health-metrics AVAGO_SMC3108_00f11da416efd1fc2200d36c23800403
{
"20200629-001032": {
"dev": "/dev/sdg",
"error": "smartctl returned invalid JSON",
"nvme_smart_health_information_add_log_error": "nvme returned an error: sudo: exit status: 231",
"nvme_smart_health_information_add_log_error_code": -22,
"nvme_vendor": "avago"
},
"20200630-000823": {
"dev": "/dev/sdg",
"error": "smartctl returned invalid JSON",
"nvme_smart_health_information_add_log_error": "nvme returned an error: sudo: exit status: 231",
"nvme_smart_health_information_add_log_error_code": -22,
"nvme_vendor": "avago"
}
}
[root@mero007 ~]# smartctl -a --json "/dev/sdg"
{
"json_format_version": [
1,
0
],
"smartctl": {
"version": [
7,
0
],
"svn_revision": "4883",
"platform_info": "x86_64-linux-3.10.0-1062.7.1.el7.x86_64",
"build_info": "(local build)",
"argv": [
"smartctl",
"-a",
"--json",
"/dev/sdg"
],
"exit_status": 4
},
"device": {
"name": "/dev/sdg",
"info_name": "/dev/sdg",
"type": "scsi",
"protocol": "SCSI"
},
"vendor": "AVAGO",
"product": "SMC3108",
"model_name": "AVAGO SMC3108",
"revision": "4.68",
"scsi_version": "SPC-3",
"user_capacity": {
"blocks": 15626993664,
"bytes": 8001020755968
},
"logical_block_size": 512,
"physical_block_size": 4096,
"serial_number": "00f11da416efd1fc2200d36c23800403",
"device_type": {
"scsi_value": 0,
"name": "disk"
},
"local_time": {
"time_t": 1593500798,
"asctime": "Tue Jun 30 07:06:38 2020 UTC"
},
"temperature": {
"current": 0,
"drive_trip": 0
}
}
History
#1 Updated by Yaarit Hatuka about 3 years ago
Which version is this cluster running?
I would expect to see this "output" key in the command's output:
https://github.com/ceph/ceph/pull/28848/files#diff-14b6f2e2d1fee0fc37c98071f13661f4R750
Is the cluster containerized?
If so - smartctl version in the container might be old (meaning --json is indeed invalid), versus smartctl version outside (which is 7.0 and has --json option).
Btw, the osd also returns other smartctl errors, for example:
{
"dev": "/dev/sdg",
"error": "smartctl failed",
"host_id": "obfuscated here",
"nvme_vendor": "lsi",
"smartctl_output": "smartctl returned an error (1): stderr:\nsudo: exit status: 1\nstdout:\n",
"smartctl_error_code": -22,
"nvme_smart_health_information_add_log_error": "nvme returned an error: sudo: exit status: 1",
"nvme_smart_health_information_add_log_error_code": -22
}
(taken from telemetry device report of a HW RAID controller)
#2 Updated by Josh Durgin about 3 years ago
- Status changed from New to Rejected
turns out the report was from an earlier version (it did not contain the 'output' key)
#3 Updated by Daniƫl Vos about 3 years ago
Yaarit Hatuka wrote:
Is the cluster containerized?
If so - smartctl version in the container might be old (meaning --json is indeed invalid), versus smartctl version outside (which is 7.0 and has --json option).
smartmontools in the ceph/ceph:latest image is indeed version 6.6