Bug #43006
Device monitoring - get-health-metrics - json parse error
Status:
New
Priority:
Normal
Assignee:
-
Category:
Monitoring
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Hi,
My Ceph node are on "Ubuntu-18.04" and 4.15.0-66-generic kernel.
I install smartmontools like that :
wget http://launchpadlibrarian.net/425861623/smartmontools_7.0-0ubuntu1~ubuntu18.04.1_amd64.deb dpkg -i smartmontools_7.0-0ubuntu1~ubuntu18.04.1_amd64.deb
My device list :
ceph device ls DEVICE HOST:DEV DAEMONS LIFE EXPECTANCY ATA_HGST_HUS726040AL_K3G1BXMB weil02:sdf osd.7 ATA_ST2000VX008-2E31_Z5232C5P weil02:sdg osd.11 ATA_ST4000NM0004-1FT_Z4F04GEK weil02:sde osd.6 HGST_HUS726020ALA610_K5HKP7YD weil01:sdc osd.3 HGST_HUS726040ALA610_K4H4WG8B weil01:sdb osd.2 Hitachi_HUA722020ALA330_JK1151YAHB9EJZ weil04:sdb osd.10 Hitachi_HUA722020ALA330_JK1151YAHL1MEZ weil04:sda osd.9 Hitachi_HUA722020ALA330_JK1151YAHL7GXZ weil04:sdd osd.0 SEAGATE_ST2000NM0023_Z1X1C7GD weil02:sdd osd.5 ST2000NM0033-9ZM175_Z1X0RFXY weil04:sdc osd.8 WDC_WD2003FZEX-00Z4SA0_WD-WCC5C53H4FE0 weil01:sdd osd.4
Some disk have some error like this one (osd.3) :
ceph device info HGST_HUS726020ALA610_K5HKP7YD device HGST_HUS726020ALA610_K5HKP7YD attachment weil01:sdc daemons osd.3
ceph device get-health-metrics HGST_HUS726020ALA610_K5HKP7YD "20191121-132728": { "nvme_smart_health_information_add_log_error_code": -22, "nvme_vendor": "hgst_hus726020ala610", "nvme_smart_health_information_add_log_error": "nvme returned an error: sudo: exit status: 231", "dev": "/dev/sdc", "error": "smartctl returned invalid JSON" },
- Its strange, because another drive osd.2 (same model,same host) return some information (but have in JSON output the same error from nvme ) ! the output is in the attached files: HGST_output_K4H4WG8B.txt :
ceph device get-health-metrics HGST_HUS726040ALA610_K4H4WG8B
- The smart output of osd.3 :
smartctl -a --json /dev/sdc
(the output is in the attached files: output_smart_K5HKP7YD.txt)
History
#1 Updated by Itay Ringler over 3 years ago
Hi,
I'm experiencing the same problem on my setup.
It is run on CentOS Linux release 7.7.1908 (Core).
I have smartmontools ov version - smartmontools release 7.0 dated 2018-12-30 at 14:47:55 UTC
Ceph version:
ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
smartmontools configure arguments: '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--localstatedir=/var' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-selinux' '--with-libcap-ng=yes' '--with-libsystemd' '--with-systemdsystemunitdir=/usr/lib/systemd/system' '--sysconfdir=/etc/smartmontools/' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CXXFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic' 'LDFLAGS=-Wl,-z,relro ' 'CFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic' 'PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig'
My device list :
ceph device ls
DEVICE HOST:DEV DAEMONS LIFE EXPECTANCY
SanDisk_SSD_PLUS_240_GB_174302800028 overcloud-ovscompute-0:sdb osd.1
SanDisk_SSD_PLUS_240_GB_174352803082 overcloud-ovscompute-0:sda osd.0
I don't have errors on my disks but I still get:
ceph device get-health-metrics SanDisk_SSD_PLUS_240_GB_174302800028
{
"20200113-094027": {
"nvme_smart_health_information_add_log_error_code": -22,
"nvme_vendor": "lvm",
"nvme_smart_health_information_add_log_error": "nvme returned an error: sudo: exit status: 1",
"dev": "/dev/sdb",
"error": "smartctl returned invalid JSON"
}
}
#2 Updated by Ernesto Puerta over 2 years ago
- Project changed from mgr to Dashboard
- Category changed from 148 to Monitoring
#3 Updated by Janek Bevendorff over 1 year ago
Any progress on this? We have the same issue with all our 10GB SAS disks. Running ceph device get-health-metrics HGST_HUH721010AL5200_7JKS257G
prints
{
...
"20220511-120407": {
"dev": "/dev/sdp",
"error": "smartctl failed",
"nvme_smart_health_information_add_log_error": "nvme returned an error: sudo: exit status: 1",
"nvme_smart_health_information_add_log_error_code": -22,
"nvme_vendor": "hgst",
"smartctl_error_code": -22,
"smartctl_output": "smartctl returned an error (1): stderr:\nsudo: exit status: 1\nstdout:\n"
},
"20220512-003147": {
"dev": "/dev/sdp",
"error": "smartctl failed",
"nvme_smart_health_information_add_log_error": "nvme returned an error: sudo: exit status: 1",
"nvme_smart_health_information_add_log_error_code": -22,
"nvme_vendor": "hgst",
"smartctl_error_code": -22,
"smartctl_output": "smartctl returned an error (1): stderr:\nsudo: exit status: 1\nstdout:\n"
}
}
whereas
smartctl -aj /dev/sdp
gives me
{
"json_format_version": [
1,
0
],
"smartctl": {
"version": [
7,
1
],
"svn_revision": "5022",
"platform_info": "x86_64-linux-5.4.0-110-generic",
"build_info": "(local build)",
"argv": [
"smartctl",
"-aj",
"/dev/sdp"
],
"exit_status": 0
},
"device": {
"name": "/dev/sdp",
"info_name": "/dev/sdp",
"type": "scsi",
"protocol": "SCSI"
},
"vendor": "HGST",
"product": "HUH721010AL5200",
"model_name": "HGST HUH721010AL5200",
"revision": "LS17",
"scsi_version": "SPC-4",
"user_capacity": {
"blocks": 19134414848,
"bytes": 9796820402176
},
"logical_block_size": 512,
"physical_block_size": 4096,
"rotation_rate": 7200,
"form_factor": {
"scsi_value": 2,
"name": "3.5 inches"
},
"serial_number": "7JKB2RYC",
"device_type": {
"scsi_value": 0,
"name": "disk"
},
"local_time": {
"time_t": 1652341574,
"asctime": "Thu May 12 09:46:14 2022 CEST"
},
"smart_status": {
"passed": true
},
"temperature": {
"current": 47,
"drive_trip": 50
},
"scsi_grown_defect_list": 0,
"scsi_error_counter_log": {
"read": {
"errors_corrected_by_eccfast": 0,
"errors_corrected_by_eccdelayed": 1588,
"errors_corrected_by_rereads_rewrites": 0,
"total_errors_corrected": 3317,
"correction_algorithm_invocations": 4616120,
"gigabytes_processed": "277434.907",
"total_uncorrected_errors": 0
},
"write": {
"errors_corrected_by_eccfast": 0,
"errors_corrected_by_eccdelayed": 0,
"errors_corrected_by_rereads_rewrites": 0,
"total_errors_corrected": 0,
"correction_algorithm_invocations": 2211120,
"gigabytes_processed": "47340.664",
"total_uncorrected_errors": 0
},
"verify": {
"errors_corrected_by_eccfast": 0,
"errors_corrected_by_eccdelayed": 0,
"errors_corrected_by_rereads_rewrites": 0,
"total_errors_corrected": 0,
"correction_algorithm_invocations": 230731,
"gigabytes_processed": "0.138",
"total_uncorrected_errors": 0
}
}
}