Project

General

Profile

Actions

Bug #43006

open

Device monitoring - get-health-metrics - json parse error

Added by Olivier Sauzet over 4 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Monitoring
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

My Ceph node are on "Ubuntu-18.04" and 4.15.0-66-generic kernel.

I install smartmontools like that :

wget http://launchpadlibrarian.net/425861623/smartmontools_7.0-0ubuntu1~ubuntu18.04.1_amd64.deb
dpkg -i smartmontools_7.0-0ubuntu1~ubuntu18.04.1_amd64.deb

My device list :

 ceph device ls
DEVICE                                 HOST:DEV   DAEMONS LIFE EXPECTANCY 
ATA_HGST_HUS726040AL_K3G1BXMB          weil02:sdf osd.7                   
ATA_ST2000VX008-2E31_Z5232C5P          weil02:sdg osd.11                  
ATA_ST4000NM0004-1FT_Z4F04GEK          weil02:sde osd.6                   
HGST_HUS726020ALA610_K5HKP7YD          weil01:sdc osd.3                   
HGST_HUS726040ALA610_K4H4WG8B          weil01:sdb osd.2                   
Hitachi_HUA722020ALA330_JK1151YAHB9EJZ weil04:sdb osd.10                  
Hitachi_HUA722020ALA330_JK1151YAHL1MEZ weil04:sda osd.9                   
Hitachi_HUA722020ALA330_JK1151YAHL7GXZ weil04:sdd osd.0                   
SEAGATE_ST2000NM0023_Z1X1C7GD          weil02:sdd osd.5                   
ST2000NM0033-9ZM175_Z1X0RFXY           weil04:sdc osd.8                   
WDC_WD2003FZEX-00Z4SA0_WD-WCC5C53H4FE0 weil01:sdd osd.4           

Some disk have some error like this one (osd.3) :

ceph device info HGST_HUS726020ALA610_K5HKP7YD
device HGST_HUS726020ALA610_K5HKP7YD
attachment weil01:sdc
daemons osd.3
ceph device get-health-metrics HGST_HUS726020ALA610_K5HKP7YD
    "20191121-132728": {
        "nvme_smart_health_information_add_log_error_code": -22, 
        "nvme_vendor": "hgst_hus726020ala610", 
        "nvme_smart_health_information_add_log_error": "nvme returned an error: sudo: exit status: 231", 
        "dev": "/dev/sdc", 
        "error": "smartctl returned invalid JSON" 
    }, 
  • Its strange, because another drive osd.2 (same model,same host) return some information (but have in JSON output the same error from nvme ) ! the output is in the attached files: HGST_output_K4H4WG8B.txt :
    ceph device get-health-metrics HGST_HUS726040ALA610_K4H4WG8B
    
  • The smart output of osd.3 :
    smartctl -a --json /dev/sdc
    

    (the output is in the attached files: output_smart_K5HKP7YD.txt)

Files

HGST_output_K4H4WG8B.txt (96.6 KB) HGST_output_K4H4WG8B.txt health-metrics HGST_HUS726040ALA610_K4H4WG8B Olivier Sauzet, 11/25/2019 12:53 PM
output_smart_K5HKP7YD.txt (23 KB) output_smart_K5HKP7YD.txt smart output for HGST_HUS726020ALA610_K5HKP7YD Olivier Sauzet, 11/25/2019 12:56 PM
Actions #1

Updated by Itay Ringler over 4 years ago

Hi,

I'm experiencing the same problem on my setup.

It is run on CentOS Linux release 7.7.1908 (Core).

I have smartmontools ov version - smartmontools release 7.0 dated 2018-12-30 at 14:47:55 UTC

Ceph version:

ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)

smartmontools configure arguments: '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--localstatedir=/var' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-selinux' '--with-libcap-ng=yes' '--with-libsystemd' '--with-systemdsystemunitdir=/usr/lib/systemd/system' '--sysconfdir=/etc/smartmontools/' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CXXFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic' 'LDFLAGS=-Wl,-z,relro ' 'CFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic' 'PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig'

My device list :

 ceph device ls
DEVICE                               HOST:DEV                   DAEMONS LIFE EXPECTANCY
SanDisk_SSD_PLUS_240_GB_174302800028 overcloud-ovscompute-0:sdb osd.1
SanDisk_SSD_PLUS_240_GB_174352803082 overcloud-ovscompute-0:sda osd.0

I don't have errors on my disks but I still get:

ceph device get-health-metrics SanDisk_SSD_PLUS_240_GB_174302800028
{
    "20200113-094027": {
        "nvme_smart_health_information_add_log_error_code": -22,
        "nvme_vendor": "lvm",
        "nvme_smart_health_information_add_log_error": "nvme returned an error: sudo: exit status: 1",
        "dev": "/dev/sdb",
        "error": "smartctl returned invalid JSON" 
    }
}

Actions #2

Updated by Ernesto Puerta about 3 years ago

  • Project changed from mgr to Dashboard
  • Category changed from 148 to Monitoring
Actions #3

Updated by Janek Bevendorff almost 2 years ago

Any progress on this? We have the same issue with all our 10GB SAS disks. Running ceph device get-health-metrics HGST_HUH721010AL5200_7JKS257G prints

{

    ...

    "20220511-120407": {
        "dev": "/dev/sdp",
        "error": "smartctl failed",
        "nvme_smart_health_information_add_log_error": "nvme returned an error: sudo: exit status: 1",
        "nvme_smart_health_information_add_log_error_code": -22,
        "nvme_vendor": "hgst",
        "smartctl_error_code": -22,
        "smartctl_output": "smartctl returned an error (1): stderr:\nsudo: exit status: 1\nstdout:\n" 
    },
    "20220512-003147": {
        "dev": "/dev/sdp",
        "error": "smartctl failed",
        "nvme_smart_health_information_add_log_error": "nvme returned an error: sudo: exit status: 1",
        "nvme_smart_health_information_add_log_error_code": -22,
        "nvme_vendor": "hgst",
        "smartctl_error_code": -22,
        "smartctl_output": "smartctl returned an error (1): stderr:\nsudo: exit status: 1\nstdout:\n" 
    }
}

whereas smartctl -aj /dev/sdp gives me
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      1
    ],
    "svn_revision": "5022",
    "platform_info": "x86_64-linux-5.4.0-110-generic",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "-aj",
      "/dev/sdp" 
    ],
    "exit_status": 0
  },
  "device": {
    "name": "/dev/sdp",
    "info_name": "/dev/sdp",
    "type": "scsi",
    "protocol": "SCSI" 
  },
  "vendor": "HGST",
  "product": "HUH721010AL5200",
  "model_name": "HGST HUH721010AL5200",
  "revision": "LS17",
  "scsi_version": "SPC-4",
  "user_capacity": {
    "blocks": 19134414848,
    "bytes": 9796820402176
  },
  "logical_block_size": 512,
  "physical_block_size": 4096,
  "rotation_rate": 7200,
  "form_factor": {
    "scsi_value": 2,
    "name": "3.5 inches" 
  },
  "serial_number": "7JKB2RYC",
  "device_type": {
    "scsi_value": 0,
    "name": "disk" 
  },
  "local_time": {
    "time_t": 1652341574,
    "asctime": "Thu May 12 09:46:14 2022 CEST" 
  },
  "smart_status": {
    "passed": true
  },
  "temperature": {
    "current": 47,
    "drive_trip": 50
  },
  "scsi_grown_defect_list": 0,
  "scsi_error_counter_log": {
   "read": {
      "errors_corrected_by_eccfast": 0,
      "errors_corrected_by_eccdelayed": 1588,
      "errors_corrected_by_rereads_rewrites": 0,
      "total_errors_corrected": 3317,
      "correction_algorithm_invocations": 4616120,
      "gigabytes_processed": "277434.907",
      "total_uncorrected_errors": 0
    },
    "write": {
      "errors_corrected_by_eccfast": 0,
      "errors_corrected_by_eccdelayed": 0,
      "errors_corrected_by_rereads_rewrites": 0,
      "total_errors_corrected": 0,
      "correction_algorithm_invocations": 2211120,
      "gigabytes_processed": "47340.664",
      "total_uncorrected_errors": 0
    },
    "verify": {
      "errors_corrected_by_eccfast": 0,
      "errors_corrected_by_eccdelayed": 0,
      "errors_corrected_by_rereads_rewrites": 0,
      "total_errors_corrected": 0,
      "correction_algorithm_invocations": 230731,
      "gigabytes_processed": "0.138",
      "total_uncorrected_errors": 0
    }
  }
}
Actions

Also available in: Atom PDF