Project

General

Profile

Actions

Bug #58316

open

Ceph health metric Scraping still broken

Added by Janek Bevendorff over 1 year ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This was brought up in #46285 already, but the issue has been marked as rejected.

When I run ceph device scrape-health-metrics HGST_HUH721010AL5200_7JKMZYKG to collect SMART metrics for a device and then list them via ceph device get-health-metrics HGST_HUH721010AL5200_7JKMZYKG, I only get

{
    "20221220-090607": {
        "dev": "/dev/sdd",
        "error": "smartctl failed",
        "nvme_smart_health_information_add_log_error": "nvme returned an error: sudo: exit status: 1",
        "nvme_smart_health_information_add_log_error_code": -22,
        "nvme_vendor": "hgst",
        "smartctl_error_code": -22,
        "smartctl_output": "smartctl returned an error (1): stderr:\nsudo: exit status: 1\nstdout:\n" 
    }
}

The device is NOT an NVMe drive, it's an SAS-attached spinning disk. The same happens for ALL other (SAS) devices in our cluster. In fact, it's been doing that from day one when the device health feature came out and I have only been waiting for this to be fixed eventually, but the issue is still there.

I am running the latest Pacific release and smartmontools 7.1.

Actions #1

Updated by Janek Bevendorff over 1 year ago

BTW this is the output of smartctl -a --json on the device:

{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      1
    ],
    "svn_revision": "5022",
    "platform_info": "x86_64-linux-5.4.0-135-generic",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "-a",
      "--json",
      "/dev/disk/by-path/pci-0000:02:00.0-scsi-0:0:3:0" 
    ],
    "exit_status": 0
  },
  "device": {
    "name": "/dev/disk/by-path/pci-0000:02:00.0-scsi-0:0:3:0",
    "info_name": "/dev/disk/by-path/pci-0000:02:00.0-scsi-0:0:3:0",
    "type": "scsi",
    "protocol": "SCSI" 
  },
  "vendor": "HGST",
  "product": "HUH721010AL5200",
  "model_name": "HGST HUH721010AL5200",
  "revision": "LS21",
  "scsi_version": "SPC-4",
  "user_capacity": {
    "blocks": 19134414848,
    "bytes": 9796820402176
  },
  "logical_block_size": 512,
  "physical_block_size": 4096,
  "rotation_rate": 7200,
  "form_factor": {
    "scsi_value": 2,
    "name": "3.5 inches" 
  },
  "serial_number": "7JKMZYKG",
  "device_type": {
    "scsi_value": 0,
    "name": "disk" 
  },
  "local_time": {
    "time_t": 1671527999,
    "asctime": "Tue Dec 20 10:19:59 2022 CET" 
  },
  "smart_status": {
    "passed": true
  },
  "temperature": {
    "current": 30,
    "drive_trip": 50
  },
  "scsi_grown_defect_list": 608,
  "scsi_error_counter_log": {
    "read": {
      "errors_corrected_by_eccfast": 0,
      "errors_corrected_by_eccdelayed": 2602945,
      "errors_corrected_by_rereads_rewrites": 0,
      "total_errors_corrected": 3435317,
      "correction_algorithm_invocations": 54429661,
      "gigabytes_processed": "386635.708",
      "total_uncorrected_errors": 13
    },
    "write": {
      "errors_corrected_by_eccfast": 0,
      "errors_corrected_by_eccdelayed": 0,
      "errors_corrected_by_rereads_rewrites": 0,
      "total_errors_corrected": 0,
      "correction_algorithm_invocations": 665823,
      "gigabytes_processed": "15771.422",
      "total_uncorrected_errors": 0
    },
    "verify": {
      "errors_corrected_by_eccfast": 0,
      "errors_corrected_by_eccdelayed": 0,
      "errors_corrected_by_rereads_rewrites": 0,
      "total_errors_corrected": 0,
      "correction_algorithm_invocations": 2357059,
      "gigabytes_processed": "0.275",
      "total_uncorrected_errors": 0
    }
  }
}

I understand that this is not what you would expect from a normal ATA drive, but there are still health metrics. The devicehealth module should still be able to use them or at least show a proper error message of why this data is unsupported.

Actions

Also available in: Atom PDF