Project

General

Profile

Actions

Bug #48604

closed

orchestrator: query-daemon-health-metrics fails, no smartctl output

Added by Volker Theile over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Running 'ceph device query-daemon-health-metrics' causes failures, but smartctl_output does not contain helpful information. The 'stdout:' text should contain the smartctl output as far as i understood the C code, but it doesn't.

https://github.com/ceph/ceph/blob/octopus/src/common/blkdev.cc#L728
https://github.com/ceph/ceph/blob/octopus/src/common/blkdev.cc#L758

 :~ # ceph device query-daemon-health-metrics osd.6
{
    "HUH721010ALE600______00YK043D7A01892LEN_1EK70PSZ" : {
        "dev" : "/dev/sdc",
        "error" : "smartctl failed",
        "nvme_smart_health_information_add_log_error" : "nvme returned an error: sudo: exit status: 1",
        "nvme_smart_health_information_add_log_error_code" : -22,
        "nvme_vendor" : "ata",
        "smartctl_error_code" : -22,
        "smartctl_output" : "smartctl returned an error (1): stderr:\nsudo: exit status: 1\nstdout:\n" 
    },
    "KCM51VUG800G_79M0A01PTZZF" : {
        "dev" : "/dev/nvme1n1",
        "error" : "smartctl failed",
        "nvme_smart_health_information_add_log_error" : "nvme returned an error: sudo: exit status: 1",
        "nvme_smart_health_information_add_log_error_code" : -22,
        "nvme_vendor" : "lvm",
        "smartctl_error_code" : -22,
        "smartctl_output" : "smartctl returned an error (1): stderr:\nsudo: exit status: 1\nstdout:\n" 
    }
}

It is possible to run smartctl manually without problems.

{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      0
    ],
    "svn_revision": "4917",
    "platform_info": "x86_64-linux-5.3.18-24.37-default",
    "build_info": "(SUSE RPM)",
    "argv": [
      "smartctl",
      "-a",
      "--json=o",
      "/dev/sdc" 
    ],
    "output": [
      "smartctl 7.0 2019-05-21 r4917 [x86_64-linux-5.3.18-24.37-default] (SUSE RPM)",
      "Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org",
      "",
      "=== START OF INFORMATION SECTION ===",
      "Device Model:     HUH721010ALE600      00YK043D7A01892LEN",
      "Serial Number:    1EK70PSZ",
      "LU WWN Device Id: 5 000cca 27eed77be",
      "Firmware Version: LHGNK9Q7",
      "User Capacity:    10,000,831,348,736 bytes [10.0 TB]",
      "Sector Sizes:     512 bytes logical, 4096 bytes physical",
      "Rotation Rate:    7200 rpm",
      "Form Factor:      3.5 inches",
      "Device is:        Not in smartctl database [for details use: -P showall]",
      "ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4",
      "SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)",
      "Local Time is:    Tue Nov 24 09:55:24 2020 GMT",
      "SMART support is: Available - device has SMART capability.",
      "SMART support is: Enabled",
      "",
      "=== START OF READ SMART DATA SECTION ===",
      "SMART overall-health self-assessment test result: PASSED",
      "",
      "General SMART Values:",
      "Offline data collection status:  (0x82)\tOffline data collection activity",
      "\t\t\t\t\twas completed without error.",
      "\t\t\t\t\tAuto Offline Data Collection: Enabled.",
      "Self-test execution status:      (   0)\tThe previous self-test routine completed",
      "\t\t\t\t\twithout error or no self-test has ever ",
      "\t\t\t\t\tbeen run.",
      "Total time to complete Offline ",
      "data collection: \t\t(   93) seconds.",
      "Offline data collection",
      "capabilities: \t\t\t (0x5b) SMART execute Offline immediate.",
      "\t\t\t\t\tAuto Offline data collection on/off support.",
      "\t\t\t\t\tSuspend Offline collection upon new",
      "\t\t\t\t\tcommand.",
      "\t\t\t\t\tOffline surface scan supported.",
      "\t\t\t\t\tSelf-test supported.",
      "\t\t\t\t\tNo Conveyance Self-test supported.",
      "\t\t\t\t\tSelective Self-test supported.",
      "SMART capabilities:            (0x0003)\tSaves SMART data before entering",
      "\t\t\t\t\tpower-saving mode.",
      "\t\t\t\t\tSupports SMART auto save timer.",
      "Error logging capability:        (0x01)\tError logging supported.",
      "\t\t\t\t\tGeneral Purpose Logging supported.",
      "Short self-test routine ",
      "recommended polling time: \t (   2) minutes.",
      "Extended self-test routine",
      "recommended polling time: \t (1105) minutes.",
      "SCT capabilities: \t       (0x003d)\tSCT Status supported.",
      "\t\t\t\t\tSCT Error Recovery Control supported.",
      "\t\t\t\t\tSCT Feature Control supported.",
      ...

Related issues 1 (0 open1 closed)

Copied to Ceph - Backport #48737: octopus: orchestrator: query-daemon-health-metrics fails, no smartctl outputResolvedNathan CutlerActions
Actions #1

Updated by Volker Theile over 3 years ago

  • Description updated (diff)
Actions #2

Updated by Nathan Cutler over 3 years ago

  • Status changed from New to Triaged

It is possible to run smartctl manually without problems.

This statement only holds if the smartmontools RPM is installed...

The problem would seem to be that the container image used in this case was built using a process that excludes runtime dependencies that are merely recommended (soft dependencies) as opposed to required (hard dependencies). Since the ceph-osd and ceph-mon packages only recommend smartmontools, container images built using this process do not contain the smartmontools package.

Actions #3

Updated by Nathan Cutler over 3 years ago

  • Status changed from Triaged to In Progress
  • Assignee set to Nathan Cutler
Actions #4

Updated by Nathan Cutler over 3 years ago

  • Backport set to octopus
Actions #5

Updated by Nathan Cutler over 3 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 38603
Actions #6

Updated by Nathan Cutler over 3 years ago

  • Project changed from Orchestrator to Ceph
  • Category deleted (orchestrator)

moving to generic Ceph project so we can use backporting workflow on it

Actions #7

Updated by Nathan Cutler over 3 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #8

Updated by Backport Bot over 3 years ago

  • Copied to Backport #48737: octopus: orchestrator: query-daemon-health-metrics fails, no smartctl output added
Actions #9

Updated by Nathan Cutler about 3 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF