Project

General

Profile

Bug #51554

mgr/devicehealth: health warning caused by AttributeError: 'NoneType' object has no attribute 'get'

Added by Robert Sander 2 months ago. Updated 21 days ago.

Status:
New
Priority:
Normal
Assignee:
Category:
devicehealth module
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

health warning caused by AttributeError: 'NoneType' object has no attribute 'get'

The cluster status is not healthy because the devicehealth module throws an exception.

Environment

  • ceph version string: 16.2.4
  • Platform (OS/distro/release): Container images from docker.io/ceph/ceph

How reproducible

Restarting the mgr containers does not fix the issue.

Actual results

Jun 30 16:07:09 al111 bash171790: debug 2021-06-30T14:07:09.939+0000 7f2a31d64700 -1 devicehealth.serve:
Jun 30 16:07:09 al111 bash171790: debug 2021-06-30T14:07:09.939+0000 7f2a31d64700 -1 Traceback (most recent call last):
Jun 30 16:07:09 al111 bash171790: File "/usr/share/ceph/mgr/devicehealth/module.py", line 330, in serve
Jun 30 16:07:09 al111 bash171790: self.scrape_all()
Jun 30 16:07:09 al111 bash171790: File "/usr/share/ceph/mgr/devicehealth/module.py", line 390, in scrape_all
Jun 30 16:07:09 al111 bash171790: self.put_device_metrics(ioctx, device, data)
Jun 30 16:07:09 al111 bash171790: File "/usr/share/ceph/mgr/devicehealth/module.py", line 477, in put_device_metrics
Jun 30 16:07:09 al111 bash171790: wear_level = get_ata_wear_level(data)
Jun 30 16:07:09 al111 bash171790: File "/usr/share/ceph/mgr/devicehealth/module.py", line 33, in get_ata_wear_level
Jun 30 16:07:09 al111 bash171790: if page.get("number") != 7:
Jun 30 16:07:09 al111 bash171790: AttributeError: 'NoneType' object has no attribute 'get'

Expected results

No Python exception.

History

#1 Updated by Stefan Fleischmann about 1 month ago

Same problem here with Ceph 16.2.5. Is someone looking into this?

#2 Updated by Yaarit Hatuka 25 days ago

Thanks, Robert, Stefan, for reporting this.

This seems like a nonstandard output of smartctl command.

Can you please share the output of smartctl on the device where this happens?
Specifically:
  • the vendor and model of this device
  • the entire content of 'ata_device_statistics' key
  • smartctl version

#3 Updated by Robert Sander 25 days ago

Yaarit Hatuka wrote:

  • the entire content of 'ata_device_statistics' key

Where do I find this information?

#4 Updated by Michael Wodniok 21 days ago

Robert Sander wrote:

Yaarit Hatuka wrote:

  • the entire content of 'ata_device_statistics' key

Where do I find this information?

The problem is: how do you know which disk causes the error?

We have several disk types in use and here is one which does not have any tabular SMART data available in ceph:

root@rz2b-cn11:~# cephadm shell --fsid 41902fa4-3ecf-11eb-94ef-258486fe8a0f -c /etc/ceph/ceph.conf -n osd.3 -- smartctl -a /dev/sde
Using recent ceph image ceph/ceph@sha256:829ebf54704f2d827de00913b171e5da741aad9b53c1f35ad59251524790eceb
smartctl 7.1 2020-04-05 r5049 [x86_64-linux-5.4.0-81-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST1000NX0323
Revision:             K002
Compliance:           SPC-4
User Capacity:        1,000,204,886,016 bytes [1.00 TB]
Logical block size:   4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          2.5 inches
Logical Unit id:      0x5000c5007f260653
Serial number:        S4700YQR0000J507296Q
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Mon Aug 30 09:07:02 2021 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     34 C
Drive Trip Temperature:        60 C

Manufactured in week 01 of year 2015
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  32
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  2067
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 3465542296
  Blocks received from initiator = 1594710387
  Blocks read from cache and sent to initiator = 119140210
  Number of read and write commands whose size <= segment size = 37058510
  Number of read and write commands whose size > segment size = 82828

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 41393.57
  number of minutes until next internal SMART test = 31

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   2659695438        0         0  2659695438          0      14194.861           0
write:         0        0         0         0          0       6539.483           0

Non-medium error count:       84

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
No Self-tests have been logged

As you can see there is any smart data listed in tabular form. Could this cause the issue?

Also available in: Atom PDF