Project

General

Profile

Actions

Bug #55142

closed

[ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error

Added by Vikhyat Umrao about 2 years ago. Updated about 1 year ago.

Status:
Duplicate
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
pacific, quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

022-03-31T00:15:17.829+0000 7fcf56511700  0 [balancer INFO root] ceph osd pg-upmap-items 204.c4d mappings [{'from': 878, 'to': 191}]
2022-03-31T00:15:18.419+0000 7fcf49cf8700 -1 client.1735696: SimpleRADOSStriper: lock: main.db:  lock failed: (108) Cannot send after transport endpoint shutdown
2022-03-31T00:15:18.419+0000 7fcf49cf8700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O er
ror
2022-03-31T00:15:18.419+0000 7fcf49cf8700 -1 devicehealth.serve:
2022-03-31T00:15:18.419+0000 7fcf49cf8700 -1 Traceback (most recent call last):
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 373, in serve
    self.scrape_all()
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 425, in scrape_all
    self.put_device_metrics(device, data)
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 500, in put_device_metrics
    self._create_device(devid)
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 487, in _create_device
    cursor = self.db.execute(SQL, (devid,))
sqlite3.OperationalError: disk I/O error

2022-03-31T00:15:18.568+0000 7fcff717d700 -1 mgr handle_mgr_map I was active but no longer am
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  e: '/usr/bin/ceph-mgr'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  0: '/usr/bin/ceph-mgr'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  1: '-n'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  2: 'mgr.gibba002.nzpbzu'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  3: '-f'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  4: '--setuser'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  5: 'ceph'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  6: '--setgroup'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  7: 'ceph'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  8: '--default-log-to-file=false'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  9: '--default-log-to-journald=true'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  10: '--default-log-to-stderr=false'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn respawning with exe /usr/bin/ceph-mgr
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  exe_path /proc/self/exe
2022-03-31T00:15:19.967+0000 7f2c2da44000  0 ceph version 17.1.0-138-g723fda64 (723fda64a662bb79871e590698268007049bcf7f) quincy (stable), process ceph-mgr, pid 8
2022-03-31T00:15:19.967+0000 7f2c2da44000  0 pidfile_write: ignore empty --pid-file
2022-03-31T00:15:21.569+0000 7f2c2da44000  1 mgr[py] Loading python module 'mirroring'
2022-03-31T00:15:22.569+0000 7f2c2da44000  1 mgr[py] Loading python module 'stats'

Related issues 1 (0 open1 closed)

Is duplicate of cephsqlite - Bug #55606: [ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknownResolvedPatrick Donnelly

Actions
Actions #1

Updated by Yaarit Hatuka about 2 years ago

  • Project changed from mgr to cephsqlite
  • Category deleted (devicehealth module)
  • Assignee changed from Yaarit Hatuka to Venky Shankar
  • Source set to Development
  • Backport set to pacific, quincy
  • Affected Versions v16.2.7 added

I tried to reproduce it on the gibba cluster by scraping all devices (with `sudo ceph device scrape-health-metrics`), but the exception did not appear in the logs again.

The following implies that the error relates to a locking issue in src/SimpleRADOSStriper.cc:

2022-03-31T00:15:18.419+0000 7fcf49cf8700 -1 client.1735696: SimpleRADOSStriper: lock: main.db: lock failed: (108) Cannot send after transport endpoint shutdown

Venky, can you please take a look?

Actions #2

Updated by Venky Shankar about 2 years ago

  • Assignee changed from Venky Shankar to Patrick Donnelly

Yaarit Hatuka wrote:

I tried to reproduce it on the gibba cluster by scraping all devices (with `sudo ceph device scrape-health-metrics`), but the exception did not appear in the logs again.

The following implies that the error relates to a locking issue in src/SimpleRADOSStriper.cc:

2022-03-31T00:15:18.419+0000 7fcf49cf8700 -1 client.1735696: SimpleRADOSStriper: lock: main.db: lock failed: (108) Cannot send after transport endpoint shutdown

Venky, can you please take a look?

This is (most likely) not related to CephFS, so, I'm probably not the intended assignee for this tracker.

Quick check to `src/SimpleRADOSStriper.cc' has Patrick Donnelly as the author, who works on CephFS, but I'm pretty sure cephsqlite was developed as a standalone project rather than anything related to CephFS.

Assigning to Patrick (who is on PTO until Mayish, so this might take a while to be looked into).

Actions #3

Updated by Yaarit Hatuka almost 2 years ago

  • Related to Bug #55606: [ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknown added
Actions #4

Updated by Patrick Donnelly almost 2 years ago

  • Status changed from New to Need More Info

This error is generated when the cephsqlite RADOS instance is blocklisted. So this is likely a symptom and not a bug.

Actions #5

Updated by Laura Flores almost 2 years ago

/a/yuriw-2022-05-27_21:59:17-rados-wip-yuri-testing-2022-05-27-0934-distro-default-smithi/6851244

Actions #6

Updated by Laura Flores almost 2 years ago

/a/yuriw-2022-06-09_22:06:32-rados-wip-yuri3-testing-2022-06-09-1314-distro-default-smithi/6871541

Actions #7

Updated by Laura Flores almost 2 years ago

  • Subject changed from [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O er ror to [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error
Actions #8

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

/a/yuriw-2022-07-22_03:30:40-rados-wip-yuri3-testing-2022-07-21-1604-distro-default-smithi/6944298/

Actions #9

Updated by Neha Ojha over 1 year ago

/a/yuriw-2022-09-15_17:53:16-rados-quincy-release-distro-default-smithi/7034360

Actions #10

Updated by Laura Flores over 1 year ago

/a/yuriw-2022-09-29_16:44:24-rados-wip-lflores-testing-distro-default-smithi/7048202

Actions #11

Updated by Patrick Donnelly about 1 year ago

  • Is duplicate of Bug #55606: [ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknown added
Actions #12

Updated by Patrick Donnelly about 1 year ago

  • Related to deleted (Bug #55606: [ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknown)
Actions #13

Updated by Patrick Donnelly about 1 year ago

  • Status changed from Need More Info to Duplicate
Actions #14

Updated by Laura Flores about 1 year ago

  • Translation missing: en.field_tag_list set to test-failure

/a/lflores-2023-03-27_20:42:09-rados-wip-aclamk-bs-elastic-shared-blob-quincy-distro-default-smithi/7221723

Actions

Also available in: Atom PDF