Bug #55142: [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error - cephsqlite - Ceph

Actions

Copy link

Bug #55142

closed

[ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error

Added by Vikhyat Umrao about 2 years ago. Updated about 1 year ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Patrick Donnelly

Target version:

% Done:

Source:

Development

Tags:

Backport:

pacific, quincy

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v16.2.7

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

022-03-31T00:15:17.829+0000 7fcf56511700  0 [balancer INFO root] ceph osd pg-upmap-items 204.c4d mappings [{'from': 878, 'to': 191}]
2022-03-31T00:15:18.419+0000 7fcf49cf8700 -1 client.1735696: SimpleRADOSStriper: lock: main.db:  lock failed: (108) Cannot send after transport endpoint shutdown
2022-03-31T00:15:18.419+0000 7fcf49cf8700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O er
ror
2022-03-31T00:15:18.419+0000 7fcf49cf8700 -1 devicehealth.serve:
2022-03-31T00:15:18.419+0000 7fcf49cf8700 -1 Traceback (most recent call last):
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 373, in serve
    self.scrape_all()
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 425, in scrape_all
    self.put_device_metrics(device, data)
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 500, in put_device_metrics
    self._create_device(devid)
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 487, in _create_device
    cursor = self.db.execute(SQL, (devid,))
sqlite3.OperationalError: disk I/O error

2022-03-31T00:15:18.568+0000 7fcff717d700 -1 mgr handle_mgr_map I was active but no longer am
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  e: '/usr/bin/ceph-mgr'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  0: '/usr/bin/ceph-mgr'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  1: '-n'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  2: 'mgr.gibba002.nzpbzu'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  3: '-f'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  4: '--setuser'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  5: 'ceph'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  6: '--setgroup'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  7: 'ceph'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  8: '--default-log-to-file=false'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  9: '--default-log-to-journald=true'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  10: '--default-log-to-stderr=false'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn respawning with exe /usr/bin/ceph-mgr
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  exe_path /proc/self/exe
2022-03-31T00:15:19.967+0000 7f2c2da44000  0 ceph version 17.1.0-138-g723fda64 (723fda64a662bb79871e590698268007049bcf7f) quincy (stable), process ceph-mgr, pid 8
2022-03-31T00:15:19.967+0000 7f2c2da44000  0 pidfile_write: ignore empty --pid-file
2022-03-31T00:15:21.569+0000 7f2c2da44000  1 mgr[py] Loading python module 'mirroring'
2022-03-31T00:15:22.569+0000 7f2c2da44000  1 mgr[py] Loading python module 'stats'

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Yaarit Hatuka about 2 years ago

Project changed from mgr to cephsqlite
Category deleted (~~devicehealth module~~)
Assignee changed from Yaarit Hatuka to Venky Shankar
Source set to Development
Backport set to pacific, quincy
Affected Versions v16.2.7 added

I tried to reproduce it on the gibba cluster by scraping all devices (with `sudo ceph device scrape-health-metrics`), but the exception did not appear in the logs again.

The following implies that the error relates to a locking issue in src/SimpleRADOSStriper.cc:

2022-03-31T00:15:18.419+0000 7fcf49cf8700 -1 client.1735696: SimpleRADOSStriper: lock: main.db: lock failed: (108) Cannot send after transport endpoint shutdown

Venky, can you please take a look?

Actions

Copy link

Updated by Venky Shankar about 2 years ago

Assignee changed from Venky Shankar to Patrick Donnelly

Yaarit Hatuka wrote:

I tried to reproduce it on the gibba cluster by scraping all devices (with `sudo ceph device scrape-health-metrics`), but the exception did not appear in the logs again.

The following implies that the error relates to a locking issue in src/SimpleRADOSStriper.cc:

2022-03-31T00:15:18.419+0000 7fcf49cf8700 -1 client.1735696: SimpleRADOSStriper: lock: main.db: lock failed: (108) Cannot send after transport endpoint shutdown

Venky, can you please take a look?

This is (most likely) not related to CephFS, so, I'm probably not the intended assignee for this tracker.

Quick check to `src/SimpleRADOSStriper.cc' has Patrick Donnelly as the author, who works on CephFS, but I'm pretty sure cephsqlite was developed as a standalone project rather than anything related to CephFS.

Assigning to Patrick (who is on PTO until Mayish, so this might take a while to be looked into).

Actions

Copy link

Updated by Yaarit Hatuka almost 2 years ago

Related to Bug #55606: [ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknown added

Actions

Copy link

Updated by Patrick Donnelly almost 2 years ago

Status changed from New to Need More Info

This error is generated when the cephsqlite RADOS instance is blocklisted. So this is likely a symptom and not a bug.

Actions

Copy link

Updated by Laura Flores almost 2 years ago

/a/yuriw-2022-05-27_21:59:17-rados-wip-yuri-testing-2022-05-27-0934-distro-default-smithi/6851244

Actions

Copy link

Updated by Laura Flores almost 2 years ago

/a/yuriw-2022-06-09_22:06:32-rados-wip-yuri3-testing-2022-06-09-1314-distro-default-smithi/6871541

Actions

Copy link

Updated by Laura Flores almost 2 years ago

Subject changed from [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O er ror to [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error

Actions

Copy link

Translation missing: en.field_tag_list set to test-failure

/a/lflores-2023-03-27_20:42:09-rados-wip-aclamk-bs-elastic-shared-blob-quincy-distro-default-smithi/7221723

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » cephsqlite

Custom queries

Bug #55142

[ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error

Updated by Yaarit Hatuka about 2 years ago

Updated by Venky Shankar about 2 years ago

Updated by Yaarit Hatuka almost 2 years ago

Updated by Patrick Donnelly almost 2 years ago

Updated by Laura Flores almost 2 years ago

Updated by Laura Flores almost 2 years ago

Updated by Laura Flores almost 2 years ago

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

Updated by Neha Ojha over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Patrick Donnelly about 1 year ago

Updated by Patrick Donnelly about 1 year ago

Updated by Patrick Donnelly about 1 year ago

Updated by Laura Flores about 1 year ago