Project

General

Profile

Bug #55142

[ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error

Added by Vikhyat Umrao 8 months ago. Updated 2 months ago.

Status:
Need More Info
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
pacific, quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

022-03-31T00:15:17.829+0000 7fcf56511700  0 [balancer INFO root] ceph osd pg-upmap-items 204.c4d mappings [{'from': 878, 'to': 191}]
2022-03-31T00:15:18.419+0000 7fcf49cf8700 -1 client.1735696: SimpleRADOSStriper: lock: main.db:  lock failed: (108) Cannot send after transport endpoint shutdown
2022-03-31T00:15:18.419+0000 7fcf49cf8700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O er
ror
2022-03-31T00:15:18.419+0000 7fcf49cf8700 -1 devicehealth.serve:
2022-03-31T00:15:18.419+0000 7fcf49cf8700 -1 Traceback (most recent call last):
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 373, in serve
    self.scrape_all()
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 425, in scrape_all
    self.put_device_metrics(device, data)
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 500, in put_device_metrics
    self._create_device(devid)
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 487, in _create_device
    cursor = self.db.execute(SQL, (devid,))
sqlite3.OperationalError: disk I/O error

2022-03-31T00:15:18.568+0000 7fcff717d700 -1 mgr handle_mgr_map I was active but no longer am
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  e: '/usr/bin/ceph-mgr'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  0: '/usr/bin/ceph-mgr'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  1: '-n'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  2: 'mgr.gibba002.nzpbzu'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  3: '-f'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  4: '--setuser'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  5: 'ceph'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  6: '--setgroup'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  7: 'ceph'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  8: '--default-log-to-file=false'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  9: '--default-log-to-journald=true'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  10: '--default-log-to-stderr=false'
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn respawning with exe /usr/bin/ceph-mgr
2022-03-31T00:15:18.568+0000 7fcff717d700  1 mgr respawn  exe_path /proc/self/exe
2022-03-31T00:15:19.967+0000 7f2c2da44000  0 ceph version 17.1.0-138-g723fda64 (723fda64a662bb79871e590698268007049bcf7f) quincy (stable), process ceph-mgr, pid 8
2022-03-31T00:15:19.967+0000 7f2c2da44000  0 pidfile_write: ignore empty --pid-file
2022-03-31T00:15:21.569+0000 7f2c2da44000  1 mgr[py] Loading python module 'mirroring'
2022-03-31T00:15:22.569+0000 7f2c2da44000  1 mgr[py] Loading python module 'stats'

Related issues

Related to cephsqlite - Bug #55606: [ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknown New

History

#1 Updated by Yaarit Hatuka 8 months ago

  • Project changed from mgr to cephsqlite
  • Category deleted (devicehealth module)
  • Assignee changed from Yaarit Hatuka to Venky Shankar
  • Source set to Development
  • Backport set to pacific, quincy
  • Affected Versions v16.2.7 added

I tried to reproduce it on the gibba cluster by scraping all devices (with `sudo ceph device scrape-health-metrics`), but the exception did not appear in the logs again.

The following implies that the error relates to a locking issue in src/SimpleRADOSStriper.cc:

2022-03-31T00:15:18.419+0000 7fcf49cf8700 -1 client.1735696: SimpleRADOSStriper: lock: main.db: lock failed: (108) Cannot send after transport endpoint shutdown

Venky, can you please take a look?

#2 Updated by Venky Shankar 8 months ago

  • Assignee changed from Venky Shankar to Patrick Donnelly

Yaarit Hatuka wrote:

I tried to reproduce it on the gibba cluster by scraping all devices (with `sudo ceph device scrape-health-metrics`), but the exception did not appear in the logs again.

The following implies that the error relates to a locking issue in src/SimpleRADOSStriper.cc:

2022-03-31T00:15:18.419+0000 7fcf49cf8700 -1 client.1735696: SimpleRADOSStriper: lock: main.db: lock failed: (108) Cannot send after transport endpoint shutdown

Venky, can you please take a look?

This is (most likely) not related to CephFS, so, I'm probably not the intended assignee for this tracker.

Quick check to `src/SimpleRADOSStriper.cc' has Patrick Donnelly as the author, who works on CephFS, but I'm pretty sure cephsqlite was developed as a standalone project rather than anything related to CephFS.

Assigning to Patrick (who is on PTO until Mayish, so this might take a while to be looked into).

#3 Updated by Yaarit Hatuka 7 months ago

  • Related to Bug #55606: [ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknown added

#4 Updated by Patrick Donnelly 7 months ago

  • Status changed from New to Need More Info

This error is generated when the cephsqlite RADOS instance is blocklisted. So this is likely a symptom and not a bug.

#5 Updated by Laura Flores 6 months ago

/a/yuriw-2022-05-27_21:59:17-rados-wip-yuri-testing-2022-05-27-0934-distro-default-smithi/6851244

#6 Updated by Laura Flores 6 months ago

/a/yuriw-2022-06-09_22:06:32-rados-wip-yuri3-testing-2022-06-09-1314-distro-default-smithi/6871541

#7 Updated by Laura Flores 6 months ago

  • Subject changed from [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O er ror to [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error

#8 Updated by Kamoltat (Junior) Sirivadhna 4 months ago

/a/yuriw-2022-07-22_03:30:40-rados-wip-yuri3-testing-2022-07-21-1604-distro-default-smithi/6944298/

#9 Updated by Neha Ojha 3 months ago

/a/yuriw-2022-09-15_17:53:16-rados-quincy-release-distro-default-smithi/7034360

#10 Updated by Laura Flores 2 months ago

/a/yuriw-2022-09-29_16:44:24-rados-wip-lflores-testing-distro-default-smithi/7048202

Also available in: Atom PDF