Project

General

Profile

Actions

Bug #56239

closed

crash: File "mgr/devicehealth/module.py", in get_recent_device_metrics: return self._get_device_metrics(devid, min_sample=min_sample)

Added by Telemetry Bot almost 2 years ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Target version:
% Done:

0%

Source:
Telemetry
Tags:
backport_processed
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):

372a820cbfc5af971785d9b6af2a345a1670c04429583dc564e357c04a53cf64
8f6cf6368e0ca8ac93beab8b45a0d5013805b9ef39286850ba17798f822e180c
d0ea52fbf30312347be61ce51cd1f6c5483dfaba1767a0eb62791d1f194f3381
ef43174c3be0e2b9ccb951f18b2301de313327d53325698fe20fbd29db555a38
2364791fa429f484e2ac788d520a6c4752a9e95983682b39f621373401ca0734


Description

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=c609351efcea4c1028865dbbc028e313b0b918697fb0a5b7a6cf46b171d33b27

Sanitized backtrace:

    File "mgr/devicehealth/module.py", in get_recent_device_metrics: return self._get_device_metrics(devid, min_sample=min_sample)
    File "mgr/devicehealth/module.py", in _get_device_metrics: with self._db_lock, self.db:
    File "mgr/mgr_module.py", in db: raise MgrDBNotReady();

Crash dump sample:
{
    "archived": "2022-06-19 10:32:56.076950",
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 764, in get_recent_device_metrics\n    return self._get_device_metrics(devid, min_sample=min_sample)",
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 553, in _get_device_metrics\n    with self._db_lock, self.db:",
        "  File \"/usr/share/ceph/mgr/mgr_module.py\", line 1203, in db\n    raise MgrDBNotReady();",
        "<redacted>" 
    ],
    "ceph_version": "17.2.0",
    "crash_id": "2022-06-18T19:09:19.112675Z_db7d5934-7e5a-4ee8-908e-4ee606f9dd1c",
    "entity_name": "mgr.8db3d30b2fe0f2dc446f5bc8b03f08b697cf9f58",
    "mgr_module": "devicehealth",
    "mgr_module_caller": "ActivePyModule::dispatch_remote get_recent_device_metrics",
    "mgr_python_exception": "MgrDBNotReady",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mgr",
    "stack_sig": "bb14694bacd8d2b1a934cf4a3f4a27f50f27e160354c2f796b64991db731505e",
    "timestamp": "2022-06-18T19:09:19.112675Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.0-39-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#42-Ubuntu SMP Thu Jun 9 23:42:32 UTC 2022" 
}


Files

ceph-mgr.gibba001.nkuepu.log.gz (49.5 KB) ceph-mgr.gibba001.nkuepu.log.gz Laura Flores, 05/30/2023 10:21 PM

Related issues 3 (0 open3 closed)

Copied to cephsqlite - Backport #61834: quincy: crash: File "mgr/devicehealth/module.py", in get_recent_device_metrics: return self._get_device_metrics(devid, min_sample=min_sample)ResolvedPatrick DonnellyActions
Copied to cephsqlite - Backport #61835: pacific: crash: File "mgr/devicehealth/module.py", in get_recent_device_metrics: return self._get_device_metrics(devid, min_sample=min_sample)RejectedPatrick DonnellyActions
Copied to cephsqlite - Backport #61836: reef: crash: File "mgr/devicehealth/module.py", in get_recent_device_metrics: return self._get_device_metrics(devid, min_sample=min_sample)ResolvedPatrick DonnellyActions
Actions #1

Updated by Telemetry Bot almost 2 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v17.0.0, v17.1.0, v17.2.0 added
Actions #2

Updated by Telemetry Bot almost 2 years ago

  • Crash signature (v1) updated (diff)
  • Affected Versions v17.2.1, v17.2.2 added
Actions #3

Updated by Telemetry Bot 12 months ago

  • Crash signature (v1) updated (diff)
  • Affected Versions v17.2.3, v17.2.4, v17.2.5, v17.2.6 added
Actions #4

Updated by Laura Flores 11 months ago

  • Crash signature (v1) updated (diff)

Happened in the gibba cluster:

[lflores@gibba001 ~]$ sudo ceph -s
  cluster:
    id:     5363501e-fdf2-11ed-bac8-3cecef3d8fb8
    health: HEALTH_WARN
            1 pool(s) do not have an application enabled
            1 mgr modules have recently crashed

  services:
    mon: 5 daemons, quorum gibba001,gibba002,gibba005,gibba003,gibba004 (age 38h)
    mgr: gibba006.afdywy(active, since 38h), standbys: gibba008.nemumh
    osd: 62 osds: 62 up (since 38h), 62 in (since 38h); 18 remapped pgs
    rgw: 6 daemons active (6 hosts, 1 zones)

  data:
    pools:   6 pools, 257 pgs
    objects: 83.37M objects, 318 GiB
    usage:   1.1 TiB used, 9.4 TiB / 11 TiB avail
    pgs:     20809893/250109739 objects misplaced (8.320%)
             239 active+clean
             18  active+remapped+backfilling

  io:
    client:   63 KiB/s rd, 0 B/s wr, 63 op/s rd, 42 op/s wr
    recovery: 1.0 MiB/s, 266 objects/s

  progress:
    Global Recovery Event (0s)
      [............................] 

[lflores@gibba001 ~]$ sudo ceph health detail
HEALTH_WARN 1 pool(s) do not have an application enabled; 1 mgr modules have recently crashed
[WRN] POOL_APP_NOT_ENABLED: 1 pool(s) do not have an application enabled
    application not enabled on pool 'foo'
    use 'ceph osd pool application enable <pool-name> <app-name>', where <app-name> is 'cephfs', 'rbd', 'rgw', or freeform for custom applications.
[WRN] RECENT_MGR_MODULE_CRASH: 1 mgr modules have recently crashed
    mgr module devicehealth crashed in daemon mgr.gibba001.nkuepu on host gibba001 at 2023-05-29T07:32:20.873598Z

[lflores@gibba001 ~]$ sudo ceph crash info 2023-05-29T07:32:20.873598Z_0465ae2d-0220-4d9b-9ef8-debf2e6a5d70
{
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 764, in get_recent_device_metrics\n    return self._get_device_metrics(devid, min_sample=min_sample)",
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 553, in _get_device_metrics\n    with self._db_lock, self.db:",
        "  File \"/usr/share/ceph/mgr/mgr_module.py\", line 1233, in db\n    raise MgrDBNotReady();",
        "mgr_module.MgrDBNotReady" 
    ],
    "ceph_version": "17.2.6",
    "crash_id": "2023-05-29T07:32:20.873598Z_0465ae2d-0220-4d9b-9ef8-debf2e6a5d70",
    "entity_name": "mgr.gibba001.nkuepu",
    "mgr_module": "devicehealth",
    "mgr_module_caller": "ActivePyModule::dispatch_remote get_recent_device_metrics",
    "mgr_python_exception": "MgrDBNotReady",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mgr",
    "stack_sig": "fbbc6a4724a20738af8118fb5d84831008735002870daa3a76853a0dcaaa3f92",
    "timestamp": "2023-05-29T07:32:20.873598Z",
    "utsname_hostname": "gibba001",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-301.1.el8.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Tue Apr 13 16:24:22 UTC 2021" 
}

From the mgr log:

2023-05-29T07:32:20.746+0000 7fe13d427700  0 [telemetry INFO root] Compiling and sending report to https://telemetry.ceph.com/report
2023-05-29T07:32:20.764+0000 7fe13d427700  0 [telemetry INFO root] Sending ceph report to: https://telemetry.ceph.com/report
2023-05-29T07:32:20.796+0000 7fe15c602700  0 [progress WARNING root] complete: ev c158f0be-5ee5-43ec-9dc4-5754658550ba does not exist
2023-05-29T07:32:20.796+0000 7fe15c602700  0 [progress WARNING root] complete: ev b16c5b1b-f70c-4902-a80a-58955b08c131 does not exist
2023-05-29T07:32:20.796+0000 7fe15c602700  0 [progress WARNING root] complete: ev d8460a9b-583b-4f9d-849c-3ed28768bbff does not exist
2023-05-29T07:32:20.796+0000 7fe15c602700  0 [progress WARNING root] complete: ev fbae7d8f-22a6-4ca3-8304-18a178d62c55 does not exist
2023-05-29T07:32:20.796+0000 7fe15c602700  0 [progress WARNING root] complete: ev 91c8d6fc-a976-4651-84f9-72dbc59c52b5 does not exist
2023-05-29T07:32:20.797+0000 7fe15c602700  0 [progress WARNING root] complete: ev 12f6ceb0-d855-4345-95cf-616f4429160b does not exist
2023-05-29T07:32:20.797+0000 7fe15c602700  0 [progress WARNING root] complete: ev b9af52da-d16d-4106-89b2-eb2220aff415 does not exist
2023-05-29T07:32:20.797+0000 7fe15c602700  0 [progress WARNING root] complete: ev 40bdf7b1-80d7-4fd3-beb6-069b394d7f31 does not exist
2023-05-29T07:32:20.821+0000 7fe1843a6700  0 [prometheus INFO cherrypy.error] [29/May/2023:07:32:20] ENGINE Serving on http://:::9283
2023-05-29T07:32:20.821+0000 7fe1843a6700  0 [prometheus INFO cherrypy.error] [29/May/2023:07:32:20] ENGINE Bus STARTED
2023-05-29T07:32:20.821+0000 7fe1843a6700  0 [prometheus INFO root] Engine started.
2023-05-29T07:32:20.871+0000 7fe13d427700  0 [telemetry INFO root] Sent report to https://telemetry.ceph.com/report
2023-05-29T07:32:20.872+0000 7fe13d427700 -1 Remote method threw exception: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 764, in get_recent_device_metrics
    return self._get_device_metrics(devid, min_sample=min_sample)
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 553, in _get_device_metrics
    with self._db_lock, self.db:
  File "/usr/share/ceph/mgr/mgr_module.py", line 1233, in db
    raise MgrDBNotReady();
mgr_module.MgrDBNotReady

2023-05-29T07:32:20.872+0000 7fe13d427700  0 [telemetry ERROR root] Unable to get recent metrics from device with id "TOSHIBA_MG04ACA1_Y9I3K2IYF6XF": Remote method threw exception: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 764, in get_recent_device_metrics
    return self._get_device_metrics(devid, min_sample=min_sample)
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 553, in _get_device_metrics
    with self._db_lock, self.db:
  File "/usr/share/ceph/mgr/mgr_module.py", line 1233, in db
    raise MgrDBNotReady();
mgr_module.MgrDBNotReady

2023-05-29T07:32:20.872+0000 7fe13d427700  0 [telemetry ERROR root] Unable to send device report: Device channel is on, but the generated report was empty.

Actions #5

Updated by Laura Flores 11 months ago

  • Category set to devicehealth module
Actions #7

Updated by Laura Flores 11 months ago

Could this be an sqlite issue rather than a problem with the devicehealth module?

src/pybind/mgr/mgr_module.py

1223     @property
1224     def db(self) -> sqlite3.Connection:
1225         assert self._db_lock.locked()
1226         if self._db is not None:
1227             return self._db
1228         db_allowed = self.get_ceph_option("mgr_pool")
1229         if not db_allowed:
1230             raise MgrDBNotReady();
1231         self._db = self.open_db()
1232         if self._db is None:
1233             raise MgrDBNotReady();
1234         return self._db

Actions #8

Updated by Yaarit Hatuka 11 months ago

  • Project changed from mgr to cephsqlite
  • Category deleted (devicehealth module)

Looks like a sqlite issue; Patrick, can you please take a look?

Actions #9

Updated by Patrick Donnelly 11 months ago

  • Status changed from New to Fix Under Review
  • Assignee set to Patrick Donnelly
  • Target version set to v19.0.0
  • Backport set to reef,quincy,pacific
  • Pull request ID set to 51858
Actions #10

Updated by Patrick Donnelly 11 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #11

Updated by Backport Bot 11 months ago

  • Copied to Backport #61834: quincy: crash: File "mgr/devicehealth/module.py", in get_recent_device_metrics: return self._get_device_metrics(devid, min_sample=min_sample) added
Actions #12

Updated by Backport Bot 11 months ago

  • Copied to Backport #61835: pacific: crash: File "mgr/devicehealth/module.py", in get_recent_device_metrics: return self._get_device_metrics(devid, min_sample=min_sample) added
Actions #13

Updated by Backport Bot 11 months ago

  • Copied to Backport #61836: reef: crash: File "mgr/devicehealth/module.py", in get_recent_device_metrics: return self._get_device_metrics(devid, min_sample=min_sample) added
Actions #14

Updated by Backport Bot 11 months ago

  • Tags set to backport_processed
Actions #15

Updated by Patrick Donnelly 6 months ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF