Project

General

Profile

Actions

Bug #55606

closed

[ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknown

Added by Nitzan Mordechai almost 2 years ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Target version:
% Done:

0%

Source:
Community (user)
Tags:
backport_processed
Backport:
reef,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):

86fccb1331b555469509cb40ea7998ef724d7f67fa05541e6599b57aacbda12b
ac08e1ef6d35905f8dfb2e8bef5e63013092d4fd1812af45665b7a375c848d4f
41161d56a146d7bb9e51e8b458ced4b09761704adb81abc8819b2521798b5509
0e5a33107a4c9e75eb406fa58976e51f0d12e6429cd47f57338418c114c31aec
79b843397ff3a9e490e460f19f02c73f2348dcaf8946776a443641fadfd6a923
7673dfdf4352116147fdfb3a31a6dd64e8525ca1ac27047746c4d0b19a2b8ee1
1fe134bcc0da984504f6ff8534b02cfb33ee34e038f36fa318560969865ac0b8
7ea7ab36946ce7e1ee42c3db55b1f40c4e09f3f9a2cc99ffd99bb0a631c0f973
0be10189a1e1d49804974c592d340e7592230857f4ac8bc15c11845f9d829f98
1d5dd4d12786fb804db0e4c84384cc70ccc77ce77816086e0818123a91fdc1f9
c5efd9a11456f073d65f44e1fa633a76d4f99cc530cba6ab7357983ad052aa58
3ae3570ef19c6cebdcd5ece0d70c63cb0f91903c52e031c81f419514acfb656f
48c37619e84753c832296a9c285b309043d88fc3f1fc38e126054824623809b1
5e854d73b29ef14826356aacb6ea625cd128d8ed0362a0b117d0ef2b1d2736d0
1795c28174bea254aec378b8393504c5a1849bf13e04066c12af05fb82fcb054
2d141d1e83d9ea238b960e0d50bdc5f423259c1a0c58578de181aa36b942a3b5
c2da23cc7287af55584a694c0355fe8b7998d118c3c4caf81f67f2ef1c390720
b104ab4c8d1bf3668e82ed30637b204f3babe572b1a390a050df031dc65cc8e3


Description

/home/teuthworker/archive/yuriw-2022-04-29_15:44:49-rados-wip-yuri5-testing-2022-04-28-1007-distro-default-smithi/6813955

2022-04-29T19:34:39.756 INFO:teuthology.orchestra.run.smithi169.stdout:op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck. Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.
2022-04-29T19:34:39.767 DEBUG:teuthology.orchestra.run.smithi155:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok dump_ops_in_flight
2022-04-29T19:34:39.787 INFO:tasks.ceph.mgr.y.smithi155.stderr:2022-04-29T19:34:39.785+0000 7f5604101700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.y: unknown operation
2022-04-29T19:34:39.787 INFO:tasks.ceph.mgr.y.smithi155.stderr:2022-04-29T19:34:39.785+0000 7f5604101700 -1 devicehealth.serve:
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr:2022-04-29T19:34:39.785+0000 7f5604101700 -1 Traceback (most recent call last):
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr:  File "/usr/share/ceph/mgr/devicehealth/module.py", line 376, in serve
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr:    self.set_kv('last_scrape', last_scrape.strftime(TIME_FORMAT))
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr:  File "/usr/share/ceph/mgr/mgr_module.py", line 1117, in set_kv
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr:    self.db.execute(SQL, (key, value))
2022-04-29T19:34:39.789 INFO:tasks.ceph.mgr.y.smithi155.stderr:sqlite3.InternalError: unknown operation
2022-04-29T19:34:39.789 INFO:tasks.ceph.mgr.y.smithi155.stderr:
2022-04-29T19:34:39.834 INFO:tasks.ceph.osd.3.smithi155.stderr:2022-04-29T19:34:39.833+0000 7f38469fe700 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper kill ceph-osd -f --cluster ceph -i 3  (PID: 91832) UID: 0


Related issues 17 (0 open17 closed)

Has duplicate cephsqlite - Bug #55142: [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O errorDuplicatePatrick Donnelly

Actions
Has duplicate cephsqlite - Bug #56844: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring:Duplicate

Actions
Has duplicate mgr - Bug #58652: Module 'devicehealth' has failed: disk I/O errorDuplicate

Actions
Has duplicate cephsqlite - Bug #56266: crash: File "mgr/devicehealth/module.py", in serve: self.set_kv('last_scrape', last_scrape.strftime(TIME_FORMAT))Duplicate

Actions
Has duplicate cephsqlite - Bug #56205: crash: File "mgr/devicehealth/module.py", in serve: self.scrape_all()Duplicate

Actions
Has duplicate cephsqlite - Bug #56279: crash: File "mgr/devicehealth/module.py", in get_recent_device_metrics: return self._get_device_metrics(devid, min_sample=min_sample)Duplicate

Actions
Has duplicate cephsqlite - Bug #56287: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring:Duplicate

Actions
Has duplicate cephsqlite - Bug #56291: crash: File "mgr/devicehealth/module.py", in get_recent_device_metrics: return self._get_device_metrics(devid, min_sample=min_sample)Duplicate

Actions
Has duplicate cephsqlite - Bug #56297: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring:Duplicate

Actions
Has duplicate cephsqlite - Bug #56312: crash: File "mgr/devicehealth/module.py", in serve: ls = self.get_kv('last_scrape')Duplicate

Actions
Has duplicate cephsqlite - Bug #56321: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring:Duplicate

Actions
Has duplicate cephsqlite - Bug #56322: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring:Duplicate

Actions
Has duplicate cephsqlite - Bug #56858: crash: File "mgr/devicehealth/module.py", in serve: self.set_kv('last_scrape', last_scrape.strftime(TIME_FORMAT))Duplicate

Actions
Has duplicate cephsqlite - Bug #56898: crash: File "mgr/devicehealth/module.py", in show_device_metrics: res = self._get_device_metrics(devid, sample=sample)Duplicate

Actions
Has duplicate cephsqlite - Bug #58351: Module 'devicehealth' has failed: unknown operationDuplicatePatrick Donnelly

Actions
Copied to cephsqlite - Backport #62022: reef: [ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknownResolvedPatrick DonnellyActions
Copied to cephsqlite - Backport #62023: quincy: [ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknownResolvedPatrick DonnellyActions
Actions #1

Updated by Laura Flores almost 2 years ago

  • Project changed from RADOS to cephsqlite
Actions #2

Updated by Yaarit Hatuka almost 2 years ago

  • Related to Bug #55142: [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error added
Actions #3

Updated by Laura Flores almost 2 years ago

Nitzan Mordechai wrote:

/home/teuthworker/archive/yuriw-2022-04-29_15:44:49-rados-wip-yuri5-testing-2022-04-28-1007-distro-default-smithi/6813955

[...]

mgr log at the time of the crash:

2022-04-29T19:34:39.785+0000 7f5664339700 10 MonCommandCompletion::finish()
2022-04-29T19:34:39.785+0000 7f5664339700 20 mgr Gil Switched to new thread state 0x559d5f25c000
2022-04-29T19:34:39.785+0000 7f5664339700 20 mgr ~Gil Destroying new thread state 0x559d5f25c000
2022-04-29T19:34:39.785+0000 7f5664339700 10 mgr notify_all notify_all: notify_all command
2022-04-29T19:34:39.785+0000 7f5664339700 15 mgr notify_all queuing notify (command) to restful
2022-04-29T19:34:39.785+0000 7f5604101700  0 [devicehealth DEBUG root] skipping duplicate INTEL_SSDPEDMD400G4_CVFT623300GW400BGN
2022-04-29T19:34:39.785+0000 7f5604101700 10 ceph_option_get device_failure_prediction_mode found: none
2022-04-29T19:34:39.785+0000 7f5604101700  0 [devicehealth DEBUG root] set_kv('last_scrape', '20220429-193321')
2022-04-29T19:34:39.785+0000 7f56297f6700 20 mgr Gil Switched to new thread state 0x559d5ea22400
2022-04-29T19:34:39.785+0000 7f56297f6700 20 mgr ~Gil Destroying new thread state 0x559d5ea22400
2022-04-29T19:34:39.785+0000 7f5604101700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.y: unknown operation
2022-04-29T19:34:39.785+0000 7f5604101700 -1 devicehealth.serve:
2022-04-29T19:34:39.785+0000 7f5604101700 -1 Traceback (most recent call last):
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 376, in serve
    self.set_kv('last_scrape', last_scrape.strftime(TIME_FORMAT))
  File "/usr/share/ceph/mgr/mgr_module.py", line 1117, in set_kv
    self.db.execute(SQL, (key, value))
sqlite3.InternalError: unknown operation

Actions #4

Updated by Patrick Donnelly almost 2 years ago

Problem appears to be that libcephsqlite lost its RADOS lock:

2022-04-29T19:34:39.315+0000 7f5669343700  1 -- 172.21.15.155:0/832204322 <== osd.7 v1:172.21.15.169:6804/91796 154 ==== osd_op_reply(146 main.db.0000000000000000 [call] v59'104 uv103 ondisk = -2 ((2) No such file or directory)) v8 ==== 168+0+0 (unknown 3461770130 0 0) 0x559d5fa1d8c0 con 0x559d5f026000
2022-04-29T19:34:39.315+0000 7f55ef0d7700 -1 client.4184: SimpleRADOSStriper: lock_keeper_main: main.db: lock renewal failed: (2) No such file or directory

At this time, the mgr was noting some 10 pgs were inactive.

I will think about how we can make this mgr module more resilient to this type of situation.

Actions #5

Updated by Yaarit Hatuka almost 2 years ago

Hi Patrick,

Can you please explain why unknown state of pgs results in this behavior?

The commonality we see between these issues so far is that they happen after an upgrade (on gibba and the LRC), or due to daemon restarts.

Actions #6

Updated by Laura Flores over 1 year ago

/a/yuriw-2022-12-07_15:47:33-rados-wip-yuri-testing-2022-12-06-1204-distro-default-smithi/7106555

Actions #7

Updated by Patrick Donnelly about 1 year ago

  • Has duplicate Bug #55142: [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error added
Actions #8

Updated by Patrick Donnelly about 1 year ago

  • Related to deleted (Bug #55142: [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error)
Actions #9

Updated by Patrick Donnelly about 1 year ago

  • Has duplicate Bug #56844: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added
Actions #10

Updated by Patrick Donnelly about 1 year ago

  • Has duplicate Bug #58652: Module 'devicehealth' has failed: disk I/O error added
Actions #11

Updated by Patrick Donnelly about 1 year ago

  • Has duplicate Bug #56266: crash: File "mgr/devicehealth/module.py", in serve: self.set_kv('last_scrape', last_scrape.strftime(TIME_FORMAT)) added
Actions #12

Updated by Patrick Donnelly about 1 year ago

  • Has duplicate Bug #56205: crash: File "mgr/devicehealth/module.py", in serve: self.scrape_all() added
Actions #13

Updated by Patrick Donnelly about 1 year ago

  • Has duplicate Bug #56279: crash: File "mgr/devicehealth/module.py", in get_recent_device_metrics: return self._get_device_metrics(devid, min_sample=min_sample) added
Actions #14

Updated by Patrick Donnelly about 1 year ago

  • Has duplicate Bug #56287: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added
Actions #15

Updated by Patrick Donnelly about 1 year ago

  • Has duplicate Bug #56291: crash: File "mgr/devicehealth/module.py", in get_recent_device_metrics: return self._get_device_metrics(devid, min_sample=min_sample) added
Actions #16

Updated by Patrick Donnelly about 1 year ago

  • Has duplicate Bug #56297: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added
Actions #17

Updated by Patrick Donnelly about 1 year ago

  • Has duplicate Bug #56312: crash: File "mgr/devicehealth/module.py", in serve: ls = self.get_kv('last_scrape') added
Actions #18

Updated by Patrick Donnelly about 1 year ago

  • Has duplicate Bug #56321: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added
Actions #19

Updated by Patrick Donnelly about 1 year ago

  • Has duplicate Bug #56322: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added
Actions #20

Updated by Patrick Donnelly about 1 year ago

  • Has duplicate Bug #56858: crash: File "mgr/devicehealth/module.py", in serve: self.set_kv('last_scrape', last_scrape.strftime(TIME_FORMAT)) added
Actions #21

Updated by Patrick Donnelly about 1 year ago

  • Has duplicate Bug #56898: crash: File "mgr/devicehealth/module.py", in show_device_metrics: res = self._get_device_metrics(devid, sample=sample) added
Actions #22

Updated by Patrick Donnelly about 1 year ago

  • Status changed from New to In Progress
  • Assignee set to Patrick Donnelly
  • Target version set to v18.0.0
  • Source set to Community (user)
  • Backport set to quincy
Actions #23

Updated by Patrick Donnelly about 1 year ago

  • Pull request ID set to 50291
Actions #24

Updated by Satoru Takeuchi about 1 year ago

I hit a quite similar problem. My cluster became unhealthy as follows.

```
$ kubectl exec n ${NS} deployment/rook-ceph-tools - ceph status
cluster:
id: b52d5f3d-ba14-442e-a089-0bca47b83758
health: HEALTH_ERR
441 large omap objects
Module 'devicehealth' has failed: unknown operation
1 osds down
1 host (1 osds) down
Degraded data redundancy: 1282/321355596 objects degraded (0.000%), 88 pgs degraded, 106 pgs undersized
62 pgs not deep-scrubbed in time
62 pgs not scrubbed in time
```

Before occurring this problem, the following message is shown in the mgr.a's log.

```
2023-03-16 09:08:24 debug 2023-03-16T00:08:24.478+0000 7feacbcf5700 -1 client.107899129: SimpleRADOSStriper: lock_keeper_main: main.db: lock renewal failed: (2) No such file or directory
```

At that time, mgr failover didn't happen. In addition, this problem was resolved after restarting mgr.a,

I have two questions about this.

a. Can my problem be considered to hit this issue's bug?
b. If a is true, is there any configuration to bypass this problem before 50291 is merged?

software environment:
- ceph: v17.2.5
- rook: v1.10.7

Actions #25

Updated by Telemetry Bot 12 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v17.0.0, v17.2.0, v17.2.3, v17.2.4, v17.2.5 added

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=396bcc462c052481a3b21c9f0a35da3582e33cbca357b481453a7b6fcfd3c9da

Sanitized backtrace:

    File "mgr/devicehealth/module.py", in serve: self.scrape_all()
    File "mgr/devicehealth/module.py", in scrape_all: self.put_device_metrics(device, data)
    File "mgr/devicehealth/module.py", in put_device_metrics: self._create_device(devid)
    File "mgr/devicehealth/module.py", in _create_device: cursor = self.db.execute(SQL, (devid,))

Crash dump sample:
{
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 373, in serve\n    self.scrape_all()",
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 425, in scrape_all\n    self.put_device_metrics(device, data)",
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 500, in put_device_metrics\n    self._create_device(devid)",
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 487, in _create_device\n    cursor = self.db.execute(SQL, (devid,))",
        "<redacted>" 
    ],
    "ceph_version": "17.2.3",
    "crash_id": "2022-12-23T00:08:19.380001Z_532f16aa-9667-4774-9740-7eb6486407c1",
    "entity_name": "mgr.380e06557672265883c1723194701ca26b08aabe",
    "mgr_module": "devicehealth",
    "mgr_module_caller": "PyModuleRunner::serve",
    "mgr_python_exception": "InternalError",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mgr",
    "stack_sig": "92fb822a43775ec1de7d41c75c8d0ec0bbb72ba5429a46000f34101c1bc6524e",
    "timestamp": "2022-12-23T00:08:19.380001Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.4.0-87-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#98~18.04.1-Ubuntu SMP Wed Sep 22 10:45:04 UTC 2021" 
}

Actions #26

Updated by Telemetry Bot 12 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v17.1.0, v17.2.1, v17.2.6 added
Actions #27

Updated by Telemetry Bot 12 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions #28

Updated by Telemetry Bot 12 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions #29

Updated by Telemetry Bot 12 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions #30

Updated by Telemetry Bot 12 months ago

  • Crash signature (v2) updated (diff)
Actions #31

Updated by Telemetry Bot 12 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions #32

Updated by Telemetry Bot 12 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions #33

Updated by Telemetry Bot 12 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions #34

Updated by Telemetry Bot 12 months ago

  • Crash signature (v2) updated (diff)
Actions #35

Updated by Telemetry Bot 12 months ago

  • Crash signature (v2) updated (diff)
Actions #36

Updated by Telemetry Bot 12 months ago

  • Crash signature (v2) updated (diff)
Actions #37

Updated by Telemetry Bot 12 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions #38

Updated by Telemetry Bot 12 months ago

  • Crash signature (v2) updated (diff)
Actions #39

Updated by Patrick Donnelly 12 months ago

  • Status changed from In Progress to Fix Under Review
  • Target version changed from v18.0.0 to v19.0.0
  • Backport changed from quincy to reef,quincy
  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions #40

Updated by Patrick Donnelly 12 months ago

  • Has duplicate Bug #58351: Module 'devicehealth' has failed: unknown operation added
Actions #41

Updated by Patrick Donnelly 10 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #42

Updated by Backport Bot 10 months ago

  • Copied to Backport #62022: reef: [ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknown added
Actions #43

Updated by Backport Bot 10 months ago

  • Copied to Backport #62023: quincy: [ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknown added
Actions #44

Updated by Backport Bot 10 months ago

  • Tags set to backport_processed
Actions #45

Updated by Patrick Donnelly 4 months ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF