Bug #55606: [ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknown - cephsqlite - Ceph

Actions

Copy link

Bug #55606

closed

[ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknown

Added by Nitzan Mordechai almost 2 years ago. Updated 4 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Patrick Donnelly

Target version:

Ceph - v19.0.0

% Done:

Source:

Community (user)

Tags:

backport_processed

Backport:

reef,quincy

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v17.0.0, Ceph - v17.1.0, Ceph - v17.2.0, Ceph - v17.2.1, Ceph - v17.2.3, Ceph - v17.2.4, Ceph - v17.2.5, Ceph - v17.2.6

ceph-qa-suite:

Pull request ID:

50291

Crash signature (v1):

86fccb1331b555469509cb40ea7998ef724d7f67fa05541e6599b57aacbda12b
ac08e1ef6d35905f8dfb2e8bef5e63013092d4fd1812af45665b7a375c848d4f
41161d56a146d7bb9e51e8b458ced4b09761704adb81abc8819b2521798b5509
0e5a33107a4c9e75eb406fa58976e51f0d12e6429cd47f57338418c114c31aec
79b843397ff3a9e490e460f19f02c73f2348dcaf8946776a443641fadfd6a923
7673dfdf4352116147fdfb3a31a6dd64e8525ca1ac27047746c4d0b19a2b8ee1
1fe134bcc0da984504f6ff8534b02cfb33ee34e038f36fa318560969865ac0b8
7ea7ab36946ce7e1ee42c3db55b1f40c4e09f3f9a2cc99ffd99bb0a631c0f973
0be10189a1e1d49804974c592d340e7592230857f4ac8bc15c11845f9d829f98
1d5dd4d12786fb804db0e4c84384cc70ccc77ce77816086e0818123a91fdc1f9
c5efd9a11456f073d65f44e1fa633a76d4f99cc530cba6ab7357983ad052aa58
3ae3570ef19c6cebdcd5ece0d70c63cb0f91903c52e031c81f419514acfb656f
48c37619e84753c832296a9c285b309043d88fc3f1fc38e126054824623809b1
5e854d73b29ef14826356aacb6ea625cd128d8ed0362a0b117d0ef2b1d2736d0
1795c28174bea254aec378b8393504c5a1849bf13e04066c12af05fb82fcb054
2d141d1e83d9ea238b960e0d50bdc5f423259c1a0c58578de181aa36b942a3b5
c2da23cc7287af55584a694c0355fe8b7998d118c3c4caf81f67f2ef1c390720
b104ab4c8d1bf3668e82ed30637b204f3babe572b1a390a050df031dc65cc8e3

Crash signature (v2):

396bcc462c052481a3b21c9f0a35da3582e33cbca357b481453a7b6fcfd3c9da
fef0c14d68d4ab20a440219453dff6981764918b789158159adda37ad1096ccb
2dd339bceed23e489a9564f732e69cf19ffd7ffd661cbd057ea8cafbe34a6f16
a15696173b045beed5ced7efeac27d9fe556b2ed42c4a15b7d6bd204667a6d96
501850d0201c79bf70aab39cb2c732cc62854370f3572200aee926bdcd636c5a
3fb7c07e2758c3e1c64e759819d6eeba3190b628f6ed8df59865c8f9d36b3476
6a0693716901bbbd4e38c3df4a1d36b7e062ded71c9e4994f43c7b90170e79bc
244316c81aaa8c068cd3d842f6cc6aded71b94ab0ac6f51e3b2c494bcb7a2074
f8a6bd09d1ebb266b60fefdd5d6612d2c9a2f5743f7cb55e6bcb7e02bb9349f0
29336199cdba60a4e2d498feecd46dbfa1acc1685d9e2abe49847b8ffa0c3a02
e1b200da5c7a7d3c546d1fc01fec4545f9dcd0f662ac99c775e117c2ae631ac0
33e0d03b2fe6747102a652652523f61a111be84a071e2326f24a3b772c24e842
5885ab1c678da947f6e070dcc640f7e80f0283f446a9f1c18b4f4829e5969531
f034b9f04e4efb6084994eaad8c65a74e35186ee6090ed17f7fbba9ff0bbf936

Description

/home/teuthworker/archive/yuriw-2022-04-29_15:44:49-rados-wip-yuri5-testing-2022-04-28-1007-distro-default-smithi/6813955

2022-04-29T19:34:39.756 INFO:teuthology.orchestra.run.smithi169.stdout:op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck. Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.
2022-04-29T19:34:39.767 DEBUG:teuthology.orchestra.run.smithi155:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok dump_ops_in_flight
2022-04-29T19:34:39.787 INFO:tasks.ceph.mgr.y.smithi155.stderr:2022-04-29T19:34:39.785+0000 7f5604101700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.y: unknown operation
2022-04-29T19:34:39.787 INFO:tasks.ceph.mgr.y.smithi155.stderr:2022-04-29T19:34:39.785+0000 7f5604101700 -1 devicehealth.serve:
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr:2022-04-29T19:34:39.785+0000 7f5604101700 -1 Traceback (most recent call last):
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr:  File "/usr/share/ceph/mgr/devicehealth/module.py", line 376, in serve
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr:    self.set_kv('last_scrape', last_scrape.strftime(TIME_FORMAT))
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr:  File "/usr/share/ceph/mgr/mgr_module.py", line 1117, in set_kv
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr:    self.db.execute(SQL, (key, value))
2022-04-29T19:34:39.789 INFO:tasks.ceph.mgr.y.smithi155.stderr:sqlite3.InternalError: unknown operation
2022-04-29T19:34:39.789 INFO:tasks.ceph.mgr.y.smithi155.stderr:
2022-04-29T19:34:39.834 INFO:tasks.ceph.osd.3.smithi155.stderr:2022-04-29T19:34:39.833+0000 7f38469fe700 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper kill ceph-osd -f --cluster ceph -i 3  (PID: 91832) UID: 0

Actions

Copy link

Updated by Laura Flores almost 2 years ago

Project changed from RADOS to cephsqlite

Actions

Copy link

Updated by Yaarit Hatuka almost 2 years ago

Related to Bug #55142: [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error added

Actions

Copy link

Updated by Laura Flores almost 2 years ago

Nitzan Mordechai wrote:

/home/teuthworker/archive/yuriw-2022-04-29_15:44:49-rados-wip-yuri5-testing-2022-04-28-1007-distro-default-smithi/6813955

[...]

mgr log at the time of the crash:

2022-04-29T19:34:39.785+0000 7f5664339700 10 MonCommandCompletion::finish()
2022-04-29T19:34:39.785+0000 7f5664339700 20 mgr Gil Switched to new thread state 0x559d5f25c000
2022-04-29T19:34:39.785+0000 7f5664339700 20 mgr ~Gil Destroying new thread state 0x559d5f25c000
2022-04-29T19:34:39.785+0000 7f5664339700 10 mgr notify_all notify_all: notify_all command
2022-04-29T19:34:39.785+0000 7f5664339700 15 mgr notify_all queuing notify (command) to restful
2022-04-29T19:34:39.785+0000 7f5604101700  0 [devicehealth DEBUG root] skipping duplicate INTEL_SSDPEDMD400G4_CVFT623300GW400BGN
2022-04-29T19:34:39.785+0000 7f5604101700 10 ceph_option_get device_failure_prediction_mode found: none
2022-04-29T19:34:39.785+0000 7f5604101700  0 [devicehealth DEBUG root] set_kv('last_scrape', '20220429-193321')
2022-04-29T19:34:39.785+0000 7f56297f6700 20 mgr Gil Switched to new thread state 0x559d5ea22400
2022-04-29T19:34:39.785+0000 7f56297f6700 20 mgr ~Gil Destroying new thread state 0x559d5ea22400
2022-04-29T19:34:39.785+0000 7f5604101700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.y: unknown operation
2022-04-29T19:34:39.785+0000 7f5604101700 -1 devicehealth.serve:
2022-04-29T19:34:39.785+0000 7f5604101700 -1 Traceback (most recent call last):
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 376, in serve
    self.set_kv('last_scrape', last_scrape.strftime(TIME_FORMAT))
  File "/usr/share/ceph/mgr/mgr_module.py", line 1117, in set_kv
    self.db.execute(SQL, (key, value))
sqlite3.InternalError: unknown operation

Actions

Copy link

Updated by Patrick Donnelly almost 2 years ago

Problem appears to be that libcephsqlite lost its RADOS lock:

2022-04-29T19:34:39.315+0000 7f5669343700  1 -- 172.21.15.155:0/832204322 <== osd.7 v1:172.21.15.169:6804/91796 154 ==== osd_op_reply(146 main.db.0000000000000000 [call] v59'104 uv103 ondisk = -2 ((2) No such file or directory)) v8 ==== 168+0+0 (unknown 3461770130 0 0) 0x559d5fa1d8c0 con 0x559d5f026000
2022-04-29T19:34:39.315+0000 7f55ef0d7700 -1 client.4184: SimpleRADOSStriper: lock_keeper_main: main.db: lock renewal failed: (2) No such file or directory

At this time, the mgr was noting some 10 pgs were inactive.

I will think about how we can make this mgr module more resilient to this type of situation.

Actions

Copy link

Updated by Yaarit Hatuka almost 2 years ago

Hi Patrick,

Can you please explain why unknown state of pgs results in this behavior?

The commonality we see between these issues so far is that they happen after an upgrade (on gibba and the LRC), or due to daemon restarts.

Actions

Copy link

Updated by Laura Flores over 1 year ago

/a/yuriw-2022-12-07_15:47:33-rados-wip-yuri-testing-2022-12-06-1204-distro-default-smithi/7106555

Actions

Copy link

Updated by Patrick Donnelly about 1 year ago

Has duplicate Bug #55142: [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error added

Actions

Copy link

Updated by Patrick Donnelly about 1 year ago

Related to deleted (Bug #55142: [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error)

Actions

Copy link

Updated by Patrick Donnelly about 1 year ago

Has duplicate Bug #56844: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added

Actions

Copy link

#10

Updated by Patrick Donnelly about 1 year ago

Has duplicate Bug #58652: Module 'devicehealth' has failed: disk I/O error added

Actions

Copy link

#11

Updated by Patrick Donnelly about 1 year ago

Has duplicate Bug #56266: crash: File "mgr/devicehealth/module.py", in serve: self.set_kv('last_scrape', last_scrape.strftime(TIME_FORMAT)) added

Actions

Copy link

#12

Updated by Patrick Donnelly about 1 year ago

Has duplicate Bug #56205: crash: File "mgr/devicehealth/module.py", in serve: self.scrape_all() added

Actions

Copy link

#13

Updated by Patrick Donnelly about 1 year ago

Has duplicate Bug #56279: crash: File "mgr/devicehealth/module.py", in get_recent_device_metrics: return self._get_device_metrics(devid, min_sample=min_sample) added

Actions

Copy link

#14

Updated by Patrick Donnelly about 1 year ago

Has duplicate Bug #56287: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added

Actions

Copy link

#15

Updated by Patrick Donnelly about 1 year ago

Has duplicate Bug #56291: crash: File "mgr/devicehealth/module.py", in get_recent_device_metrics: return self._get_device_metrics(devid, min_sample=min_sample) added

Actions

Copy link

#16

Updated by Patrick Donnelly about 1 year ago

Has duplicate Bug #56297: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added

Actions

Copy link

#17

Updated by Patrick Donnelly about 1 year ago

Has duplicate Bug #56312: crash: File "mgr/devicehealth/module.py", in serve: ls = self.get_kv('last_scrape') added

Actions

Copy link

#18

Updated by Patrick Donnelly about 1 year ago

Has duplicate Bug #56321: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added

Actions

Copy link

#19

Updated by Patrick Donnelly about 1 year ago

Has duplicate Bug #56322: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added

Actions

Copy link

#20

Updated by Patrick Donnelly about 1 year ago

Has duplicate Bug #56858: crash: File "mgr/devicehealth/module.py", in serve: self.set_kv('last_scrape', last_scrape.strftime(TIME_FORMAT)) added

Actions

Copy link

#21

Updated by Patrick Donnelly about 1 year ago

Has duplicate Bug #56898: crash: File "mgr/devicehealth/module.py", in show_device_metrics: res = self._get_device_metrics(devid, sample=sample) added

Actions

Copy link

#22

Updated by Patrick Donnelly about 1 year ago

Status changed from New to In Progress
Assignee set to Patrick Donnelly
Target version set to v18.0.0
Source set to Community (user)
Backport set to quincy

Actions

Copy link

#23

Updated by Patrick Donnelly about 1 year ago

Pull request ID set to 50291

Actions

Copy link

#24

Updated by Satoru Takeuchi about 1 year ago

I hit a quite similar problem. My cluster became unhealthy as follows.

```
$ kubectl exec ~~n ${NS} deployment/rook-ceph-tools -~~ ceph status
cluster:
id: b52d5f3d-ba14-442e-a089-0bca47b83758
health: HEALTH_ERR
441 large omap objects
Module 'devicehealth' has failed: unknown operation
1 osds down
1 host (1 osds) down
Degraded data redundancy: 1282/321355596 objects degraded (0.000%), 88 pgs degraded, 106 pgs undersized
62 pgs not deep-scrubbed in time
62 pgs not scrubbed in time
```

Before occurring this problem, the following message is shown in the mgr.a's log.

```
2023-03-16 09:08:24 debug 2023-03-16T00:08:24.478+0000 7feacbcf5700 -1 client.107899129: SimpleRADOSStriper: lock_keeper_main: main.db: lock renewal failed: (2) No such file or directory
```

At that time, mgr failover didn't happen. In addition, this problem was resolved after restarting mgr.a,

I have two questions about this.

a. Can my problem be considered to hit this issue's bug?
b. If a is true, is there any configuration to bypass this problem before 50291 is merged?

software environment:
- ceph: v17.2.5
- rook: v1.10.7

Actions

Copy link

#25

Updated by Telemetry Bot 12 months ago

Crash signature (v1) updated (diff)
Crash signature (v2) updated (diff)
Affected Versions v17.0.0, v17.2.0, v17.2.3, v17.2.4, v17.2.5 added

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=396bcc462c052481a3b21c9f0a35da3582e33cbca357b481453a7b6fcfd3c9da

Sanitized backtrace:

    File "mgr/devicehealth/module.py", in serve: self.scrape_all()
    File "mgr/devicehealth/module.py", in scrape_all: self.put_device_metrics(device, data)
    File "mgr/devicehealth/module.py", in put_device_metrics: self._create_device(devid)
    File "mgr/devicehealth/module.py", in _create_device: cursor = self.db.execute(SQL, (devid,))

Crash dump sample:

{
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 373, in serve\n    self.scrape_all()",
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 425, in scrape_all\n    self.put_device_metrics(device, data)",
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 500, in put_device_metrics\n    self._create_device(devid)",
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 487, in _create_device\n    cursor = self.db.execute(SQL, (devid,))",
        "<redacted>" 
    ],
    "ceph_version": "17.2.3",
    "crash_id": "2022-12-23T00:08:19.380001Z_532f16aa-9667-4774-9740-7eb6486407c1",
    "entity_name": "mgr.380e06557672265883c1723194701ca26b08aabe",
    "mgr_module": "devicehealth",
    "mgr_module_caller": "PyModuleRunner::serve",
    "mgr_python_exception": "InternalError",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mgr",
    "stack_sig": "92fb822a43775ec1de7d41c75c8d0ec0bbb72ba5429a46000f34101c1bc6524e",
    "timestamp": "2022-12-23T00:08:19.380001Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.4.0-87-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#98~18.04.1-Ubuntu SMP Wed Sep 22 10:45:04 UTC 2021" 
}

Actions

Copy link

#26