Bug #65494
openceph-mgr critical error: "Module 'devicehealth' has failed: table Device already exists"
0%
Description
Description¶
We have a random error (about 1 in 200 deploys) when after creating a rook
cephcluster and cephblockpool successfully, configuring rbd mirroring and
adding a cephrbdmirror, the cephrbdmirror never becomes ready (we waited few hours).
Looking at ceph status shows:
cluster: id: dbf6c8b8-dd8b-4117-933e-93778b1a7274 health: HEALTH_ERR Module 'devicehealth' has failed: table Device already exists
In rook-ceph-mgr-a pod logs we see:
debug 2024-04-09T13:05:48.947+0000 7f0632607700 -1 Traceback (most recent call last): File "/usr/share/ceph/mgr/mgr_module.py", line 524, in check return func(self, *args, **kwargs) File "/usr/share/ceph/mgr/devicehealth/module.py", line 350, in _do_serve if self.db_ready() and self.enable_monitoring: File "/usr/share/ceph/mgr/mgr_module.py", line 1271, in db_ready return self.db is not None File "/usr/share/ceph/mgr/mgr_module.py", line 1283, in db self._db = self.open_db() File "/usr/share/ceph/mgr/mgr_module.py", line 1264, in open_db self.configure_db(db) File "/usr/share/ceph/mgr/mgr_module.py", line 1241, in configure_db self.load_schema(db) File "/usr/share/ceph/mgr/mgr_module.py", line 1230, in load_schema self.maybe_upgrade(db, int(row['value'])) File "/usr/share/ceph/mgr/mgr_module.py", line 1207, in maybe_upgrade db.executescript(self.SCHEMA) sqlite3.OperationalError: table Device already exists During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/share/ceph/mgr/devicehealth/module.py", line 394, in serve self._do_serve() File "/usr/share/ceph/mgr/mgr_module.py", line 532, in check self.open_db(); File "/usr/share/ceph/mgr/mgr_module.py", line 1264, in open_db self.configure_db(db) File "/usr/share/ceph/mgr/mgr_module.py", line 1241, in configure_db self.load_schema(db) File "/usr/share/ceph/mgr/mgr_module.py", line 1230, in load_schema self.maybe_upgrade(db, int(row['value'])) File "/usr/share/ceph/mgr/mgr_module.py", line 1207, in maybe_upgrade db.executescript(self.SCHEMA) sqlite3.OperationalError: table Device already exists
Restarting the ceph-mgr pod does not help, rbd-mirroring is broken and
we don't have any workaround.
For testing ramen this is not that bad, we can delete the environment and
recreate it in 10 minutes, but for real deployment this looks bad.
See also¶
- Upstream issue: https://github.com/RamenDR/ramen/issues/1298
Files
Updated by Nir Soffer 15 days ago
Thread from rook slack:
https://rook-io.slack.com/archives/CK9CF5H2R/p1711467112958679
Updated by Yaarit Hatuka 15 days ago
- Project changed from mgr to cephsqlite
Looks like a sqlite issue. Patrick, can you take a look please?
Updated by Nir Soffer 15 days ago
- image: quay.io/ceph/ceph:v18
- imageID: quay.io/ceph/ceph@sha256:8c1697a0a924bbd625c9f1b33893bbc47b97b8a8c66666a715fe60b353b1d93e
Updated by Patrick Donnelly 15 days ago
debug 2024-04-09T13:05:18.368+0000 7f644947d700 0 log_channel(cluster) log [DBG] : pgmap v7: 1 pgs: 1 creating+peering; 0 B data, 26 MiB used, 50 GiB / 50 GiB avail debug 2024-04-09T13:05:18.654+0000 7f64180bc700 0 [devicehealth INFO root] creating main.db for devicehealth ignoring --setuser ceph since I am not root ignoring --setgroup ceph since I am not root debug 2024-04-09T13:05:18.784+0000 7fd37cd18200 0 ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable), process ceph-mgr, pid 1
mgr restarted while db was being initialized
debug 2024-04-09T13:05:48.913+0000 7f0632607700 0 [devicehealth INFO root] creating main.db for devicehealth debug 2024-04-09T13:05:48.913+0000 7f0632607700 0 [devicehealth ERROR root] Caught fatal database error: table Device already exists
There is a nuance of sqlite transactions with python I was not aware. I'll write up a fix soon.
Updated by Patrick Donnelly 15 days ago
- Status changed from New to In Progress
- Assignee set to Patrick Donnelly
- Target version set to v20.0.0
- Source set to Community (dev)
- Backport set to squid,reef
Updated by Ilya Dryomov 15 days ago
Nir Soffer wrote:
Restarting the ceph-mgr pod does not help, rbd-mirroring is broken and
we don't have any workaround.
Hi Nir,
I don't think this is related to RBD mirroring. From the manager log, it's clear that MirrorSnapshotScheduleHandler in rbd_support module continued running after that exception was thrown in devicehealth module:
debug 2024-04-09T13:05:48.913+0000 7f0632607700 0 [devicehealth INFO root] creating main.db for devicehealth debug 2024-04-09T13:05:48.913+0000 7f0632607700 0 [devicehealth ERROR root] Caught fatal database error: table Device already exists debug 2024-04-09T13:05:48.947+0000 7f0632607700 0 [devicehealth INFO root] creating main.db for devicehealth debug 2024-04-09T13:05:48.947+0000 7f0632607700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.a: table Device already exists [...] debug 2024-04-09T13:06:36.965+0000 7f06225e7700 0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: load_schedules debug 2024-04-09T13:06:36.966+0000 7f06225e7700 0 [rbd_support INFO root] load_schedules: replicapool, start_after= debug 2024-04-09T13:06:36.967+0000 7f06225e7700 0 [rbd_support INFO root] load_schedule: 2 [ { "interval": "2m", "start_time": "14:00:00-05:00" } ] [...] debug 2024-04-09T13:07:36.980+0000 7f06225e7700 0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: load_schedules debug 2024-04-09T13:07:36.980+0000 7f06225e7700 0 [rbd_support INFO root] load_schedules: replicapool, start_after= debug 2024-04-09T13:07:36.985+0000 7f06225e7700 0 [rbd_support INFO root] load_schedule: 2 [ { "interval": "2m", "start_time": "14:00:00-05:00" } ] [...] debug 2024-04-09T15:41:38.808+0000 7f06225e7700 0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: load_schedules debug 2024-04-09T15:41:38.811+0000 7f06225e7700 0 [rbd_support INFO root] load_schedules: replicapool, start_after= debug 2024-04-09T15:41:38.813+0000 7f06225e7700 0 [rbd_support INFO root] load_schedule: 2 [ { "interval": "2m", "start_time": "14:00:00-05:00" } ]
Have you tried restarting rbd-mirror pod like you did in https://tracker.ceph.com/issues/65487? Could the RBD mirroring part of this ticket be a duplicate of that?
Updated by Nir Soffer 15 days ago
Ilya Dryomov wrote in #note-6:
Nir Soffer wrote:
Restarting the ceph-mgr pod does not help, rbd-mirroring is broken and
we don't have any workaround.Hi Nir,
I don't think this is related to RBD mirroring. From the manager log, it's clear that MirrorSnapshotScheduleHandler in rbd_support module continued running after that exception was thrown in devicehealth module:
[...]
Have you tried restarting rbd-mirror pod like you did in https://tracker.ceph.com/issues/65487? Could the RBD mirroring part of this ticket be a duplicate of that?
I probably tried to rbd-mirror, but this is not the same issue. In this case the
cephrbdmirror never becomes Ready - it never reports any status.
In https://tracker.ceph.com/issues/65487 the cephrbdmirror becomes Ready and everyhing
looks normal, but the cephblockpool .status.mirroringStatus.summary.deamon_health never
becomes OK.
In our test log you can see:
cephcluster is created successfully:
2024-04-09 16:02:40,569 DEBUG [dr2/0] Deploying rook ceph cluster 2024-04-09 16:02:41,236 DEBUG [dr2/0] cephblockpool.ceph.rook.io/builtin-mgr created 2024-04-09 16:02:41,250 DEBUG [dr2/0] cephcluster.ceph.rook.io/my-cluster created 2024-04-09 16:02:41,253 DEBUG [dr2/0] Waiting until rook ceph cluster is ready 2024-04-09 16:02:41,318 DEBUG [dr2/0] 'cephcluster/my-cluster' output='jsonpath={.status.phase}' found in 0.06 seconds 2024-04-09 16:05:07,151 DEBUG [dr2/0] cephcluster.ceph.rook.io/my-cluster condition met 2024-04-09 16:05:07,234 DEBUG [dr2/0] ceph cluster status: 2024-04-09 16:05:07,234 DEBUG [dr2/0] conditions: 2024-04-09 16:05:07,234 DEBUG [dr2/0] - lastHeartbeatTime: '2024-04-09T13:05:07Z' 2024-04-09 16:05:07,234 DEBUG [dr2/0] lastTransitionTime: '2024-04-09T13:05:07Z' 2024-04-09 16:05:07,234 DEBUG [dr2/0] message: Cluster created successfully 2024-04-09 16:05:07,234 DEBUG [dr2/0] reason: ClusterCreated 2024-04-09 16:05:07,234 DEBUG [dr2/0] status: 'True' 2024-04-09 16:05:07,234 DEBUG [dr2/0] type: Ready 2024-04-09 16:05:07,235 DEBUG [dr2/0] message: Cluster created successfully 2024-04-09 16:05:07,235 DEBUG [dr2/0] observedGeneration: 2 2024-04-09 16:05:07,235 DEBUG [dr2/0] phase: Ready 2024-04-09 16:05:07,235 DEBUG [dr2/0] state: Created 2024-04-09 16:05:07,235 DEBUG [dr2/0] storage: 2024-04-09 16:05:07,235 DEBUG [dr2/0] osd: 2024-04-09 16:05:07,235 DEBUG [dr2/0] storeType: 2024-04-09 16:05:07,235 DEBUG [dr2/0] bluestore: 1 2024-04-09 16:05:07,235 DEBUG [dr2/0] version: 2024-04-09 16:05:07,235 DEBUG [dr2/0] image: quay.io/ceph/ceph:v18 2024-04-09 16:05:07,235 DEBUG [dr2/0] version: 18.2.2-0
ceph status after cluster is created:
2024-04-09 16:05:10,870 DEBUG [dr2/0] deployment "rook-ceph-tools" successfully rolled out 2024-04-09 16:05:10,872 DEBUG [dr2/0] ceph status: 2024-04-09 16:05:11,683 DEBUG [dr2/0] cluster: 2024-04-09 16:05:11,684 DEBUG [dr2/0] id: 2aeb5e7d-352f-4227-9c99-6d2d422751f9 2024-04-09 16:05:11,684 DEBUG [dr2/0] health: HEALTH_OK 2024-04-09 16:05:11,684 DEBUG [dr2/0] 2024-04-09 16:05:11,684 DEBUG [dr2/0] services: 2024-04-09 16:05:11,684 DEBUG [dr2/0] mon: 1 daemons, quorum a (age 42s) 2024-04-09 16:05:11,684 DEBUG [dr2/0] mgr: a(active, since 3s) 2024-04-09 16:05:11,684 DEBUG [dr2/0] osd: 1 osds: 0 up, 1 in (since 10s) 2024-04-09 16:05:11,684 DEBUG [dr2/0] 2024-04-09 16:05:11,684 DEBUG [dr2/0] data: 2024-04-09 16:05:11,684 DEBUG [dr2/0] pools: 0 pools, 0 pgs 2024-04-09 16:05:11,684 DEBUG [dr2/0] objects: 0 objects, 0 B 2024-04-09 16:05:11,684 DEBUG [dr2/0] usage: 0 B used, 0 B / 0 B avail 2024-04-09 16:05:11,684 DEBUG [dr2/0] pgs:
cephblockpool created successfully:
2024-04-09 16:05:11,827 DEBUG [dr2/0] Creating rbd pool and storage class 2024-04-09 16:05:12,020 DEBUG [dr2/0] cephblockpool.ceph.rook.io/replicapool created 2024-04-09 16:05:12,033 DEBUG [dr2/0] storageclass.storage.k8s.io/rook-ceph-block created 2024-04-09 16:05:12,036 DEBUG [dr2/0] Waiting until ceph block pool is ready 2024-04-09 16:05:16,353 DEBUG [dr2/0] 'cephblockpool/replicapool' output='jsonpath={.status.phase}' found in 4.32 seconds 2024-04-09 16:05:41,886 DEBUG [dr2/0] cephblockpool.ceph.rook.io/replicapool condition met 2024-04-09 16:05:41,888 DEBUG [dr2/0] Waiting for replica pool peer token 2024-04-09 16:05:42,048 DEBUG [dr2/0] cephblockpool.ceph.rook.io/replicapool condition met 2024-04-09 16:05:42,127 DEBUG [dr2/0] ceph pool status: 2024-04-09 16:05:42,127 DEBUG [dr2/0] info: 2024-04-09 16:05:42,127 DEBUG [dr2/0] rbdMirrorBootstrapPeerSecretName: pool-peer-token-replicapool 2024-04-09 16:05:42,127 DEBUG [dr2/0] mirroringInfo: 2024-04-09 16:05:42,127 DEBUG [dr2/0] lastChecked: '2024-04-09T13:05:42Z' 2024-04-09 16:05:42,127 DEBUG [dr2/0] mode: image 2024-04-09 16:05:42,127 DEBUG [dr2/0] site_name: 2aeb5e7d-352f-4227-9c99-6d2d422751f9 2024-04-09 16:05:42,127 DEBUG [dr2/0] mirroringStatus: 2024-04-09 16:05:42,127 DEBUG [dr2/0] lastChecked: '2024-04-09T13:05:42Z' 2024-04-09 16:05:42,127 DEBUG [dr2/0] summary: 2024-04-09 16:05:42,127 DEBUG [dr2/0] daemon_health: UNKNOWN 2024-04-09 16:05:42,127 DEBUG [dr2/0] health: UNKNOWN 2024-04-09 16:05:42,127 DEBUG [dr2/0] image_health: OK 2024-04-09 16:05:42,127 DEBUG [dr2/0] states: {} 2024-04-09 16:05:42,127 DEBUG [dr2/0] observedGeneration: 1 2024-04-09 16:05:42,127 DEBUG [dr2/0] phase: Ready 2024-04-09 16:05:42,127 DEBUG [dr2/0] snapshotScheduleStatus: 2024-04-09 16:05:42,127 DEBUG [dr2/0] lastChecked: '2024-04-09T13:05:42Z' 2024-04-09 16:05:42,127 DEBUG [dr2/0] snapshotSchedules: 2024-04-09 16:05:42,127 DEBUG [dr2/0] - image: '-' 2024-04-09 16:05:42,127 DEBUG [dr2/0] items: 2024-04-09 16:05:42,127 DEBUG [dr2/0] - interval: 2m 2024-04-09 16:05:42,127 DEBUG [dr2/0] start_time: 14:00:00-05:00 2024-04-09 16:05:42,127 DEBUG [dr2/0] namespace: '-' 2024-04-09 16:05:42,127 DEBUG [dr2/0] pool: replicapool
Configuring rbd mirroring on both clusters:
2024-04-09 16:06:50,002 DEBUG [rdr/0] Getting mirroring info site name for cluster 'dr1' 2024-04-09 16:06:50,084 DEBUG [rdr/0] 'cephblockpools.ceph.rook.io/replicapool' output='jsonpath={.status.mirroringInfo.site_name}' found in 0.08 seconds 2024-04-09 16:06:50,084 DEBUG [rdr/0] Getting rbd mirror boostrap peer secret name for cluster 'dr1' 2024-04-09 16:06:50,183 DEBUG [rdr/0] Getting secret pool-peer-token-replicapool token for cluster 'dr1' 2024-04-09 16:06:50,253 DEBUG [rdr/0] Getting mirroring info site name for cluster 'dr2' 2024-04-09 16:06:50,369 DEBUG [rdr/0] 'cephblockpools.ceph.rook.io/replicapool' output='jsonpath={.status.mirroringInfo.site_name}' found in 0.12 seconds 2024-04-09 16:06:50,369 DEBUG [rdr/0] Getting rbd mirror boostrap peer secret name for cluster 'dr2' 2024-04-09 16:06:50,454 DEBUG [rdr/0] Getting secret pool-peer-token-replicapool token for cluster 'dr2' 2024-04-09 16:06:50,547 DEBUG [rdr/0] Setting up mirroring from 'dr2' to 'dr1' 2024-04-09 16:06:50,547 DEBUG [rdr/0] Applying rbd mirror secret in cluster 'dr1' 2024-04-09 16:06:50,708 DEBUG [rdr/0] secret/2aeb5e7d-352f-4227-9c99-6d2d422751f9 created 2024-04-09 16:06:50,711 DEBUG [rdr/0] Configure peers for cluster 'dr1' 2024-04-09 16:06:50,842 DEBUG [rdr/0] cephblockpool.ceph.rook.io/replicapool patched 2024-04-09 16:06:50,847 DEBUG [rdr/0] Apply rbd mirror to cluster 'dr1' 2024-04-09 16:06:50,984 DEBUG [rdr/0] cephrbdmirror.ceph.rook.io/my-rbd-mirror created 2024-04-09 16:06:50,990 DEBUG [rdr/0] Setting up mirroring from 'dr1' to 'dr2' 2024-04-09 16:06:50,990 DEBUG [rdr/0] Applying rbd mirror secret in cluster 'dr2' 2024-04-09 16:06:51,146 DEBUG [rdr/0] secret/c5123696-8638-4c9e-aaf4-2a59333bf389 created 2024-04-09 16:06:51,152 DEBUG [rdr/0] Configure peers for cluster 'dr2' 2024-04-09 16:06:51,241 DEBUG [rdr/0] cephblockpool.ceph.rook.io/replicapool patched 2024-04-09 16:06:51,244 DEBUG [rdr/0] Apply rbd mirror to cluster 'dr2' 2024-04-09 16:06:51,399 DEBUG [rdr/0] cephrbdmirror.ceph.rook.io/my-rbd-mirror created
Waiting for cephrbdmirror for cluster dr1 succeeded in 5 seconds:
2024-04-09 16:06:51,406 DEBUG [rdr/0] Waiting until rbd mirror is ready in cluster 'dr1' 2024-04-09 16:06:55,809 DEBUG [rdr/0] 'cephrbdmirror/my-rbd-mirror' output='jsonpath={.status.phase}' found in 4.40 seconds 2024-04-09 16:06:55,986 DEBUG [rdr/0] cephrbdmirror.ceph.rook.io/my-rbd-mirror condition met 2024-04-09 16:06:56,057 DEBUG [rdr/0] Cluster '{cluster}' rbd mirror status: 2024-04-09 16:06:56,057 DEBUG [rdr/0] observedGeneration: 1 2024-04-09 16:06:56,057 DEBUG [rdr/0] phase: Ready 2024-04-09 16:06:56,057 DEBUG [rdr/0]
Waiting for cephrbdmirror .status.phase for cluster dr2 times out after 300 seconds
this resources is never reconciled:
2024-04-09 16:06:56,057 DEBUG [rdr/0] Waiting until rbd mirror is ready in cluster 'dr2' 2024-04-09 16:11:56,172 ERROR Command failed Traceback (most recent call last): ... drenv.commands.Error: Command failed: command: ('addons/rbd-mirror/start', 'dr1', 'dr2') exitcode: 1 error: Traceback (most recent call last): ... RuntimeError: Timeout waiting for 'cephrbdmirror/my-rbd-mirror' output='jsonpath={.status.phase}' namespace='rook-ceph' profile='dr2' timeout=300
Updated by Patrick Donnelly 12 days ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 56997
Updated by Patrick Donnelly about 2 hours ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot about 1 hour ago
- Copied to Backport #65730: squid: ceph-mgr critical error: "Module 'devicehealth' has failed: table Device already exists" added
Updated by Backport Bot about 1 hour ago
- Copied to Backport #65731: reef: ceph-mgr critical error: "Module 'devicehealth' has failed: table Device already exists" added