Actions

Copy link

Bug #65494

open

ceph-mgr critical error: "Module 'devicehealth' has failed: table Device already exists"

Added by Nir Soffer 15 days ago. Updated about 1 hour ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Patrick Donnelly

Target version:

Ceph - v20.0.0

% Done:

Source:

Community (dev)

Tags:

backport_processed

Backport:

squid,reef

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

56997

Crash signature (v1):

Crash signature (v2):

Description

Description¶

We have a random error (about 1 in 200 deploys) when after creating a rook
cephcluster and cephblockpool successfully, configuring rbd mirroring and
adding a cephrbdmirror, the cephrbdmirror never becomes ready (we waited few hours).

Looking at ceph status shows:

  cluster:
    id:     dbf6c8b8-dd8b-4117-933e-93778b1a7274
    health: HEALTH_ERR
            Module 'devicehealth' has failed: table Device already exists

In rook-ceph-mgr-a pod logs we see:

debug 2024-04-09T13:05:48.947+0000 7f0632607700 -1 Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 524, in check
    return func(self, *args, **kwargs)
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 350, in _do_serve
    if self.db_ready() and self.enable_monitoring:
  File "/usr/share/ceph/mgr/mgr_module.py", line 1271, in db_ready
    return self.db is not None
  File "/usr/share/ceph/mgr/mgr_module.py", line 1283, in db
    self._db = self.open_db()
  File "/usr/share/ceph/mgr/mgr_module.py", line 1264, in open_db
    self.configure_db(db)
  File "/usr/share/ceph/mgr/mgr_module.py", line 1241, in configure_db
    self.load_schema(db)
  File "/usr/share/ceph/mgr/mgr_module.py", line 1230, in load_schema
    self.maybe_upgrade(db, int(row['value']))
  File "/usr/share/ceph/mgr/mgr_module.py", line 1207, in maybe_upgrade
    db.executescript(self.SCHEMA)
sqlite3.OperationalError: table Device already exists

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 394, in serve
    self._do_serve()
  File "/usr/share/ceph/mgr/mgr_module.py", line 532, in check
    self.open_db();
  File "/usr/share/ceph/mgr/mgr_module.py", line 1264, in open_db
    self.configure_db(db)
  File "/usr/share/ceph/mgr/mgr_module.py", line 1241, in configure_db
    self.load_schema(db)
  File "/usr/share/ceph/mgr/mgr_module.py", line 1230, in load_schema
    self.maybe_upgrade(db, int(row['value']))
  File "/usr/share/ceph/mgr/mgr_module.py", line 1207, in maybe_upgrade
    db.executescript(self.SCHEMA)
sqlite3.OperationalError: table Device already exists

Restarting the ceph-mgr pod does not help, rbd-mirroring is broken and
we don't have any workaround.

For testing ramen this is not that bad, we can delete the environment and
recreate it in 10 minutes, but for real deployment this looks bad.

See also¶

- Upstream issue: https://github.com/RamenDR/ramen/issues/1298

Files

rook-ceph-mgr-a-7868f9cbdd-2cn62.log.gz (87.2 KB) rook-ceph-mgr-a-7868f9cbdd-2cn62.log.gz

rook-ceph-mgr-a-7868f9cbdd-2cn62 log

Nir Soffer, 04/15/2024 07:13 PM

Related issues 2 (2 open — 0 closed)

Actions

Copy link

Updated by Nir Soffer 15 days ago

Thread from rook slack:
https://rook-io.slack.com/archives/CK9CF5H2R/p1711467112958679

Actions

Copy link

Updated by Yaarit Hatuka 15 days ago

Project changed from mgr to cephsqlite

Looks like a sqlite issue. Patrick, can you take a look please?

Actions

Copy link

Updated by Nir Soffer 15 days ago

Tested with:

image: quay.io/ceph/ceph:v18
imageID: quay.io/ceph/ceph@sha256:8c1697a0a924bbd625c9f1b33893bbc47b97b8a8c66666a715fe60b353b1d93e

Actions

Copy link

Updated by Patrick Donnelly 15 days ago

debug 2024-04-09T13:05:18.368+0000 7f644947d700  0 log_channel(cluster) log [DBG] : pgmap v7: 1 pgs: 1 creating+peering; 0 B data, 26 MiB used, 50 GiB / 50 GiB avail
debug 2024-04-09T13:05:18.654+0000 7f64180bc700  0 [devicehealth INFO root] creating main.db for devicehealth
ignoring --setuser ceph since I am not root
ignoring --setgroup ceph since I am not root
debug 2024-04-09T13:05:18.784+0000 7fd37cd18200  0 ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable), process ceph-mgr, pid 1

mgr restarted while db was being initialized

debug 2024-04-09T13:05:48.913+0000 7f0632607700  0 [devicehealth INFO root] creating main.db for devicehealth
debug 2024-04-09T13:05:48.913+0000 7f0632607700  0 [devicehealth ERROR root] Caught fatal database error: table Device already exists

There is a nuance of sqlite transactions with python I was not aware. I'll write up a fix soon.

Actions

Copy link

Updated by Patrick Donnelly 15 days ago

Status changed from New to In Progress
Assignee set to Patrick Donnelly
Target version set to v20.0.0
Source set to Community (dev)
Backport set to squid,reef

Actions

Copy link

Updated by Ilya Dryomov 15 days ago

Nir Soffer wrote:

Restarting the ceph-mgr pod does not help, rbd-mirroring is broken and
we don't have any workaround.

Hi Nir,

I don't think this is related to RBD mirroring. From the manager log, it's clear that MirrorSnapshotScheduleHandler in rbd_support module continued running after that exception was thrown in devicehealth module:

debug 2024-04-09T13:05:48.913+0000 7f0632607700  0 [devicehealth INFO root] creating main.db for devicehealth
debug 2024-04-09T13:05:48.913+0000 7f0632607700  0 [devicehealth ERROR root] Caught fatal database error: table Device already exists
debug 2024-04-09T13:05:48.947+0000 7f0632607700  0 [devicehealth INFO root] creating main.db for devicehealth
debug 2024-04-09T13:05:48.947+0000 7f0632607700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.a: table Device already exists

[...]

debug 2024-04-09T13:06:36.965+0000 7f06225e7700  0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: load_schedules
debug 2024-04-09T13:06:36.966+0000 7f06225e7700  0 [rbd_support INFO root] load_schedules: replicapool, start_after=
debug 2024-04-09T13:06:36.967+0000 7f06225e7700  0 [rbd_support INFO root] load_schedule: 2 [
    {
        "interval": "2m",
        "start_time": "14:00:00-05:00" 
    }
]

[...]

debug 2024-04-09T13:07:36.980+0000 7f06225e7700  0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: load_schedules
debug 2024-04-09T13:07:36.980+0000 7f06225e7700  0 [rbd_support INFO root] load_schedules: replicapool, start_after=
debug 2024-04-09T13:07:36.985+0000 7f06225e7700  0 [rbd_support INFO root] load_schedule: 2 [
    {
        "interval": "2m",
        "start_time": "14:00:00-05:00" 
    }
]

[...]

debug 2024-04-09T15:41:38.808+0000 7f06225e7700  0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: load_schedules
debug 2024-04-09T15:41:38.811+0000 7f06225e7700  0 [rbd_support INFO root] load_schedules: replicapool, start_after=
debug 2024-04-09T15:41:38.813+0000 7f06225e7700  0 [rbd_support INFO root] load_schedule: 2 [
    {
        "interval": "2m",
        "start_time": "14:00:00-05:00" 
    }
]

Have you tried restarting rbd-mirror pod like you did in https://tracker.ceph.com/issues/65487? Could the RBD mirroring part of this ticket be a duplicate of that?

Actions

Copy link

Updated by Nir Soffer 15 days ago

Ilya Dryomov wrote in #note-6:

Nir Soffer wrote:

Restarting the ceph-mgr pod does not help, rbd-mirroring is broken and
we don't have any workaround.

Hi Nir,

I don't think this is related to RBD mirroring. From the manager log, it's clear that MirrorSnapshotScheduleHandler in rbd_support module continued running after that exception was thrown in devicehealth module:

[...]

Have you tried restarting rbd-mirror pod like you did in https://tracker.ceph.com/issues/65487? Could the RBD mirroring part of this ticket be a duplicate of that?

I probably tried to rbd-mirror, but this is not the same issue. In this case the
cephrbdmirror never becomes Ready - it never reports any status.

In https://tracker.ceph.com/issues/65487 the cephrbdmirror becomes Ready and everyhing
looks normal, but the cephblockpool .status.mirroringStatus.summary.deamon_health never
becomes OK.

In our test log you can see:

cephcluster is created successfully:

2024-04-09 16:02:40,569 DEBUG   [dr2/0] Deploying rook ceph cluster
2024-04-09 16:02:41,236 DEBUG   [dr2/0] cephblockpool.ceph.rook.io/builtin-mgr created
2024-04-09 16:02:41,250 DEBUG   [dr2/0] cephcluster.ceph.rook.io/my-cluster created
2024-04-09 16:02:41,253 DEBUG   [dr2/0] Waiting until rook ceph cluster is ready
2024-04-09 16:02:41,318 DEBUG   [dr2/0] 'cephcluster/my-cluster' output='jsonpath={.status.phase}' found in 0.06 seconds
2024-04-09 16:05:07,151 DEBUG   [dr2/0] cephcluster.ceph.rook.io/my-cluster condition met
2024-04-09 16:05:07,234 DEBUG   [dr2/0] ceph cluster status:
2024-04-09 16:05:07,234 DEBUG   [dr2/0]   conditions:
2024-04-09 16:05:07,234 DEBUG   [dr2/0]   - lastHeartbeatTime: '2024-04-09T13:05:07Z'
2024-04-09 16:05:07,234 DEBUG   [dr2/0]     lastTransitionTime: '2024-04-09T13:05:07Z'
2024-04-09 16:05:07,234 DEBUG   [dr2/0]     message: Cluster created successfully
2024-04-09 16:05:07,234 DEBUG   [dr2/0]     reason: ClusterCreated
2024-04-09 16:05:07,234 DEBUG   [dr2/0]     status: 'True'
2024-04-09 16:05:07,234 DEBUG   [dr2/0]     type: Ready
2024-04-09 16:05:07,235 DEBUG   [dr2/0]   message: Cluster created successfully
2024-04-09 16:05:07,235 DEBUG   [dr2/0]   observedGeneration: 2
2024-04-09 16:05:07,235 DEBUG   [dr2/0]   phase: Ready
2024-04-09 16:05:07,235 DEBUG   [dr2/0]   state: Created
2024-04-09 16:05:07,235 DEBUG   [dr2/0]   storage:
2024-04-09 16:05:07,235 DEBUG   [dr2/0]     osd:
2024-04-09 16:05:07,235 DEBUG   [dr2/0]       storeType:
2024-04-09 16:05:07,235 DEBUG   [dr2/0]         bluestore: 1
2024-04-09 16:05:07,235 DEBUG   [dr2/0]   version:
2024-04-09 16:05:07,235 DEBUG   [dr2/0]     image: quay.io/ceph/ceph:v18
2024-04-09 16:05:07,235 DEBUG   [dr2/0]     version: 18.2.2-0

ceph status after cluster is created:

2024-04-09 16:05:10,870 DEBUG   [dr2/0] deployment "rook-ceph-tools" successfully rolled out
2024-04-09 16:05:10,872 DEBUG   [dr2/0] ceph status:
2024-04-09 16:05:11,683 DEBUG   [dr2/0]   cluster:
2024-04-09 16:05:11,684 DEBUG   [dr2/0]     id:     2aeb5e7d-352f-4227-9c99-6d2d422751f9
2024-04-09 16:05:11,684 DEBUG   [dr2/0]     health: HEALTH_OK
2024-04-09 16:05:11,684 DEBUG   [dr2/0] 
2024-04-09 16:05:11,684 DEBUG   [dr2/0]   services:
2024-04-09 16:05:11,684 DEBUG   [dr2/0]     mon: 1 daemons, quorum a (age 42s)
2024-04-09 16:05:11,684 DEBUG   [dr2/0]     mgr: a(active, since 3s)
2024-04-09 16:05:11,684 DEBUG   [dr2/0]     osd: 1 osds: 0 up, 1 in (since 10s)
2024-04-09 16:05:11,684 DEBUG   [dr2/0] 
2024-04-09 16:05:11,684 DEBUG   [dr2/0]   data:
2024-04-09 16:05:11,684 DEBUG   [dr2/0]     pools:   0 pools, 0 pgs
2024-04-09 16:05:11,684 DEBUG   [dr2/0]     objects: 0 objects, 0 B
2024-04-09 16:05:11,684 DEBUG   [dr2/0]     usage:   0 B used, 0 B / 0 B avail
2024-04-09 16:05:11,684 DEBUG   [dr2/0]     pgs:

cephblockpool created successfully:

2024-04-09 16:05:11,827 DEBUG   [dr2/0] Creating rbd pool and storage class
2024-04-09 16:05:12,020 DEBUG   [dr2/0] cephblockpool.ceph.rook.io/replicapool created
2024-04-09 16:05:12,033 DEBUG   [dr2/0] storageclass.storage.k8s.io/rook-ceph-block created
2024-04-09 16:05:12,036 DEBUG   [dr2/0] Waiting until ceph block pool is ready
2024-04-09 16:05:16,353 DEBUG   [dr2/0] 'cephblockpool/replicapool' output='jsonpath={.status.phase}' found in 4.32 seconds
2024-04-09 16:05:41,886 DEBUG   [dr2/0] cephblockpool.ceph.rook.io/replicapool condition met
2024-04-09 16:05:41,888 DEBUG   [dr2/0] Waiting for replica pool peer token
2024-04-09 16:05:42,048 DEBUG   [dr2/0] cephblockpool.ceph.rook.io/replicapool condition met
2024-04-09 16:05:42,127 DEBUG   [dr2/0] ceph pool status:
2024-04-09 16:05:42,127 DEBUG   [dr2/0]   info:
2024-04-09 16:05:42,127 DEBUG   [dr2/0]     rbdMirrorBootstrapPeerSecretName: pool-peer-token-replicapool
2024-04-09 16:05:42,127 DEBUG   [dr2/0]   mirroringInfo:
2024-04-09 16:05:42,127 DEBUG   [dr2/0]     lastChecked: '2024-04-09T13:05:42Z'
2024-04-09 16:05:42,127 DEBUG   [dr2/0]     mode: image
2024-04-09 16:05:42,127 DEBUG   [dr2/0]     site_name: 2aeb5e7d-352f-4227-9c99-6d2d422751f9
2024-04-09 16:05:42,127 DEBUG   [dr2/0]   mirroringStatus:
2024-04-09 16:05:42,127 DEBUG   [dr2/0]     lastChecked: '2024-04-09T13:05:42Z'
2024-04-09 16:05:42,127 DEBUG   [dr2/0]     summary:
2024-04-09 16:05:42,127 DEBUG   [dr2/0]       daemon_health: UNKNOWN
2024-04-09 16:05:42,127 DEBUG   [dr2/0]       health: UNKNOWN
2024-04-09 16:05:42,127 DEBUG   [dr2/0]       image_health: OK
2024-04-09 16:05:42,127 DEBUG   [dr2/0]       states: {}
2024-04-09 16:05:42,127 DEBUG   [dr2/0]   observedGeneration: 1
2024-04-09 16:05:42,127 DEBUG   [dr2/0]   phase: Ready
2024-04-09 16:05:42,127 DEBUG   [dr2/0]   snapshotScheduleStatus:
2024-04-09 16:05:42,127 DEBUG   [dr2/0]     lastChecked: '2024-04-09T13:05:42Z'
2024-04-09 16:05:42,127 DEBUG   [dr2/0]     snapshotSchedules:
2024-04-09 16:05:42,127 DEBUG   [dr2/0]     - image: '-'
2024-04-09 16:05:42,127 DEBUG   [dr2/0]       items:
2024-04-09 16:05:42,127 DEBUG   [dr2/0]       - interval: 2m
2024-04-09 16:05:42,127 DEBUG   [dr2/0]         start_time: 14:00:00-05:00
2024-04-09 16:05:42,127 DEBUG   [dr2/0]       namespace: '-'
2024-04-09 16:05:42,127 DEBUG   [dr2/0]       pool: replicapool

Configuring rbd mirroring on both clusters:

2024-04-09 16:06:50,002 DEBUG   [rdr/0] Getting mirroring info site name for cluster 'dr1'
2024-04-09 16:06:50,084 DEBUG   [rdr/0] 'cephblockpools.ceph.rook.io/replicapool' output='jsonpath={.status.mirroringInfo.site_name}' found in 0.08 seconds
2024-04-09 16:06:50,084 DEBUG   [rdr/0] Getting rbd mirror boostrap peer secret name for cluster 'dr1'
2024-04-09 16:06:50,183 DEBUG   [rdr/0] Getting secret pool-peer-token-replicapool token for cluster 'dr1'
2024-04-09 16:06:50,253 DEBUG   [rdr/0] Getting mirroring info site name for cluster 'dr2'
2024-04-09 16:06:50,369 DEBUG   [rdr/0] 'cephblockpools.ceph.rook.io/replicapool' output='jsonpath={.status.mirroringInfo.site_name}' found in 0.12 seconds
2024-04-09 16:06:50,369 DEBUG   [rdr/0] Getting rbd mirror boostrap peer secret name for cluster 'dr2'
2024-04-09 16:06:50,454 DEBUG   [rdr/0] Getting secret pool-peer-token-replicapool token for cluster 'dr2'
2024-04-09 16:06:50,547 DEBUG   [rdr/0] Setting up mirroring from 'dr2' to 'dr1'
2024-04-09 16:06:50,547 DEBUG   [rdr/0] Applying rbd mirror secret in cluster 'dr1'
2024-04-09 16:06:50,708 DEBUG   [rdr/0] secret/2aeb5e7d-352f-4227-9c99-6d2d422751f9 created
2024-04-09 16:06:50,711 DEBUG   [rdr/0] Configure peers for cluster 'dr1'
2024-04-09 16:06:50,842 DEBUG   [rdr/0] cephblockpool.ceph.rook.io/replicapool patched
2024-04-09 16:06:50,847 DEBUG   [rdr/0] Apply rbd mirror to cluster 'dr1'
2024-04-09 16:06:50,984 DEBUG   [rdr/0] cephrbdmirror.ceph.rook.io/my-rbd-mirror created
2024-04-09 16:06:50,990 DEBUG   [rdr/0] Setting up mirroring from 'dr1' to 'dr2'
2024-04-09 16:06:50,990 DEBUG   [rdr/0] Applying rbd mirror secret in cluster 'dr2'
2024-04-09 16:06:51,146 DEBUG   [rdr/0] secret/c5123696-8638-4c9e-aaf4-2a59333bf389 created
2024-04-09 16:06:51,152 DEBUG   [rdr/0] Configure peers for cluster 'dr2'
2024-04-09 16:06:51,241 DEBUG   [rdr/0] cephblockpool.ceph.rook.io/replicapool patched
2024-04-09 16:06:51,244 DEBUG   [rdr/0] Apply rbd mirror to cluster 'dr2'
2024-04-09 16:06:51,399 DEBUG   [rdr/0] cephrbdmirror.ceph.rook.io/my-rbd-mirror created

Waiting for cephrbdmirror for cluster dr1 succeeded in 5 seconds:

2024-04-09 16:06:51,406 DEBUG   [rdr/0] Waiting until rbd mirror is ready in cluster 'dr1'
2024-04-09 16:06:55,809 DEBUG   [rdr/0] 'cephrbdmirror/my-rbd-mirror' output='jsonpath={.status.phase}' found in 4.40 seconds
2024-04-09 16:06:55,986 DEBUG   [rdr/0] cephrbdmirror.ceph.rook.io/my-rbd-mirror condition met
2024-04-09 16:06:56,057 DEBUG   [rdr/0] Cluster '{cluster}' rbd mirror status:
2024-04-09 16:06:56,057 DEBUG   [rdr/0]   observedGeneration: 1
2024-04-09 16:06:56,057 DEBUG   [rdr/0]   phase: Ready
2024-04-09 16:06:56,057 DEBUG   [rdr/0]

Waiting for cephrbdmirror .status.phase for cluster dr2 times out after 300 seconds
this resources is never reconciled:

2024-04-09 16:06:56,057 DEBUG   [rdr/0] Waiting until rbd mirror is ready in cluster 'dr2'
2024-04-09 16:11:56,172 ERROR   Command failed
Traceback (most recent call last):
  ...
drenv.commands.Error: Command failed:
   command: ('addons/rbd-mirror/start', 'dr1', 'dr2')
   exitcode: 1
   error:
      Traceback (most recent call last):
        ...
      RuntimeError: Timeout waiting for 'cephrbdmirror/my-rbd-mirror' output='jsonpath={.status.phase}' namespace='rook-ceph' profile='dr2' timeout=300

Actions

Copy link