Project

General

Profile

Actions

Bug #65487

open

rbd-mirror daemon in ERROR state, require manual restart

Added by Nir Soffer about 1 month ago. Updated 9 days ago.

Status:
Pending Backport
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
quincy,reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description

We experience a random error in rbd-mirror daemon, occurring 1-2 times per 100 deployments.

When it happens, after adding a cephrbdmiror resource, watching cephblockpool
.status.mirroringStatus.summary shows daemon_health: ERROR forever (we waited few hours).

$ kubectl rook-ceph --context dr2 rbd mirror pool status -p replicapool --verbose
health: ERROR
daemon health: ERROR
image health: OK
images: 0 total

DAEMONS
service 4361:
  instance_id: 4408
  client_id: a
  hostname: dr2
  version: 18.2.2
  leader: true
  health: ERROR
  callouts: unable to connect to remote cluster

In rbd-mirror logs we see:

8287-356f-4f81-87dc-51bb05942553.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin

debug 2024-04-07T05:18:11.585+0000 7fc86d4808c0  0 rbd::mirror::PoolReplayer: 0x5589c90dc000
init_rados: reverting global config option override: mon_host:
[v2:192.168.122.98:3300,v1:192.168.122.98:6789] ->

unable to get monitor info from DNS SRV with service name: ceph-mon

debug 2024-04-07T05:18:11.602+0000 7fc86d4808c0 -1 failed for service _ceph-mon._tcp

debug 2024-04-07T05:18:11.602+0000 7fc86d4808c0 -1 monclient: get_monmap_and_config cannot
identify monitors to contact

While rbd mirror is not functional, ceph status does not report any issue:

  cluster:
    id:     8e054339-dedf-4ea0-8936-682ccfd323ca
    health: HEALTH_OK

  services:
    mon:        1 daemons, quorum a (age 8h)
    mgr:        a(active, since 8h)
    osd:        1 osds: 1 up (since 8h), 1 in (since 8h)
    rbd-mirror: 1 daemon active (1 hosts)

  data:
    pools:   2 pools, 64 pgs
    objects: 7 objects, 463 KiB
    usage:   31 MiB used, 50 GiB / 50 GiB avail
    pgs:     64 active+clean

  io:
    client:   1019 B/s rd, 84 B/s wr, 1 op/s rd, 0 op/s wr
Looks like we have several issues:
  • ceph status does not report actual status
  • rbd-mirror is not handling errors correctly. It should retry failing operations and recover, or it it cannot recover terminate with an error.

Workaround

Restarting rbd-mirror daemon (e.g. kubectl rollout restart) fixes the issue.

See also

- Ramen upstream issue: https://github.com/RamenDR/ramen/issues/1332


Files

rook-ceph-rbd-mirror-a-696c4594fd-tsdnv.log (2.97 KB) rook-ceph-rbd-mirror-a-696c4594fd-tsdnv.log rbd-mirror-a pod log Nir Soffer, 04/15/2024 01:21 PM
rbd-mirror.log.gz (55.9 KB) rbd-mirror.log.gz example output from kubectl logs Nir Soffer, 04/16/2024 01:03 PM
rbd-mirror-logs.tar.gz (67.5 KB) rbd-mirror-logs.tar.gz rbd mirror logs from both clusters Nir Soffer, 04/20/2024 03:05 PM

Related issues 3 (2 open1 closed)

Copied to rbd - Backport #65817: squid: rbd-mirror daemon in ERROR state, require manual restartResolvedIlya DryomovActions
Copied to rbd - Backport #65818: quincy: rbd-mirror daemon in ERROR state, require manual restartIn ProgressIlya DryomovActions
Copied to rbd - Backport #65819: reef: rbd-mirror daemon in ERROR state, require manual restartIn ProgressIlya DryomovActions
Actions

Also available in: Atom PDF