Actions

Copy link

Bug #65487

open

rbd-mirror daemon in ERROR state, require manual restart

Added by Nir Soffer about 1 month ago. Updated 9 days ago.

Status:

Pending Backport

Priority:

High

Assignee:

Ilya Dryomov

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

quincy,reef,squid

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

57082

Crash signature (v1):

Crash signature (v2):

Description

Description¶

We experience a random error in rbd-mirror daemon, occurring 1-2 times per 100 deployments.

When it happens, after adding a cephrbdmiror resource, watching cephblockpool
.status.mirroringStatus.summary shows daemon_health: ERROR forever (we waited few hours).

$ kubectl rook-ceph --context dr2 rbd mirror pool status -p replicapool --verbose
health: ERROR
daemon health: ERROR
image health: OK
images: 0 total

DAEMONS
service 4361:
  instance_id: 4408
  client_id: a
  hostname: dr2
  version: 18.2.2
  leader: true
  health: ERROR
  callouts: unable to connect to remote cluster

In rbd-mirror logs we see:

8287-356f-4f81-87dc-51bb05942553.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin

debug 2024-04-07T05:18:11.585+0000 7fc86d4808c0  0 rbd::mirror::PoolReplayer: 0x5589c90dc000
init_rados: reverting global config option override: mon_host:
[v2:192.168.122.98:3300,v1:192.168.122.98:6789] ->

unable to get monitor info from DNS SRV with service name: ceph-mon

debug 2024-04-07T05:18:11.602+0000 7fc86d4808c0 -1 failed for service _ceph-mon._tcp

debug 2024-04-07T05:18:11.602+0000 7fc86d4808c0 -1 monclient: get_monmap_and_config cannot
identify monitors to contact

While rbd mirror is not functional, ceph status does not report any issue:

  cluster:
    id:     8e054339-dedf-4ea0-8936-682ccfd323ca
    health: HEALTH_OK

  services:
    mon:        1 daemons, quorum a (age 8h)
    mgr:        a(active, since 8h)
    osd:        1 osds: 1 up (since 8h), 1 in (since 8h)
    rbd-mirror: 1 daemon active (1 hosts)

  data:
    pools:   2 pools, 64 pgs
    objects: 7 objects, 463 KiB
    usage:   31 MiB used, 50 GiB / 50 GiB avail
    pgs:     64 active+clean

  io:
    client:   1019 B/s rd, 84 B/s wr, 1 op/s rd, 0 op/s wr

Looks like we have several issues:

ceph status does not report actual status
rbd-mirror is not handling errors correctly. It should retry failing operations and recover, or it it cannot recover terminate with an error.

Workaround¶

Restarting rbd-mirror daemon (e.g. kubectl rollout restart) fixes the issue.

rook-ceph-rbd-mirror-a-696c4594fd-tsdnv.log (2.97 KB) rook-ceph-rbd-mirror-a-696c4594fd-tsdnv.log	rbd-mirror-a pod log	Nir Soffer, 04/15/2024 01:21 PM
rbd-mirror.log.gz (55.9 KB) rbd-mirror.log.gz	example output from kubectl logs	Nir Soffer, 04/16/2024 01:03 PM
rbd-mirror-logs.tar.gz (67.5 KB) rbd-mirror-logs.tar.gz	rbd mirror logs from both clusters	Nir Soffer, 04/20/2024 03:05 PM

Project

General

Profile

Ceph » rbd

Custom queries

Bug #65487

rbd-mirror daemon in ERROR state, require manual restart

Description¶

Workaround¶

See also¶

Updated by Ilya Dryomov about 1 month ago

Updated by Nir Soffer about 1 month ago

Updated by Nir Soffer about 1 month ago

Updated by Ilya Dryomov about 1 month ago

Updated by Nir Soffer about 1 month ago

Updated by Nir Soffer 30 days ago

Updated by Ilya Dryomov 29 days ago

Updated by Nir Soffer 29 days ago

Updated by Ilya Dryomov 29 days ago

Updated by Nir Soffer 29 days ago

Updated by Ilya Dryomov 29 days ago

Updated by Nir Soffer 29 days ago

Updated by Nir Soffer 28 days ago

Updated by Ilya Dryomov 28 days ago

Updated by Nir Soffer 28 days ago

Updated by Nir Soffer 25 days ago

Updated by Ilya Dryomov 23 days ago

Updated by Ilya Dryomov 22 days ago · Edited

Updated by Ilya Dryomov 21 days ago

Updated by Ilya Dryomov 21 days ago

Updated by Nir Soffer 16 days ago

Updated by Nir Soffer 15 days ago

Updated by Nir Soffer 13 days ago

Updated by Ilya Dryomov 9 days ago

Updated by Backport Bot 9 days ago

Updated by Backport Bot 9 days ago

Updated by Backport Bot 9 days ago

Updated by Backport Bot 9 days ago