Project

General

Profile

Actions

Bug #65487

open

rbd-mirror daemon in ERROR state, require manual restart

Added by Nir Soffer 16 days ago. Updated about 15 hours ago.

Status:
Fix Under Review
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy,reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description

We experience a random error in rbd-mirror daemon, occurring 1-2 times per 100 deployments.

When it happens, after adding a cephrbdmiror resource, watching cephblockpool
.status.mirroringStatus.summary shows daemon_health: ERROR forever (we waited few hours).

$ kubectl rook-ceph --context dr2 rbd mirror pool status -p replicapool --verbose
health: ERROR
daemon health: ERROR
image health: OK
images: 0 total

DAEMONS
service 4361:
  instance_id: 4408
  client_id: a
  hostname: dr2
  version: 18.2.2
  leader: true
  health: ERROR
  callouts: unable to connect to remote cluster

In rbd-mirror logs we see:

8287-356f-4f81-87dc-51bb05942553.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin

debug 2024-04-07T05:18:11.585+0000 7fc86d4808c0  0 rbd::mirror::PoolReplayer: 0x5589c90dc000
init_rados: reverting global config option override: mon_host:
[v2:192.168.122.98:3300,v1:192.168.122.98:6789] ->

unable to get monitor info from DNS SRV with service name: ceph-mon

debug 2024-04-07T05:18:11.602+0000 7fc86d4808c0 -1 failed for service _ceph-mon._tcp

debug 2024-04-07T05:18:11.602+0000 7fc86d4808c0 -1 monclient: get_monmap_and_config cannot
identify monitors to contact

While rbd mirror is not functional, ceph status does not report any issue:

  cluster:
    id:     8e054339-dedf-4ea0-8936-682ccfd323ca
    health: HEALTH_OK

  services:
    mon:        1 daemons, quorum a (age 8h)
    mgr:        a(active, since 8h)
    osd:        1 osds: 1 up (since 8h), 1 in (since 8h)
    rbd-mirror: 1 daemon active (1 hosts)

  data:
    pools:   2 pools, 64 pgs
    objects: 7 objects, 463 KiB
    usage:   31 MiB used, 50 GiB / 50 GiB avail
    pgs:     64 active+clean

  io:
    client:   1019 B/s rd, 84 B/s wr, 1 op/s rd, 0 op/s wr
Looks like we have several issues:
  • ceph status does not report actual status
  • rbd-mirror is not handling errors correctly. It should retry failing operations and recover, or it it cannot recover terminate with an error.

Workaround

Restarting rbd-mirror daemon (e.g. kubectl rollout restart) fixes the issue.

See also

- Ramen upstream issue: https://github.com/RamenDR/ramen/issues/1332


Files

rook-ceph-rbd-mirror-a-696c4594fd-tsdnv.log (2.97 KB) rook-ceph-rbd-mirror-a-696c4594fd-tsdnv.log rbd-mirror-a pod log Nir Soffer, 04/15/2024 01:21 PM
rbd-mirror.log.gz (55.9 KB) rbd-mirror.log.gz example output from kubectl logs Nir Soffer, 04/16/2024 01:03 PM
rbd-mirror-logs.tar.gz (67.5 KB) rbd-mirror-logs.tar.gz rbd mirror logs from both clusters Nir Soffer, 04/20/2024 03:05 PM
Actions #1

Updated by Ilya Dryomov 16 days ago

  • Status changed from New to Need More Info
  • Assignee set to Ilya Dryomov

Hi Nir,

rbd-mirror daemon states that it was unable to connect to the remote cluster. Could it be some kind of pod networking issue that goes away upon restart?

There is nothing to go on in the attached log -- those errors aren't necessarily fatal. In an ODF environment, the debug levels are typically bumped to "debug ms = 1", "debug rbd = 15" and "debug rbd mirror = 30", both for the local client and the remote client. Also, rbd-mirror daemon is configured to log to actual files with "log file = /var/log/ceph/\$cluster-\$name.log" and we are supposed to get two files: ceph-client.rbd-mirror.a.log and e.g. 62f28287-356f-4f81-87dc-51bb05942553-client.rbd-mirror-peer.log. Did that not happen here?

Actions #2

Updated by Nir Soffer 16 days ago

Tested with:

  • image: quay.io/ceph/ceph:v18
  • imageID: quay.io/ceph/ceph@sha256:06ddc3ef5b66f2dcc6d16e41842d33a3d7f497849981b0842672ef9014a96726
Actions #3

Updated by Nir Soffer 16 days ago

Ilya Dryomov wrote in #note-1:

Hi Nir,

rbd-mirror daemon states that it was unable to connect to the remote cluster. Could it be some kind of pod networking issue that goes away upon restart?

It can be, but rbd mirror should fail (and restart) if pod networking is broken, no?

There is nothing to go on in the attached log -- those errors aren't necessarily fatal. In an ODF environment, the debug levels are typically bumped to "debug ms = 1", "debug rbd = 15" and "debug rbd mirror = 30", both for the local client and the remote client.

This is not ODF environment, this is upstream rook environment.

You can see the way we deploy and configure here:

What do we have to change to bump the log level as you mention?

Also, rbd-mirror daemon is configured to log to actual files with "log file = /var/log/ceph/\$cluster-\$name.log" and we are supposed to get two files: ceph-client.rbd-mirror.a.log and e.g. 62f28287-356f-4f81-87dc-51bb05942553-client.rbd-mirror-peer.log. Did that not happen here?

I don't know the environment was destroyed, and I don't know if it is configured to create these logs.

I can check if these logs exist on a new deployment.

Actions #4

Updated by Ilya Dryomov 16 days ago

Nir Soffer wrote in #note-3:

It can be, but rbd mirror should fail (and restart) if pod networking is broken, no?

There is code in rbd-mirror for restarting the failed pool replayer, so it's supposed to do that at least in some cases. There is no way to tell whether that code kicked in here or not without verbose logs.

rbd-mirror daemon will never restart itself, so if this is indeed a pod networking issue that disappears when the pod is restarted, no matter what rbd-mirror is doing internally it's never going to be able to resolve it like "kubectl rollout restart" does.

This is not ODF environment, this is upstream rook environment.

You can see the way we deploy and configure here:

What do we have to change to bump the log level as you mention?

It's automated for ODF here:

https://github.com/red-hat-storage/ocs-operator/blob/4a0325d824a409e84fac21ffbf0a1338971d1a70/controllers/storagecluster/cephrbdmirrors.go#L142-L232

You would need to do something close to that.

Actions #5

Updated by Nir Soffer 16 days ago

Ilya Dryomov wrote in #note-4:

This is not ODF environment, this is upstream rook environment.

You can see the way we deploy and configure here:

What do we have to change to bump the log level as you mention?

It's automated for ODF here:

https://github.com/red-hat-storage/ocs-operator/blob/4a0325d824a409e84fac21ffbf0a1338971d1a70/controllers/storagecluster/cephrbdmirrors.go#L142-L232

You would need to do something close to that.

Thanks, we will improve logging and reproduce again.

The current setup look like this:

$ kubectl describe deploy rook-ceph-rbd-mirror-a -n rook-ceph --context dr1
Name:                   rook-ceph-rbd-mirror-a
Namespace:              rook-ceph
CreationTimestamp:      Tue, 16 Apr 2024 00:22:11 +0300
Labels:                 app=rook-ceph-rbd-mirror
                        app.kubernetes.io/component=cephrbdmirrors.ceph.rook.io
                        app.kubernetes.io/created-by=rook-ceph-operator
                        app.kubernetes.io/instance=a
                        app.kubernetes.io/managed-by=rook-ceph-operator
                        app.kubernetes.io/name=ceph-rbd-mirror
                        app.kubernetes.io/part-of=my-rbd-mirror
                        ceph-version=18.2.2-0
                        ceph_daemon_id=a
                        ceph_daemon_type=rbd-mirror
                        rbd-mirror=a
                        rook-version=v1.13.8
                        rook.io/operator-namespace=rook-ceph
                        rook_cluster=rook-ceph
Annotations:            banzaicloud.com/last-applied:
                          UEsDBBQACAAIAAAAAAAAAAAAAAAAAAAAAAAIAAAAb3JpZ2luYWzsWNtu2zgTfpUfxH+xC4jyoWmaCPBF6rhbo/VhnXQXiyIwaGpkc02R6pByYgR+98VIimM7ttNDLhaL3gQy5zzzzQ...
                        deployment.kubernetes.io/revision: 1
Selector:               app=rook-ceph-rbd-mirror,app.kubernetes.io/component=cephrbdmirrors.ceph.rook.io,app.kubernetes.io/created-by=rook-ceph-operator,app.kubernetes.io/instance=a,app.kubernetes.io/managed-by=rook-ceph-operator,app.kubernetes.io/name=ceph-rbd-mirror,app.kubernetes.io/part-of=my-rbd-mirror,ceph_daemon_id=a,ceph_daemon_type=rbd-mirror,rbd-mirror=a,rook.io/operator-namespace=rook-ceph,rook_cluster=rook-ceph
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:  app=rook-ceph-rbd-mirror
           app.kubernetes.io/component=cephrbdmirrors.ceph.rook.io
           app.kubernetes.io/created-by=rook-ceph-operator
           app.kubernetes.io/instance=a
           app.kubernetes.io/managed-by=rook-ceph-operator
           app.kubernetes.io/name=ceph-rbd-mirror
           app.kubernetes.io/part-of=my-rbd-mirror
           ceph_daemon_id=a
           ceph_daemon_type=rbd-mirror
           rbd-mirror=a
           rook.io/operator-namespace=rook-ceph
           rook_cluster=rook-ceph
  Init Containers:
   chown-container-data-dir:
    Image:      quay.io/ceph/ceph:v18
    Port:       <none>
    Host Port:  <none>
    Command:
      chown
    Args:
      --verbose
      --recursive
      ceph:ceph
      /var/log/ceph
      /var/lib/ceph/crash
      /run/ceph
    Environment:  <none>
    Mounts:
      /etc/ceph from rook-config-override (ro)
      /etc/ceph/keyring-store/ from rook-ceph-rbd-mirror-a-keyring (ro)
      /run/ceph from ceph-daemons-sock-dir (rw)
      /var/lib/ceph/crash from rook-ceph-crash (rw)
      /var/log/ceph from rook-ceph-log (rw)
  Containers:
   rbd-mirror:
    Image:      quay.io/ceph/ceph:v18
    Port:       <none>
    Host Port:  <none>
    Command:
      rbd-mirror
    Args:
      --fsid=ea6c70f1-21d0-4991-829d-bc0c3aad6a2c
      --keyring=/etc/ceph/keyring-store/keyring
      --default-log-to-stderr=true
      --default-err-to-stderr=true
      --default-mon-cluster-log-to-stderr=true
      --default-log-stderr-prefix=debug 
      --default-log-to-file=false
      --default-mon-cluster-log-to-file=false
      --mon-host=$(ROOK_CEPH_MON_HOST)
      --mon-initial-members=$(ROOK_CEPH_MON_INITIAL_MEMBERS)
      --id=a
      --setuser=ceph
      --setgroup=ceph
      --foreground
      --name=client.rbd-mirror.a
    Environment:
      CONTAINER_IMAGE:                quay.io/ceph/ceph:v18
      POD_NAME:                        (v1:metadata.name)
      POD_NAMESPACE:                   (v1:metadata.namespace)
      NODE_NAME:                       (v1:spec.nodeName)
      POD_MEMORY_LIMIT:               node allocatable (limits.memory)
      POD_MEMORY_REQUEST:             0 (requests.memory)
      POD_CPU_LIMIT:                  node allocatable (limits.cpu)
      POD_CPU_REQUEST:                0 (requests.cpu)
      CEPH_USE_RANDOM_NONCE:          true
      ROOK_CEPH_MON_HOST:             <set to the key 'mon_host' in secret 'rook-ceph-config'>             Optional: false
      ROOK_CEPH_MON_INITIAL_MEMBERS:  <set to the key 'mon_initial_members' in secret 'rook-ceph-config'>  Optional: false
    Mounts:
      /etc/ceph from rook-config-override (ro)
      /etc/ceph/keyring-store/ from rook-ceph-rbd-mirror-a-keyring (ro)
      /run/ceph from ceph-daemons-sock-dir (rw)
      /var/lib/ceph/crash from rook-ceph-crash (rw)
      /var/log/ceph from rook-ceph-log (rw)
  Volumes:
   rook-config-override:
    Type:               Projected (a volume that contains injected data from multiple sources)
    ConfigMapName:      rook-config-override
    ConfigMapOptional:  <nil>
   rook-ceph-rbd-mirror-a-keyring:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  rook-ceph-rbd-mirror-a-keyring
    Optional:    false
   ceph-daemons-sock-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /data/rook/exporter
    HostPathType:  DirectoryOrCreate
   rook-ceph-log:
    Type:          HostPath (bare host directory volume)
    Path:          /data/rook/rook-ceph/log
    HostPathType:  
   rook-ceph-crash:
    Type:          HostPath (bare host directory volume)
    Path:          /data/rook/rook-ceph/crash
    HostPathType:  
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   rook-ceph-rbd-mirror-a-7c677bd5bf (1/1 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  28m   deployment-controller  Scaled up replica set rook-ceph-rbd-mirror-a-7c677bd5bf to 1
Actions #6

Updated by Nir Soffer 15 days ago

Nir Soffer wrote in #note-5:

https://github.com/red-hat-storage/ocs-operator/blob/4a0325d824a409e84fac21ffbf0a1338971d1a70/controllers/storagecluster/cephrbdmirrors.go#L142-L232

You would need to do something close to that.

Thanks, we will improve logging and reproduce again.

I tried this rook-config-override1 with this configuration:

kind: ConfigMap
apiVersion: v1
metadata:
  name: rook-config-override
  namespace: rook-ceph
data:
  config: |
    [client.rbd-mirror.a]
    debug_ms = 1
    debug_rbd = 15
    debug_rbd_mirror = 30
    log_to_file = true
    log_file = /var/log/ceph/$cluster-$name.log
    [client.rbd-mirror-peer]
    debug_ms = 1
    debug_rbd = 15
    debug_rbd_mirror = 30
    log_to_file = true
    log_file = /var/log/ceph/$cluster-$name.log
    [mgr]
    mgr/rbd_support/log_level = debug

I added log_to_file = true (not used in ODF), without it no log file is created.

With this applied after deploying rook operator, These logs are created:

$ minikube ssh -p dr1 'ls -lh /data/rook/rook-ceph/log'
total 74M
-rw-r--r-- 1 167 167 74M Apr 16 12:46 ceph-client.rbd-mirror.a.log
-rw-r--r-- 1 167 167 17K Apr 16 01:02 ceph-volume.log

The log file includes log level up to 20:

$ awk '{print $3}' /data/rook/rook-ceph/log/ceph-client.rbd-mirror.a.log | sort -rn | head -1
20

The other log file (e.g. 62f28287-356f-4f81-87dc-51bb05942553-client.rbd-mirror-peer.log) is
not created.

Few questions:
- Any change needed to get enough info when we reproduce the issue again?
- Why we need debug_rbd_mirror = 30 when the highest log level is 20? (also documented to be 20)
- Do we really need the log file? it is much easier to work with kubectl logs, and this setup is used for short living clusters during CI or manual testing. We don't expect to debug issues that happened more than 1-2 hours ago.

Attaching example of log from:

kubectl logs deploy/rook-ceph-rbd-mirror-a -n rook-ceph --context dr1 > rbd-mirror.log

[1] https://rook.io/docs/rook/latest-release/Storage-Configuration/Advanced/ceph-configuration/#example

Actions #7

Updated by Ilya Dryomov 15 days ago

Nir Soffer wrote in #note-6:

The other log file (e.g. 62f28287-356f-4f81-87dc-51bb05942553-client.rbd-mirror-peer.log) is
not created.

Hi Nir,

The above needs to be done of both clusters. The configuration on cluster A affects logging on cluster B because rbd-mirror daemon is just another client from cluster A's perspective.

Few questions:
- Any change needed to get enough info when we reproduce the issue again?

Let's obtain both log files and go from there.

- Why we need debug_rbd_mirror = 30 when the highest log level is 20? (also documented to be 20)

I don't think there is such a thing as the highest log level. rbd-mirror daemon certainly has a few log messages at 30.

Overall, Ceph is pretty lax about log levels and how they are used varies greatly from subsystem to subsystem.

- Do we really need the log file? it is much easier to work with kubectl logs, and this setup is used for short living clusters during CI or manual testing. We don't expect to debug issues that happened more than 1-2 hours ago.

Unfortunately, we do. Once you get to rbd-mirror actually syncing an image, you would see that kubectl can sustain only 1-2 minutes worth of verbose logs.

Actions #8

Updated by Nir Soffer 15 days ago

Ilya Dryomov wrote in #note-7:

Nir Soffer wrote in #note-6:

The other log file (e.g. 62f28287-356f-4f81-87dc-51bb05942553-client.rbd-mirror-peer.log) is
not created.

Hi Nir,

The above needs to be done of both clusters. The configuration on cluster A affects logging on cluster B because rbd-mirror daemon is just another client from cluster A's perspective.

Yes, the configuration is applied to both clusters. If I understand correctly,
The mentioned 62f28287-356f-4f81-87dc-51bb05942553-client.rbd-mirror-peer.log
file is the ceph-client.rbd-mirror.a.log from the other cluster? Maybe odf
must-gather is creating this file?

Few questions:
- Any change needed to get enough info when we reproduce the issue again?

Let's obtain both log files and go from there.

- Why we need debug_rbd_mirror = 30 when the highest log level is 20? (also documented to be 20)

I don't think there is such a thing as the highest log level. rbd-mirror daemon certainly has a few log messages at 30.

OK, I was confused by the note here saying "Log levels range from 0 to 20"
https://docs.ceph.com/en/latest/rados/configuration/ceph-conf/#runtime-changes

- Do we really need the log file? it is much easier to work with kubectl logs, and this setup is used for short living clusters during CI or manual testing. We don't expect to debug issues that happened more than 1-2 hours ago.

Unfortunately, we do. Once you get to rbd-mirror actually syncing an image, you would see that kubectl can sustain only 1-2 minutes worth of verbose logs.

OK

I'll try to reproduce with this configuration and collect the full log from
both clusters on errors.

Actions #9

Updated by Ilya Dryomov 15 days ago

Nir Soffer wrote in #note-8:

Yes, the configuration is applied to both clusters. If I understand correctly,
The mentioned 62f28287-356f-4f81-87dc-51bb05942553-client.rbd-mirror-peer.log
file is the ceph-client.rbd-mirror.a.log from the other cluster?

62f28287-356f-4f81-87dc-51bb05942553 was just an example -- I copied that from the log that you provided previously. Since the clusters have presumably been redeployed from scratch, look for any file named <UUID>-client.rbd-mirror-peer.log.

Maybe odf must-gather is creating this file?

No, it's created by rbd-mirror daemon.

Actions #10

Updated by Nir Soffer 15 days ago

Ilya Dryomov wrote in #note-9:

Nir Soffer wrote in #note-8:

Yes, the configuration is applied to both clusters. If I understand correctly,
The mentioned 62f28287-356f-4f81-87dc-51bb05942553-client.rbd-mirror-peer.log
file is the ceph-client.rbd-mirror.a.log from the other cluster?

62f28287-356f-4f81-87dc-51bb05942553 was just an example -- I copied that from the log that you provided previously. Since the clusters have presumably been redeployed from scratch, look for any file named <UUID>-client.rbd-mirror-peer.log.

Maybe odf must-gather is creating this file?

No, it's created by rbd-mirror daemon.

I see, but it is was not created using the configuration I shared here.

Both clusters are configured, and I run a full DR flow with failover and relocate.
In every step we wait for the first replication. Maybe something is wrong with this
configuration?

    [client.rbd-mirror-peer]
    debug_ms = 1
    debug_rbd = 15
    debug_rbd_mirror = 30
    log_to_file = true
    log_file = /var/log/ceph/$cluster-$name.log
Actions #11

Updated by Ilya Dryomov 14 days ago

Hi Nir,

I think the problem is the method you used to set these config options. Note that the way it's done in ODF is via "ceph config set" commands, while you did it through a "data: config: " override which simply injects the snippet verbatim into ceph.conf file. Ordinarily there would be almost no difference, but in this case it matters because there are two different clusters in play and their names don't match up, at least as far as rbd-mirror daemon is concerned.

Actions #12

Updated by Nir Soffer 14 days ago

Ilya Dryomov wrote in #note-11:

Hi Nir,

I think the problem is the method you used to set these config options. Note that the way it's done in ODF is via "ceph config set" commands, while you did it through a "data: config: " override which simply injects the snippet verbatim into ceph.conf file. Ordinarily there would be almost no difference, but in this case it matters because there are two different clusters in play and their names don't match up, at least as far as rbd-mirror daemon is concerned.

OK, I will try to do the same changes using same method as ODF.

Actions #13

Updated by Nir Soffer 14 days ago

Nir Soffer wrote in #note-12:

Ilya Dryomov wrote in #note-11:

Hi Nir,

I think the problem is the method you used to set these config options. Note that the way it's done in ODF is via "ceph config set" commands, while you did it through a "data: config: " override which simply injects the snippet verbatim into ceph.conf file. Ordinarily there would be almost no difference, but in this case it matters because there are two different clusters in play and their names don't match up, at least as far as rbd-mirror daemon is concerned.

Changing to dynamic configuration1 the rbd-mirror-peer log is created:

$ minikube ssh -p dr1 'ls /data/rook/rook-ceph/log'
9e67de2a-c3eb-4a06-a8b5-444bff716362-client.rbd-mirror-peer.log  ceph-client.rbd-mirror.a.log  ceph-volume.log

However unlike ODF, the log files are not created without log_to_file = true. Maybe ODF runs the container with different arguments?

[1] https://github.com/RamenDR/ramen/pull/1346

Actions #14

Updated by Ilya Dryomov 13 days ago

log_to_file gets set to true by Rook as part of enabling the log collector:

https://github.com/rook/rook/blob/a9fded234575895285779d57763d39086dc497d4/pkg/operator/ceph/config/config.go#L120-L129

The log collector rotates logs. If you don't have it enabled but have verbose logging set up, you would quickly run out of space.

Actions #15

Updated by Nir Soffer 13 days ago

Ilya Dryomov wrote in #note-14:

log_to_file gets set to true by Rook as part of enabling the log collector:

https://github.com/rook/rook/blob/a9fded234575895285779d57763d39086dc497d4/pkg/operator/ceph/config/config.go#L120-L129

The log collector rotates logs. If you don't have it enabled but have verbose logging set up, you would quickly run out of space.

Thanks! This should not be a problem in most cases since the test environment rarely lives more than one day, but may be for people keeping a test environment running for many days.

Actions #16

Updated by Nir Soffer 11 days ago

I reproduced the issue again with debug logs.

Tested flow:
- Configure rbd mirroring on both clusters
- Wait for cephrbdmirror to become Ready on both clusters
- Wait for cephblockpool .status.mirroringStatus.summary.daemon_health to become OK on cluster dr1

Wait logs:

2024-04-20 14:03:18,306 DEBUG   [rdr/0] Waiting for mirroring health in cluster 'dr1' (1/1)
2024-04-20 14:03:18,439 DEBUG   [rdr/0] Cluster 'dr1' mirroring status': {'daemon_health': 'UNKNOWN', 'health': 'UNKNOWN', 'image_health': 'OK', 'states': {}}
2024-04-20 14:04:14,124 DEBUG   [rdr/0] Cluster 'dr1' mirroring status': {'daemon_health': 'ERROR', 'health': 'ERROR', 'image_health': 'OK', 'states': {}}
2024-04-20 14:05:14,492 DEBUG   [rdr/0] Cluster 'dr1' mirroring status': {'daemon_health': 'ERROR', 'health': 'ERROR', 'image_health': 'OK', 'states': {}}
2024-04-20 14:06:14,989 DEBUG   [rdr/0] Cluster 'dr1' mirroring status': {'daemon_health': 'ERROR', 'health': 'ERROR', 'image_health': 'OK', 'states': {}}

The wait timed out after 180 seconds, then logs were gathered from both clusters.
(I disabled restarting the daemon after the timeout).

Reproducing was much harder with debug logs, I ran 257 deployments until it failed
on a vmware VM. On another KVM VM I ran 300 deployments without error.

Attached rbd mirror logs from both clusters.

Actions #17

Updated by Ilya Dryomov 9 days ago

  • Status changed from Need More Info to In Progress

Hi Nir,

Thanks for providing verbose logs. For now, I have all the information I need.

Due to rbd-mirror daemon startup racing with mirror peer bootstrapping in your setup, the first time rbd-mirror daemon attempts to connect to a remote cluster, it immediately fails because it doesn't know the monitor addresses and the key for that cluster at that point:

2024-04-20T11:03:13.333+0000 7f9a9d2eb8c0 10 rbd::mirror::ClusterWatcher:0x55c56c2b3130 resolve_peer_site_config_keys: retrieving config-key: pool_id=2, pool_name=replicapool, peer_uuid=05d512ce-7722-4dd8-875e-c7df6fb49203
2024-04-20T11:03:13.333+0000 7f9a9d2eb8c0  1 -- 192.168.122.100:0/2413595472 --> [v2:192.168.122.100:3300/0,v1:192.168.122.100:6789/0] -- mon_command({"prefix": "config-key get", "key": "rbd/mirror/peer/2/05d512ce-7722-4dd8-875e-c7df6fb49203"} v 0) v1 -- 0x55c56c2e8180 con 0x55c56c28e800
2024-04-20T11:03:13.339+0000 7f9a97679700  1 -- 192.168.122.100:0/2413595472 <== mon.0 v2:192.168.122.100:3300/0 7 ==== mon_command_ack([{"prefix": "config-key get", "key": "rbd/mirror/peer/2/05d512ce-7722-4dd8-875e-c7df6fb49203"}]=-2  v39) v1 ==== 127+0+0 (secure 0 0 0) 0x55c56d07b520 con 0x55c56c28e800
2024-04-20T11:03:13.339+0000 7f9a9d2eb8c0 10 rbd::mirror::ClusterWatcher:0x55c56c2b3130 read_site_name:
2024-04-20T11:03:13.344+0000 7f9a9d2eb8c0  1 -- 192.168.122.100:0/2413595472 --> [v2:192.168.122.100:3300/0,v1:192.168.122.100:6789/0] -- mon_command({"prefix": "config-key get", "key": "rbd/mirror/site_name"} v 0) v1 -- 0x55c56c2e8480 con 0x55c56c28e800
2024-04-20T11:03:13.355+0000 7f9a97679700  1 -- 192.168.122.100:0/2413595472 <== mon.0 v2:192.168.122.100:3300/0 8 ==== mon_command_ack([{"prefix": "config-key get", "key": "rbd/mirror/site_name"}]=-2  v39) v1 ==== 93+0+0 (secure 0 0 0) 0x55c56d07b6c0 con 0x55c56c28e800
2024-04-20T11:03:13.358+0000 7f9a9d2eb8c0 20 rbd::mirror::Mirror: 0x55c56c561c20 update_pool_replayers: enter
2024-04-20T11:03:13.358+0000 7f9a9d2eb8c0 20 rbd::mirror::Mirror: 0x55c56c561c20 update_pool_replayers: starting pool replayer for uuid: 05d512ce-7722-4dd8-875e-c7df6fb49203 cluster: 2d560a25-76f7-406a-a4b0-b4875c52eaa8 client: client.rbd-mirror-peer
2024-04-20T11:03:13.358+0000 7f9a9d2eb8c0 10 rbd::mirror::PoolReplayer: 0x55c56d0fa900 init: replaying for uuid: 05d512ce-7722-4dd8-875e-c7df6fb49203 cluster: 2d560a25-76f7-406a-a4b0-b4875c52eaa8 client: client.rbd-mirror-peer
2024-04-20T11:03:13.501+0000 7f9a9d2eb8c0  0 rbd::mirror::PoolReplayer: 0x55c56d0fa900 init_rados: reverting global config option override: keyring: /etc/ceph/keyring-store/keyring -> /etc/ceph/2d560a25-76f7-406a-a4b0-b4875c52eaa8.client.rbd-mirror-peer.keyring,/etc/ceph/2d560a25-76f7-406a-a4b0-b4875c52eaa8.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin
2024-04-20T11:03:13.501+0000 7f9a9d2eb8c0  0 rbd::mirror::PoolReplayer: 0x55c56d0fa900 init_rados: reverting global config option override: mon_host: [v2:192.168.122.100:3300,v1:192.168.122.100:6789] ->
2024-04-20T11:03:13.514+0000 7f9a9d2eb8c0 -1 rbd::mirror::PoolReplayer: 0x55c56d0fa900 init_rados: error connecting to remote peer uuid: 05d512ce-7722-4dd8-875e-c7df6fb49203 cluster: 2d560a25-76f7-406a-a4b0-b4875c52eaa8 client: client.rbd-mirror-peer: (2) No such file or directory
2024-04-20T11:03:13.514+0000 7f9a9d2eb8c0 20 rbd::mirror::ServiceDaemon: 0x55c56c3d2ab0 add_or_update_callout: pool_id=2, callout_id=0, callout_level=error, text=unable to connect to remote cluster

After 30 seconds, rbd-mirror daemon restarts the failed pool replayer and since the mirror peer information information is available by then, it succeeds:

2024-04-20T11:03:43.593+0000 7f9a9d2eb8c0 10 rbd::mirror::ClusterWatcher:0x55c56c2b3130 resolve_peer_site_config_keys: retrieving config-key: pool_id=2, pool_name=replicapool, peer_uuid=05d512ce-7722-4dd8-875e-c7df6fb49203
2024-04-20T11:03:43.593+0000 7f9a9d2eb8c0  1 -- 192.168.122.100:0/2413595472 --> [v2:192.168.122.100:3300/0,v1:192.168.122.100:6789/0] -- mon_command({"prefix": "config-key get", "key": "rbd/mirror/peer/2/05d512ce-7722-4dd8-875e-c7df6fb49203"} v 0) v1 -- 0x55c56c3e6900 con 0x55c56c28e800
2024-04-20T11:03:43.597+0000 7f9a97679700  1 -- 192.168.122.100:0/2413595472 <== mon.0 v2:192.168.122.100:3300/0 10 ==== mon_command_ack([{"prefix": "config-key get", "key": "rbd/mirror/peer/2/05d512ce-7722-4dd8-875e-c7df6fb49203"}]=0  v41) v1 ==== 127+0+120 (secure 0 0 0) 0x55c56d07b860 con 0x55c56c28e800
2024-04-20T11:03:43.597+0000 7f9a9d2eb8c0 10 rbd::mirror::ClusterWatcher:0x55c56c2b3130 read_site_name:
2024-04-20T11:03:43.600+0000 7f9a9d2eb8c0  1 -- 192.168.122.100:0/2413595472 --> [v2:192.168.122.100:3300/0,v1:192.168.122.100:6789/0] -- mon_command({"prefix": "config-key get", "key": "rbd/mirror/site_name"} v 0) v1 -- 0x55c56c3e6a80 con 0x55c56c28e800
2024-04-20T11:03:43.605+0000 7f9a97679700  1 -- 192.168.122.100:0/2413595472 <== mon.0 v2:192.168.122.100:3300/0 11 ==== mon_command_ack([{"prefix": "config-key get", "key": "rbd/mirror/site_name"}]=-2  v41) v1 ==== 93+0+0 (secure 0 0 0) 0x55c56d07ba00 con 0x55c56c28e800
2024-04-20T11:03:43.606+0000 7f9a9d2eb8c0 20 rbd::mirror::Mirror: 0x55c56c561c20 update_pool_replayers: enter
2024-04-20T11:03:43.606+0000 7f9a9d2eb8c0 20 rbd::mirror::Mirror: 0x55c56c561c20 update_pool_replayers: starting pool replayer for uuid: 05d512ce-7722-4dd8-875e-c7df6fb49203 cluster: 2d560a25-76f7-406a-a4b0-b4875c52eaa8 client: client.rbd-mirror-peer
2024-04-20T11:03:43.606+0000 7f9a9d2eb8c0 10 rbd::mirror::PoolReplayer: 0x55c56de05200 init: replaying for uuid: 05d512ce-7722-4dd8-875e-c7df6fb49203 cluster: 2d560a25-76f7-406a-a4b0-b4875c52eaa8 client: client.rbd-mirror-peer
2024-04-20T11:03:43.719+0000 7f9a9d2eb8c0  0 rbd::mirror::PoolReplayer: 0x55c56de05200 init_rados: reverting global config option override: keyring: /etc/ceph/keyring-store/keyring -> /etc/ceph/2d560a25-76f7-406a-a4b0-b4875c52eaa8.client.rbd-mirror-peer.keyring,/etc/ceph/2d560a25-76f7-406a-a4b0-b4875c52eaa8.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin
2024-04-20T11:03:43.719+0000 7f9a9d2eb8c0  0 rbd::mirror::PoolReplayer: 0x55c56de05200 init_rados: reverting global config option override: mon_host: [v2:192.168.122.100:3300,v1:192.168.122.100:6789] ->
2024-04-20T11:03:43.837+0000 7f9a9d2eb8c0 10 rbd::mirror::PoolReplayer: 0x55c56de05200 init: connected to uuid: 05d512ce-7722-4dd8-875e-c7df6fb49203 cluster: 2d560a25-76f7-406a-a4b0-b4875c52eaa8 client: client.rbd-mirror-peer

However, the "unable to connect to remote cluster" error isn't cleared and you continue to see "daemon_health: ERROR" status due to a bug in the restart code. I think I see what it is, I'll confirm and post a fix in the coming days.

Actions #18

Updated by Ilya Dryomov 7 days ago ยท Edited

  • Priority changed from Normal to High

Ilya Dryomov wrote in #note-17:

However, the "unable to connect to remote cluster" error isn't cleared and you continue to see "daemon_health: ERROR" status due to a bug in the restart code. I think I see what it is, I'll confirm and post a fix in the coming days.

In addition to the error not getting cleared, I identified another bug which causes pool replayers to not get cleaned up in some cases (this ticket's scenario being one of them).

Actions #19

Updated by Ilya Dryomov 6 days ago

  • Status changed from In Progress to Fix Under Review
  • Backport set to quincy,reef,squid
  • Pull request ID set to 57082
Actions #20

Updated by Ilya Dryomov 6 days ago

Hi Nir,

I built a container based on 18.2.3 (an upcoming release). It would be great if you could try it: podman pull quay.ceph.io/ceph-ci/ceph:wip-65487-on-18.2.3

Disable verbose logging since that made the issue harder to reproduce -- I'm interested in just pass/fail (or rather the number of deployments you are able to clock without restarting the daemon).

Actions #21

Updated by Nir Soffer about 21 hours ago

Ilya Dryomov wrote in #note-20:

Hi Nir,

I built a container based on 18.2.3 (an upcoming release). It would be great if you could try it: podman pull quay.ceph.io/ceph-ci/ceph:wip-65487-on-18.2.3

Disable verbose logging since that made the issue harder to reproduce -- I'm interested in just pass/fail (or rather the number of deployments you are able to clock without restarting the daemon).

Thanks Ilia! I will try this image later this week.

Actions #22

Updated by Nir Soffer about 15 hours ago

Ilya Dryomov wrote in #note-20:

Hi Nir,

I built a container based on 18.2.3 (an upcoming release). It would be great if you could try it: podman pull quay.ceph.io/ceph-ci/ceph:wip-65487-on-18.2.3

This registry does not have good enough connection - it takes 7-8 minutes to pull it
in the lab use for testing:

rook-ceph                             0s                      Normal    Pulled                             Pod/rook-ceph-detect-version-ms4hn                                                 Successfully pulled image "quay.ceph.io/ceph-ci/ceph:wip-65487-on-18.2.3" in 8m40.521s (8m40.521s including waiting)
...
rook-ceph                             0s                      Normal    Pulled                             Pod/rook-ceph-detect-version-kjbrf                                                 Successfully pulled image "quay.ceph.io/ceph-ci/ceph:wip-65487-on-18.2.3" in 7m5.058s (7m5.058s including waiting)

This causes timeout when creating ceph cluster. We probably need to setup a local registry in this lab.

Actions

Also available in: Atom PDF