Bug #65487
openrbd-mirror daemon in ERROR state, require manual restart
0%
Description
Description¶
We experience a random error in rbd-mirror daemon, occurring 1-2 times per 100 deployments.
When it happens, after adding a cephrbdmiror resource, watching cephblockpool
.status.mirroringStatus.summary shows daemon_health: ERROR forever (we waited few hours).
$ kubectl rook-ceph --context dr2 rbd mirror pool status -p replicapool --verbose health: ERROR daemon health: ERROR image health: OK images: 0 total DAEMONS service 4361: instance_id: 4408 client_id: a hostname: dr2 version: 18.2.2 leader: true health: ERROR callouts: unable to connect to remote cluster
In rbd-mirror logs we see:
8287-356f-4f81-87dc-51bb05942553.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin debug 2024-04-07T05:18:11.585+0000 7fc86d4808c0 0 rbd::mirror::PoolReplayer: 0x5589c90dc000 init_rados: reverting global config option override: mon_host: [v2:192.168.122.98:3300,v1:192.168.122.98:6789] -> unable to get monitor info from DNS SRV with service name: ceph-mon debug 2024-04-07T05:18:11.602+0000 7fc86d4808c0 -1 failed for service _ceph-mon._tcp debug 2024-04-07T05:18:11.602+0000 7fc86d4808c0 -1 monclient: get_monmap_and_config cannot identify monitors to contact
While rbd mirror is not functional, ceph status
does not report any issue:
cluster: id: 8e054339-dedf-4ea0-8936-682ccfd323ca health: HEALTH_OK services: mon: 1 daemons, quorum a (age 8h) mgr: a(active, since 8h) osd: 1 osds: 1 up (since 8h), 1 in (since 8h) rbd-mirror: 1 daemon active (1 hosts) data: pools: 2 pools, 64 pgs objects: 7 objects, 463 KiB usage: 31 MiB used, 50 GiB / 50 GiB avail pgs: 64 active+clean io: client: 1019 B/s rd, 84 B/s wr, 1 op/s rd, 0 op/s wrLooks like we have several issues:
- ceph status does not report actual status
- rbd-mirror is not handling errors correctly. It should retry failing operations and recover, or it it cannot recover terminate with an error.
Workaround¶
Restarting rbd-mirror daemon (e.g. kubectl rollout restart
) fixes the issue.
See also¶
- Ramen upstream issue: https://github.com/RamenDR/ramen/issues/1332
Files
Updated by Ilya Dryomov 16 days ago
- Status changed from New to Need More Info
- Assignee set to Ilya Dryomov
Hi Nir,
rbd-mirror daemon states that it was unable to connect to the remote cluster. Could it be some kind of pod networking issue that goes away upon restart?
There is nothing to go on in the attached log -- those errors aren't necessarily fatal. In an ODF environment, the debug levels are typically bumped to "debug ms = 1", "debug rbd = 15" and "debug rbd mirror = 30", both for the local client and the remote client. Also, rbd-mirror daemon is configured to log to actual files with "log file = /var/log/ceph/\$cluster-\$name.log" and we are supposed to get two files: ceph-client.rbd-mirror.a.log and e.g. 62f28287-356f-4f81-87dc-51bb05942553-client.rbd-mirror-peer.log. Did that not happen here?
Updated by Nir Soffer 16 days ago
Tested with:
- image: quay.io/ceph/ceph:v18
- imageID: quay.io/ceph/ceph@sha256:06ddc3ef5b66f2dcc6d16e41842d33a3d7f497849981b0842672ef9014a96726
Updated by Nir Soffer 16 days ago
Ilya Dryomov wrote in #note-1:
Hi Nir,
rbd-mirror daemon states that it was unable to connect to the remote cluster. Could it be some kind of pod networking issue that goes away upon restart?
It can be, but rbd mirror should fail (and restart) if pod networking is broken, no?
There is nothing to go on in the attached log -- those errors aren't necessarily fatal. In an ODF environment, the debug levels are typically bumped to "debug ms = 1", "debug rbd = 15" and "debug rbd mirror = 30", both for the local client and the remote client.
This is not ODF environment, this is upstream rook environment.
You can see the way we deploy and configure here:- deploying rook operator: https://github.com/RamenDR/ramen/tree/main/test/addons/rook-operator
- creating ceph cluster: https://github.com/RamenDR/ramen/tree/main/test/addons/rook-cluster
- creating ceph block pool: https://github.com/RamenDR/ramen/tree/main/test/addons/rook-pool
What do we have to change to bump the log level as you mention?
Also, rbd-mirror daemon is configured to log to actual files with "log file = /var/log/ceph/\$cluster-\$name.log" and we are supposed to get two files: ceph-client.rbd-mirror.a.log and e.g. 62f28287-356f-4f81-87dc-51bb05942553-client.rbd-mirror-peer.log. Did that not happen here?
I don't know the environment was destroyed, and I don't know if it is configured to create these logs.
I can check if these logs exist on a new deployment.
Updated by Ilya Dryomov 16 days ago
Nir Soffer wrote in #note-3:
It can be, but rbd mirror should fail (and restart) if pod networking is broken, no?
There is code in rbd-mirror for restarting the failed pool replayer, so it's supposed to do that at least in some cases. There is no way to tell whether that code kicked in here or not without verbose logs.
rbd-mirror daemon will never restart itself, so if this is indeed a pod networking issue that disappears when the pod is restarted, no matter what rbd-mirror is doing internally it's never going to be able to resolve it like "kubectl rollout restart" does.
This is not ODF environment, this is upstream rook environment.
You can see the way we deploy and configure here:
- deploying rook operator: https://github.com/RamenDR/ramen/tree/main/test/addons/rook-operator
- creating ceph cluster: https://github.com/RamenDR/ramen/tree/main/test/addons/rook-cluster
- creating ceph block pool: https://github.com/RamenDR/ramen/tree/main/test/addons/rook-pool
What do we have to change to bump the log level as you mention?
It's automated for ODF here:
You would need to do something close to that.
Updated by Nir Soffer 16 days ago
Ilya Dryomov wrote in #note-4:
This is not ODF environment, this is upstream rook environment.
You can see the way we deploy and configure here:
- deploying rook operator: https://github.com/RamenDR/ramen/tree/main/test/addons/rook-operator
- creating ceph cluster: https://github.com/RamenDR/ramen/tree/main/test/addons/rook-cluster
- creating ceph block pool: https://github.com/RamenDR/ramen/tree/main/test/addons/rook-pool
What do we have to change to bump the log level as you mention?
It's automated for ODF here:
You would need to do something close to that.
Thanks, we will improve logging and reproduce again.
The current setup look like this:
$ kubectl describe deploy rook-ceph-rbd-mirror-a -n rook-ceph --context dr1 Name: rook-ceph-rbd-mirror-a Namespace: rook-ceph CreationTimestamp: Tue, 16 Apr 2024 00:22:11 +0300 Labels: app=rook-ceph-rbd-mirror app.kubernetes.io/component=cephrbdmirrors.ceph.rook.io app.kubernetes.io/created-by=rook-ceph-operator app.kubernetes.io/instance=a app.kubernetes.io/managed-by=rook-ceph-operator app.kubernetes.io/name=ceph-rbd-mirror app.kubernetes.io/part-of=my-rbd-mirror ceph-version=18.2.2-0 ceph_daemon_id=a ceph_daemon_type=rbd-mirror rbd-mirror=a rook-version=v1.13.8 rook.io/operator-namespace=rook-ceph rook_cluster=rook-ceph Annotations: banzaicloud.com/last-applied: UEsDBBQACAAIAAAAAAAAAAAAAAAAAAAAAAAIAAAAb3JpZ2luYWzsWNtu2zgTfpUfxH+xC4jyoWmaCPBF6rhbo/VhnXQXiyIwaGpkc02R6pByYgR+98VIimM7ttNDLhaL3gQy5zzzzQ... deployment.kubernetes.io/revision: 1 Selector: app=rook-ceph-rbd-mirror,app.kubernetes.io/component=cephrbdmirrors.ceph.rook.io,app.kubernetes.io/created-by=rook-ceph-operator,app.kubernetes.io/instance=a,app.kubernetes.io/managed-by=rook-ceph-operator,app.kubernetes.io/name=ceph-rbd-mirror,app.kubernetes.io/part-of=my-rbd-mirror,ceph_daemon_id=a,ceph_daemon_type=rbd-mirror,rbd-mirror=a,rook.io/operator-namespace=rook-ceph,rook_cluster=rook-ceph Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable StrategyType: RollingUpdate MinReadySeconds: 0 RollingUpdateStrategy: 25% max unavailable, 25% max surge Pod Template: Labels: app=rook-ceph-rbd-mirror app.kubernetes.io/component=cephrbdmirrors.ceph.rook.io app.kubernetes.io/created-by=rook-ceph-operator app.kubernetes.io/instance=a app.kubernetes.io/managed-by=rook-ceph-operator app.kubernetes.io/name=ceph-rbd-mirror app.kubernetes.io/part-of=my-rbd-mirror ceph_daemon_id=a ceph_daemon_type=rbd-mirror rbd-mirror=a rook.io/operator-namespace=rook-ceph rook_cluster=rook-ceph Init Containers: chown-container-data-dir: Image: quay.io/ceph/ceph:v18 Port: <none> Host Port: <none> Command: chown Args: --verbose --recursive ceph:ceph /var/log/ceph /var/lib/ceph/crash /run/ceph Environment: <none> Mounts: /etc/ceph from rook-config-override (ro) /etc/ceph/keyring-store/ from rook-ceph-rbd-mirror-a-keyring (ro) /run/ceph from ceph-daemons-sock-dir (rw) /var/lib/ceph/crash from rook-ceph-crash (rw) /var/log/ceph from rook-ceph-log (rw) Containers: rbd-mirror: Image: quay.io/ceph/ceph:v18 Port: <none> Host Port: <none> Command: rbd-mirror Args: --fsid=ea6c70f1-21d0-4991-829d-bc0c3aad6a2c --keyring=/etc/ceph/keyring-store/keyring --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true --default-log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --mon-host=$(ROOK_CEPH_MON_HOST) --mon-initial-members=$(ROOK_CEPH_MON_INITIAL_MEMBERS) --id=a --setuser=ceph --setgroup=ceph --foreground --name=client.rbd-mirror.a Environment: CONTAINER_IMAGE: quay.io/ceph/ceph:v18 POD_NAME: (v1:metadata.name) POD_NAMESPACE: (v1:metadata.namespace) NODE_NAME: (v1:spec.nodeName) POD_MEMORY_LIMIT: node allocatable (limits.memory) POD_MEMORY_REQUEST: 0 (requests.memory) POD_CPU_LIMIT: node allocatable (limits.cpu) POD_CPU_REQUEST: 0 (requests.cpu) CEPH_USE_RANDOM_NONCE: true ROOK_CEPH_MON_HOST: <set to the key 'mon_host' in secret 'rook-ceph-config'> Optional: false ROOK_CEPH_MON_INITIAL_MEMBERS: <set to the key 'mon_initial_members' in secret 'rook-ceph-config'> Optional: false Mounts: /etc/ceph from rook-config-override (ro) /etc/ceph/keyring-store/ from rook-ceph-rbd-mirror-a-keyring (ro) /run/ceph from ceph-daemons-sock-dir (rw) /var/lib/ceph/crash from rook-ceph-crash (rw) /var/log/ceph from rook-ceph-log (rw) Volumes: rook-config-override: Type: Projected (a volume that contains injected data from multiple sources) ConfigMapName: rook-config-override ConfigMapOptional: <nil> rook-ceph-rbd-mirror-a-keyring: Type: Secret (a volume populated by a Secret) SecretName: rook-ceph-rbd-mirror-a-keyring Optional: false ceph-daemons-sock-dir: Type: HostPath (bare host directory volume) Path: /data/rook/exporter HostPathType: DirectoryOrCreate rook-ceph-log: Type: HostPath (bare host directory volume) Path: /data/rook/rook-ceph/log HostPathType: rook-ceph-crash: Type: HostPath (bare host directory volume) Path: /data/rook/rook-ceph/crash HostPathType: Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing True NewReplicaSetAvailable OldReplicaSets: <none> NewReplicaSet: rook-ceph-rbd-mirror-a-7c677bd5bf (1/1 replicas created) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ScalingReplicaSet 28m deployment-controller Scaled up replica set rook-ceph-rbd-mirror-a-7c677bd5bf to 1
Updated by Nir Soffer 15 days ago
- File rbd-mirror.log.gz rbd-mirror.log.gz added
Nir Soffer wrote in #note-5:
You would need to do something close to that.
Thanks, we will improve logging and reproduce again.
I tried this rook-config-override1 with this configuration:
kind: ConfigMap apiVersion: v1 metadata: name: rook-config-override namespace: rook-ceph data: config: | [client.rbd-mirror.a] debug_ms = 1 debug_rbd = 15 debug_rbd_mirror = 30 log_to_file = true log_file = /var/log/ceph/$cluster-$name.log [client.rbd-mirror-peer] debug_ms = 1 debug_rbd = 15 debug_rbd_mirror = 30 log_to_file = true log_file = /var/log/ceph/$cluster-$name.log [mgr] mgr/rbd_support/log_level = debug
I added log_to_file = true
(not used in ODF), without it no log file is created.
With this applied after deploying rook operator, These logs are created:
$ minikube ssh -p dr1 'ls -lh /data/rook/rook-ceph/log' total 74M -rw-r--r-- 1 167 167 74M Apr 16 12:46 ceph-client.rbd-mirror.a.log -rw-r--r-- 1 167 167 17K Apr 16 01:02 ceph-volume.log
The log file includes log level up to 20:
$ awk '{print $3}' /data/rook/rook-ceph/log/ceph-client.rbd-mirror.a.log | sort -rn | head -1 20
The other log file (e.g. 62f28287-356f-4f81-87dc-51bb05942553-client.rbd-mirror-peer.log) is
not created.
Few questions:
- Any change needed to get enough info when we reproduce the issue again?
- Why we need debug_rbd_mirror = 30
when the highest log level is 20? (also documented to be 20)
- Do we really need the log file? it is much easier to work with kubectl logs, and this setup is used for short living clusters during CI or manual testing. We don't expect to debug issues that happened more than 1-2 hours ago.
Attaching example of log from:
kubectl logs deploy/rook-ceph-rbd-mirror-a -n rook-ceph --context dr1 > rbd-mirror.log
[1] https://rook.io/docs/rook/latest-release/Storage-Configuration/Advanced/ceph-configuration/#example
Updated by Ilya Dryomov 15 days ago
Nir Soffer wrote in #note-6:
The other log file (e.g. 62f28287-356f-4f81-87dc-51bb05942553-client.rbd-mirror-peer.log) is
not created.
Hi Nir,
The above needs to be done of both clusters. The configuration on cluster A affects logging on cluster B because rbd-mirror daemon is just another client from cluster A's perspective.
Few questions:
- Any change needed to get enough info when we reproduce the issue again?
Let's obtain both log files and go from there.
- Why we need
debug_rbd_mirror = 30
when the highest log level is 20? (also documented to be 20)
I don't think there is such a thing as the highest log level. rbd-mirror daemon certainly has a few log messages at 30.
Overall, Ceph is pretty lax about log levels and how they are used varies greatly from subsystem to subsystem.
- Do we really need the log file? it is much easier to work with kubectl logs, and this setup is used for short living clusters during CI or manual testing. We don't expect to debug issues that happened more than 1-2 hours ago.
Unfortunately, we do. Once you get to rbd-mirror actually syncing an image, you would see that kubectl can sustain only 1-2 minutes worth of verbose logs.
Updated by Nir Soffer 15 days ago
Ilya Dryomov wrote in #note-7:
Nir Soffer wrote in #note-6:
The other log file (e.g. 62f28287-356f-4f81-87dc-51bb05942553-client.rbd-mirror-peer.log) is
not created.Hi Nir,
The above needs to be done of both clusters. The configuration on cluster A affects logging on cluster B because rbd-mirror daemon is just another client from cluster A's perspective.
Yes, the configuration is applied to both clusters. If I understand correctly,
The mentioned 62f28287-356f-4f81-87dc-51bb05942553-client.rbd-mirror-peer.log
file is the ceph-client.rbd-mirror.a.log from the other cluster? Maybe odf
must-gather is creating this file?
Few questions:
- Any change needed to get enough info when we reproduce the issue again?Let's obtain both log files and go from there.
- Why we need
debug_rbd_mirror = 30
when the highest log level is 20? (also documented to be 20)I don't think there is such a thing as the highest log level. rbd-mirror daemon certainly has a few log messages at 30.
OK, I was confused by the note here saying "Log levels range from 0 to 20"
https://docs.ceph.com/en/latest/rados/configuration/ceph-conf/#runtime-changes
- Do we really need the log file? it is much easier to work with kubectl logs, and this setup is used for short living clusters during CI or manual testing. We don't expect to debug issues that happened more than 1-2 hours ago.
Unfortunately, we do. Once you get to rbd-mirror actually syncing an image, you would see that kubectl can sustain only 1-2 minutes worth of verbose logs.
OK
I'll try to reproduce with this configuration and collect the full log from
both clusters on errors.
Updated by Ilya Dryomov 15 days ago
Nir Soffer wrote in #note-8:
Yes, the configuration is applied to both clusters. If I understand correctly,
The mentioned 62f28287-356f-4f81-87dc-51bb05942553-client.rbd-mirror-peer.log
file is the ceph-client.rbd-mirror.a.log from the other cluster?
62f28287-356f-4f81-87dc-51bb05942553 was just an example -- I copied that from the log that you provided previously. Since the clusters have presumably been redeployed from scratch, look for any file named <UUID>-client.rbd-mirror-peer.log.
Maybe odf must-gather is creating this file?
No, it's created by rbd-mirror daemon.
Updated by Nir Soffer 15 days ago
Ilya Dryomov wrote in #note-9:
Nir Soffer wrote in #note-8:
Yes, the configuration is applied to both clusters. If I understand correctly,
The mentioned 62f28287-356f-4f81-87dc-51bb05942553-client.rbd-mirror-peer.log
file is the ceph-client.rbd-mirror.a.log from the other cluster?62f28287-356f-4f81-87dc-51bb05942553 was just an example -- I copied that from the log that you provided previously. Since the clusters have presumably been redeployed from scratch, look for any file named <UUID>-client.rbd-mirror-peer.log.
Maybe odf must-gather is creating this file?
No, it's created by rbd-mirror daemon.
I see, but it is was not created using the configuration I shared here.
Both clusters are configured, and I run a full DR flow with failover and relocate.
In every step we wait for the first replication. Maybe something is wrong with this
configuration?
[client.rbd-mirror-peer] debug_ms = 1 debug_rbd = 15 debug_rbd_mirror = 30 log_to_file = true log_file = /var/log/ceph/$cluster-$name.log
Updated by Ilya Dryomov 14 days ago
Hi Nir,
I think the problem is the method you used to set these config options. Note that the way it's done in ODF is via "ceph config set" commands, while you did it through a "data: config: " override which simply injects the snippet verbatim into ceph.conf file. Ordinarily there would be almost no difference, but in this case it matters because there are two different clusters in play and their names don't match up, at least as far as rbd-mirror daemon is concerned.
Updated by Nir Soffer 14 days ago
Ilya Dryomov wrote in #note-11:
Hi Nir,
I think the problem is the method you used to set these config options. Note that the way it's done in ODF is via "ceph config set" commands, while you did it through a "data: config: " override which simply injects the snippet verbatim into ceph.conf file. Ordinarily there would be almost no difference, but in this case it matters because there are two different clusters in play and their names don't match up, at least as far as rbd-mirror daemon is concerned.
OK, I will try to do the same changes using same method as ODF.
Updated by Nir Soffer 14 days ago
Nir Soffer wrote in #note-12:
Ilya Dryomov wrote in #note-11:
Hi Nir,
I think the problem is the method you used to set these config options. Note that the way it's done in ODF is via "ceph config set" commands, while you did it through a "data: config: " override which simply injects the snippet verbatim into ceph.conf file. Ordinarily there would be almost no difference, but in this case it matters because there are two different clusters in play and their names don't match up, at least as far as rbd-mirror daemon is concerned.
Changing to dynamic configuration1 the rbd-mirror-peer log is created:
$ minikube ssh -p dr1 'ls /data/rook/rook-ceph/log' 9e67de2a-c3eb-4a06-a8b5-444bff716362-client.rbd-mirror-peer.log ceph-client.rbd-mirror.a.log ceph-volume.log
However unlike ODF, the log files are not created without log_to_file = true
. Maybe ODF runs the container with different arguments?
Updated by Ilya Dryomov 13 days ago
log_to_file gets set to true by Rook as part of enabling the log collector:
The log collector rotates logs. If you don't have it enabled but have verbose logging set up, you would quickly run out of space.
Updated by Nir Soffer 13 days ago
Ilya Dryomov wrote in #note-14:
log_to_file gets set to true by Rook as part of enabling the log collector:
The log collector rotates logs. If you don't have it enabled but have verbose logging set up, you would quickly run out of space.
Thanks! This should not be a problem in most cases since the test environment rarely lives more than one day, but may be for people keeping a test environment running for many days.
Updated by Nir Soffer 11 days ago
- File rbd-mirror-logs.tar.gz rbd-mirror-logs.tar.gz added
I reproduced the issue again with debug logs.
Tested flow:
- Configure rbd mirroring on both clusters
- Wait for cephrbdmirror to become Ready on both clusters
- Wait for cephblockpool .status.mirroringStatus.summary.daemon_health to become OK on cluster dr1
Wait logs:
2024-04-20 14:03:18,306 DEBUG [rdr/0] Waiting for mirroring health in cluster 'dr1' (1/1) 2024-04-20 14:03:18,439 DEBUG [rdr/0] Cluster 'dr1' mirroring status': {'daemon_health': 'UNKNOWN', 'health': 'UNKNOWN', 'image_health': 'OK', 'states': {}} 2024-04-20 14:04:14,124 DEBUG [rdr/0] Cluster 'dr1' mirroring status': {'daemon_health': 'ERROR', 'health': 'ERROR', 'image_health': 'OK', 'states': {}} 2024-04-20 14:05:14,492 DEBUG [rdr/0] Cluster 'dr1' mirroring status': {'daemon_health': 'ERROR', 'health': 'ERROR', 'image_health': 'OK', 'states': {}} 2024-04-20 14:06:14,989 DEBUG [rdr/0] Cluster 'dr1' mirroring status': {'daemon_health': 'ERROR', 'health': 'ERROR', 'image_health': 'OK', 'states': {}}
The wait timed out after 180 seconds, then logs were gathered from both clusters.
(I disabled restarting the daemon after the timeout).
Reproducing was much harder with debug logs, I ran 257 deployments until it failed
on a vmware VM. On another KVM VM I ran 300 deployments without error.
Attached rbd mirror logs from both clusters.
Updated by Ilya Dryomov 9 days ago
- Status changed from Need More Info to In Progress
Hi Nir,
Thanks for providing verbose logs. For now, I have all the information I need.
Due to rbd-mirror daemon startup racing with mirror peer bootstrapping in your setup, the first time rbd-mirror daemon attempts to connect to a remote cluster, it immediately fails because it doesn't know the monitor addresses and the key for that cluster at that point:
2024-04-20T11:03:13.333+0000 7f9a9d2eb8c0 10 rbd::mirror::ClusterWatcher:0x55c56c2b3130 resolve_peer_site_config_keys: retrieving config-key: pool_id=2, pool_name=replicapool, peer_uuid=05d512ce-7722-4dd8-875e-c7df6fb49203 2024-04-20T11:03:13.333+0000 7f9a9d2eb8c0 1 -- 192.168.122.100:0/2413595472 --> [v2:192.168.122.100:3300/0,v1:192.168.122.100:6789/0] -- mon_command({"prefix": "config-key get", "key": "rbd/mirror/peer/2/05d512ce-7722-4dd8-875e-c7df6fb49203"} v 0) v1 -- 0x55c56c2e8180 con 0x55c56c28e800 2024-04-20T11:03:13.339+0000 7f9a97679700 1 -- 192.168.122.100:0/2413595472 <== mon.0 v2:192.168.122.100:3300/0 7 ==== mon_command_ack([{"prefix": "config-key get", "key": "rbd/mirror/peer/2/05d512ce-7722-4dd8-875e-c7df6fb49203"}]=-2 v39) v1 ==== 127+0+0 (secure 0 0 0) 0x55c56d07b520 con 0x55c56c28e800 2024-04-20T11:03:13.339+0000 7f9a9d2eb8c0 10 rbd::mirror::ClusterWatcher:0x55c56c2b3130 read_site_name: 2024-04-20T11:03:13.344+0000 7f9a9d2eb8c0 1 -- 192.168.122.100:0/2413595472 --> [v2:192.168.122.100:3300/0,v1:192.168.122.100:6789/0] -- mon_command({"prefix": "config-key get", "key": "rbd/mirror/site_name"} v 0) v1 -- 0x55c56c2e8480 con 0x55c56c28e800 2024-04-20T11:03:13.355+0000 7f9a97679700 1 -- 192.168.122.100:0/2413595472 <== mon.0 v2:192.168.122.100:3300/0 8 ==== mon_command_ack([{"prefix": "config-key get", "key": "rbd/mirror/site_name"}]=-2 v39) v1 ==== 93+0+0 (secure 0 0 0) 0x55c56d07b6c0 con 0x55c56c28e800 2024-04-20T11:03:13.358+0000 7f9a9d2eb8c0 20 rbd::mirror::Mirror: 0x55c56c561c20 update_pool_replayers: enter 2024-04-20T11:03:13.358+0000 7f9a9d2eb8c0 20 rbd::mirror::Mirror: 0x55c56c561c20 update_pool_replayers: starting pool replayer for uuid: 05d512ce-7722-4dd8-875e-c7df6fb49203 cluster: 2d560a25-76f7-406a-a4b0-b4875c52eaa8 client: client.rbd-mirror-peer 2024-04-20T11:03:13.358+0000 7f9a9d2eb8c0 10 rbd::mirror::PoolReplayer: 0x55c56d0fa900 init: replaying for uuid: 05d512ce-7722-4dd8-875e-c7df6fb49203 cluster: 2d560a25-76f7-406a-a4b0-b4875c52eaa8 client: client.rbd-mirror-peer 2024-04-20T11:03:13.501+0000 7f9a9d2eb8c0 0 rbd::mirror::PoolReplayer: 0x55c56d0fa900 init_rados: reverting global config option override: keyring: /etc/ceph/keyring-store/keyring -> /etc/ceph/2d560a25-76f7-406a-a4b0-b4875c52eaa8.client.rbd-mirror-peer.keyring,/etc/ceph/2d560a25-76f7-406a-a4b0-b4875c52eaa8.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin 2024-04-20T11:03:13.501+0000 7f9a9d2eb8c0 0 rbd::mirror::PoolReplayer: 0x55c56d0fa900 init_rados: reverting global config option override: mon_host: [v2:192.168.122.100:3300,v1:192.168.122.100:6789] -> 2024-04-20T11:03:13.514+0000 7f9a9d2eb8c0 -1 rbd::mirror::PoolReplayer: 0x55c56d0fa900 init_rados: error connecting to remote peer uuid: 05d512ce-7722-4dd8-875e-c7df6fb49203 cluster: 2d560a25-76f7-406a-a4b0-b4875c52eaa8 client: client.rbd-mirror-peer: (2) No such file or directory 2024-04-20T11:03:13.514+0000 7f9a9d2eb8c0 20 rbd::mirror::ServiceDaemon: 0x55c56c3d2ab0 add_or_update_callout: pool_id=2, callout_id=0, callout_level=error, text=unable to connect to remote cluster
After 30 seconds, rbd-mirror daemon restarts the failed pool replayer and since the mirror peer information information is available by then, it succeeds:
2024-04-20T11:03:43.593+0000 7f9a9d2eb8c0 10 rbd::mirror::ClusterWatcher:0x55c56c2b3130 resolve_peer_site_config_keys: retrieving config-key: pool_id=2, pool_name=replicapool, peer_uuid=05d512ce-7722-4dd8-875e-c7df6fb49203 2024-04-20T11:03:43.593+0000 7f9a9d2eb8c0 1 -- 192.168.122.100:0/2413595472 --> [v2:192.168.122.100:3300/0,v1:192.168.122.100:6789/0] -- mon_command({"prefix": "config-key get", "key": "rbd/mirror/peer/2/05d512ce-7722-4dd8-875e-c7df6fb49203"} v 0) v1 -- 0x55c56c3e6900 con 0x55c56c28e800 2024-04-20T11:03:43.597+0000 7f9a97679700 1 -- 192.168.122.100:0/2413595472 <== mon.0 v2:192.168.122.100:3300/0 10 ==== mon_command_ack([{"prefix": "config-key get", "key": "rbd/mirror/peer/2/05d512ce-7722-4dd8-875e-c7df6fb49203"}]=0 v41) v1 ==== 127+0+120 (secure 0 0 0) 0x55c56d07b860 con 0x55c56c28e800 2024-04-20T11:03:43.597+0000 7f9a9d2eb8c0 10 rbd::mirror::ClusterWatcher:0x55c56c2b3130 read_site_name: 2024-04-20T11:03:43.600+0000 7f9a9d2eb8c0 1 -- 192.168.122.100:0/2413595472 --> [v2:192.168.122.100:3300/0,v1:192.168.122.100:6789/0] -- mon_command({"prefix": "config-key get", "key": "rbd/mirror/site_name"} v 0) v1 -- 0x55c56c3e6a80 con 0x55c56c28e800 2024-04-20T11:03:43.605+0000 7f9a97679700 1 -- 192.168.122.100:0/2413595472 <== mon.0 v2:192.168.122.100:3300/0 11 ==== mon_command_ack([{"prefix": "config-key get", "key": "rbd/mirror/site_name"}]=-2 v41) v1 ==== 93+0+0 (secure 0 0 0) 0x55c56d07ba00 con 0x55c56c28e800 2024-04-20T11:03:43.606+0000 7f9a9d2eb8c0 20 rbd::mirror::Mirror: 0x55c56c561c20 update_pool_replayers: enter 2024-04-20T11:03:43.606+0000 7f9a9d2eb8c0 20 rbd::mirror::Mirror: 0x55c56c561c20 update_pool_replayers: starting pool replayer for uuid: 05d512ce-7722-4dd8-875e-c7df6fb49203 cluster: 2d560a25-76f7-406a-a4b0-b4875c52eaa8 client: client.rbd-mirror-peer 2024-04-20T11:03:43.606+0000 7f9a9d2eb8c0 10 rbd::mirror::PoolReplayer: 0x55c56de05200 init: replaying for uuid: 05d512ce-7722-4dd8-875e-c7df6fb49203 cluster: 2d560a25-76f7-406a-a4b0-b4875c52eaa8 client: client.rbd-mirror-peer 2024-04-20T11:03:43.719+0000 7f9a9d2eb8c0 0 rbd::mirror::PoolReplayer: 0x55c56de05200 init_rados: reverting global config option override: keyring: /etc/ceph/keyring-store/keyring -> /etc/ceph/2d560a25-76f7-406a-a4b0-b4875c52eaa8.client.rbd-mirror-peer.keyring,/etc/ceph/2d560a25-76f7-406a-a4b0-b4875c52eaa8.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin 2024-04-20T11:03:43.719+0000 7f9a9d2eb8c0 0 rbd::mirror::PoolReplayer: 0x55c56de05200 init_rados: reverting global config option override: mon_host: [v2:192.168.122.100:3300,v1:192.168.122.100:6789] -> 2024-04-20T11:03:43.837+0000 7f9a9d2eb8c0 10 rbd::mirror::PoolReplayer: 0x55c56de05200 init: connected to uuid: 05d512ce-7722-4dd8-875e-c7df6fb49203 cluster: 2d560a25-76f7-406a-a4b0-b4875c52eaa8 client: client.rbd-mirror-peer
However, the "unable to connect to remote cluster" error isn't cleared and you continue to see "daemon_health: ERROR" status due to a bug in the restart code. I think I see what it is, I'll confirm and post a fix in the coming days.
Updated by Ilya Dryomov 7 days ago ยท Edited
- Priority changed from Normal to High
Ilya Dryomov wrote in #note-17:
However, the "unable to connect to remote cluster" error isn't cleared and you continue to see "daemon_health: ERROR" status due to a bug in the restart code. I think I see what it is, I'll confirm and post a fix in the coming days.
In addition to the error not getting cleared, I identified another bug which causes pool replayers to not get cleaned up in some cases (this ticket's scenario being one of them).
Updated by Ilya Dryomov 6 days ago
- Status changed from In Progress to Fix Under Review
- Backport set to quincy,reef,squid
- Pull request ID set to 57082
Updated by Ilya Dryomov 6 days ago
Hi Nir,
I built a container based on 18.2.3 (an upcoming release). It would be great if you could try it: podman pull quay.ceph.io/ceph-ci/ceph:wip-65487-on-18.2.3
Disable verbose logging since that made the issue harder to reproduce -- I'm interested in just pass/fail (or rather the number of deployments you are able to clock without restarting the daemon).
Updated by Nir Soffer about 21 hours ago
Ilya Dryomov wrote in #note-20:
Hi Nir,
I built a container based on 18.2.3 (an upcoming release). It would be great if you could try it: podman pull quay.ceph.io/ceph-ci/ceph:wip-65487-on-18.2.3
Disable verbose logging since that made the issue harder to reproduce -- I'm interested in just pass/fail (or rather the number of deployments you are able to clock without restarting the daemon).
Thanks Ilia! I will try this image later this week.
Updated by Nir Soffer about 15 hours ago
Ilya Dryomov wrote in #note-20:
Hi Nir,
I built a container based on 18.2.3 (an upcoming release). It would be great if you could try it: podman pull quay.ceph.io/ceph-ci/ceph:wip-65487-on-18.2.3
This registry does not have good enough connection - it takes 7-8 minutes to pull it
in the lab use for testing:
rook-ceph 0s Normal Pulled Pod/rook-ceph-detect-version-ms4hn Successfully pulled image "quay.ceph.io/ceph-ci/ceph:wip-65487-on-18.2.3" in 8m40.521s (8m40.521s including waiting) ... rook-ceph 0s Normal Pulled Pod/rook-ceph-detect-version-kjbrf Successfully pulled image "quay.ceph.io/ceph-ci/ceph:wip-65487-on-18.2.3" in 7m5.058s (7m5.058s including waiting)
This causes timeout when creating ceph cluster. We probably need to setup a local registry in this lab.