Bug #49277
closedcephadm bootstrap --apply-spec <cluster.yaml> hangs
0%
Description
The feature introduced by https://tracker.ceph.com/issues/44873 seems to have the following flaw.
If I bootstrap a cluster on node oc0-ceph-0 with --apply-spec, then the bootstrap proceeds but the spec [1] is never applied and the cephadm log shows it waiting acquire a lock.
If I bootstrap a cluster on node oc0-ceph-0 without --apply-spec and then apply the same spec file [1] a few seconds later, then the spec is applied flawlessly.
I have ansible tasks I can use to easily reproduce [2]3 to ensure a consistent test. I used cephadm-15.2.5-0.el8.x86_64.rpm with with the latest "docker.io/ceph/ceph:v15" as of Jan 10, 2020.
The command run by ansible is:
/usr/sbin/cephadm bootstrap --ssh-private-key /home/ceph-admin/.ssh/id_rsa --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub --ssh-user ceph-admin --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --fsid 77642368-c850-5eb9-ba49-e59024b4d0ab --mon-ip 192.168.24.6
FWIW: This is not a major problem for TripleO's cephadm integration because we can bootstrap a single node and apply the spec afterwards.
[1] ceph_spec.yml
---
service_type: host
addr: oc0-ceph-1
hostname: oc0-ceph-1
---
service_type: host
addr: oc0-ceph-2
hostname: oc0-ceph-2
---
service_type: mon
placement:
hosts:
- oc0-ceph-0
- oc0-ceph-1
- oc0-ceph-2
---
service_type: osd
service_id: default_drive_group
placement:
hosts:
- oc0-ceph-0
- oc0-ceph-1
- oc0-ceph-2
data_devices:
all: true
[2] https://review.opendev.org/c/openstack/tripleo-ansible/+/770674/54/tripleo_ansible/roles/tripleo_cephadm/tasks/bootstrap.yaml
[3] https://review.opendev.org/c/openstack/tripleo-ansible/+/770674/54/tripleo_ansible/roles/tripleo_cephadm/tasks/apply_spec.yaml
Files
Updated by John Fulton about 3 years ago
- File cephadm.log.1.gz cephadm.log.1.gz added
Reproduced just now and attaching logs and versions.
using latest cephadm from https://download.ceph.com/rpm-octopus/el8/
[root@oc0-ceph-2 ceph]# rpm -q cephadm
cephadm-15.2.8-0.el8.x86_64
[root@oc0-ceph-2 ceph]#
exact container tag below.
[root@oc0-ceph-2 ceph]# cephadm ls
[
{
"style": "cephadm:v1",
"name": "mon.oc0-ceph-2",
"fsid": "ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135",
"systemd_unit": "ceph-ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135@mon.oc0-ceph-2",
"enabled": true,
"state": "running",
"container_id": "3d5694664c54d4d2c9cb0e7f143ebacbeb3541edd09b6caa0686cf06fa452318",
"container_image_name": "undercloud.ctlplane.mydomain.tld:8787/ceph-ci/daemon:v5.0.7-stable-5.0-octopus-centos-8-x86_64",
"container_image_id": "9dd970f9358d366d78edc20835503ce0d4bb2dcb735651c7589a4cea12c47ffd",
"version": "15.2.8",
"started": "2021-02-12T15:38:19.278934",
"created": "2021-02-12T15:38:16.521453",
"deployed": "2021-02-12T15:38:15.432359",
"configured": "2021-02-12T15:38:16.521453"
},
{
"style": "cephadm:v1",
"name": "mgr.oc0-ceph-2.vlzoat",
"fsid": "ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135",
"systemd_unit": "ceph-ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135@mgr.oc0-ceph-2.vlzoat",
"enabled": true,
"state": "running",
"container_id": "c20dc8fca9310d8f4982c8aa50d1f8b3a68c518708149dbd6c520d5fc253d3a2",
"container_image_name": "undercloud.ctlplane.mydomain.tld:8787/ceph-ci/daemon:v5.0.7-stable-5.0-octopus-centos-8-x86_64",
"container_image_id": "9dd970f9358d366d78edc20835503ce0d4bb2dcb735651c7589a4cea12c47ffd",
"version": "15.2.8",
"started": "2021-02-12T15:38:20.644009",
"created": "2021-02-12T15:38:20.954838",
"deployed": "2021-02-12T15:38:20.101764",
"configured": "2021-02-12T15:38:20.954838"
}
]
[root@oc0-ceph-2 ceph]#
Updated by Sebastian Wagner about 3 years ago
- Project changed from Ceph to Orchestrator
Updated by Sebastian Wagner about 3 years ago
- Description updated (diff)
- Priority changed from Normal to High
- Tags deleted (
cephadm)
Updated by Sebastian Wagner about 3 years ago
2021-02-12 15:38:41,511 INFO Applying /home/ceph-admin/specs/ceph_spec.yaml to cluster 2021-02-12 15:38:41,511 INFO Adding ssh key to oc0-ceph-3 2021-02-12 15:38:41,511 DEBUG Running command: ssh-copy-id -f -i /home/ceph-admin/.ssh/id_rsa.pub ceph-admin@oc0-ceph-3 2021-02-12 15:38:41,519 DEBUG ssh-copy-id:stderr /bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/ceph-admin/.ssh/id_rsa.pub" 2021-02-12 15:38:41,531 DEBUG Acquiring lock 139806070650584 on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock 2021-02-12 15:38:41,532 DEBUG Lock 139806070650584 not acquired on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock, waiting 0.05 seconds ... 2021-02-12 15:38:41,582 DEBUG Acquiring lock 139806070650584 on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock 2021-02-12 15:38:41,582 DEBUG Lock 139806070650584 not acquired on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock, waiting 0.05 seconds ... 2021-02-12 15:38:41,632 DEBUG Acquiring lock 139806070650584 on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock 2021-02-12 15:38:41,633 DEBUG Lock 139806070650584 not acquired on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock, waiting 0.05 seconds ... 2021-02-12 15:38:41,683 DEBUG Acquiring lock 139806070650584 on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock 2021-02-12 15:38:41,683 DEBUG Lock 139806070650584 not acquired on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock, waiting 0.05 seconds ... 2021-02-12 15:38:41,733 DEBUG Acquiring lock 139806070650584 on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock 2021-02-12 15:38:41,733 DEBUG Lock 139806070650584 not acquired on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock, waiting 0.05 seconds ... 2021-02-12 15:38:41,783 DEBUG Acquiring lock 139806070650584 on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock 2021-02-12 15:38:41,783 DEBUG Lock 139806070650584 not acquired on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock, waiting 0.05 seconds ... 2021-02-12 15:38:41,834 DEBUG Acquiring lock 139806070650584 on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock 2021-02-12 15:38:41,834 DEBUG Lock 139806070650584 not acquired on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock, waiting 0.05 seconds ... ... snip ...
looks like we're doing too much synchronously here.
Updated by Sebastian Wagner about 3 years ago
- Related to Feature #44873: cephadm bootstrap: add --apply-spec <cluster.yaml> added
Updated by Daniel Pivonka about 3 years ago
- Status changed from New to Can't reproduce
tested on 15.2.8 and 15.2.9 cant reproduce
Updated by John Fulton about 3 years ago
I reproduced this with 15.2.10. I'll follow up on IRC to offer the assignee live access to the reproducer.
[root@oc0-ceph-2 ~]# cephadm ls [ { "style": "cephadm:v1", "name": "mon.oc0-ceph-2", "fsid": "ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135", "systemd_unit": "ceph-ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135@mon.oc0-ceph-2", "enabled": true, "state": "running", "container_id": "fd1caf3c583e5767dff00b9319d053c150f92ed572ac1f7ea3b23b1448784f18", "container_image_name": "quay.ceph.io/ceph-ci/daemon:v5.0.9-stable-5.0-octopus-centos-8-x86_64", "container_image_id": "1f47579e3c610d981f4dbe27f372fa7cca94a16620152ddad4bd20285c8c95e7", "version": "15.2.10", "started": "2021-04-08T19:17:26.194598Z", "created": "2021-04-08T19:17:23.868279Z", "deployed": "2021-04-08T19:17:22.814175Z", "configured": "2021-04-08T19:17:23.868279Z" }, { "style": "cephadm:v1", "name": "mgr.oc0-ceph-2.mgogfr", "fsid": "ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135", "systemd_unit": "ceph-ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135@mgr.oc0-ceph-2.mgogfr", "enabled": true, "state": "running", "container_id": "6c597539ccfe12a7324c9eb4f1e0386f69865db392fefbec81dc1d99662d846e", "container_image_name": "quay.ceph.io/ceph-ci/daemon:v5.0.9-stable-5.0-octopus-centos-8-x86_64", "container_image_id": "1f47579e3c610d981f4dbe27f372fa7cca94a16620152ddad4bd20285c8c95e7", "version": "15.2.10", "started": "2021-04-08T19:17:27.372219Z", "created": "2021-04-08T19:17:27.536640Z", "deployed": "2021-04-08T19:17:26.869575Z", "configured": "2021-04-08T19:17:27.536640Z" } ] [root@oc0-ceph-2 ~]# rpm -q cephadm cephadm-15.2.10-4.el8.noarch [root@oc0-ceph-2 ~]# cat /etc/redhat-release CentOS Stream release 8 [root@oc0-ceph-2 ~]# uname -a Linux oc0-ceph-2 4.18.0-294.el8.x86_64 #1 SMP Mon Mar 15 22:38:42 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux [root@oc0-ceph-2 ~]# podman --version podman version 3.0.2-dev [root@oc0-ceph-2 ~]# <pre>
Updated by John Fulton about 3 years ago
We think this is a duplicate of https://tracker.ceph.com/issues/50041 and fixed by https://github.com/ceph/ceph/pull/40477/files
I reproduced the problem by running this:
sudo /usr/sbin/cephadm bootstrap \ --ssh-private-key /home/ceph-admin/.ssh/id_rsa \ --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub \ --ssh-user ceph-admin \ --apply-spec /home/ceph-admin/specs/ceph_spec.yaml \ ...
and watching root fail to SSH into the hosts in the spec. I could run "ssh ceph-admin@oc0-ceph-4" on my system but "sudo ssh ceph-admin@oc0-ceph-4" failed. This is because ceph-admin's SSH keys were set up correctly but because of sudo root was trying to SSH and root's keys were not distributed. It didn't reproduce on dpivonka's system because root's SSH keys were already distributed on his system. Apply spec not in bootstrap will not try to copy the keys for you. That's why I didn't hit the issue if I applied the spec after bootstrap.
Next steps:
1. confirm I cannot reproduce 49277 with pacific
2. close 49277 as a duplicate of 50041
Updated by Sebastian Wagner about 3 years ago
- Related to Bug #50041: cephadm bootstrap with apply-spec anmd ssh-user option failed while adding the hosts added
Updated by John Fulton about 3 years ago
When using pacific with the fixing patch https://github.com/ceph/ceph/pull/40477 from Bug #50041, the deployment fails better (it doesn't hang waiting for me to accept the ssh key) but it still fails [1]. Similar underlying root cause though: ceph-admin's SSH keys were set up correctly but because of sudo, root was trying to SSH and root's keys were not distributed. Thus, it tried to distribute root's keys as root and failed.
[ceph-admin@oc0-ceph-2 ~]$ sudo ssh oc0-ceph-3 The authenticity of host 'oc0-ceph-3 (192.168.24.14)' can't be established. ECDSA key fingerprint is SHA256:VDnoF5dEU7gmwmT9RV6eJg/I1Hby0GhRN7VFCs6fZ0Q. Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added 'oc0-ceph-3,192.168.24.14' (ECDSA) to the list of known hosts. root@oc0-ceph-3: Permission denied (publickey,gssapi-keyex,gssapi-with-mic). [ceph-admin@oc0-ceph-2 ~]$
I undersatnd cephadm defaulting to root for the SSH user and root SSH keys, but maybe cephadm needs to look at these parameters:
--ssh-private-key /home/ceph-admin/.ssh/id_rsa \ --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub \ --ssh-user ceph-admin \
and if they are set, then it should use the above for the SSH connection instead. I.e. let the above override the default?
[1]
[ceph-admin@oc0-ceph-2 ~]$ sudo /usr/sbin/cephadm --image quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 bootstrap --skip-firewalld --ssh-private-key /home/ceph-admin/.ssh/id_rsa --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub --ssh-user ceph-admin --allow-fqdn-hostname --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --fsid ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135 --apply-spec /home/ceph-admin/specs/ceph_spec.yaml --config /home/ceph-admin/bootstrap_ceph.conf --skip-monitoring-stack --skip-dashboard --mon-ip 192.168.24.22 Verifying podman|docker is present... Verifying lvm2 is present... Verifying time synchronization is in place... Unit chronyd.service is enabled and running Repeating the final host check... podman|docker (/bin/podman) is present systemctl is present lvcreate is present Unit chronyd.service is enabled and running Host looks OK Cluster fsid: ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135 Verifying IP 192.168.24.22 port 3300 ... Verifying IP 192.168.24.22 port 6789 ... Mon IP 192.168.24.22 is in CIDR network 192.168.24.0/24 - internal network (--cluster-network) has not been provided, OSD replication will default to the public_network Pulling container image quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64... Ceph version: ceph version 16.2.0 (0c2054e95bcd9b30fdd908a79ac1d8bbc3394442) pacific (stable) Extracting ceph user uid/gid from container image... Creating initial keys... Creating initial monmap... Creating mon... Waiting for mon to start... Waiting for mon... mon is available Assimilating anything we can from ceph.conf... Generating new minimal ceph.conf... Restarting the monitor... Setting mon public_network to 192.168.24.0/24 Wrote config to /etc/ceph/ceph.conf Wrote keyring to /etc/ceph/ceph.client.admin.keyring Creating mgr... Verifying port 9283 ... Waiting for mgr to start... Waiting for mgr... mgr not available, waiting (1/15)... mgr not available, waiting (2/15)... mgr not available, waiting (3/15)... mgr is available Enabling cephadm module... Waiting for the mgr to restart... Waiting for mgr epoch 5... mgr epoch 5 is available Setting orchestrator backend to cephadm... Using provided ssh keys... Adding host oc0-ceph-2... Deploying mon service with default placement... Deploying mgr service with default placement... Deploying crash service with default placement... Applying /home/ceph-admin/specs/ceph_spec.yaml to cluster Adding ssh key to oc0-ceph-3 Adding ssh key to oc0-ceph-4 Non-zero exit code 22 from /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=oc0-ceph-2 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135:/var/log/ceph:z -v /tmp/ceph-tmpbfayxvwp:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpjm_n51dx:/etc/ceph/ceph.conf:z -v /home/ceph-admin/specs/ceph_spec.yaml:/tmp/spec.yml:z quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 orch apply -i /tmp/spec.yml /usr/bin/ceph: stderr Error EINVAL: Failed to connect to oc0-ceph-3 (oc0-ceph-3). /usr/bin/ceph: stderr Please make sure that the host is reachable and accepts connections using the cephadm SSH key /usr/bin/ceph: stderr /usr/bin/ceph: stderr To add the cephadm SSH key to the host: /usr/bin/ceph: stderr > ceph cephadm get-pub-key > ~/ceph.pub /usr/bin/ceph: stderr > ssh-copy-id -f -i ~/ceph.pub ceph-admin@oc0-ceph-3 /usr/bin/ceph: stderr /usr/bin/ceph: stderr To check that the host is reachable: /usr/bin/ceph: stderr > ceph cephadm get-ssh-config > ssh_config /usr/bin/ceph: stderr > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key /usr/bin/ceph: stderr > chmod 0600 ~/cephadm_private_key /usr/bin/ceph: stderr > ssh -F ssh_config -i ~/cephadm_private_key ceph-admin@oc0-ceph-3 Traceback (most recent call last): File "/usr/sbin/cephadm", line 7924, in <module> main() File "/usr/sbin/cephadm", line 7912, in main r = ctx.func(ctx) File "/usr/sbin/cephadm", line 1717, in _default_image return func(ctx) File "/usr/sbin/cephadm", line 4037, in command_bootstrap out = cli(['orch', 'apply', '-i', '/tmp/spec.yml'], extra_mounts=mounts) File "/usr/sbin/cephadm", line 3931, in cli ).run(timeout=timeout) File "/usr/sbin/cephadm", line 3174, in run desc=self.entrypoint, timeout=timeout) File "/usr/sbin/cephadm", line 1411, in call_throws raise RuntimeError('Failed command: %s' % ' '.join(command)) RuntimeError: Failed command: /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=oc0-ceph-2 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135:/var/log/ceph:z -v /tmp/ceph-tmpbfayxvwp:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpjm_n51dx:/etc/ceph/ceph.conf:z -v /home/ceph-admin/specs/ceph_spec.yaml:/tmp/spec.yml:z quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 orch apply -i /tmp/spec.yml [ceph-admin@oc0-ceph-2 ~]$
Updated by Daniel Pivonka about 3 years ago
- Status changed from Can't reproduce to Duplicate
Updated by John Fulton about 3 years ago
Because I'm having the same issue in Pacific let's use a new bug https://tracker.ceph.com/issues/50306