Bug #50306
closed/etc/hosts is not passed to ceph containers. clusters that were relying on /etc/hosts for name resolution will have strange behavior
0%
Description
While using `cephadm bootstrap --apply-spec` to bootstrap a spec containing other hosts, cephadm attempts to set up SSH keys for root on those other hosts even if though I passed the following options which refer to an account and SSH keypair which are already working.
--ssh-private-key /home/ceph-admin/.ssh/id_rsa \ --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub \ --ssh-user ceph-admin \
I understand cephadm defaulting to root for the SSH user and root SSH keys when sudo is used, but maybe cephadm needs to use keys/accounts of the above parameters to override the defaults if they are passed.
I hit this issue even witht he fix for bug #50041 installed as reported in Bug #49277.
[ceph-admin@oc0-ceph-2 ~]$ sudo /usr/sbin/cephadm --image quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 bootstrap --skip-firewalld --ssh-private-key /home/ceph-admin/.ssh/id_rsa --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub --ssh-user ceph-admin --allow-fqdn-hostname --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --fsid ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135 --apply-spec /home/ceph-admin/specs/ceph_spec.yaml --config /home/ceph-admin/bootstrap_ceph.conf --skip-monitoring-stack --skip-dashboard --mon-ip 192.168.24.22 Verifying podman|docker is present... Verifying lvm2 is present... Verifying time synchronization is in place... Unit chronyd.service is enabled and running Repeating the final host check... podman|docker (/bin/podman) is present systemctl is present lvcreate is present Unit chronyd.service is enabled and running Host looks OK Cluster fsid: ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135 Verifying IP 192.168.24.22 port 3300 ... Verifying IP 192.168.24.22 port 6789 ... Mon IP 192.168.24.22 is in CIDR network 192.168.24.0/24 - internal network (--cluster-network) has not been provided, OSD replication will default to the public_network Pulling container image quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64... Ceph version: ceph version 16.2.0 (0c2054e95bcd9b30fdd908a79ac1d8bbc3394442) pacific (stable) Extracting ceph user uid/gid from container image... Creating initial keys... Creating initial monmap... Creating mon... Waiting for mon to start... Waiting for mon... mon is available Assimilating anything we can from ceph.conf... Generating new minimal ceph.conf... Restarting the monitor... Setting mon public_network to 192.168.24.0/24 Wrote config to /etc/ceph/ceph.conf Wrote keyring to /etc/ceph/ceph.client.admin.keyring Creating mgr... Verifying port 9283 ... Waiting for mgr to start... Waiting for mgr... mgr not available, waiting (1/15)... mgr not available, waiting (2/15)... mgr not available, waiting (3/15)... mgr is available Enabling cephadm module... Waiting for the mgr to restart... Waiting for mgr epoch 5... mgr epoch 5 is available Setting orchestrator backend to cephadm... Using provided ssh keys... Adding host oc0-ceph-2... Deploying mon service with default placement... Deploying mgr service with default placement... Deploying crash service with default placement... Applying /home/ceph-admin/specs/ceph_spec.yaml to cluster Adding ssh key to oc0-ceph-3 Adding ssh key to oc0-ceph-4 Non-zero exit code 22 from /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=oc0-ceph-2 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135:/var/log/ceph:z -v /tmp/ceph-tmpbfayxvwp:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpjm_n51dx:/etc/ceph/ceph.conf:z -v /home/ceph-admin/specs/ceph_spec.yaml:/tmp/spec.yml:z quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 orch apply -i /tmp/spec.yml /usr/bin/ceph: stderr Error EINVAL: Failed to connect to oc0-ceph-3 (oc0-ceph-3). /usr/bin/ceph: stderr Please make sure that the host is reachable and accepts connections using the cephadm SSH key /usr/bin/ceph: stderr /usr/bin/ceph: stderr To add the cephadm SSH key to the host: /usr/bin/ceph: stderr > ceph cephadm get-pub-key > ~/ceph.pub /usr/bin/ceph: stderr > ssh-copy-id -f -i ~/ceph.pub ceph-admin@oc0-ceph-3 /usr/bin/ceph: stderr /usr/bin/ceph: stderr To check that the host is reachable: /usr/bin/ceph: stderr > ceph cephadm get-ssh-config > ssh_config /usr/bin/ceph: stderr > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key /usr/bin/ceph: stderr > chmod 0600 ~/cephadm_private_key /usr/bin/ceph: stderr > ssh -F ssh_config -i ~/cephadm_private_key ceph-admin@oc0-ceph-3 Traceback (most recent call last): File "/usr/sbin/cephadm", line 7924, in <module> main() File "/usr/sbin/cephadm", line 7912, in main r = ctx.func(ctx) File "/usr/sbin/cephadm", line 1717, in _default_image return func(ctx) File "/usr/sbin/cephadm", line 4037, in command_bootstrap out = cli(['orch', 'apply', '-i', '/tmp/spec.yml'], extra_mounts=mounts) File "/usr/sbin/cephadm", line 3931, in cli ).run(timeout=timeout) File "/usr/sbin/cephadm", line 3174, in run desc=self.entrypoint, timeout=timeout) File "/usr/sbin/cephadm", line 1411, in call_throws raise RuntimeError('Failed command: %s' % ' '.join(command)) RuntimeError: Failed command: /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=oc0-ceph-2 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135:/var/log/ceph:z -v /tmp/ceph-tmpbfayxvwp:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpjm_n51dx:/etc/ceph/ceph.conf:z -v /home/ceph-admin/specs/ceph_spec.yaml:/tmp/spec.yml:z quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 orch apply -i /tmp/spec.yml [ceph-admin@oc0-ceph-2 ~]$
[ceph-admin@oc0-ceph-2 ~]$ cat /home/ceph-admin/specs/ceph_spec.yaml --- service_type: host addr: oc0-ceph-3 hostname: oc0-ceph-3 --- service_type: host addr: oc0-ceph-4 hostname: oc0-ceph-4 --- service_type: mon placement: hosts: - oc0-ceph-2 - oc0-ceph-3 - oc0-ceph-4 --- service_type: osd service_id: default_drive_group placement: hosts: - oc0-ceph-2 - oc0-ceph-3 - oc0-ceph-4 data_devices: all: true [ceph-admin@oc0-ceph-2 ~]$
Files
Updated by Daniel Pivonka about 3 years ago
i was able to determine this was caused because the host name could not resolve when trying to add hosts.
debug 2021-04-12T21:55:47.520+0000 7fc64b3d8700 0 log_channel(audit) log [DBG] : from='client.14196 -' entity='client.admin' cmd=[{" prefix": "orch host add", "hostname": "oc0-ceph-3", "target": ["mon-mgr", ""]}]: dispatch ssh: Could not resolve hostname oc0-ceph-3: Name or service not known debug 2021-04-12T21:55:47.537+0000 7fc657cc2700 -1 mgr.server reply reply (22) Invalid argument -F /tmp/cephadm-conf-o3k_xelh -i /tmp /cephadm-identity-h6ki3b4e ceph-admin@oc0-ceph-3
i modified _remote_connection in serve.py to get this ^ to print
all the keys were copied correctly and the problem here is not about using a combination of these flags --apply-spec, --ssh-private-key, --ssh-public-key, --ssh-user
this change https://github.com/ceph/ceph/pull/40223 removed having /etc/hosts in the mgr container.
Updated by John Fulton about 3 years ago
If I use a spec with IPs then I can add my hosts after bootstrap [1] but not at bootstrap [2].
[1]
[ceph-admin@oc0-ceph-2 ~]$ cat specs/ceph_spec.yaml --- service_type: host addr: 192.168.24.14 hostname: oc0-ceph-3 --- service_type: host addr: 192.168.24.10 hostname: oc0-ceph-4 --- service_type: mon placement: hosts: - oc0-ceph-2 - oc0-ceph-3 - oc0-ceph-4 --- service_type: osd service_id: default_drive_group placement: hosts: - oc0-ceph-2 - oc0-ceph-3 - oc0-ceph-4 data_devices: all: true [ceph-admin@oc0-ceph-2 ~]$ [ceph-admin@oc0-ceph-2 ~]$ sudo cephadm ls [] [ceph-admin@oc0-ceph-2 ~]$ [ceph-admin@oc0-ceph-2 ~]$ sudo /usr/sbin/cephadm --image quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 bootstrap --skip-firewalld --ssh-private-key /home/ceph-admin/.ssh/id_rsa --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub --ssh-user ceph-admin --allow-fqdn-hostname --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --fsid ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135 --config /home/ceph-admin/bootstrap_ceph.conf --skip-monitoring-stack --skip-dashboard --mon-ip 192.168.24.18 Verifying podman|docker is present... Verifying lvm2 is present... Verifying time synchronization is in place... Unit chronyd.service is enabled and running Repeating the final host check... podman|docker (/bin/podman) is present systemctl is present lvcreate is present Unit chronyd.service is enabled and running Host looks OK Cluster fsid: ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135 Verifying IP 192.168.24.18 port 3300 ... Verifying IP 192.168.24.18 port 6789 ... Mon IP 192.168.24.18 is in CIDR network 192.168.24.0/24 - internal network (--cluster-network) has not been provided, OSD replication will default to the public_network Pulling container image quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64... Ceph version: ceph version 16.2.0 (0c2054e95bcd9b30fdd908a79ac1d8bbc3394442) pacific (stable) Extracting ceph user uid/gid from container image... Creating initial keys... Creating initial monmap... Creating mon... Waiting for mon to start... Waiting for mon... mon is available Assimilating anything we can from ceph.conf... Generating new minimal ceph.conf... Restarting the monitor... Setting mon public_network to 192.168.24.0/24 Wrote config to /etc/ceph/ceph.conf Wrote keyring to /etc/ceph/ceph.client.admin.keyring Creating mgr... Verifying port 9283 ... Waiting for mgr to start... Waiting for mgr... mgr not available, waiting (1/15)... mgr not available, waiting (2/15)... mgr not available, waiting (3/15)... mgr is available Enabling cephadm module... Waiting for the mgr to restart... Waiting for mgr epoch 5... mgr epoch 5 is available Setting orchestrator backend to cephadm... Using provided ssh keys... Adding host oc0-ceph-2... Deploying mon service with default placement... Deploying mgr service with default placement... Deploying crash service with default placement... You can access the Ceph CLI with: sudo /usr/sbin/cephadm shell --fsid ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135 -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring Please consider enabling telemetry to help improve Ceph: ceph telemetry on For more information see: https://docs.ceph.com/docs/pacific/mgr/telemetry/ Bootstrap complete. [ceph-admin@oc0-ceph-2 ~]$ [ceph-admin@oc0-ceph-2 ~]$ sudo podman images REPOSITORY TAG IMAGE ID CREATED SIZE quay.ceph.io/ceph-ci/daemon v6.0.0-stable-6.0-pacific-centos-8-x86_64 14fee0875498 11 days ago 1.17 GB [ceph-admin@oc0-ceph-2 ~]$ [ceph-admin@oc0-ceph-2 ~]$ sudo podman run --rm --volume /etc/ceph:/etc/ceph:z --volume /home/ceph-admin/specs:/home/specs --entrypoint ceph 14fee0875498 orch apply -i /home/specs/ceph_spec.yaml Added host 'oc0-ceph-3' Added host 'oc0-ceph-4' Scheduled mon update... Scheduled osd.default_drive_group update... [ceph-admin@oc0-ceph-2 ~]$
[2]
[ceph-admin@oc0-ceph-2 ~]$ sudo /usr/sbin/cephadm --image quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 bootstrap --skip-firewalld --ssh-private-key /home/ceph-admin/.ssh/id_rsa --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub --ssh-user ceph-admin --allow-fqdn-hostname --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --fsid ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135 --config /home/ceph-admin/bootstrap_ceph.conf --skip-monitoring-stack --skip-dashboard --mon-ip 192.168.24.18 --apply-spec /home/ceph-admin/specs/ceph_spec.yaml Verifying podman|docker is present... Verifying lvm2 is present... Verifying time synchronization is in place... Unit chronyd.service is enabled and running Repeating the final host check... podman|docker (/bin/podman) is present systemctl is present lvcreate is present Unit chronyd.service is enabled and running Host looks OK Cluster fsid: ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135 Verifying IP 192.168.24.18 port 3300 ... Verifying IP 192.168.24.18 port 6789 ... Mon IP 192.168.24.18 is in CIDR network 192.168.24.0/24 - internal network (--cluster-network) has not been provided, OSD replication will default to the public_network Pulling container image quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64... Ceph version: ceph version 16.2.0 (0c2054e95bcd9b30fdd908a79ac1d8bbc3394442) pacific (stable) Extracting ceph user uid/gid from container image... Creating initial keys... Creating initial monmap... Creating mon... Waiting for mon to start... Waiting for mon... mon is available Assimilating anything we can from ceph.conf... Generating new minimal ceph.conf... Restarting the monitor... Setting mon public_network to 192.168.24.0/24 Wrote config to /etc/ceph/ceph.conf Wrote keyring to /etc/ceph/ceph.client.admin.keyring Creating mgr... Verifying port 9283 ... Waiting for mgr to start... Waiting for mgr... mgr not available, waiting (1/15)... mgr not available, waiting (2/15)... mgr not available, waiting (3/15)... mgr is available Enabling cephadm module... Waiting for the mgr to restart... Waiting for mgr epoch 5... mgr epoch 5 is available Setting orchestrator backend to cephadm... Using provided ssh keys... Adding host oc0-ceph-2... Deploying mon service with default placement... Deploying mgr service with default placement... Deploying crash service with default placement... Applying /home/ceph-admin/specs/ceph_spec.yaml to cluster Adding ssh key to oc0-ceph-3 Non-zero exit code 1 from ssh-copy-id -f -i /home/ceph-admin/.ssh/id_rsa.pub ceph-admin@oc0-ceph-3 ssh-copy-id: stderr /bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/ceph-admin/.ssh/id_rsa.pub" ssh-copy-id: stderr ceph-admin@oc0-ceph-3: Permission denied (publickey,gssapi-keyex,gssapi-with-mic). Traceback (most recent call last): File "/usr/sbin/cephadm", line 7924, in <module> main() File "/usr/sbin/cephadm", line 7912, in main r = ctx.func(ctx) File "/usr/sbin/cephadm", line 1717, in _default_image return func(ctx) File "/usr/sbin/cephadm", line 4032, in command_bootstrap out, err, code = call_throws(ctx, ['ssh-copy-id', '-f', '-i', ssh_key, '%s@%s' % (ctx.ssh_user, split[1])]) File "/usr/sbin/cephadm", line 1411, in call_throws raise RuntimeError('Failed command: %s' % ' '.join(command)) RuntimeError: Failed command: ssh-copy-id -f -i /home/ceph-admin/.ssh/id_rsa.pub ceph-admin@oc0-ceph-3 [ceph-admin@oc0-ceph-2 ~]$
Updated by John Fulton about 3 years ago
Wait, I think I can't apply it at bootstrap because I am currently missing the fix for bug #50041 (I had rolled it back while testing). I will retest with that patch and then update the bug.
out, err, code = call_throws(ctx, ['ssh-copy-id', '-f', '-i', ssh_key, '%s@%s' % (ctx.ssh_user, split[1])])
Updated by John Fulton about 3 years ago
I confirm I could apply a spec on bootstrap. Thanks!
Conclusions:
- Ensure you have the fix for bug #50041
- Do not rely on /etc/hosts of the container host. Instead set addr: to an actual IP in the service entry
Updated by John Fulton about 3 years ago
FWIW I see nothing wrong with closing this bug as invalid.
Unless you want to follow up on https://github.com/ceph/ceph/pull/40223 to support /etc/hosts in the mgr container, users should probably just ensure they pass IPs in their spec file.
Updated by Daniel Pivonka about 3 years ago
- File gt7DGcXc.txt gt7DGcXc.txt added
- Subject changed from cephadm bootstrap --apply-spec uses root's ssh keys even if --ssh-{user,private-key,public-key} are passed to /etc/hosts is not passed to ceph containers. clusters that were relying on /etc/hosts for name resolution will have strange behavior
@john im keeping the bug open and just changing the subject and providing more details on the real problem here
log shows /etc/hosts is not passed to ceph containers but is passed to the shell container resulting in unexpected behavior when checking ssh connections.
2 problems here:
1. the error from 'ceph orch host add' should have made it more clear that the hostname could not be resolved
2. the trouble shooting steps printed out from the failed 'ceph orch host add' showed the connection should have worked. (this is because the shell has /etc/hosts)
Updated by Daniel Pivonka about 3 years ago
- Status changed from New to Fix Under Review
- Backport set to pacific
- Pull request ID set to 40924
Updated by Sage Weil almost 3 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Sebastian Wagner almost 3 years ago
- Related to Bug #49654: iSCSI stops working after Upgrade 15.2.4 -> 15.2.9 added
Updated by Sebastian Wagner over 2 years ago
- Status changed from Pending Backport to Resolved