Bug #50306: /etc/hosts is not passed to ceph containers. clusters that were relying on /etc/hosts for name resolution will have strange behavior - Orchestrator - Ceph

Actions

Copy link

Bug #50306

closed

/etc/hosts is not passed to ceph containers. clusters that were relying on /etc/hosts for name resolution will have strange behavior

Added by John Fulton about 3 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

cephadm

Target version:

Ceph - v16.2.0

% Done:

Source:

Community (dev)

Tags:

Backport:

pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

40924

Crash signature (v1):

Crash signature (v2):

Description

While using `cephadm bootstrap --apply-spec` to bootstrap a spec containing other hosts, cephadm attempts to set up SSH keys for root on those other hosts even if though I passed the following options which refer to an account and SSH keypair which are already working.

  --ssh-private-key /home/ceph-admin/.ssh/id_rsa \
  --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub \
  --ssh-user ceph-admin \

I understand cephadm defaulting to root for the SSH user and root SSH keys when sudo is used, but maybe cephadm needs to use keys/accounts of the above parameters to override the defaults if they are passed.

I hit this issue even witht he fix for bug #50041 installed as reported in Bug #49277.

[ceph-admin@oc0-ceph-2 ~]$ sudo /usr/sbin/cephadm --image quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 bootstrap --skip-firewalld --ssh-private-key /home/ceph-admin/.ssh/id_rsa --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub --ssh-user ceph-admin --allow-fqdn-hostname --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --fsid ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135 --apply-spec /home/ceph-admin/specs/ceph_spec.yaml --config /home/ceph-admin/bootstrap_ceph.conf --skip-monitoring-stack --skip-dashboard --mon-ip 192.168.24.22                                            
Verifying podman|docker is present...
Verifying lvm2 is present...
Verifying time synchronization is in place...
Unit chronyd.service is enabled and running
Repeating the final host check...
podman|docker (/bin/podman) is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
Cluster fsid: ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135
Verifying IP 192.168.24.22 port 3300 ...
Verifying IP 192.168.24.22 port 6789 ...
Mon IP 192.168.24.22 is in CIDR network 192.168.24.0/24
- internal network (--cluster-network) has not been provided, OSD replication will default to the public_network
Pulling container image quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64...
Ceph version: ceph version 16.2.0 (0c2054e95bcd9b30fdd908a79ac1d8bbc3394442) pacific (stable)
Extracting ceph user uid/gid from container image...
Creating initial keys...
Creating initial monmap...
Creating mon...
Waiting for mon to start...
Waiting for mon...
mon is available
Assimilating anything we can from ceph.conf...
Generating new minimal ceph.conf...
Restarting the monitor...
Setting mon public_network to 192.168.24.0/24
Wrote config to /etc/ceph/ceph.conf
Wrote keyring to /etc/ceph/ceph.client.admin.keyring
Creating mgr...
Verifying port 9283 ...
Waiting for mgr to start...
Waiting for mgr...
mgr not available, waiting (1/15)...
mgr not available, waiting (2/15)...
mgr not available, waiting (3/15)...
mgr is available
Enabling cephadm module...
Waiting for the mgr to restart...
Waiting for mgr epoch 5...
mgr epoch 5 is available
Setting orchestrator backend to cephadm...
Using provided ssh keys...
Adding host oc0-ceph-2...
Deploying mon service with default placement...
Deploying mgr service with default placement...
Deploying crash service with default placement...
Applying /home/ceph-admin/specs/ceph_spec.yaml to cluster
Adding ssh key to oc0-ceph-3
Adding ssh key to oc0-ceph-4
Non-zero exit code 22 from /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=oc0-ceph-2 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135:/var/log/ceph:z -v /tmp/ceph-tmpbfayxvwp:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpjm_n51dx:/etc/ceph/ceph.conf:z -v /home/ceph-admin/specs/ceph_spec.yaml:/tmp/spec.yml:z quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 orch apply -i /tmp/spec.yml                                            
/usr/bin/ceph: stderr Error EINVAL: Failed to connect to oc0-ceph-3 (oc0-ceph-3).
/usr/bin/ceph: stderr Please make sure that the host is reachable and accepts connections using the cephadm SSH key
/usr/bin/ceph: stderr
/usr/bin/ceph: stderr To add the cephadm SSH key to the host:
/usr/bin/ceph: stderr > ceph cephadm get-pub-key > ~/ceph.pub
/usr/bin/ceph: stderr > ssh-copy-id -f -i ~/ceph.pub ceph-admin@oc0-ceph-3
/usr/bin/ceph: stderr
/usr/bin/ceph: stderr To check that the host is reachable:
/usr/bin/ceph: stderr > ceph cephadm get-ssh-config > ssh_config
/usr/bin/ceph: stderr > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
/usr/bin/ceph: stderr > chmod 0600 ~/cephadm_private_key
/usr/bin/ceph: stderr > ssh -F ssh_config -i ~/cephadm_private_key ceph-admin@oc0-ceph-3
Traceback (most recent call last):
  File "/usr/sbin/cephadm", line 7924, in <module>
    main()
  File "/usr/sbin/cephadm", line 7912, in main
    r = ctx.func(ctx)
  File "/usr/sbin/cephadm", line 1717, in _default_image
    return func(ctx)
  File "/usr/sbin/cephadm", line 4037, in command_bootstrap
    out = cli(['orch', 'apply', '-i', '/tmp/spec.yml'], extra_mounts=mounts)
  File "/usr/sbin/cephadm", line 3931, in cli
    ).run(timeout=timeout)
  File "/usr/sbin/cephadm", line 3174, in run
    desc=self.entrypoint, timeout=timeout)
  File "/usr/sbin/cephadm", line 1411, in call_throws
    raise RuntimeError('Failed command: %s' % ' '.join(command))
RuntimeError: Failed command: /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=oc0-ceph-2 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135:/var/log/ceph:z -v /tmp/ceph-tmpbfayxvwp:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpjm_n51dx:/etc/ceph/ceph.conf:z -v /home/ceph-admin/specs/ceph_spec.yaml:/tmp/spec.yml:z quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 orch apply -i /tmp/spec.yml                                         
[ceph-admin@oc0-ceph-2 ~]$

[ceph-admin@oc0-ceph-2 ~]$ cat /home/ceph-admin/specs/ceph_spec.yaml
---
service_type: host
addr: oc0-ceph-3
hostname: oc0-ceph-3
---
service_type: host
addr: oc0-ceph-4
hostname: oc0-ceph-4
---
service_type: mon
placement:
  hosts:
    - oc0-ceph-2
    - oc0-ceph-3
    - oc0-ceph-4
---
service_type: osd
service_id: default_drive_group
placement:
  hosts:
    - oc0-ceph-2
    - oc0-ceph-3
    - oc0-ceph-4
data_devices:
  all: true
[ceph-admin@oc0-ceph-2 ~]$

Files

gt7DGcXc.txt (9.29 KB) gt7DGcXc.txt

Daniel Pivonka, 04/14/2021 06:12 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Daniel Pivonka about 3 years ago

i was able to determine this was caused because the host name could not resolve when trying to add hosts.

debug 2021-04-12T21:55:47.520+0000 7fc64b3d8700  0 log_channel(audit) log [DBG] : from='client.14196 -' entity='client.admin' cmd=[{" 
prefix": "orch host add", "hostname": "oc0-ceph-3", "target": ["mon-mgr", ""]}]: dispatch
ssh: Could not resolve hostname oc0-ceph-3: Name or service not known
debug 2021-04-12T21:55:47.537+0000 7fc657cc2700 -1 mgr.server reply reply (22) Invalid argument -F /tmp/cephadm-conf-o3k_xelh -i /tmp
/cephadm-identity-h6ki3b4e ceph-admin@oc0-ceph-3

i modified _remote_connection in serve.py to get this ^ to print

all the keys were copied correctly and the problem here is not about using a combination of these flags --apply-spec, --ssh-private-key, --ssh-public-key, --ssh-user

this change https://github.com/ceph/ceph/pull/40223 removed having /etc/hosts in the mgr container.

Actions

Copy link

Updated by John Fulton about 3 years ago

If I use a spec with IPs then I can add my hosts after bootstrap [1] but not at bootstrap [2].

[1]

[ceph-admin@oc0-ceph-2 ~]$ cat specs/ceph_spec.yaml 
---
service_type: host
addr: 192.168.24.14
hostname: oc0-ceph-3
---
service_type: host
addr: 192.168.24.10
hostname: oc0-ceph-4
---
service_type: mon
placement:
  hosts:
    - oc0-ceph-2
    - oc0-ceph-3
    - oc0-ceph-4
---
service_type: osd
service_id: default_drive_group
placement:
  hosts:
    - oc0-ceph-2
    - oc0-ceph-3
    - oc0-ceph-4
data_devices:
  all: true
[ceph-admin@oc0-ceph-2 ~]$ 
[ceph-admin@oc0-ceph-2 ~]$ sudo cephadm ls
[]
[ceph-admin@oc0-ceph-2 ~]$ 
[ceph-admin@oc0-ceph-2 ~]$ sudo /usr/sbin/cephadm --image quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 bootstrap --skip-firewalld --ssh-private-key /home/ceph-admin/.ssh/id_rsa --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub --ssh-user ceph-admin --allow-fqdn-hostname --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --fsid ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135 --config /home/ceph-admin/bootstrap_ceph.conf --skip-monitoring-stack --skip-dashboard --mon-ip 192.168.24.18 
Verifying podman|docker is present...
Verifying lvm2 is present...
Verifying time synchronization is in place...
Unit chronyd.service is enabled and running
Repeating the final host check...
podman|docker (/bin/podman) is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
Cluster fsid: ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135
Verifying IP 192.168.24.18 port 3300 ...
Verifying IP 192.168.24.18 port 6789 ...
Mon IP 192.168.24.18 is in CIDR network 192.168.24.0/24
- internal network (--cluster-network) has not been provided, OSD replication will default to the public_network
Pulling container image quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64...
Ceph version: ceph version 16.2.0 (0c2054e95bcd9b30fdd908a79ac1d8bbc3394442) pacific (stable)
Extracting ceph user uid/gid from container image...
Creating initial keys...
Creating initial monmap...
Creating mon...
Waiting for mon to start...
Waiting for mon...
mon is available
Assimilating anything we can from ceph.conf...
Generating new minimal ceph.conf...
Restarting the monitor...
Setting mon public_network to 192.168.24.0/24
Wrote config to /etc/ceph/ceph.conf
Wrote keyring to /etc/ceph/ceph.client.admin.keyring
Creating mgr...
Verifying port 9283 ...
Waiting for mgr to start...
Waiting for mgr...
mgr not available, waiting (1/15)...
mgr not available, waiting (2/15)...
mgr not available, waiting (3/15)...
mgr is available
Enabling cephadm module...
Waiting for the mgr to restart...
Waiting for mgr epoch 5...
mgr epoch 5 is available
Setting orchestrator backend to cephadm...
Using provided ssh keys...
Adding host oc0-ceph-2...
Deploying mon service with default placement...
Deploying mgr service with default placement...
Deploying crash service with default placement...
You can access the Ceph CLI with:

    sudo /usr/sbin/cephadm shell --fsid ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135 -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring

Please consider enabling telemetry to help improve Ceph:

    ceph telemetry on

For more information see:

    https://docs.ceph.com/docs/pacific/mgr/telemetry/

Bootstrap complete.
[ceph-admin@oc0-ceph-2 ~]$ 
[ceph-admin@oc0-ceph-2 ~]$ sudo podman images
REPOSITORY                   TAG                                        IMAGE ID      CREATED      SIZE
quay.ceph.io/ceph-ci/daemon  v6.0.0-stable-6.0-pacific-centos-8-x86_64  14fee0875498  11 days ago  1.17 GB
[ceph-admin@oc0-ceph-2 ~]$ 
[ceph-admin@oc0-ceph-2 ~]$ sudo podman run --rm --volume /etc/ceph:/etc/ceph:z --volume /home/ceph-admin/specs:/home/specs --entrypoint ceph 14fee0875498 orch apply -i /home/specs/ceph_spec.yaml 
Added host 'oc0-ceph-3'
Added host 'oc0-ceph-4'
Scheduled mon update...
Scheduled osd.default_drive_group update...
[ceph-admin@oc0-ceph-2 ~]$

[2]

[ceph-admin@oc0-ceph-2 ~]$ sudo /usr/sbin/cephadm --image quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 bootstrap --skip-firewalld --ssh-private-key /home/ceph-admin/.ssh/id_rsa --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub --ssh-user ceph-admin --allow-fqdn-hostname --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --fsid ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135 --config /home/ceph-admin/bootstrap_ceph.conf --skip-monitoring-stack --skip-dashboard --mon-ip 192.168.24.18 --apply-spec /home/ceph-admin/specs/ceph_spec.yaml 
Verifying podman|docker is present...
Verifying lvm2 is present...
Verifying time synchronization is in place...
Unit chronyd.service is enabled and running
Repeating the final host check...
podman|docker (/bin/podman) is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
Cluster fsid: ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135
Verifying IP 192.168.24.18 port 3300 ...
Verifying IP 192.168.24.18 port 6789 ...
Mon IP 192.168.24.18 is in CIDR network 192.168.24.0/24
- internal network (--cluster-network) has not been provided, OSD replication will default to the public_network
Pulling container image quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64...
Ceph version: ceph version 16.2.0 (0c2054e95bcd9b30fdd908a79ac1d8bbc3394442) pacific (stable)
Extracting ceph user uid/gid from container image...
Creating initial keys...
Creating initial monmap...
Creating mon...
Waiting for mon to start...
Waiting for mon...
mon is available
Assimilating anything we can from ceph.conf...
Generating new minimal ceph.conf...
Restarting the monitor...
Setting mon public_network to 192.168.24.0/24
Wrote config to /etc/ceph/ceph.conf
Wrote keyring to /etc/ceph/ceph.client.admin.keyring
Creating mgr...
Verifying port 9283 ...
Waiting for mgr to start...
Waiting for mgr...
mgr not available, waiting (1/15)...
mgr not available, waiting (2/15)...
mgr not available, waiting (3/15)...
mgr is available
Enabling cephadm module...
Waiting for the mgr to restart...
Waiting for mgr epoch 5...
mgr epoch 5 is available
Setting orchestrator backend to cephadm...
Using provided ssh keys...
Adding host oc0-ceph-2...
Deploying mon service with default placement...
Deploying mgr service with default placement...
Deploying crash service with default placement...
Applying /home/ceph-admin/specs/ceph_spec.yaml to cluster
Adding ssh key to oc0-ceph-3
Non-zero exit code 1 from ssh-copy-id -f -i /home/ceph-admin/.ssh/id_rsa.pub ceph-admin@oc0-ceph-3
ssh-copy-id: stderr /bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/ceph-admin/.ssh/id_rsa.pub" 
ssh-copy-id: stderr ceph-admin@oc0-ceph-3: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
Traceback (most recent call last):
  File "/usr/sbin/cephadm", line 7924, in <module>
    main()
  File "/usr/sbin/cephadm", line 7912, in main
    r = ctx.func(ctx)
  File "/usr/sbin/cephadm", line 1717, in _default_image
    return func(ctx)
  File "/usr/sbin/cephadm", line 4032, in command_bootstrap
    out, err, code = call_throws(ctx, ['ssh-copy-id', '-f', '-i', ssh_key, '%s@%s' % (ctx.ssh_user, split[1])])
  File "/usr/sbin/cephadm", line 1411, in call_throws
    raise RuntimeError('Failed command: %s' % ' '.join(command))
RuntimeError: Failed command: ssh-copy-id -f -i /home/ceph-admin/.ssh/id_rsa.pub ceph-admin@oc0-ceph-3
[ceph-admin@oc0-ceph-2 ~]$

Actions

Copy link

Updated by John Fulton about 3 years ago

Wait, I think I can't apply it at bootstrap because I am currently missing the fix for bug #50041 (I had rolled it back while testing). I will retest with that patch and then update the bug.

    out, err, code = call_throws(ctx, ['ssh-copy-id', '-f', '-i', ssh_key, '%s@%s' % (ctx.ssh_user, split[1])])

Actions

Copy link

Updated by John Fulton about 3 years ago

I confirm I could apply a spec on bootstrap. Thanks!

Conclusions:
- Ensure you have the fix for bug #50041
- Do not rely on /etc/hosts of the container host. Instead set addr: to an actual IP in the service entry

Actions

Copy link

Updated by John Fulton about 3 years ago

FWIW I see nothing wrong with closing this bug as invalid.

Unless you want to follow up on https://github.com/ceph/ceph/pull/40223 to support /etc/hosts in the mgr container, users should probably just ensure they pass IPs in their spec file.

Actions

Copy link

Updated by Daniel Pivonka about 3 years ago

File gt7DGcXc.txt gt7DGcXc.txt added
Subject changed from cephadm bootstrap --apply-spec uses root's ssh keys even if --ssh-{user,private-key,public-key} are passed to /etc/hosts is not passed to ceph containers. clusters that were relying on /etc/hosts for name resolution will have strange behavior

@john im keeping the bug open and just changing the subject and providing more details on the real problem here

log shows /etc/hosts is not passed to ceph containers but is passed to the shell container resulting in unexpected behavior when checking ssh connections.

2 problems here:

1. the error from 'ceph orch host add' should have made it more clear that the hostname could not be resolved

2. the trouble shooting steps printed out from the failed 'ceph orch host add' showed the connection should have worked. (this is because the shell has /etc/hosts)

Actions

Copy link