Project

General

Profile

Actions

Bug #49277

closed

cephadm bootstrap --apply-spec <cluster.yaml> hangs

Added by John Fulton about 3 years ago. Updated about 3 years ago.

Status:
Duplicate
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The feature introduced by https://tracker.ceph.com/issues/44873 seems to have the following flaw.

If I bootstrap a cluster on node oc0-ceph-0 with --apply-spec, then the bootstrap proceeds but the spec [1] is never applied and the cephadm log shows it waiting acquire a lock.
If I bootstrap a cluster on node oc0-ceph-0 without --apply-spec and then apply the same spec file [1] a few seconds later, then the spec is applied flawlessly.

I have ansible tasks I can use to easily reproduce [2]3 to ensure a consistent test. I used cephadm-15.2.5-0.el8.x86_64.rpm with with the latest "docker.io/ceph/ceph:v15" as of Jan 10, 2020.

The command run by ansible is:

/usr/sbin/cephadm bootstrap --ssh-private-key /home/ceph-admin/.ssh/id_rsa --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub --ssh-user ceph-admin --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --fsid 77642368-c850-5eb9-ba49-e59024b4d0ab --mon-ip 192.168.24.6

FWIW: This is not a major problem for TripleO's cephadm integration because we can bootstrap a single node and apply the spec afterwards.

[1] ceph_spec.yml
---

service_type: host
addr: oc0-ceph-1
hostname: oc0-ceph-1
---
service_type: host
addr: oc0-ceph-2
hostname: oc0-ceph-2
---
service_type: mon
placement:
  hosts:
    - oc0-ceph-0
    - oc0-ceph-1
    - oc0-ceph-2
---
service_type: osd
service_id: default_drive_group
placement:
  hosts:
    - oc0-ceph-0
    - oc0-ceph-1
    - oc0-ceph-2
data_devices:
  all: true

[2] https://review.opendev.org/c/openstack/tripleo-ansible/+/770674/54/tripleo_ansible/roles/tripleo_cephadm/tasks/bootstrap.yaml
[3] https://review.opendev.org/c/openstack/tripleo-ansible/+/770674/54/tripleo_ansible/roles/tripleo_cephadm/tasks/apply_spec.yaml


Files

cephadm.log.1.gz (47.3 KB) cephadm.log.1.gz Log from cephadm initial bootstrap and wait for lock John Fulton, 02/12/2021 03:44 PM
ceph_spec.yml (367 Bytes) ceph_spec.yml In most recent reproduction the spec file's hostnames were aligned correctly John Fulton, 02/12/2021 03:56 PM

Related issues 2 (0 open2 closed)

Related to Orchestrator - Feature #44873: cephadm bootstrap: add --apply-spec <cluster.yaml>ResolvedDaniel Pivonka

Actions
Related to Orchestrator - Bug #50041: cephadm bootstrap with apply-spec anmd ssh-user option failed while adding the hostsResolvedDaniel Pivonka

Actions
Actions #1

Updated by John Fulton about 3 years ago

Reproduced just now and attaching logs and versions.

using latest cephadm from https://download.ceph.com/rpm-octopus/el8/

[root@oc0-ceph-2 ceph]# rpm -q cephadm
cephadm-15.2.8-0.el8.x86_64
[root@oc0-ceph-2 ceph]#

exact container tag below.

[root@oc0-ceph-2 ceph]# cephadm ls

[
    {
        "style": "cephadm:v1",
        "name": "mon.oc0-ceph-2",
        "fsid": "ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135",
        "systemd_unit": "ceph-ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135@mon.oc0-ceph-2",
        "enabled": true,
        "state": "running",
        "container_id": "3d5694664c54d4d2c9cb0e7f143ebacbeb3541edd09b6caa0686cf06fa452318",
        "container_image_name": "undercloud.ctlplane.mydomain.tld:8787/ceph-ci/daemon:v5.0.7-stable-5.0-octopus-centos-8-x86_64",
        "container_image_id": "9dd970f9358d366d78edc20835503ce0d4bb2dcb735651c7589a4cea12c47ffd",
        "version": "15.2.8",
        "started": "2021-02-12T15:38:19.278934",
        "created": "2021-02-12T15:38:16.521453",
        "deployed": "2021-02-12T15:38:15.432359",
        "configured": "2021-02-12T15:38:16.521453" 
    },
    {
        "style": "cephadm:v1",
        "name": "mgr.oc0-ceph-2.vlzoat",
        "fsid": "ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135",
        "systemd_unit": "ceph-ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135@mgr.oc0-ceph-2.vlzoat",
        "enabled": true,
        "state": "running",
        "container_id": "c20dc8fca9310d8f4982c8aa50d1f8b3a68c518708149dbd6c520d5fc253d3a2",
        "container_image_name": "undercloud.ctlplane.mydomain.tld:8787/ceph-ci/daemon:v5.0.7-stable-5.0-octopus-centos-8-x86_64",
        "container_image_id": "9dd970f9358d366d78edc20835503ce0d4bb2dcb735651c7589a4cea12c47ffd",
        "version": "15.2.8",
        "started": "2021-02-12T15:38:20.644009",
        "created": "2021-02-12T15:38:20.954838",
        "deployed": "2021-02-12T15:38:20.101764",
        "configured": "2021-02-12T15:38:20.954838" 
    }
]

[root@oc0-ceph-2 ceph]#

Actions #3

Updated by Sebastian Wagner about 3 years ago

  • Project changed from Ceph to Orchestrator
Actions #4

Updated by Sebastian Wagner about 3 years ago

  • Description updated (diff)
  • Priority changed from Normal to High
  • Tags deleted (cephadm)
Actions #5

Updated by Sebastian Wagner about 3 years ago

2021-02-12 15:38:41,511 INFO Applying /home/ceph-admin/specs/ceph_spec.yaml to cluster
2021-02-12 15:38:41,511 INFO Adding ssh key to oc0-ceph-3
2021-02-12 15:38:41,511 DEBUG Running command: ssh-copy-id -f -i /home/ceph-admin/.ssh/id_rsa.pub ceph-admin@oc0-ceph-3
2021-02-12 15:38:41,519 DEBUG ssh-copy-id:stderr /bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/ceph-admin/.ssh/id_rsa.pub" 
2021-02-12 15:38:41,531 DEBUG Acquiring lock 139806070650584 on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock
2021-02-12 15:38:41,532 DEBUG Lock 139806070650584 not acquired on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock, waiting 0.05 seconds ...
2021-02-12 15:38:41,582 DEBUG Acquiring lock 139806070650584 on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock
2021-02-12 15:38:41,582 DEBUG Lock 139806070650584 not acquired on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock, waiting 0.05 seconds ...
2021-02-12 15:38:41,632 DEBUG Acquiring lock 139806070650584 on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock
2021-02-12 15:38:41,633 DEBUG Lock 139806070650584 not acquired on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock, waiting 0.05 seconds ...
2021-02-12 15:38:41,683 DEBUG Acquiring lock 139806070650584 on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock
2021-02-12 15:38:41,683 DEBUG Lock 139806070650584 not acquired on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock, waiting 0.05 seconds ...
2021-02-12 15:38:41,733 DEBUG Acquiring lock 139806070650584 on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock
2021-02-12 15:38:41,733 DEBUG Lock 139806070650584 not acquired on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock, waiting 0.05 seconds ...
2021-02-12 15:38:41,783 DEBUG Acquiring lock 139806070650584 on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock
2021-02-12 15:38:41,783 DEBUG Lock 139806070650584 not acquired on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock, waiting 0.05 seconds ...
2021-02-12 15:38:41,834 DEBUG Acquiring lock 139806070650584 on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock
2021-02-12 15:38:41,834 DEBUG Lock 139806070650584 not acquired on /run/cephadm/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135.lock, waiting 0.05 seconds ...
... snip ... 

looks like we're doing too much synchronously here.

Actions #6

Updated by Sebastian Wagner about 3 years ago

  • Related to Feature #44873: cephadm bootstrap: add --apply-spec <cluster.yaml> added
Actions #7

Updated by Sebastian Wagner about 3 years ago

  • Assignee set to Daniel Pivonka
Actions #8

Updated by Daniel Pivonka about 3 years ago

  • Status changed from New to Can't reproduce

tested on 15.2.8 and 15.2.9 cant reproduce

Actions #9

Updated by John Fulton about 3 years ago

I reproduced this with 15.2.10. I'll follow up on IRC to offer the assignee live access to the reproducer.

[root@oc0-ceph-2 ~]# cephadm ls
[
    {
        "style": "cephadm:v1",
        "name": "mon.oc0-ceph-2",
        "fsid": "ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135",
        "systemd_unit": "ceph-ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135@mon.oc0-ceph-2",
        "enabled": true,
        "state": "running",
        "container_id": "fd1caf3c583e5767dff00b9319d053c150f92ed572ac1f7ea3b23b1448784f18",
        "container_image_name": "quay.ceph.io/ceph-ci/daemon:v5.0.9-stable-5.0-octopus-centos-8-x86_64",
        "container_image_id": "1f47579e3c610d981f4dbe27f372fa7cca94a16620152ddad4bd20285c8c95e7",
        "version": "15.2.10",
        "started": "2021-04-08T19:17:26.194598Z",
        "created": "2021-04-08T19:17:23.868279Z",
        "deployed": "2021-04-08T19:17:22.814175Z",
        "configured": "2021-04-08T19:17:23.868279Z" 
    },
    {
        "style": "cephadm:v1",
        "name": "mgr.oc0-ceph-2.mgogfr",
        "fsid": "ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135",
        "systemd_unit": "ceph-ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135@mgr.oc0-ceph-2.mgogfr",
        "enabled": true,
        "state": "running",
        "container_id": "6c597539ccfe12a7324c9eb4f1e0386f69865db392fefbec81dc1d99662d846e",
        "container_image_name": "quay.ceph.io/ceph-ci/daemon:v5.0.9-stable-5.0-octopus-centos-8-x86_64",
        "container_image_id": "1f47579e3c610d981f4dbe27f372fa7cca94a16620152ddad4bd20285c8c95e7",
        "version": "15.2.10",
        "started": "2021-04-08T19:17:27.372219Z",
        "created": "2021-04-08T19:17:27.536640Z",
        "deployed": "2021-04-08T19:17:26.869575Z",
        "configured": "2021-04-08T19:17:27.536640Z" 
    }
]
[root@oc0-ceph-2 ~]# rpm -q cephadm
cephadm-15.2.10-4.el8.noarch
[root@oc0-ceph-2 ~]# cat /etc/redhat-release 
CentOS Stream release 8
[root@oc0-ceph-2 ~]# uname -a
Linux oc0-ceph-2 4.18.0-294.el8.x86_64 #1 SMP Mon Mar 15 22:38:42 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[root@oc0-ceph-2 ~]# podman --version
podman version 3.0.2-dev
[root@oc0-ceph-2 ~]# 
<pre>

Actions #10

Updated by John Fulton about 3 years ago

We think this is a duplicate of https://tracker.ceph.com/issues/50041 and fixed by https://github.com/ceph/ceph/pull/40477/files

I reproduced the problem by running this:

sudo /usr/sbin/cephadm bootstrap \
  --ssh-private-key /home/ceph-admin/.ssh/id_rsa \
  --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub \
  --ssh-user ceph-admin \
  --apply-spec /home/ceph-admin/specs/ceph_spec.yaml \
  ...

and watching root fail to SSH into the hosts in the spec. I could run "ssh ceph-admin@oc0-ceph-4" on my system but "sudo ssh ceph-admin@oc0-ceph-4" failed. This is because ceph-admin's SSH keys were set up correctly but because of sudo root was trying to SSH and root's keys were not distributed. It didn't reproduce on dpivonka's system because root's SSH keys were already distributed on his system. Apply spec not in bootstrap will not try to copy the keys for you. That's why I didn't hit the issue if I applied the spec after bootstrap.

Next steps:
1. confirm I cannot reproduce 49277 with pacific
2. close 49277 as a duplicate of 50041

Actions #11

Updated by Sebastian Wagner about 3 years ago

  • Related to Bug #50041: cephadm bootstrap with apply-spec anmd ssh-user option failed while adding the hosts added
Actions #12

Updated by John Fulton about 3 years ago

When using pacific with the fixing patch https://github.com/ceph/ceph/pull/40477 from Bug #50041, the deployment fails better (it doesn't hang waiting for me to accept the ssh key) but it still fails [1]. Similar underlying root cause though: ceph-admin's SSH keys were set up correctly but because of sudo, root was trying to SSH and root's keys were not distributed. Thus, it tried to distribute root's keys as root and failed.

[ceph-admin@oc0-ceph-2 ~]$ sudo ssh oc0-ceph-3
The authenticity of host 'oc0-ceph-3 (192.168.24.14)' can't be established.
ECDSA key fingerprint is SHA256:VDnoF5dEU7gmwmT9RV6eJg/I1Hby0GhRN7VFCs6fZ0Q.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'oc0-ceph-3,192.168.24.14' (ECDSA) to the list of known hosts.
root@oc0-ceph-3: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
[ceph-admin@oc0-ceph-2 ~]$ 

I undersatnd cephadm defaulting to root for the SSH user and root SSH keys, but maybe cephadm needs to look at these parameters:

  --ssh-private-key /home/ceph-admin/.ssh/id_rsa \
  --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub \
  --ssh-user ceph-admin \

and if they are set, then it should use the above for the SSH connection instead. I.e. let the above override the default?

[1]

[ceph-admin@oc0-ceph-2 ~]$ sudo /usr/sbin/cephadm --image quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 bootstrap --skip-firewalld --ssh-private-key /home/ceph-admin/.ssh/id_rsa --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub --ssh-user ceph-admin --allow-fqdn-hostname --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --fsid ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135 --apply-spec /home/ceph-admin/specs/ceph_spec.yaml --config /home/ceph-admin/bootstrap_ceph.conf --skip-monitoring-stack --skip-dashboard --mon-ip 192.168.24.22                                            
Verifying podman|docker is present...
Verifying lvm2 is present...
Verifying time synchronization is in place...
Unit chronyd.service is enabled and running
Repeating the final host check...
podman|docker (/bin/podman) is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
Cluster fsid: ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135
Verifying IP 192.168.24.22 port 3300 ...
Verifying IP 192.168.24.22 port 6789 ...
Mon IP 192.168.24.22 is in CIDR network 192.168.24.0/24
- internal network (--cluster-network) has not been provided, OSD replication will default to the public_network
Pulling container image quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64...
Ceph version: ceph version 16.2.0 (0c2054e95bcd9b30fdd908a79ac1d8bbc3394442) pacific (stable)
Extracting ceph user uid/gid from container image...
Creating initial keys...
Creating initial monmap...
Creating mon...
Waiting for mon to start...
Waiting for mon...
mon is available
Assimilating anything we can from ceph.conf...
Generating new minimal ceph.conf...
Restarting the monitor...
Setting mon public_network to 192.168.24.0/24
Wrote config to /etc/ceph/ceph.conf
Wrote keyring to /etc/ceph/ceph.client.admin.keyring
Creating mgr...
Verifying port 9283 ...
Waiting for mgr to start...
Waiting for mgr...
mgr not available, waiting (1/15)...
mgr not available, waiting (2/15)...
mgr not available, waiting (3/15)...
mgr is available
Enabling cephadm module...
Waiting for the mgr to restart...
Waiting for mgr epoch 5...
mgr epoch 5 is available
Setting orchestrator backend to cephadm...
Using provided ssh keys...
Adding host oc0-ceph-2...
Deploying mon service with default placement...
Deploying mgr service with default placement...
Deploying crash service with default placement...
Applying /home/ceph-admin/specs/ceph_spec.yaml to cluster
Adding ssh key to oc0-ceph-3
Adding ssh key to oc0-ceph-4
Non-zero exit code 22 from /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=oc0-ceph-2 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135:/var/log/ceph:z -v /tmp/ceph-tmpbfayxvwp:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpjm_n51dx:/etc/ceph/ceph.conf:z -v /home/ceph-admin/specs/ceph_spec.yaml:/tmp/spec.yml:z quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 orch apply -i /tmp/spec.yml                                            
/usr/bin/ceph: stderr Error EINVAL: Failed to connect to oc0-ceph-3 (oc0-ceph-3).
/usr/bin/ceph: stderr Please make sure that the host is reachable and accepts connections using the cephadm SSH key
/usr/bin/ceph: stderr
/usr/bin/ceph: stderr To add the cephadm SSH key to the host:
/usr/bin/ceph: stderr > ceph cephadm get-pub-key > ~/ceph.pub
/usr/bin/ceph: stderr > ssh-copy-id -f -i ~/ceph.pub ceph-admin@oc0-ceph-3
/usr/bin/ceph: stderr
/usr/bin/ceph: stderr To check that the host is reachable:
/usr/bin/ceph: stderr > ceph cephadm get-ssh-config > ssh_config
/usr/bin/ceph: stderr > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
/usr/bin/ceph: stderr > chmod 0600 ~/cephadm_private_key
/usr/bin/ceph: stderr > ssh -F ssh_config -i ~/cephadm_private_key ceph-admin@oc0-ceph-3
Traceback (most recent call last):
  File "/usr/sbin/cephadm", line 7924, in <module>
    main()
  File "/usr/sbin/cephadm", line 7912, in main
    r = ctx.func(ctx)
  File "/usr/sbin/cephadm", line 1717, in _default_image
    return func(ctx)
  File "/usr/sbin/cephadm", line 4037, in command_bootstrap
    out = cli(['orch', 'apply', '-i', '/tmp/spec.yml'], extra_mounts=mounts)
  File "/usr/sbin/cephadm", line 3931, in cli
    ).run(timeout=timeout)
  File "/usr/sbin/cephadm", line 3174, in run
    desc=self.entrypoint, timeout=timeout)
  File "/usr/sbin/cephadm", line 1411, in call_throws
    raise RuntimeError('Failed command: %s' % ' '.join(command))
RuntimeError: Failed command: /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=oc0-ceph-2 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/ca9bf37b-ed0f-4e5a-bb21-e5b5f9b75135:/var/log/ceph:z -v /tmp/ceph-tmpbfayxvwp:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpjm_n51dx:/etc/ceph/ceph.conf:z -v /home/ceph-admin/specs/ceph_spec.yaml:/tmp/spec.yml:z quay.ceph.io/ceph-ci/daemon:v6.0.0-stable-6.0-pacific-centos-8-x86_64 orch apply -i /tmp/spec.yml                                         
[ceph-admin@oc0-ceph-2 ~]$ 

Actions #13

Updated by Daniel Pivonka about 3 years ago

  • Status changed from Can't reproduce to Duplicate
Actions #14

Updated by John Fulton about 3 years ago

Because I'm having the same issue in Pacific let's use a new bug https://tracker.ceph.com/issues/50306

Actions

Also available in: Atom PDF