Project

General

Profile

Actions

Bug #51258

closed

cephadm bootstrap: applying host specs suddenly removes the admin keyring from bootstrap host

Added by Francesco Pantano almost 3 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
cephadm
Target version:
-
% Done:

0%

Source:
Tags:
ux
Backport:
Regression:
Yes
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

There's a job in OpenStack which is able to test the latest pacific bits for both ceph containers and cephadm.
Using [1] and [2], the job, which is supposed to deploy a new cluster, fails with [3]:

```

Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)", "stderr_lines": ["Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
```

[root@standalone ~]# ls /etc/ceph/
[root@standalone ~]#


The problem I found, is the /etc/ceph is empty, so that kind of failure is expected.
However, the bootstrap command [4], generates both conf and keyrings, and I can see them being generated, but after some time, they're gone.

In addition, you won't be able to interact with the Ceph cluster and `cephadm shell` returns something like:

```

[vagrant@standalone ~]$ sudo cephadm shell
Inferring fsid 4b5c8c0a-ff60-454b-a1b4-9747aa737d19
Inferring config /var/lib/ceph/4b5c8c0a-ff60-454b-a1b4-9747aa737d19/mon.standalone.localdomain/config
Using recent ceph image quay.ceph.io/ceph-ci/daemon@sha256:ec271e81d73b6687ad2e097e6b8784066a8f092f2ce4c2cbc2ec2095ff0d8d27
cephceph[ceph: root@standalone /]# ceph -s
2021-06-17T09:54:00.382+0000 7f59def71700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
2021-06-17T09:54:00.382+0000 7f59def71700 -1 AuthRegistry(0x7f59d805ed00) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,, disabling cephx
2021-06-17T09:54:00.382+0000 7f59def71700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
2021-06-17T09:54:00.382+0000 7f59def71700 -1 AuthRegistry(0x7f59def6fea0) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,, disabling cephx
2021-06-17T09:54:00.383+0000 7f59dcd0d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1]
2021-06-17T09:54:00.383+0000 7f59def71700 -1 monclient: authenticate NOTE: no keyring found; disabled cephx authentication
[errno 13] RADOS permission denied (error connecting to the cluster)

```
[ceph: root@standalone /]# ls /etc/ceph/
ceph.conf  rbdmap

where ceph.conf is:
```

# minimal ceph.conf for 4b5c8c0a-ff60-454b-a1b4-9747aa737d19
[global]
        fsid = 4b5c8c0a-ff60-454b-a1b4-9747aa737d19
        mon_host = [v2:192.168.24.1:3300/0,v1:192.168.24.1:6789/0]
[mon.standalone.localdomain]
public network = 192.168.24.0/24

```
This means you have the ceph config file because it's taken from /var/lib/ceph/4b5c8c0a-ff60-454b-a1b4-9747aa737d19/mon.standalone.localdomain/config,
but there's no keyring.

Note that the bootstrap command (which happens before applying any OSD) works properly, and you should be able to
interact with the cluster using `cephadm shell` or any other client, and in /etc/ceph you can see both ceph.conf and
the keyring, but something happen when the osds are applied using something like:

```
---

addr: 192.168.24.1
hostname: standalone.localdomain
labels:
- osd
- mgr
- mon
service_type: host
---
placement:
  hosts:
  - standalone.localdomain
service_id: mon
service_name: mon
service_type: mon
---
placement:
  hosts:
  - standalone.localdomain
service_id: mgr
service_name: mgr
service_type: mgr
---
data_devices:
  paths:
  - /dev/ceph_vg/ceph_lv_data
placement:
  hosts:
  - standalone.localdomain
service_id: default_drive_group
service_name: osd.default_drive_group
service_type: osd

```

[1] container: quay.ceph.io/ceph-ci/daemon:latest-pacific-devel
[2] cephadm version(s):
a. https://cbs.centos.org/koji/buildinfo?buildID=33232
b. https://cbs.centos.org/koji/buildinfo?buildID=33140
[3] https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_50e/778915/40/check/tripleo-ci-centos-8-scenario004-standalone/50ea241/logs/undercloud/home/zuul/standalone-ansible-d4qzhga2/cephadm/cephadm_command.log
[4] https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_cephadm/tasks/bootstrap.yaml#L57-L58


Related issues 1 (0 open1 closed)

Related to Orchestrator - Bug #51277: cephadm bootstrap: unable to set up admin labelCan't reproduce

Actions
Actions #1

Updated by Francesco Pantano almost 3 years ago

When that spec is applied I see (cephadm in debug mode):

2021-06-17T10:46:46.346580+0000 mgr.standalone.localdomain.uruacd [INF] Saving service mgr spec with placement standalone.localdomain
2021-06-17T10:46:46.346887+0000 mgr.standalone.localdomain.uruacd [DBG] mon_command: 'auth get' 0 in 0.003s
2021-06-17T10:46:46.347159+0000 mgr.standalone.localdomain.uruacd [WRN] unable to calc client keyring client.admin placement PlacementSpec(label='_admin'): Cannot place <ServiceSpec for service_name=mon>: No matching hosts for label _admin
2021-06-17T10:46:46.347592+0000 mgr.standalone.localdomain.uruacd [INF] Removing standalone.localdomain:/etc/ceph/ceph.conf
2021-06-17T10:46:46.347721+0000 mgr.standalone.localdomain.uruacd [DBG] Have connection to 192.168.24.1
2021-06-17T10:46:46.352670+0000 mgr.standalone.localdomain.uruacd [DBG] _kick_serve_loop
2021-06-17T10:46:46.352803+0000 mgr.standalone.localdomain.uruacd [INF] Marking host: standalone.localdomain for OSDSpec preview refresh.
2021-06-17T10:46:46.353114+0000 mgr.standalone.localdomain.uruacd [INF] Saving service osd.default_drive_group spec with placement standalone.localdomain
2021-06-17T10:46:46.354999+0000 mgr.standalone.localdomain.uruacd [INF] Removing standalone.localdomain:/etc/ceph/ceph.client.admin.keyring
2021-06-17T10:46:46.355105+0000 mgr.standalone.localdomain.uruacd [DBG] Have connection to 192.168.24.1
2021-06-17T10:46:46.360030+0000 mgr.standalone.localdomain.uruacd [DBG] _kick_serve_loop
2021-06-17T10:46:46.371412+0000 mgr.standalone.localdomain.uruacd [DBG] _check_for_strays
2021-06-17T10:46:46.371559+0000 mgr.standalone.localdomain.uruacd [DBG] 0 OSDs are scheduled for removal: []
2021-06-17T10:46:46.371625+0000 mgr.standalone.localdomain.uruacd [DBG] Saving [] to store
2021-06-17T10:46:46.377148+0000 mgr.standalone.localdomain.uruacd [DBG] Applying service mon spec
2021-06-17T10:46:46.377326+0000 mgr.standalone.localdomain.uruacd [DBG] mon public_network(s) is ['192.168.24.0/24']
2021-06-17T10:46:46.377583+0000 mgr.standalone.localdomain.uruacd [DBG] Provided hosts: [DaemonPlacement(daemon_type='mon', hostname='standalone.localdomain', network='', name='', ip=None, ports=[], rank=None, rank_generation=None)]
2021-06-17T10:46:46.377639+0000 mgr.standalone.localdomain.uruacd [DBG] Add [], remove []
2021-06-17T10:46:46.377763+0000 mgr.standalone.localdomain.uruacd [DBG] Hosts that will receive new daemons: []
2021-06-17T10:46:46.377857+0000 mgr.standalone.localdomain.uruacd [DBG] Daemons that will be removed: []
2021-06-17T10:46:46.378085+0000 mgr.standalone.localdomain.uruacd [DBG] Applying service mgr spec
2021-06-17T10:46:46.378326+0000 mgr.standalone.localdomain.uruacd [DBG] Provided hosts: [DaemonPlacement(daemon_type='mgr', hostname='standalone.localdomain', network='', name='', ip=None, ports=[], rank=None, rank_generation=None)]
2021-06-17T10:46:46.378387+0000 mgr.standalone.localdomain.uruacd [DBG] Add [], remove []
2021-06-17T10:46:46.378492+0000 mgr.standalone.localdomain.uruacd [DBG] Hosts that will receive new daemons: []
2021-06-17T10:46:46.378570+0000 mgr.standalone.localdomain.uruacd [DBG] Daemons that will be removed: []
2021-06-17T10:46:46.378772+0000 mgr.standalone.localdomain.uruacd [DBG] Applying service crash spec
2021-06-17T10:46:46.379000+0000 mgr.standalone.localdomain.uruacd [DBG] Provided hosts: [DaemonPlacement(daemon_type='crash', hostname='standalone.localdomain', network='', name='', ip=None, ports=[], rank=None, rank_generation=None)]
2021-06-17T10:46:46.379056+0000 mgr.standalone.localdomain.uruacd [DBG] Add [], remove []
2021-06-17T10:46:46.379129+0000 mgr.standalone.localdomain.uruacd [DBG] Hosts that will receive new daemons: []
2021-06-17T10:46:46.379205+0000 mgr.standalone.localdomain.uruacd [DBG] Daemons that will be removed: []
2021-06-17T10:46:46.379376+0000 mgr.standalone.localdomain.uruacd [DBG] Applying service osd.default_drive_group spec
2021-06-17T10:46:46.379540+0000 mgr.standalone.localdomain.uruacd [DBG] Processing DriveGroup DriveGroupSpec(name=default_drive_group->placement=PlacementSpec(hosts=[HostPlacementSpec(hostname='standalone.localdomain', network='', name='')]), service_id='default_drive_group', serv
ice_type='osd', data_devices=DeviceSelection(paths=[<ceph.deployment.inventory.Device object at 0x7fd2e1eb3390>], all=False), osd_id_claims={}, unmanaged=False, filter_logic='AND', preview_only=False)
2021-06-17T10:46:46.380609+0000 mgr.standalone.localdomain.uruacd [DBG] mon_command: 'osd tree' - 0 in 0.001s
2021-06-17T10:46:46.380880+0000 mgr.standalone.localdomain.uruacd [DBG] Checking matching hosts - ['standalone.localdomain']
2021-06-17T10:46:46.381037+0000 mgr.standalone.localdomain.uruacd [DBG] Found inventory for host []
2021-06-17T10:46:46.381174+0000 mgr.standalone.localdomain.uruacd [DBG] device filter is using explicit paths
2021-06-17T10:46:46.381263+0000 mgr.standalone.localdomain.uruacd [DBG] device_filter is None
2021-06-17T10:46:46.381342+0000 mgr.standalone.localdomain.uruacd [DBG] device_filter is None
2021-06-17T10:46:46.381423+0000 mgr.standalone.localdomain.uruacd [DBG] device_filter is None
2021-06-17T10:46:46.381506+0000 mgr.standalone.localdomain.uruacd [DBG] Found drive selection <ceph.deployment.drive_selection.selector.DriveSelection object at 0x7fd2eb6e1eb8>
2021-06-17T10:46:46.381935+0000 mgr.standalone.localdomain.uruacd [DBG] Translating DriveGroup <DriveGroupSpec(name=default_drive_group->placement=PlacementSpec(hosts=[HostPlacementSpec(hostname='standalone.localdomain', network='', name='')]), service_id='default_drive_group', se
rvice_type='osd', data_devices=DeviceSelection(paths=[<ceph.deployment.inventory.Device object at 0x7fd2e1eb3390>], all=False), osd_id_claims={}, unmanaged=False, filter_logic='AND', preview_only=False)> to ceph-volume command

So at some point you can see:

[INF] Removing standalone.localdomain:/etc/ceph/ceph.conf
[INF] Removing standalone.localdomain:/etc/ceph/ceph.client.admin.keyring

Actions #2

Updated by Sebastian Wagner almost 3 years ago

  • Description updated (diff)
Actions #3

Updated by Sebastian Wagner almost 3 years ago

  • Project changed from CephFS to Orchestrator
  • Subject changed from cephadm fails after osds are applied against a new deployed cluster to cephadm bootstrap: applying host specs suddenly removes the admin keyring from bootstrap host
  • Category set to cephadm
  • Tags set to ux

might be a nasty trap: if you add hosts via yaml files during bootstrap, cephadm now suddenly removes the admin keyring from all hosts.

Actions #4

Updated by Francesco Pantano almost 3 years ago

Hello,
Thanks Sebastian for the quick reply.
As per our previous conversation, cephadm adds the _admin label on the bootstrap node when the bootstrap command is executed.
I can confirm the label is properly applied.
However, when the spec pasted in the description is applied (keep in mind it contains all the cluster nodes when it's generated),
the new labels are applied, and the _admin one is deleted.
Sounds like it's not a Ceph bug, but we just need to adapt the TripleO deployer to this new behavior introduced by the _admin
label, therefore we submitted the patch [1] which is supposed to fix it, generating the following:

```
---
addr: 192.168.24.1
hostname: standalone.localdomain
labels:
- _admin
- osd
- mgr
- mon
service_type: host
---
placement:
hosts:
- standalone.localdomain
service_id: mon
service_name: mon
service_type: mon
---
placement:
hosts:
- standalone.localdomain
service_id: mgr
service_name: mgr
service_type: mgr
---
data_devices:
paths:
- /dev/ceph_vg/ceph_lv_data
placement:
hosts:
- standalone.localdomain
service_id: default_drive_group
service_name: osd.default_drive_group
service_type: osd

```

We're currently testing it via [2], and we can update (and close) this tracker as long as [2] works.

[1] https://review.opendev.org/c/openstack/tripleo-ansible/+/796677
[2] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/778915

Actions #5

Updated by Sebastian Wagner almost 3 years ago

right, for now you need to add the _admin label when applying a spec.

Actions #7

Updated by Sebastian Wagner almost 3 years ago

Francesco Pantano wrote:

Hello,
I confirm we can close this tracker.

This is a real UX bug that needs to be resolved eventually.

Actions #8

Updated by Francesco Pantano almost 3 years ago

Agree on that, if the UX can be improved it's a good chance to add that to the backlog.
Just wanted to make sure the current features and the deployment is not affected if that
label is explicitly added.
Different story for the ceph.conf and keyring distribution, but this might be analyzed
on a different tracker.

Actions #9

Updated by Sebastian Wagner almost 3 years ago

  • Related to Bug #51277: cephadm bootstrap: unable to set up admin label added
Actions #10

Updated by Adam King about 2 years ago

  • Status changed from New to Resolved
  • Pull request ID set to 42772

Fixed in pacific as of 16.2.7. The linked PR stops labels from being removed when re-adding hosts so using a host spec as done in this tracker should not remove the _admin label and cause the keyring to be removed.

Actions

Also available in: Atom PDF