Bug #57016: cephadm bootstrap begs robustness - Orchestrator - Ceph

Actions

Copy link

Bug #57016

closed

cephadm bootstrap begs robustness

Added by greg mott over 1 year ago. Updated 3 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Redouane Kachach Elhichou

Category:

cephadm (binary)

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

51718

Crash signature (v1):

Crash signature (v2):

Description

How do i just start over?

First time i tried cephadm bootstrap and it failed i guessed it wasn't good with PermitRootLogin set to no in /etc/ssh/sshd_config so i set that to yes, and also thought it best to upgrade all packages and reboot.

Next try it complained that /etc/ceph/ceph.conf already exists, and then /etc/ceph/ceph.client.admin.keyring, and then /etc/ceph/ceph.pub, so i moved them to a souvenir bin.

Next it complained "Cannot bind to IP 192.168.176.13 port 3300: [Errno 98] Address already in use", so i killed off the ceph processes and tried again.

Next it complained "Waiting for mon to start... Waiting for mon...
Non-zero exit code 13 from /usr/bin/podman run --rm--ipc=host --stop-signal=SIGTERM --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.io/ceph/ceph:v17 -e NODE_NAME=tangelo -e CEPH_USE_RANDOM_NONCE=1 -v /var/lib/ceph/d0e93cf8-12a0-11ed-83c2-8cdcd4320acb/mon.tangelo:/var/lib/ceph/mon/ceph-tangelo:z -v /tmp/ceph-tmp0cemkjph:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpkh0pixyn:/etc/ceph/ceph.conf:z quay.io/ceph/ceph:v17 status
/usr/bin/ceph: stderr 2022-08-02T20:23:07.594+0000 7f3ece419700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
/usr/bin/ceph: stderr [errno 13] RADOS permission denied (error connecting to the cluster)
ERROR: mon not available after 15 tries"

How do i just start over?

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by greg mott over 1 year ago

this is on alma(rhel) 8.6
the commands i've given are

curl --silent --remote-name --location https://github.com/ceph/ceph/raw/quincy/src/cephadm/cephadm
chmod +x cephadm
cephadm bootstrap --mon-ip 192.168...

Actions

Copy link

Updated by greg mott over 1 year ago

gave these commands also before cephadm bootstrap

./cephadm add-repo --release quincy
./cephadm install

Actions

Copy link

Updated by Dhairya Parmar over 1 year ago

greg mott wrote:

How do i just start over?

First time i tried cephadm bootstrap and it failed i guessed it wasn't good with PermitRootLogin set to no in /etc/ssh/sshd_config so i set that to yes, and also thought it best to upgrade all packages and reboot.

Next try it complained that /etc/ceph/ceph.conf already exists, and then /etc/ceph/ceph.client.admin.keyring, and then /etc/ceph/ceph.pub, so i moved them to a souvenir bin.

Next it complained "Cannot bind to IP 192.168.176.13 port 3300: [Errno 98] Address already in use", so i killed off the ceph processes and tried again.

Next it complained "Waiting for mon to start... Waiting for mon...
Non-zero exit code 13 from /usr/bin/podman run --rm--ipc=host --stop-signal=SIGTERM --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.io/ceph/ceph:v17 -e NODE_NAME=tangelo -e CEPH_USE_RANDOM_NONCE=1 -v /var/lib/ceph/d0e93cf8-12a0-11ed-83c2-8cdcd4320acb/mon.tangelo:/var/lib/ceph/mon/ceph-tangelo:z -v /tmp/ceph-tmp0cemkjph:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpkh0pixyn:/etc/ceph/ceph.conf:z quay.io/ceph/ceph:v17 status
/usr/bin/ceph: stderr 2022-08-02T20:23:07.594+0000 7f3ece419700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
/usr/bin/ceph: stderr [errno 13] RADOS permission denied (error connecting to the cluster)
ERROR: mon not available after 15 tries"

How do i just start over?

Did you try running "lsof -i :3300" or maybe "nmap -p 3300 192.168.176.13"? Actually I do have ran into similar issue and the solution has always been to make sure those processes are not running anymore. Did you try restarting that machine?

Actions

Copy link

Updated by greg mott over 1 year ago

yes indeed, i restarted the machine, killed the ceph processes, and then it complained "...mon not available..."

How do i just start over?

Actions

Copy link

Updated by greg mott over 1 year ago

Ok i've worked out how to "start over":
Just move the following files to a souvenir bin (or delete them): /etc/{ceph,logr*,sy*/sy*/{,/mu*}}/ceph*
Then restart, and reissue the command: cephadm bootstrap --mon-ip 192.168...
So with that i've now got "Bootstrap complete."

Actions

Copy link

Updated by Dhairya Parmar over 1 year ago

greg mott wrote:

Ok i've worked out how to "start over":
Just move the following files to a souvenir bin (or delete them): /etc/{ceph,logr*,sy*/sy*/{,/mu*}}/ceph*
Then restart, and reissue the command: cephadm bootstrap --mon-ip 192.168...
So with that i've now got "Bootstrap complete."

Good to know it finally worked out for you.

Actions

Copy link

Updated by Dhairya Parmar over 1 year ago

Status changed from New to Resolved

Actions

Copy link

Updated by Dhairya Parmar over 1 year ago

Changing the status to resolved. You can re-open if you hit something again

Actions

Copy link

Updated by Adam King over 1 year ago

Project changed from Ceph to Orchestrator
Status changed from Resolved to New

going to re-open this and move to orchestrator component for tracking some mechanism in cephadm for helping users with cleaning up failed bootstrap attempts since currently they're sort of on their own for figuring out what to clean up.

Actions

Copy link

#10

Updated by Redouane Kachach Elhichou over 1 year ago

Maybe what happened is basically the ceph cluster was partially installed. In this case, what you have to do is to remove the faulty cluster by using the rm-cluster:

cephadm rm-cluster --force --zap-osds --fsid <fsid>

You can get the fsid from the conf file (/etc/ceph/ceph.conf) or from the boostrap logs, look for a line similar to:

Cluster fsid: 72ed6c66-1d4d-11ed-96f4-5254001aecae

This should cleanup all the files that were installed as part of the boostrap.

More info on:
https://docs.ceph.com/en/latest/cephadm/operations/#purging-a-cluster

Actions

Copy link

#11