Bug #51629
cephadm reports nodes offline after rolling reboot
0%
Description
Hello,
On my 8 nodes Octopus 15.2.13 cluster which I installed using cephadm and podman for containers, I did a rolling reboot in order apply Linux kernel security updates (Ubuntu 20.04 LTS) but after that rolling reboot cephadm reports that all nodes are offline expect one. I am using cephfs on this cluster and cephfs still works fine but cephadm is somehow in a confused state as you can see below from the "ceph health detail" output:
HEALTH_WARN 7 hosts fail cephadm check [WRN] CEPHADM_HOST_CHECK_FAILED: 7 hosts fail cephadm check host ceph1c failed check: Can't communicate with remote host `ceph1c`, possibly because python3 is not installed there: [Errno 32] Broken pipe host ceph1e failed check: Can't communicate with remote host `ceph1e`, possibly because python3 is not installed there: [Errno 32] Broken pipe host ceph1d failed check: Can't communicate with remote host `ceph1d`, possibly because python3 is not installed there: [Errno 32] Broken pipe host ceph1f failed check: Can't communicate with remote host `ceph1f`, possibly because python3 is not installed there: [Errno 32] Broken pipe host ceph1g failed check: Can't communicate with remote host `ceph1g`, possibly because python3 is not installed there: [Errno 32] Broken pipe host ceph1h failed check: Can't communicate with remote host `ceph1h`, possibly because python3 is not installed there: [Errno 32] Broken pipe host ceph1b failed check: Can't communicate with remote host `ceph1b`, possibly because python3 is not installed there: [Errno 32] Broken pipe
And here the output of "ceph orch host ls":
HOST ADDR LABELS STATUS ceph1a ceph1a _admin mon ceph1b ceph1b _admin mon Offline ceph1c ceph1c _admin mon Offline ceph1d ceph1d Offline ceph1e ceph1e Offline ceph1f ceph1f Offline ceph1g ceph1g mds Offline ceph1h ceph1h mds Offline
So I checked and all nodes are reachable under their hostname (ping and SSH) and all nodes have python3 installed and working properly. So my conclusion here would be a bug within cephadm, hence my bug issue. I also already mailed this case on the mailing list but did not get any answer.
Thank you in advance for your help.
History
#1 Updated by Ian Merrick over 2 years ago
Hi,
This could be because the IP address for the ceph hosts are not defined, and if you are relying on /etc/hosts for name resolution.
You could try adding the IP for each host..
ceph orch host set-addr [hostname] [host IP]
The IP will then be listed in the ADDR column of the 'ceph orch host ls' output.
It is the case that Pacific no longer uses the hosts' /etc/hosts file in dns resolution for podman containers (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/KO4KEZKKBKCU3ML5OQ36IDKIIXXTGUAK/). I realise you are using octupus though, so perhaps not the issue here, but looks similar.
#2 Updated by Sebastian Wagner over 2 years ago
does a failing all MGRs daemons (calling `ceph mgr fail ...`) helps?
#3 Updated by M B over 2 years ago
Ian Merrick wrote:
This could be because the IP address for the ceph hosts are not defined, and if you are relying on /etc/hosts for name resolution.
I see, I wasn't aware of that and I can't remember having read that in the cephadm documentation, also in the past I already did some reboots but this behavior did not happen.
Anyway your trick with the "ceph orch host set-addr" worked nicely and cephadm sees all nodes back online again.
There is just one issue left is that the ceph web dashboard reports the following:
Could not reach Alertmanager's API on http://ceph1e:9093/api/v1
Could not reach Prometheus's API on http://ceph1b:9095/api/v1
Again these URLs are reachable. Do I also need to define the IPs for these services somewhere?
#4 Updated by Redouane Kachach Elhichou almost 2 years ago
- Priority changed from Normal to Low