Project

General

Profile

Bug #51629

cephadm reports nodes offline after rolling reboot

Added by M B over 2 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
cephadm
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

On my 8 nodes Octopus 15.2.13 cluster which I installed using cephadm and podman for containers, I did a rolling reboot in order apply Linux kernel security updates (Ubuntu 20.04 LTS) but after that rolling reboot cephadm reports that all nodes are offline expect one. I am using cephfs on this cluster and cephfs still works fine but cephadm is somehow in a confused state as you can see below from the "ceph health detail" output:

HEALTH_WARN 7 hosts fail cephadm check
[WRN] CEPHADM_HOST_CHECK_FAILED: 7 hosts fail cephadm check
    host ceph1c failed check: Can't communicate with remote host `ceph1c`, possibly because python3 is not installed there: [Errno 32] Broken pipe
    host ceph1e failed check: Can't communicate with remote host `ceph1e`, possibly because python3 is not installed there: [Errno 32] Broken pipe
    host ceph1d failed check: Can't communicate with remote host `ceph1d`, possibly because python3 is not installed there: [Errno 32] Broken pipe
    host ceph1f failed check: Can't communicate with remote host `ceph1f`, possibly because python3 is not installed there: [Errno 32] Broken pipe
    host ceph1g failed check: Can't communicate with remote host `ceph1g`, possibly because python3 is not installed there: [Errno 32] Broken pipe
    host ceph1h failed check: Can't communicate with remote host `ceph1h`, possibly because python3 is not installed there: [Errno 32] Broken pipe
    host ceph1b failed check: Can't communicate with remote host `ceph1b`, possibly because python3 is not installed there: [Errno 32] Broken pipe

And here the output of "ceph orch host ls":

HOST    ADDR    LABELS      STATUS   
ceph1a  ceph1a  _admin mon           
ceph1b  ceph1b  _admin mon  Offline  
ceph1c  ceph1c  _admin mon  Offline  
ceph1d  ceph1d              Offline  
ceph1e  ceph1e              Offline  
ceph1f  ceph1f              Offline  
ceph1g  ceph1g  mds         Offline  
ceph1h  ceph1h  mds         Offline  

So I checked and all nodes are reachable under their hostname (ping and SSH) and all nodes have python3 installed and working properly. So my conclusion here would be a bug within cephadm, hence my bug issue. I also already mailed this case on the mailing list but did not get any answer.

Thank you in advance for your help.

History

#1 Updated by Ian Merrick over 2 years ago

Hi,

This could be because the IP address for the ceph hosts are not defined, and if you are relying on /etc/hosts for name resolution.

You could try adding the IP for each host..

ceph orch host set-addr [hostname] [host IP]

The IP will then be listed in the ADDR column of the 'ceph orch host ls' output.

It is the case that Pacific no longer uses the hosts' /etc/hosts file in dns resolution for podman containers (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/KO4KEZKKBKCU3ML5OQ36IDKIIXXTGUAK/). I realise you are using octupus though, so perhaps not the issue here, but looks similar.

#2 Updated by Sebastian Wagner over 2 years ago

does a failing all MGRs daemons (calling `ceph mgr fail ...`) helps?

#3 Updated by M B over 2 years ago

Ian Merrick wrote:

This could be because the IP address for the ceph hosts are not defined, and if you are relying on /etc/hosts for name resolution.

I see, I wasn't aware of that and I can't remember having read that in the cephadm documentation, also in the past I already did some reboots but this behavior did not happen.

Anyway your trick with the "ceph orch host set-addr" worked nicely and cephadm sees all nodes back online again.

There is just one issue left is that the ceph web dashboard reports the following:

Could not reach Alertmanager's API on http://ceph1e:9093/api/v1
Could not reach Prometheus's API on http://ceph1b:9095/api/v1

Again these URLs are reachable. Do I also need to define the IPs for these services somewhere?

#4 Updated by Redouane Kachach Elhichou almost 2 years ago

  • Priority changed from Normal to Low

Also available in: Atom PDF