Project

General

Profile

Bug #57800

ceph orch upgrade does not appear to work with FQNDs.

Added by Brian Woods 4 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This is purely speculative on my part, but after attempting an upgrade to 17.2.4 from 17.2.3, it just sits there doing nothing. Checking the logs shows:

2022-10-09T00:03:51.821500+0000 mgr.ceph01.domain.local.miydsy (mgr.14186) 587419 : cephadm [ERR] check-host failed for 'ceph02.domain.local'
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1042, in check_host
    error_ok=True, no_fsid=True))
  File "/usr/share/ceph/mgr/cephadm/module.py", line 590, in wait_async
    return self.event_loop.get_result(coro)
  File "/usr/share/ceph/mgr/cephadm/ssh.py", line 48, in get_result
    return asyncio.run_coroutine_threadsafe(coro, self._loop).result()
  File "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1273, in _run_cephadm
    await self.mgr.ssh._remote_connection(host, addr)
  File "/usr/share/ceph/mgr/cephadm/ssh.py", line 66, in _remote_connection
    raise OrchestratorError("host address is empty")
orchestrator._interface.OrchestratorError: host address is empty

I assume something might not be escaped for correctly and the periods in the FQDN is breaking something?

log.txt View (272 KB) Brian Woods, 10/25/2022 07:00 PM

History

#1 Updated by Adam King 4 months ago

what does `ceph orch host ls` report for this host? This error should only be raised if we can't find any IP stored for the host. You could also look at "ceph config-key get mgr/cephadm/inventory" which should be a json struct that includes all the hosts with their names, addresses, etc. and see if it lists an actual address for that host (as opposed to just listing the hostname as the addr). If it does look like there is no address for the host, the `ceph orch host set-addr` command might be able to fix it.

#2 Updated by Brian Woods 4 months ago

So, I did notice that I had set the domain name on one of the nodes to the "oldname.local" (when I was doing the find/replace to scrub this), but that shouldn't impact DNS. I confirmed that all names resolve from all hosts (DNS provided be a DHCP in the case). And it looks like it is seeing all the correct IPs.

ceph orch host ls:

HOST                   ADDR            LABELS                  STATUS  
ceph03.domain.local    192.168.10.80   rbd                             
ceph01.domain.local    192.168.10.210  _admin rgw grafana mds          
ceph02.oldname.local   192.168.10.51   mon mgr mds _admin              
3 hosts in cluster

ceph config-key get mgr/cephadm/inventory:

{
   "ceph01.domain.local":{
      "hostname":"ceph01.domain.local",
      "addr":"192.168.10.210",
      "labels":[
         "_admin",
         "rgw",
         "grafana",
         "mds" 
      ],
      "status":"" 
   },
   "ceph03.domain.local":{
      "hostname":"ceph03.domain.local",
      "addr":"192.168.10.80",
      "labels":[
         "rbd" 
      ],
      "status":"" 
   },
   "ceph02.oldname.local":{
      "hostname":"ceph02.oldname.local",
      "addr":"192.168.10.51",
      "labels":[
         "mon",
         "mgr",
         "mds",
         "_admin" 
      ],
      "status":"" 
   }
}

I did health checks on both the host, as well as the cephadm shell container:

root@ceph01:/var/log# ceph cephadm check-host ceph03.domain.local
ceph03.domain.local (None) ok
docker (/usr/bin/docker) is present
systemctl is present
lvcreate is present
Unit chrony.service is enabled and running
Hostname "ceph03.domain.local" matches what is expected.
Host looks OK

root@ceph01:/var/log# ceph cephadm check-host ceph01.domain.local
ceph01.domain.local (None) ok
docker (/usr/bin/docker) is present
systemctl is present
lvcreate is present
Unit chrony.service is enabled and running
Hostname "ceph01.domain.local" matches what is expected.
Host looks OK

root@ceph01:/var/log# ceph cephadm check-host ceph02.oldname.local
ceph02.oldname.local (None) ok
docker (/usr/bin/docker) is present
systemctl is present
lvcreate is present
Unit chrony.service is enabled and running
Hostname "ceph02.oldname.local" matches what is expected.
Host looks OK

Thoughts?

#3 Updated by Adam King 4 months ago

it's odd that the hostname it reports not having an address for isn't even a hostname it has stored "ceph02.domain.local". It seems at this point that it probably failed to find an address for "ceph02.domain.local" since it doesn't even have any entry for it. The question is why was it trying to go to that hostname at all? Perhaps something was cached that shouldn't have been there anymore? Does stopping the upgrade, running "ceph mgr fail" and starting the upgrade up again make this happen again? Might also be worth checking in `orch ps` output that "ceph02.domain.local" isn't reported as the hostname for any of the daemons as well and that no service spec placements explicitly reference "ceph02.domain.local" either.

#4 Updated by Brian Woods 4 months ago

I add DNS entries for all combinations. So both ceph02.oldname.local and ceph02.domain.local are now valid names but the host is still configured as "oldname.local".

After confirming all combos worked, I then ran these, waiting about a minute between each command:

root@ceph01# ceph orch upgrade stop
Stopped upgrade to quay.io/ceph/ceph:v17.2.4
root@ceph01# ceph mgr fail
root@ceph01# ceph orch upgrade start --image quay.io/ceph/ceph:v17.2.4
Initiating upgrade to quay.io/ceph/ceph:v17.2.4
root@ceph01# ceph orch upgrade status
{
    "target_image": "quay.io/ceph/ceph:v17.2.4",
    "in_progress": true,
    "which": "Upgrading all daemon types on all hosts",
    "services_complete": [],
    "progress": "",
    "message": "" 
}

No movement in the process after about a half hour.

Current ceph ps:

# ceph orch ps
NAME                                   HOST                           PORTS        STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID  
alertmanager.ceph01                    ceph01.domain.local            *:9093,9094  running (3d)      4m ago  13d    29.0M        -           ba2b418f427c  16eaa667487b  
crash.ceph03                           ceph03.domain.local                         running (3d)      4m ago  13d    8999k        -  17.2.3   0912465dcea5  60d65abff255  
crash.ceph01                           ceph01.domain.local                         running (3d)      4m ago  13d    8928k        -  17.2.3   0912465dcea5  2463cf388cd7  
crash.ceph02                           ceph02.oldname.local                        running (11d)     4m ago  11d    10.2M        -  17.2.3   0912465dcea5  a24e3c222e35  
grafana.ceph01                         ceph01.domain.local            *:3000       running (3d)      4m ago  13d    86.0M        -  8.3.5    dad864ee21e9  50ed1c829566  
mds.mds-default.ceph03.ptkjle          ceph03.domain.local                         running (3d)      4m ago  13d    1994M        -  17.2.3   0912465dcea5  0db1f329b706  
mds.mds-default.ceph01.zrrptd          ceph01.domain.local                         running (3d)      4m ago  13d    30.3M        -  17.2.3   0912465dcea5  496c9f753bc7  
mgr.ceph03.haayqy                      ceph03.domain.local            *:8443,9283  running (3d)      4m ago  13d     111M        -  17.2.3   0912465dcea5  3fe9424055ee  
mgr.ceph01.domain.local       .miydsy  ceph01.domain.local            *:9283       running (3d)      4m ago  13d     469M        -  17.2.3   0912465dcea5  5bfed8d219fd  
mon.ceph03                             ceph03.domain.local                         running (3d)      4m ago  13d     476M    2048M  17.2.3   0912465dcea5  1119bcfc84af  
mon.ceph01.domain.local                ceph01.domain.local                         running (3d)      4m ago  13d     472M    2048M  17.2.3   0912465dcea5  3da27dc943f4  
mon.ceph02                             ceph02.oldname.local                        running (11d)     4m ago  11d     578M    2048M  17.2.3   0912465dcea5  88d7bfbcd9f5  
node-exporter.ceph03                   ceph03.domain.local            *:9100       running (3d)      4m ago  13d    20.2M        -           1dbe0e931976  8e6130b088db  
node-exporter.ceph01                   ceph01.domain.local            *:9100       running (3d)      4m ago  13d    21.3M        -           1dbe0e931976  8d01b76dda13  
node-exporter.ceph02                   ceph02.oldname.local           *:9100       running (11d)     4m ago  11d    4620k        -           1dbe0e931976  fa2b46930880  
osd.0                                  ceph01.domain.local                         running (3d)      4m ago  13d    3022M    4096M  17.2.3   0912465dcea5  d23b707f5f44  
osd.1                                  ceph01.domain.local                         running (3d)      4m ago  11d    6731M    4096M  17.2.3   0912465dcea5  af7b429509e7  
osd.2                                  ceph01.domain.local                         running (3d)      4m ago  11d    4897M    4096M  17.2.3   0912465dcea5  2bf8a273ffa9  
osd.3                                  ceph01.domain.local                         running (3d)      4m ago  11d    4897M    4096M  17.2.3   0912465dcea5  57e198c87d82  
osd.4                                  ceph01.domain.local                         running (3d)      4m ago  11d    4842M    4096M  17.2.3   0912465dcea5  90023164d14d  
osd.5                                  ceph01.domain.local                         running (3d)      4m ago  11d    4460M    4096M  17.2.3   0912465dcea5  0c6a9a34ff72  
osd.6                                  ceph03.domain.local                         running (3d)      4m ago  11d    2241M    4096M  17.2.3   0912465dcea5  537b839a31b7  
osd.7                                  ceph03.domain.local                         running (3d)      4m ago  11d    3894M    4096M  17.2.3   0912465dcea5  8a30f14aa72c  
osd.8                                  ceph03.domain.local                         running (3d)      4m ago  11d    3191M    4096M  17.2.3   0912465dcea5  5bcc089677a6  
osd.9                                  ceph03.domain.local                         running (3d)      4m ago  11d    3717M    4096M  17.2.3   0912465dcea5  6e42ca8325d8  
osd.10                                 ceph03.domain.local                         running (3d)      4m ago  11d    2406M    4096M  17.2.3   0912465dcea5  95858a805de8  
osd.12                                 ceph03.domain.local                         running (3d)      4m ago  12d    3355M    4096M  17.2.3   0912465dcea5  3c8cc41e1dce  
prometheus.ceph01                      ceph01.domain.local            *:9095       running (3d)      4m ago  13d     138M        -           514e6a882f6e  95e532fae898  

Also saw this in the logs, not sure what I was doing at the time, but:

2022-10-11T04:23:59.481304+0000 mgr.ceph03.haayqy (mgr.5537070) 1331 : cephadm [ERR] check-host failed for '192.168.10.210'
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1042, in check_host
    error_ok=True, no_fsid=True))
  File "/usr/share/ceph/mgr/cephadm/module.py", line 590, in wait_async
    return self.event_loop.get_result(coro)
  File "/usr/share/ceph/mgr/cephadm/ssh.py", line 48, in get_result
    return asyncio.run_coroutine_threadsafe(coro, self._loop).result()
  File "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1273, in _run_cephadm
    await self.mgr.ssh._remote_connection(host, addr)
  File "/usr/share/ceph/mgr/cephadm/ssh.py", line 66, in _remote_connection
    raise OrchestratorError("host address is empty")
orchestrator._interface.OrchestratorError: host address is empty

#5 Updated by Brian Woods 4 months ago

Oh, by all combinations, I mean I created DNS entries for all hosts, not just ceph02.

#6 Updated by Adam King 4 months ago

alright, looking back at the original traceback

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1042, in check_host
    error_ok=True, no_fsid=True))

that particular check_host function is the one that is called when directly running "ceph cephadm check-host". I hadn't read carefully earlier and confused that for _check_host elsewhere that get called internally.

I think the FQDNs and the hostnames are likely not the cause of the real issue here, which is that the upgrade stalled out. If you try another upgrade and wait for a bit, what does "ceph log last 100 info cephadm" say? If that doesn't give anything useful it could be worth setting the log level to debug "ceph config set mgr mgr/cephadm/log_to_cluster_level debug" then doing the same thing but instead running "ceph log last 200 debug cephadm". I think we need to go back and try to generally diagnose why the upgrade is getting stuck rather than continuing to look at this hostname stuff.

#7 Updated by Brian Woods 3 months ago

Seems I haven't seen the "host address is empty" error in about 10 days now.... Not sure if that is because of DNS, or what. So, good news?

The bad news, even with debug logging enabled, and restarting the upgrade, even hours later, zero new entries:

2022-10-21T22:45:14.884185+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8661 : cephadm [DBG]  mgr option ssh_config_file = None
2022-10-21T22:45:14.884260+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8662 : cephadm [DBG]  mgr option device_cache_timeout = 1800
2022-10-21T22:45:14.884307+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8663 : cephadm [DBG]  mgr option device_enhanced_scan = False
2022-10-21T22:45:14.884350+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8664 : cephadm [DBG]  mgr option daemon_cache_timeout = 600
2022-10-21T22:45:14.884392+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8665 : cephadm [DBG]  mgr option facts_cache_timeout = 60
2022-10-21T22:45:14.884434+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8666 : cephadm [DBG]  mgr option host_check_interval = 600
2022-10-21T22:45:14.884474+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8667 : cephadm [DBG]  mgr option2022-10-21T22:45:14.884185+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8661 : cephadm [DBG]  mgr option ssh_config_file = None
2022-10-21T22:45:14.884260+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8662 : cephadm [DBG]  mgr option device_cache_timeout = 1800
2022-10-21T22:45:14.884307+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8663 : cephadm [DBG]  mgr option device_enhanced_scan = False
2022-10-21T22:45:14.884350+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8664 : cephadm [DBG]  mgr option daemon_cache_timeout = 600
2022-10-21T22:45:14.884392+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8665 : cephadm [DBG]  mgr option facts_cache_timeout = 60
2022-10-21T22:45:14.884434+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8666 : cephadm [DBG]  mgr option host_check_interval = 600
2022-10-21T22:45:14.884474+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8667 : cephadm [DBG]  mgr option mode = root
2022-10-21T22:45:14.884516+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8668 : cephadm [DBG]  mgr option container_image_base = quay.io/ceph/ceph
2022-10-21T22:45:14.884557+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8669 : cephadm [DBG]  mgr option container_image_prometheus = quay.io/prometheus/prometheus:v2.33.4
2022-10-21T22:45:14.884597+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8670 : cephadm [DBG]  mgr option container_image_grafana = quay.io/ceph/ceph-grafana:8.3.5
2022-10-21T22:45:14.884636+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8671 : cephadm [DBG]  mgr option container_image_alertmanager = quay.io/prometheus/alertmanager:v0.23.0
2022-10-21T22:45:14.884675+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8672 : cephadm [DBG]  mgr option container_image_node_exporter = quay.io/prometheus/node-exporter:v1.3.1
2022-10-21T22:45:14.884714+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8673 : cephadm [DBG]  mgr option container_image_loki = docker.io/grafana/loki:2.4.0
2022-10-21T22:45:14.884753+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8674 : cephadm [DBG]  mgr option container_image_promtail = docker.io/grafana/promtail:2.4.0
2022-10-21T22:45:14.884793+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8675 : cephadm [DBG]  mgr option container_image_haproxy = docker.io/library/haproxy:2.3
2022-10-21T22:45:14.884832+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8676 : cephadm [DBG]  mgr option container_image_keepalived = docker.io/arcts/keepalived
2022-10-21T22:45:14.884871+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8677 : cephadm [DBG]  mgr option container_image_snmp_gateway = docker.io/maxwo/snmp-notifier:v1.2.1
2022-10-21T22:45:14.884911+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8678 : cephadm [DBG]  mgr option warn_on_stray_hosts = True
2022-10-21T22:45:14.884950+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8679 : cephadm [DBG]  mgr option warn_on_stray_daemons = True
2022-10-21T22:45:14.884989+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8680 : cephadm [DBG]  mgr option warn_on_failed_host_check = True
2022-10-21T22:45:14.885028+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8681 : cephadm [DBG]  mgr option log_to_cluster = True
2022-10-21T22:45:14.885066+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8682 : cephadm [DBG]  mgr option allow_ptrace = False
2022-10-21T22:45:14.885108+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8683 : cephadm [DBG]  mgr option container_init = True
2022-10-21T22:45:14.885149+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8684 : cephadm [DBG]  mgr option prometheus_alerts_path = /etc/prometheus/ceph/ceph_default_alerts.yml
2022-10-21T22:45:14.885190+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8685 : cephadm [DBG]  mgr option migration_current = 5
2022-10-21T22:45:14.885230+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8686 : cephadm [DBG]  mgr option config_dashboard = True
2022-10-21T22:45:14.885270+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8687 : cephadm [DBG]  mgr option manage_etc_ceph_ceph_conf = False
2022-10-21T22:45:14.885309+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8688 : cephadm [DBG]  mgr option manage_etc_ceph_ceph_conf_hosts = *
2022-10-21T22:45:14.885349+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8689 : cephadm [DBG]  mgr option registry_url = None
2022-10-21T22:45:14.885389+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8690 : cephadm [DBG]  mgr option registry_username = None
2022-10-21T22:45:14.885428+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8691 : cephadm [DBG]  mgr option registry_password = None
2022-10-21T22:45:14.885467+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8692 : cephadm [DBG]  mgr option registry_insecure = False
2022-10-21T22:45:14.885506+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8693 : cephadm [DBG]  mgr option use_repo_digest = True
2022-10-21T22:45:14.885545+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8694 : cephadm [DBG]  mgr option config_checks_enabled = False
2022-10-21T22:45:14.885588+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8695 : cephadm [DBG]  mgr option default_registry = docker.io
2022-10-21T22:45:14.885629+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8696 : cephadm [DBG]  mgr option max_count_per_host = 10
2022-10-21T22:45:14.885673+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8697 : cephadm [DBG]  mgr option autotune_memory_target_ratio = 0.7
2022-10-21T22:45:14.885715+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8698 : cephadm [DBG]  mgr option autotune_interval = 600
2022-10-21T22:45:14.885756+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8699 : cephadm [DBG]  mgr option use_agent = False
2022-10-21T22:45:14.885795+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8700 : cephadm [DBG]  mgr option agent_refresh_rate = 20
2022-10-21T22:45:14.885834+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8701 : cephadm [DBG]  mgr option agent_starting_port = 4721
2022-10-21T22:45:14.885875+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8702 : cephadm [DBG]  mgr option agent_down_multiplier = 3.0
2022-10-21T22:45:14.885915+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8703 : cephadm [DBG]  mgr option max_osd_draining_count = 10
2022-10-21T22:45:14.885955+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8704 : cephadm [DBG]  mgr option log_level = 
2022-10-21T22:45:14.885995+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8705 : cephadm [DBG]  mgr option log_to_file = False
2022-10-21T22:45:14.886036+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8706 : cephadm [DBG]  mgr option log_to_cluster_level = debug
2022-10-21T22:51:09.306208+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 9031 : cephadm [INF] Upgrade: Stopped
2022-10-21T22:51:26.531006+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 9050 : cephadm [INF] Upgrade: Started with target quay.io/ceph/ceph:v17.2.4
 mode = root
2022-10-21T22:45:14.884516+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8668 : cephadm [DBG]  mgr option container_image_base = quay.io/ceph/ceph
2022-10-21T22:45:14.884557+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8669 : cephadm [DBG]  mgr option container_image_prometheus = quay.io/prometheus/prometheus:v2.33.4
2022-10-21T22:45:14.884597+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8670 : cephadm [DBG]  mgr option container_image_grafana = quay.io/ceph/ceph-grafana:8.3.5
2022-10-21T22:45:14.884636+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8671 : cephadm [DBG]  mgr option container_image_alertmanager = quay.io/prometheus/alertmanager:v0.23.0
2022-10-21T22:45:14.884675+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8672 : cephadm [DBG]  mgr option container_image_node_exporter = quay.io/prometheus/node-exporter:v1.3.1
2022-10-21T22:45:14.884714+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8673 : cephadm [DBG]  mgr option container_image_loki = docker.io/grafana/loki:2.4.0
2022-10-21T22:45:14.884753+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8674 : cephadm [DBG]  mgr option container_image_promtail = docker.io/grafana/promtail:2.4.0
2022-10-21T22:45:14.884793+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8675 : cephadm [DBG]  mgr option container_image_haproxy = docker.io/library/haproxy:2.3
2022-10-21T22:45:14.884832+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8676 : cephadm [DBG]  mgr option container_image_keepalived = docker.io/arcts/keepalived
2022-10-21T22:45:14.884871+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8677 : cephadm [DBG]  mgr option container_image_snmp_gateway = docker.io/maxwo/snmp-notifier:v1.2.1
2022-10-21T22:45:14.884911+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8678 : cephadm [DBG]  mgr option warn_on_stray_hosts = True
2022-10-21T22:45:14.884950+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8679 : cephadm [DBG]  mgr option warn_on_stray_daemons = True
2022-10-21T22:45:14.884989+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8680 : cephadm [DBG]  mgr option warn_on_failed_host_check = True
2022-10-21T22:45:14.885028+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8681 : cephadm [DBG]  mgr option log_to_cluster = True
2022-10-21T22:45:14.885066+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8682 : cephadm [DBG]  mgr option allow_ptrace = False
2022-10-21T22:45:14.885108+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8683 : cephadm [DBG]  mgr option container_init = True
2022-10-21T22:45:14.885149+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8684 : cephadm [DBG]  mgr option prometheus_alerts_path = /etc/prometheus/ceph/ceph_default_alerts.yml
2022-10-21T22:45:14.885190+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8685 : cephadm [DBG]  mgr option migration_current = 5
2022-10-21T22:45:14.885230+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8686 : cephadm [DBG]  mgr option config_dashboard = True
2022-10-21T22:45:14.885270+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8687 : cephadm [DBG]  mgr option manage_etc_ceph_ceph_conf = False
2022-10-21T22:45:14.885309+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8688 : cephadm [DBG]  mgr option manage_etc_ceph_ceph_conf_hosts = *
2022-10-21T22:45:14.885349+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8689 : cephadm [DBG]  mgr option registry_url = None
2022-10-21T22:45:14.885389+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8690 : cephadm [DBG]  mgr option registry_username = None
2022-10-21T22:45:14.885428+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8691 : cephadm [DBG]  mgr option registry_password = None
2022-10-21T22:45:14.885467+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8692 : cephadm [DBG]  mgr option registry_insecure = False
2022-10-21T22:45:14.885506+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8693 : cephadm [DBG]  mgr option use_repo_digest = True
2022-10-21T22:45:14.885545+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8694 : cephadm [DBG]  mgr option config_checks_enabled = False
2022-10-21T22:45:14.885588+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8695 : cephadm [DBG]  mgr option default_registry = docker.io
2022-10-21T22:45:14.885629+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8696 : cephadm [DBG]  mgr option max_count_per_host = 10
2022-10-21T22:45:14.885673+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8697 : cephadm [DBG]  mgr option autotune_memory_target_ratio = 0.7
2022-10-21T22:45:14.885715+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8698 : cephadm [DBG]  mgr option autotune_interval = 600
2022-10-21T22:45:14.885756+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8699 : cephadm [DBG]  mgr option use_agent = False
2022-10-21T22:45:14.885795+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8700 : cephadm [DBG]  mgr option agent_refresh_rate = 20
2022-10-21T22:45:14.885834+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8701 : cephadm [DBG]  mgr option agent_starting_port = 4721
2022-10-21T22:45:14.885875+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8702 : cephadm [DBG]  mgr option agent_down_multiplier = 3.0
2022-10-21T22:45:14.885915+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8703 : cephadm [DBG]  mgr option max_osd_draining_count = 10
2022-10-21T22:45:14.885955+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8704 : cephadm [DBG]  mgr option log_level = 
2022-10-21T22:45:14.885995+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8705 : cephadm [DBG]  mgr option log_to_file = False
2022-10-21T22:45:14.886036+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8706 : cephadm [DBG]  mgr option log_to_cluster_level = debug
2022-10-21T22:51:09.306208+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 9031 : cephadm [INF] Upgrade: Stopped
2022-10-21T22:51:26.531006+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 9050 : cephadm [INF] Upgrade: Started with target quay.io/ceph/ceph:v17.2.4

Ideas? Those are the only entries sense enabling debug loggings.

#8 Updated by Adam King 3 months ago

this seems to imply the cephadm service loop just isn't running at all. Does the REFRESHED column in `ceph orch device ls` report a relatively recent refresh time? If not something got stuck somewhere, the most common culprit being a hung `cephadm ceph-volume` process on one of the nodes. If that does report a recent refresh, perhaps the orchestrator is paused? What does `ceph orch status` spit out?

#9 Updated by Brian Woods 3 months ago

I rebooted last night, all items report a refreshed time of about 13 hours ago, when I rebooted.

# ceph orch status
Backend: cephadm
Available: Yes
Paused: No

Re-enabled debug (just in case something resets it).
Restarted the upgrade.

And though I see this:

# ceph orch upgrade status
{
    "target_image": "quay.io/ceph/ceph:v17.2.4",
    "in_progress": true,
    "which": "Upgrading all daemon types on all hosts",
    "services_complete": [],
    "progress": "",
    "message": "" 
}

It is NOT showing in the ceph -s screen!!! So I think you are onto something. I thought it did this the other day, but after running the upgrade command again, I just figured I didn't hit enter or something..

So I ran it again, but it still doesn't show up...

Also wrote a quick script to scrub my logs for easy posting, so I have attached a a good long chunk of it.

#10 Updated by Brian Woods 3 months ago

I am getting ready to add another node to the cluster. Is there anything you can think of I can check, pre or post?

#11 Updated by Brian Woods 3 months ago

And suddenly the upgrade is happening!!!

Today I rebooted ceph02, a node that only had the MDS, and suddenly things started upgrading!

No idea why... Will report back if it finished, and will try and capture logs (not tonight though...).

Also available in: Atom PDF