Bug #57800
openceph orch upgrade does not appear to work with FQNDs.
0%
Description
This is purely speculative on my part, but after attempting an upgrade to 17.2.4 from 17.2.3, it just sits there doing nothing. Checking the logs shows:
2022-10-09T00:03:51.821500+0000 mgr.ceph01.domain.local.miydsy (mgr.14186) 587419 : cephadm [ERR] check-host failed for 'ceph02.domain.local' Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/module.py", line 1042, in check_host error_ok=True, no_fsid=True)) File "/usr/share/ceph/mgr/cephadm/module.py", line 590, in wait_async return self.event_loop.get_result(coro) File "/usr/share/ceph/mgr/cephadm/ssh.py", line 48, in get_result return asyncio.run_coroutine_threadsafe(coro, self._loop).result() File "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result return self.__get_result() File "/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result raise self._exception File "/usr/share/ceph/mgr/cephadm/serve.py", line 1273, in _run_cephadm await self.mgr.ssh._remote_connection(host, addr) File "/usr/share/ceph/mgr/cephadm/ssh.py", line 66, in _remote_connection raise OrchestratorError("host address is empty") orchestrator._interface.OrchestratorError: host address is empty
I assume something might not be escaped for correctly and the periods in the FQDN is breaking something?
Files
Updated by Adam King over 1 year ago
what does `ceph orch host ls` report for this host? This error should only be raised if we can't find any IP stored for the host. You could also look at "ceph config-key get mgr/cephadm/inventory" which should be a json struct that includes all the hosts with their names, addresses, etc. and see if it lists an actual address for that host (as opposed to just listing the hostname as the addr). If it does look like there is no address for the host, the `ceph orch host set-addr` command might be able to fix it.
Updated by Brian Woods over 1 year ago
So, I did notice that I had set the domain name on one of the nodes to the "oldname.local" (when I was doing the find/replace to scrub this), but that shouldn't impact DNS. I confirmed that all names resolve from all hosts (DNS provided be a DHCP in the case). And it looks like it is seeing all the correct IPs.
ceph orch host ls:
HOST ADDR LABELS STATUS ceph03.domain.local 192.168.10.80 rbd ceph01.domain.local 192.168.10.210 _admin rgw grafana mds ceph02.oldname.local 192.168.10.51 mon mgr mds _admin 3 hosts in cluster
ceph config-key get mgr/cephadm/inventory:
{ "ceph01.domain.local":{ "hostname":"ceph01.domain.local", "addr":"192.168.10.210", "labels":[ "_admin", "rgw", "grafana", "mds" ], "status":"" }, "ceph03.domain.local":{ "hostname":"ceph03.domain.local", "addr":"192.168.10.80", "labels":[ "rbd" ], "status":"" }, "ceph02.oldname.local":{ "hostname":"ceph02.oldname.local", "addr":"192.168.10.51", "labels":[ "mon", "mgr", "mds", "_admin" ], "status":"" } }
I did health checks on both the host, as well as the cephadm shell container:
root@ceph01:/var/log# ceph cephadm check-host ceph03.domain.local ceph03.domain.local (None) ok docker (/usr/bin/docker) is present systemctl is present lvcreate is present Unit chrony.service is enabled and running Hostname "ceph03.domain.local" matches what is expected. Host looks OK root@ceph01:/var/log# ceph cephadm check-host ceph01.domain.local ceph01.domain.local (None) ok docker (/usr/bin/docker) is present systemctl is present lvcreate is present Unit chrony.service is enabled and running Hostname "ceph01.domain.local" matches what is expected. Host looks OK root@ceph01:/var/log# ceph cephadm check-host ceph02.oldname.local ceph02.oldname.local (None) ok docker (/usr/bin/docker) is present systemctl is present lvcreate is present Unit chrony.service is enabled and running Hostname "ceph02.oldname.local" matches what is expected. Host looks OK
Thoughts?
Updated by Adam King over 1 year ago
it's odd that the hostname it reports not having an address for isn't even a hostname it has stored "ceph02.domain.local". It seems at this point that it probably failed to find an address for "ceph02.domain.local" since it doesn't even have any entry for it. The question is why was it trying to go to that hostname at all? Perhaps something was cached that shouldn't have been there anymore? Does stopping the upgrade, running "ceph mgr fail" and starting the upgrade up again make this happen again? Might also be worth checking in `orch ps` output that "ceph02.domain.local" isn't reported as the hostname for any of the daemons as well and that no service spec placements explicitly reference "ceph02.domain.local" either.
Updated by Brian Woods over 1 year ago
I add DNS entries for all combinations. So both ceph02.oldname.local and ceph02.domain.local are now valid names but the host is still configured as "oldname.local".
After confirming all combos worked, I then ran these, waiting about a minute between each command:
root@ceph01# ceph orch upgrade stop Stopped upgrade to quay.io/ceph/ceph:v17.2.4 root@ceph01# ceph mgr fail root@ceph01# ceph orch upgrade start --image quay.io/ceph/ceph:v17.2.4 Initiating upgrade to quay.io/ceph/ceph:v17.2.4 root@ceph01# ceph orch upgrade status { "target_image": "quay.io/ceph/ceph:v17.2.4", "in_progress": true, "which": "Upgrading all daemon types on all hosts", "services_complete": [], "progress": "", "message": "" }
No movement in the process after about a half hour.
Current ceph ps:
# ceph orch ps NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID alertmanager.ceph01 ceph01.domain.local *:9093,9094 running (3d) 4m ago 13d 29.0M - ba2b418f427c 16eaa667487b crash.ceph03 ceph03.domain.local running (3d) 4m ago 13d 8999k - 17.2.3 0912465dcea5 60d65abff255 crash.ceph01 ceph01.domain.local running (3d) 4m ago 13d 8928k - 17.2.3 0912465dcea5 2463cf388cd7 crash.ceph02 ceph02.oldname.local running (11d) 4m ago 11d 10.2M - 17.2.3 0912465dcea5 a24e3c222e35 grafana.ceph01 ceph01.domain.local *:3000 running (3d) 4m ago 13d 86.0M - 8.3.5 dad864ee21e9 50ed1c829566 mds.mds-default.ceph03.ptkjle ceph03.domain.local running (3d) 4m ago 13d 1994M - 17.2.3 0912465dcea5 0db1f329b706 mds.mds-default.ceph01.zrrptd ceph01.domain.local running (3d) 4m ago 13d 30.3M - 17.2.3 0912465dcea5 496c9f753bc7 mgr.ceph03.haayqy ceph03.domain.local *:8443,9283 running (3d) 4m ago 13d 111M - 17.2.3 0912465dcea5 3fe9424055ee mgr.ceph01.domain.local .miydsy ceph01.domain.local *:9283 running (3d) 4m ago 13d 469M - 17.2.3 0912465dcea5 5bfed8d219fd mon.ceph03 ceph03.domain.local running (3d) 4m ago 13d 476M 2048M 17.2.3 0912465dcea5 1119bcfc84af mon.ceph01.domain.local ceph01.domain.local running (3d) 4m ago 13d 472M 2048M 17.2.3 0912465dcea5 3da27dc943f4 mon.ceph02 ceph02.oldname.local running (11d) 4m ago 11d 578M 2048M 17.2.3 0912465dcea5 88d7bfbcd9f5 node-exporter.ceph03 ceph03.domain.local *:9100 running (3d) 4m ago 13d 20.2M - 1dbe0e931976 8e6130b088db node-exporter.ceph01 ceph01.domain.local *:9100 running (3d) 4m ago 13d 21.3M - 1dbe0e931976 8d01b76dda13 node-exporter.ceph02 ceph02.oldname.local *:9100 running (11d) 4m ago 11d 4620k - 1dbe0e931976 fa2b46930880 osd.0 ceph01.domain.local running (3d) 4m ago 13d 3022M 4096M 17.2.3 0912465dcea5 d23b707f5f44 osd.1 ceph01.domain.local running (3d) 4m ago 11d 6731M 4096M 17.2.3 0912465dcea5 af7b429509e7 osd.2 ceph01.domain.local running (3d) 4m ago 11d 4897M 4096M 17.2.3 0912465dcea5 2bf8a273ffa9 osd.3 ceph01.domain.local running (3d) 4m ago 11d 4897M 4096M 17.2.3 0912465dcea5 57e198c87d82 osd.4 ceph01.domain.local running (3d) 4m ago 11d 4842M 4096M 17.2.3 0912465dcea5 90023164d14d osd.5 ceph01.domain.local running (3d) 4m ago 11d 4460M 4096M 17.2.3 0912465dcea5 0c6a9a34ff72 osd.6 ceph03.domain.local running (3d) 4m ago 11d 2241M 4096M 17.2.3 0912465dcea5 537b839a31b7 osd.7 ceph03.domain.local running (3d) 4m ago 11d 3894M 4096M 17.2.3 0912465dcea5 8a30f14aa72c osd.8 ceph03.domain.local running (3d) 4m ago 11d 3191M 4096M 17.2.3 0912465dcea5 5bcc089677a6 osd.9 ceph03.domain.local running (3d) 4m ago 11d 3717M 4096M 17.2.3 0912465dcea5 6e42ca8325d8 osd.10 ceph03.domain.local running (3d) 4m ago 11d 2406M 4096M 17.2.3 0912465dcea5 95858a805de8 osd.12 ceph03.domain.local running (3d) 4m ago 12d 3355M 4096M 17.2.3 0912465dcea5 3c8cc41e1dce prometheus.ceph01 ceph01.domain.local *:9095 running (3d) 4m ago 13d 138M - 514e6a882f6e 95e532fae898 Also saw this in the logs, not sure what I was doing at the time, but: 2022-10-11T04:23:59.481304+0000 mgr.ceph03.haayqy (mgr.5537070) 1331 : cephadm [ERR] check-host failed for '192.168.10.210' Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/module.py", line 1042, in check_host error_ok=True, no_fsid=True)) File "/usr/share/ceph/mgr/cephadm/module.py", line 590, in wait_async return self.event_loop.get_result(coro) File "/usr/share/ceph/mgr/cephadm/ssh.py", line 48, in get_result return asyncio.run_coroutine_threadsafe(coro, self._loop).result() File "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result return self.__get_result() File "/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result raise self._exception File "/usr/share/ceph/mgr/cephadm/serve.py", line 1273, in _run_cephadm await self.mgr.ssh._remote_connection(host, addr) File "/usr/share/ceph/mgr/cephadm/ssh.py", line 66, in _remote_connection raise OrchestratorError("host address is empty") orchestrator._interface.OrchestratorError: host address is empty
Updated by Brian Woods over 1 year ago
Oh, by all combinations, I mean I created DNS entries for all hosts, not just ceph02.
Updated by Adam King over 1 year ago
alright, looking back at the original traceback
Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/module.py", line 1042, in check_host error_ok=True, no_fsid=True))
that particular check_host function is the one that is called when directly running "ceph cephadm check-host". I hadn't read carefully earlier and confused that for _check_host elsewhere that get called internally.
I think the FQDNs and the hostnames are likely not the cause of the real issue here, which is that the upgrade stalled out. If you try another upgrade and wait for a bit, what does "ceph log last 100 info cephadm" say? If that doesn't give anything useful it could be worth setting the log level to debug "ceph config set mgr mgr/cephadm/log_to_cluster_level debug" then doing the same thing but instead running "ceph log last 200 debug cephadm". I think we need to go back and try to generally diagnose why the upgrade is getting stuck rather than continuing to look at this hostname stuff.
Updated by Brian Woods over 1 year ago
Seems I haven't seen the "host address is empty" error in about 10 days now.... Not sure if that is because of DNS, or what. So, good news?
The bad news, even with debug logging enabled, and restarting the upgrade, even hours later, zero new entries:
2022-10-21T22:45:14.884185+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8661 : cephadm [DBG] mgr option ssh_config_file = None 2022-10-21T22:45:14.884260+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8662 : cephadm [DBG] mgr option device_cache_timeout = 1800 2022-10-21T22:45:14.884307+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8663 : cephadm [DBG] mgr option device_enhanced_scan = False 2022-10-21T22:45:14.884350+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8664 : cephadm [DBG] mgr option daemon_cache_timeout = 600 2022-10-21T22:45:14.884392+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8665 : cephadm [DBG] mgr option facts_cache_timeout = 60 2022-10-21T22:45:14.884434+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8666 : cephadm [DBG] mgr option host_check_interval = 600 2022-10-21T22:45:14.884474+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8667 : cephadm [DBG] mgr option2022-10-21T22:45:14.884185+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8661 : cephadm [DBG] mgr option ssh_config_file = None 2022-10-21T22:45:14.884260+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8662 : cephadm [DBG] mgr option device_cache_timeout = 1800 2022-10-21T22:45:14.884307+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8663 : cephadm [DBG] mgr option device_enhanced_scan = False 2022-10-21T22:45:14.884350+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8664 : cephadm [DBG] mgr option daemon_cache_timeout = 600 2022-10-21T22:45:14.884392+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8665 : cephadm [DBG] mgr option facts_cache_timeout = 60 2022-10-21T22:45:14.884434+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8666 : cephadm [DBG] mgr option host_check_interval = 600 2022-10-21T22:45:14.884474+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8667 : cephadm [DBG] mgr option mode = root 2022-10-21T22:45:14.884516+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8668 : cephadm [DBG] mgr option container_image_base = quay.io/ceph/ceph 2022-10-21T22:45:14.884557+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8669 : cephadm [DBG] mgr option container_image_prometheus = quay.io/prometheus/prometheus:v2.33.4 2022-10-21T22:45:14.884597+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8670 : cephadm [DBG] mgr option container_image_grafana = quay.io/ceph/ceph-grafana:8.3.5 2022-10-21T22:45:14.884636+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8671 : cephadm [DBG] mgr option container_image_alertmanager = quay.io/prometheus/alertmanager:v0.23.0 2022-10-21T22:45:14.884675+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8672 : cephadm [DBG] mgr option container_image_node_exporter = quay.io/prometheus/node-exporter:v1.3.1 2022-10-21T22:45:14.884714+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8673 : cephadm [DBG] mgr option container_image_loki = docker.io/grafana/loki:2.4.0 2022-10-21T22:45:14.884753+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8674 : cephadm [DBG] mgr option container_image_promtail = docker.io/grafana/promtail:2.4.0 2022-10-21T22:45:14.884793+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8675 : cephadm [DBG] mgr option container_image_haproxy = docker.io/library/haproxy:2.3 2022-10-21T22:45:14.884832+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8676 : cephadm [DBG] mgr option container_image_keepalived = docker.io/arcts/keepalived 2022-10-21T22:45:14.884871+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8677 : cephadm [DBG] mgr option container_image_snmp_gateway = docker.io/maxwo/snmp-notifier:v1.2.1 2022-10-21T22:45:14.884911+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8678 : cephadm [DBG] mgr option warn_on_stray_hosts = True 2022-10-21T22:45:14.884950+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8679 : cephadm [DBG] mgr option warn_on_stray_daemons = True 2022-10-21T22:45:14.884989+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8680 : cephadm [DBG] mgr option warn_on_failed_host_check = True 2022-10-21T22:45:14.885028+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8681 : cephadm [DBG] mgr option log_to_cluster = True 2022-10-21T22:45:14.885066+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8682 : cephadm [DBG] mgr option allow_ptrace = False 2022-10-21T22:45:14.885108+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8683 : cephadm [DBG] mgr option container_init = True 2022-10-21T22:45:14.885149+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8684 : cephadm [DBG] mgr option prometheus_alerts_path = /etc/prometheus/ceph/ceph_default_alerts.yml 2022-10-21T22:45:14.885190+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8685 : cephadm [DBG] mgr option migration_current = 5 2022-10-21T22:45:14.885230+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8686 : cephadm [DBG] mgr option config_dashboard = True 2022-10-21T22:45:14.885270+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8687 : cephadm [DBG] mgr option manage_etc_ceph_ceph_conf = False 2022-10-21T22:45:14.885309+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8688 : cephadm [DBG] mgr option manage_etc_ceph_ceph_conf_hosts = * 2022-10-21T22:45:14.885349+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8689 : cephadm [DBG] mgr option registry_url = None 2022-10-21T22:45:14.885389+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8690 : cephadm [DBG] mgr option registry_username = None 2022-10-21T22:45:14.885428+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8691 : cephadm [DBG] mgr option registry_password = None 2022-10-21T22:45:14.885467+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8692 : cephadm [DBG] mgr option registry_insecure = False 2022-10-21T22:45:14.885506+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8693 : cephadm [DBG] mgr option use_repo_digest = True 2022-10-21T22:45:14.885545+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8694 : cephadm [DBG] mgr option config_checks_enabled = False 2022-10-21T22:45:14.885588+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8695 : cephadm [DBG] mgr option default_registry = docker.io 2022-10-21T22:45:14.885629+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8696 : cephadm [DBG] mgr option max_count_per_host = 10 2022-10-21T22:45:14.885673+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8697 : cephadm [DBG] mgr option autotune_memory_target_ratio = 0.7 2022-10-21T22:45:14.885715+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8698 : cephadm [DBG] mgr option autotune_interval = 600 2022-10-21T22:45:14.885756+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8699 : cephadm [DBG] mgr option use_agent = False 2022-10-21T22:45:14.885795+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8700 : cephadm [DBG] mgr option agent_refresh_rate = 20 2022-10-21T22:45:14.885834+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8701 : cephadm [DBG] mgr option agent_starting_port = 4721 2022-10-21T22:45:14.885875+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8702 : cephadm [DBG] mgr option agent_down_multiplier = 3.0 2022-10-21T22:45:14.885915+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8703 : cephadm [DBG] mgr option max_osd_draining_count = 10 2022-10-21T22:45:14.885955+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8704 : cephadm [DBG] mgr option log_level = 2022-10-21T22:45:14.885995+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8705 : cephadm [DBG] mgr option log_to_file = False 2022-10-21T22:45:14.886036+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8706 : cephadm [DBG] mgr option log_to_cluster_level = debug 2022-10-21T22:51:09.306208+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 9031 : cephadm [INF] Upgrade: Stopped 2022-10-21T22:51:26.531006+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 9050 : cephadm [INF] Upgrade: Started with target quay.io/ceph/ceph:v17.2.4 mode = root 2022-10-21T22:45:14.884516+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8668 : cephadm [DBG] mgr option container_image_base = quay.io/ceph/ceph 2022-10-21T22:45:14.884557+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8669 : cephadm [DBG] mgr option container_image_prometheus = quay.io/prometheus/prometheus:v2.33.4 2022-10-21T22:45:14.884597+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8670 : cephadm [DBG] mgr option container_image_grafana = quay.io/ceph/ceph-grafana:8.3.5 2022-10-21T22:45:14.884636+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8671 : cephadm [DBG] mgr option container_image_alertmanager = quay.io/prometheus/alertmanager:v0.23.0 2022-10-21T22:45:14.884675+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8672 : cephadm [DBG] mgr option container_image_node_exporter = quay.io/prometheus/node-exporter:v1.3.1 2022-10-21T22:45:14.884714+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8673 : cephadm [DBG] mgr option container_image_loki = docker.io/grafana/loki:2.4.0 2022-10-21T22:45:14.884753+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8674 : cephadm [DBG] mgr option container_image_promtail = docker.io/grafana/promtail:2.4.0 2022-10-21T22:45:14.884793+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8675 : cephadm [DBG] mgr option container_image_haproxy = docker.io/library/haproxy:2.3 2022-10-21T22:45:14.884832+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8676 : cephadm [DBG] mgr option container_image_keepalived = docker.io/arcts/keepalived 2022-10-21T22:45:14.884871+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8677 : cephadm [DBG] mgr option container_image_snmp_gateway = docker.io/maxwo/snmp-notifier:v1.2.1 2022-10-21T22:45:14.884911+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8678 : cephadm [DBG] mgr option warn_on_stray_hosts = True 2022-10-21T22:45:14.884950+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8679 : cephadm [DBG] mgr option warn_on_stray_daemons = True 2022-10-21T22:45:14.884989+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8680 : cephadm [DBG] mgr option warn_on_failed_host_check = True 2022-10-21T22:45:14.885028+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8681 : cephadm [DBG] mgr option log_to_cluster = True 2022-10-21T22:45:14.885066+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8682 : cephadm [DBG] mgr option allow_ptrace = False 2022-10-21T22:45:14.885108+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8683 : cephadm [DBG] mgr option container_init = True 2022-10-21T22:45:14.885149+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8684 : cephadm [DBG] mgr option prometheus_alerts_path = /etc/prometheus/ceph/ceph_default_alerts.yml 2022-10-21T22:45:14.885190+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8685 : cephadm [DBG] mgr option migration_current = 5 2022-10-21T22:45:14.885230+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8686 : cephadm [DBG] mgr option config_dashboard = True 2022-10-21T22:45:14.885270+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8687 : cephadm [DBG] mgr option manage_etc_ceph_ceph_conf = False 2022-10-21T22:45:14.885309+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8688 : cephadm [DBG] mgr option manage_etc_ceph_ceph_conf_hosts = * 2022-10-21T22:45:14.885349+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8689 : cephadm [DBG] mgr option registry_url = None 2022-10-21T22:45:14.885389+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8690 : cephadm [DBG] mgr option registry_username = None 2022-10-21T22:45:14.885428+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8691 : cephadm [DBG] mgr option registry_password = None 2022-10-21T22:45:14.885467+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8692 : cephadm [DBG] mgr option registry_insecure = False 2022-10-21T22:45:14.885506+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8693 : cephadm [DBG] mgr option use_repo_digest = True 2022-10-21T22:45:14.885545+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8694 : cephadm [DBG] mgr option config_checks_enabled = False 2022-10-21T22:45:14.885588+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8695 : cephadm [DBG] mgr option default_registry = docker.io 2022-10-21T22:45:14.885629+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8696 : cephadm [DBG] mgr option max_count_per_host = 10 2022-10-21T22:45:14.885673+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8697 : cephadm [DBG] mgr option autotune_memory_target_ratio = 0.7 2022-10-21T22:45:14.885715+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8698 : cephadm [DBG] mgr option autotune_interval = 600 2022-10-21T22:45:14.885756+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8699 : cephadm [DBG] mgr option use_agent = False 2022-10-21T22:45:14.885795+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8700 : cephadm [DBG] mgr option agent_refresh_rate = 20 2022-10-21T22:45:14.885834+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8701 : cephadm [DBG] mgr option agent_starting_port = 4721 2022-10-21T22:45:14.885875+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8702 : cephadm [DBG] mgr option agent_down_multiplier = 3.0 2022-10-21T22:45:14.885915+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8703 : cephadm [DBG] mgr option max_osd_draining_count = 10 2022-10-21T22:45:14.885955+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8704 : cephadm [DBG] mgr option log_level = 2022-10-21T22:45:14.885995+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8705 : cephadm [DBG] mgr option log_to_file = False 2022-10-21T22:45:14.886036+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 8706 : cephadm [DBG] mgr option log_to_cluster_level = debug 2022-10-21T22:51:09.306208+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 9031 : cephadm [INF] Upgrade: Stopped 2022-10-21T22:51:26.531006+0000 mgr.ceph01.domain.local.miydsy (mgr.8880077) 9050 : cephadm [INF] Upgrade: Started with target quay.io/ceph/ceph:v17.2.4
Ideas? Those are the only entries sense enabling debug loggings.
Updated by Adam King over 1 year ago
this seems to imply the cephadm service loop just isn't running at all. Does the REFRESHED column in `ceph orch device ls` report a relatively recent refresh time? If not something got stuck somewhere, the most common culprit being a hung `cephadm ceph-volume` process on one of the nodes. If that does report a recent refresh, perhaps the orchestrator is paused? What does `ceph orch status` spit out?
Updated by Brian Woods over 1 year ago
I rebooted last night, all items report a refreshed time of about 13 hours ago, when I rebooted.
# ceph orch status Backend: cephadm Available: Yes Paused: No
Re-enabled debug (just in case something resets it).
Restarted the upgrade.
And though I see this:
# ceph orch upgrade status { "target_image": "quay.io/ceph/ceph:v17.2.4", "in_progress": true, "which": "Upgrading all daemon types on all hosts", "services_complete": [], "progress": "", "message": "" }
It is NOT showing in the ceph -s screen!!! So I think you are onto something. I thought it did this the other day, but after running the upgrade command again, I just figured I didn't hit enter or something..
So I ran it again, but it still doesn't show up...
Also wrote a quick script to scrub my logs for easy posting, so I have attached a a good long chunk of it.
Updated by Brian Woods over 1 year ago
I am getting ready to add another node to the cluster. Is there anything you can think of I can check, pre or post?
Updated by Brian Woods over 1 year ago
And suddenly the upgrade is happening!!!
Today I rebooted ceph02, a node that only had the MDS, and suddenly things started upgrading!
No idea why... Will report back if it finished, and will try and capture logs (not tonight though...).
Updated by Brian Woods about 1 year ago
Something was stuck on one of the nodes. Can't debug further. This ticket can be canceled.