Bug #58242
closedcephadm doesn't communicate with containers that failed initial start but were successfully restarted later by systemd
0%
Description
If, after a host restart, the initial start of a container failed for whatever reason (perhaps because the systemd start timeout TimeoutStartSec
was exceeded) then cephadm does not detect the container as running even if a subsequent automatic restart by systemd successfully started the container.
- Restart the host that runs a lot of services and look at the log for how long it takes to start the various service containers.
- Edit the ceph container systemd service file at /etc/systemd/system/ceph-<fsid><at>.service and change
TimeoutStartSec
to be just a few seconds less than the previously recorded longest time to start ceph services when all containers are starting at the same time. The idea is to have it long enough to allow some/most containers to start but cause a start timeout on others. - Restart the host.
- Since the
TimeoutStartSec
now makes some containers time out on start when all the containers are starting at the same time, systemd will terminate + restart later the containers that timed out starting. - After systemd tries to later restart the containers that timed out, hopefully now the system is less busy and the container will start faster, therefore not exceeding the start timeout.
- The containers that systemd restarted later, and that now did not exceed the start timeout because the system was less busy, are running.
- Do a
ceph orch ps
and witness theerror
in the "STATUS" field for those containers that did not start on the initial start attempt (but are now running just fine after systemd restarted them).
If there is some other way besides TimeoutStartSec
to make the initial container start after boot to fail but still allow the subsequent automatic restart by systemd to succeed then use that for reproducing.
Only thing that seems to matter is that the container needs to fail the initial start and successfully start on the later automatic restart by systemd. cephadm does not detect such containers as running in ceph orch ls
or ceph orch ps
or other commands, even if ceph -s
shows that all mons, mgrs, osds etc are running just fine. Such containers seem to be invisible to cephadm.
cephadm needs to handle the case where a container failed the initial start attempt but successfully started on the later automatic restart attempt by systemd.
Files
Updated by Redouane Kachach Elhichou over 1 year ago
- Assignee set to Redouane Kachach Elhichou
Updated by Redouane Kachach Elhichou over 1 year ago
Thank you very much for reporting the issue and for providing the detailed instructions. I tried to reproduce the issue in a small cluster (3 nodes) however eventually cephadm showed the right information on 'ceph orch ps'.
I said eventually because the information at cephadm side is not reflected instantly but you have to wait a little bit to get daemons information updated. By default cephadm refreshes daemons information each 10min. You can force cephadm to refresh the information by adding the flag --refersh:
ceph orch ps --refresh
This behaviour is configurable by configurable through the variable daemon_cache_timeout (10 min by default), following is the cmd to change this variable value:
ceph config set mgr mgr/cephadm/daemon_cache_timeout XX (in seconds)
Please, keep in mind that when you pass the --refresh option or when you change the variable above, cephadm will trigger an information refresh operation whose duration depends on how big is the cluster (the more nodes it has the longer it takes). Normally the user can re-run the `ceph orch ps` command and check whether the info is up-to-date or not (by checking the column REFRESHED which is in this case our source of truth).
Please, can you double check (with this info in mind) if the problem persists and the status is not updated?
BTW: It would be good when the problem happens to attach the output of 'ceph orch ps'
Updated by Voja Molani over 1 year ago
The cluster in question is 3 nodes.
Some time after boot:
# ceph orch ps | egrep 'cep1|REFRES' NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID alertmanager.cep1 cep1.domain.com 192.0.2.11:9093,9094 error 6m ago 4M - - <unknown> <unknown> <unknown> grafana.cep1 cep1.domain.com 192.0.2.11:3000 error 6m ago 4M - - 8.3.5 dad864ee21e9 048903dacf83 mgr.cep1.domain.com.asmugh cep1.domain.com *:9283 error 6m ago 4M 417M - 17.2.3 0912465dcea5 667e55c0d52d mon.cep1.domain.com cep1.domain.com error 6m ago 4M - 2048M <unknown> <unknown> <unknown> node-exporter.cep1 cep1.domain.com 192.0.2.11:9100 error 6m ago 4M - - 1dbe0e931976 8038262a4167 osd.0 cep1.domain.com error 6m ago 4M - 4096M 17.2.3 0912465dcea5 6084aee38804 osd.1 cep1.domain.com error 6m ago 4M - 4096M 17.2.3 0912465dcea5 359b8a6160ec osd.2 cep1.domain.com error 6m ago 4M - 4096M <unknown> <unknown> <unknown> osd.3 cep1.domain.com error 6m ago 4M - 4096M 17.2.3 0912465dcea5 b01691462e1c osd.4 cep1.domain.com error 6m ago 4M - 4096M <unknown> <unknown> <unknown> osd.5 cep1.domain.com error 6m ago 4M - 4096M <unknown> <unknown> <unknown> osd.6 cep1.domain.com error 6m ago 4M - 4096M <unknown> <unknown> <unknown> osd.7 cep1.domain.com error 6m ago 4M - 4096M 17.2.3 0912465dcea5 fe87322f6a62 osd.8 cep1.domain.com error 6m ago 4M - 4096M <unknown> <unknown> <unknown> osd.9 cep1.domain.com running (6m) 6m ago 4M - 4096M 17.2.3 0912465dcea5 e5dc85a68eab osd.10 cep1.domain.com error 6m ago 4M - 4096M 17.2.3 0912465dcea5 788d7fd8f43a osd.11 cep1.domain.com error 6m ago 4M - 4096M 17.2.3 0912465dcea5 69a36cd1aa83 prometheus.cep1 cep1.domain.com 192.0.2.11:9095 error 6m ago 4M - - <unknown> <unknown> <unknown> rgw.pool.cep1.abmahl cep1.domain.com 192.0.2.11:8000 error 6m ago 4M 77.8M - 17.2.3 0912465dcea5 458f8405ee84
Refresh was 6m ago. only osd.9 is not "error" state. meanwhile at podman side:
# podman ps --no-trunc CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 458f8405ee84ebe4c66464f852808b99d757acb616c77fbd8a8d9141b1bb65d6 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n client.rgw.pool.cep1.abmahl -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 14 minutes ago Up 11 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-rgw-pool-cep1-abmahl 667e55c0d52db4aaf293e3f76b3b00bfc32c2bd32e49b6f524265a101a971afb private-reg/ceph:latest -n mgr.cep1.domain.com.asmugh -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 14 minutes ago Up 13 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-mgr-cep1-domain-com-asmugh 048903dacf8379dc44661fb68e7e3ab01fe80af5b543bff381493cdb9408e112 quay.io/ceph/ceph-grafana:8.3.5 /bin/bash 6 minutes ago Up 6 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-grafana-cep1 8038262a4167687f86e05b4a6fc5201e85e76a8002cf4b77ed1056a57084f2c7 quay.io/prometheus/node-exporter:v1.3.1 --no-collector.timex --web.listen-address=192.0.2.11:9100 --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/rootfs 6 minutes ago Up 6 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-node-exporter-cep1 fe87322f6a62ceba8fc86d07068594fd398eaecac4943798a75a226bc948fbf0 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.7 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 6 minutes ago Up 6 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-7 359b8a6160ec6cf635cce6afa8b2c002877aa3c02a19a8ce8ef9131ad3c7604a private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.1 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 6 minutes ago Up 6 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-1 788d7fd8f43a39897e8d84f8c650d3a43c066190cd9d96c23e14acb671e531ee private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.10 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 6 minutes ago Up 6 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-10 b01691462e1c43e2c87deb3edcb5fe8afe82a7c4107623564bb8a01b80957560 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.3 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 6 minutes ago Up 6 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-3 69a36cd1aa834d65fbf3fde13a2618fce40e6be1efb8e316857ae6545db2feb6 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.11 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 6 minutes ago Up 6 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-11 6084aee38804040de0f99e51ebf288f33f86b60c91d45c84376bb5788b8d5541 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.0 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 6 minutes ago Up 6 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-0 40c66bb9109dcf2894282f2bb9cd5021ba0bd6eb2c390e741a45248d14e3dc97 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.2 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 6 minutes ago Up 6 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-2 bd0e468d3db61cb04eb7ecb1b2a0711712e09a8671c24eca8411b5a6947881b2 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.6 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 6 minutes ago Up 6 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-6 e5dc85a68eabd44f46ace54489b32dd450b237259322b2c594e3e86d4441ce3e private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.9 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 6 minutes ago Up 6 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-9 386590aeba681b374e718be72278b08e3ecd18833fccf3a05992eedf5f363bbc private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.5 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 6 minutes ago Up 6 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-5 f68b7db5debe5e26a4e01746717aa8fa86966286a5bcc8021294e5f3956daaad private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.4 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 6 minutes ago Up 6 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-4 aa6cef9888d9b9af9362c46f18f1fc39f5d9c88a173e329116cd0a815ec7c4c2 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.8 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 6 minutes ago Up 6 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-8 1b27873b18c0c0b9198a2a67eee38f186ce8e8790f059b56d95bb686b91fe36b private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 5 minutes ago Up 5 minutes ago gallant_mahavira
When ceph orch ps
was last refreshed podman was already seeing rgw, mgr as "up"; only 1 osd was listed in ceph orch ps
(no "MEM USE" for it by the way?!) while all 12 containers showing in podman ps
. Maybe there is a reason for the "error" state on rgw/mgr I wouldn't know. I do note that the "MEM USE" is listed only for rgw, mgr while podman shows many more containers - maybe they were launched after cephadm refreshed, the resolution in the "6 minutes ago" is not great.
Let's wind forward to the next refresh after the default 10 min (thanks, wasn't aware of that, I think it should be documented, when there's time I will look if I find a place & make a PR)
# ceph orch ps | egrep 'cep1|REFRES' NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID alertmanager.cep1 cep1.domain.com 192.0.2.11:9093,9094 error 13s ago 4M - - <unknown> <unknown> <unknown> grafana.cep1 cep1.domain.com 192.0.2.11:3000 error 13s ago 4M 66.7M - 8.3.5 dad864ee21e9 048903dacf83 mgr.cep1.domain.com.asmugh cep1.domain.com *:9283 error 13s ago 4M 422M - 17.2.3 0912465dcea5 667e55c0d52d mon.cep1.domain.com cep1.domain.com error 13s ago 4M - 2048M <unknown> <unknown> <unknown> node-exporter.cep1 cep1.domain.com 192.0.2.11:9100 error 13s ago 4M 26.7M - 1dbe0e931976 8038262a4167 osd.0 cep1.domain.com error 13s ago 4M 10.4M 4096M 17.2.3 0912465dcea5 6084aee38804 osd.1 cep1.domain.com error 13s ago 4M 10.8M 4096M 17.2.3 0912465dcea5 359b8a6160ec osd.2 cep1.domain.com running (11m) 13s ago 4M 11.5M 4096M 17.2.3 0912465dcea5 40c66bb9109d osd.3 cep1.domain.com error 13s ago 4M 11.3M 4096M 17.2.3 0912465dcea5 b01691462e1c osd.4 cep1.domain.com running (10m) 13s ago 4M 10.8M 4096M 17.2.3 0912465dcea5 f68b7db5debe osd.5 cep1.domain.com running (11m) 13s ago 4M 10.6M 4096M 17.2.3 0912465dcea5 386590aeba68 osd.6 cep1.domain.com running (11m) 13s ago 4M 13.7M 4096M 17.2.3 0912465dcea5 bd0e468d3db6 osd.7 cep1.domain.com error 13s ago 4M 13.8M 4096M 17.2.3 0912465dcea5 fe87322f6a62 osd.8 cep1.domain.com running (10m) 13s ago 4M 13.9M 4096M 17.2.3 0912465dcea5 aa6cef9888d9 osd.9 cep1.domain.com running (11m) 13s ago 4M 11.5M 4096M 17.2.3 0912465dcea5 e5dc85a68eab osd.10 cep1.domain.com error 13s ago 4M 13.7M 4096M 17.2.3 0912465dcea5 788d7fd8f43a osd.11 cep1.domain.com error 13s ago 4M 14.0M 4096M 17.2.3 0912465dcea5 69a36cd1aa83 prometheus.cep1 cep1.domain.com 192.0.2.11:9095 error 13s ago 4M - - <unknown> <unknown> <unknown> rgw.pool.cep1.abmahl cep1.domain.com 192.0.2.11:8000 error 13s ago 4M 86.9M - 17.2.3 0912465dcea5 458f8405ee84
# podman ps --no-trunc CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 458f8405ee84ebe4c66464f852808b99d757acb616c77fbd8a8d9141b1bb65d6 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n client.rgw.pool.cep1.abmahl -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 20 minutes ago Up 17 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-rgw-pool-cep1-abmahl 667e55c0d52db4aaf293e3f76b3b00bfc32c2bd32e49b6f524265a101a971afb private-reg/ceph:latest -n mgr.cep1.domain.com.asmugh -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 20 minutes ago Up 19 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-mgr-cep1-domain-com-asmugh 048903dacf8379dc44661fb68e7e3ab01fe80af5b543bff381493cdb9408e112 quay.io/ceph/ceph-grafana:8.3.5 /bin/bash 12 minutes ago Up 12 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-grafana-cep1 8038262a4167687f86e05b4a6fc5201e85e76a8002cf4b77ed1056a57084f2c7 quay.io/prometheus/node-exporter:v1.3.1 --no-collector.timex --web.listen-address=192.0.2.11:9100 --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/rootfs 12 minutes ago Up 12 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-node-exporter-cep1 fe87322f6a62ceba8fc86d07068594fd398eaecac4943798a75a226bc948fbf0 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.7 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 12 minutes ago Up 12 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-7 359b8a6160ec6cf635cce6afa8b2c002877aa3c02a19a8ce8ef9131ad3c7604a private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.1 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 12 minutes ago Up 12 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-1 788d7fd8f43a39897e8d84f8c650d3a43c066190cd9d96c23e14acb671e531ee private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.10 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 12 minutes ago Up 12 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-10 b01691462e1c43e2c87deb3edcb5fe8afe82a7c4107623564bb8a01b80957560 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.3 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 12 minutes ago Up 12 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-3 69a36cd1aa834d65fbf3fde13a2618fce40e6be1efb8e316857ae6545db2feb6 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.11 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 12 minutes ago Up 12 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-11 6084aee38804040de0f99e51ebf288f33f86b60c91d45c84376bb5788b8d5541 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.0 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 12 minutes ago Up 12 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-0 40c66bb9109dcf2894282f2bb9cd5021ba0bd6eb2c390e741a45248d14e3dc97 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.2 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 12 minutes ago Up 12 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-2 bd0e468d3db61cb04eb7ecb1b2a0711712e09a8671c24eca8411b5a6947881b2 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.6 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 12 minutes ago Up 12 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-6 e5dc85a68eabd44f46ace54489b32dd450b237259322b2c594e3e86d4441ce3e private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.9 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 12 minutes ago Up 12 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-9 386590aeba681b374e718be72278b08e3ecd18833fccf3a05992eedf5f363bbc private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.5 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 12 minutes ago Up 12 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-5 f68b7db5debe5e26a4e01746717aa8fa86966286a5bcc8021294e5f3956daaad private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.4 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 12 minutes ago Up 12 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-4 aa6cef9888d9b9af9362c46f18f1fc39f5d9c88a173e329116cd0a815ec7c4c2 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.8 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 12 minutes ago Up 12 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-8 1b27873b18c0c0b9198a2a67eee38f186ce8e8790f059b56d95bb686b91fe36b private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 11 minutes ago Up 11 minutes ago gallant_mahavira
This is the end result I had in the case when I opened this issue. The containers never come out of "error" state with the exception of some (seemingly semi random, with few changing between reboots) containers, usually osd.
MEM USE is shown right for all containers. Let's take a look at ceph -s
if that agrees to ceph orch ps
# ceph -s cluster: id: 5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30 health: HEALTH_WARN 13 failed cephadm daemon(s) 1/3 mons down, quorum cep2,cep3 2 osds down 1 host (12 osds) down Degraded data redundancy: 184387/553161 objects degraded (33.333%), 1099 pgs degraded, 1217 pgs undersized services: mon: 3 daemons, quorum cep2,cep3 (age 27m), out of quorum: cep1.domain.com mgr: cep2.atrumb(active, since 27m), standbys: cep1.domain.com.asmugh osd: 36 osds: 24 up (since 27m), 26 in (since 17m) rgw: 2 daemons active (2 hosts, 1 zones) data: pools: 8 pools, 1217 pgs objects: 184.39k objects, 684 GiB usage: 3.1 TiB used, 173 TiB / 176 TiB avail pgs: 184387/553161 objects degraded (33.333%) 1099 active+undersized+degraded 118 active+undersized
The health status shows that mgr is up - as it indeed is. But ceph orch ps
shows it with "error" status. It does agree with mon being down so let's wait more for that.
Meanwhile, note "2 osds down" so 10 OSD should be all good while ceph orch ps
had 6 osd in "error" status and only 6 osd as "running". It does see MEM USE for all OSD so let's just wait more then.
# ceph orch ps | egrep 'cep1|REFRES' NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID alertmanager.cep1 cep1.domain.com 192.0.2.11:9093,9094 error 8m ago 4M - - <unknown> <unknown> <unknown> grafana.cep1 cep1.domain.com 192.0.2.11:3000 error 8m ago 4M 69.4M - 8.3.5 dad864ee21e9 048903dacf83 mgr.cep1.domain.com.asmugh cep1.domain.com *:9283 error 8m ago 4M 430M - 17.2.3 0912465dcea5 667e55c0d52d mon.cep1.domain.com cep1.domain.com error 8m ago 4M - 2048M <unknown> <unknown> <unknown> node-exporter.cep1 cep1.domain.com 192.0.2.11:9100 error 8m ago 4M 29.3M - 1dbe0e931976 8038262a4167 osd.0 cep1.domain.com error 8m ago 4M 10.6M 4096M 17.2.3 0912465dcea5 6084aee38804 osd.1 cep1.domain.com error 8m ago 4M 11.0M 4096M 17.2.3 0912465dcea5 359b8a6160ec osd.2 cep1.domain.com running (51m) 8m ago 4M 11.7M 4096M 17.2.3 0912465dcea5 40c66bb9109d osd.3 cep1.domain.com error 8m ago 4M 11.5M 4096M 17.2.3 0912465dcea5 b01691462e1c osd.4 cep1.domain.com running (51m) 8m ago 4M 11.0M 4096M 17.2.3 0912465dcea5 f68b7db5debe osd.5 cep1.domain.com running (51m) 8m ago 4M 10.8M 4096M 17.2.3 0912465dcea5 386590aeba68 osd.6 cep1.domain.com running (51m) 8m ago 4M 13.8M 4096M 17.2.3 0912465dcea5 bd0e468d3db6 osd.7 cep1.domain.com error 8m ago 4M 13.9M 4096M 17.2.3 0912465dcea5 fe87322f6a62 osd.8 cep1.domain.com running (51m) 8m ago 4M 14.1M 4096M 17.2.3 0912465dcea5 aa6cef9888d9 osd.9 cep1.domain.com running (51m) 8m ago 4M 11.7M 4096M 17.2.3 0912465dcea5 e5dc85a68eab osd.10 cep1.domain.com error 8m ago 4M 13.9M 4096M 17.2.3 0912465dcea5 788d7fd8f43a osd.11 cep1.domain.com error 8m ago 4M 14.2M 4096M 17.2.3 0912465dcea5 69a36cd1aa83 prometheus.cep1 cep1.domain.com 192.0.2.11:9095 error 8m ago 4M - - <unknown> <unknown> <unknown> rgw.pool.cep1.abmahl cep1.domain.com 192.0.2.11:8000 error 8m ago 4M 96.4M - 17.2.3 0912465dcea5 458f8405ee84
Still the same so let's just wait more if the 2 running osd come to ceph -s
- on previous tests I could get the cluster healthy (sans cephadm daemon errors) with ceph orch ps
showing an output similar to above.
# ceph orch ps | egrep 'cep1|REFRES' NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID alertmanager.cep1 cep1.domain.com 192.0.2.11:9093,9094 error 4m ago 4M - - <unknown> <unknown> <unknown> grafana.cep1 cep1.domain.com 192.0.2.11:3000 error 4m ago 4M 70.3M - 8.3.5 dad864ee21e9 048903dacf83 mgr.cep1.domain.com.asmugh cep1.domain.com *:9283 error 4m ago 4M 433M - 17.2.3 0912465dcea5 667e55c0d52d mon.cep1.domain.com cep1.domain.com error 4m ago 4M - 2048M <unknown> <unknown> <unknown> node-exporter.cep1 cep1.domain.com 192.0.2.11:9100 error 4m ago 4M 25.3M - 1dbe0e931976 8038262a4167 osd.0 cep1.domain.com error 4m ago 4M - 4096M <unknown> <unknown> <unknown> osd.1 cep1.domain.com error 4m ago 4M - 4096M <unknown> <unknown> <unknown> osd.2 cep1.domain.com running (18m) 4m ago 4M 13.1M 4096M 17.2.3 0912465dcea5 5a33a3450bf6 osd.3 cep1.domain.com error 4m ago 4M - 4096M <unknown> <unknown> <unknown> osd.4 cep1.domain.com running (18m) 4m ago 4M 10.7M 4096M 17.2.3 0912465dcea5 e01c51260543 osd.5 cep1.domain.com running (18m) 4m ago 4M 10.6M 4096M 17.2.3 0912465dcea5 0affb2ba8889 osd.6 cep1.domain.com running (18m) 4m ago 4M 11.5M 4096M 17.2.3 0912465dcea5 a415cc1afd5c osd.7 cep1.domain.com error 4m ago 4M - 4096M <unknown> <unknown> <unknown> osd.8 cep1.domain.com running (18m) 4m ago 4M 13.1M 4096M 17.2.3 0912465dcea5 64e6ddc4f22c osd.9 cep1.domain.com running (18m) 4m ago 4M 14.0M 4096M 17.2.3 0912465dcea5 00117776d64e osd.10 cep1.domain.com error 4m ago 4M - 4096M <unknown> <unknown> <unknown> osd.11 cep1.domain.com error 4m ago 4M - 4096M <unknown> <unknown> <unknown> prometheus.cep1 cep1.domain.com 192.0.2.11:9095 error 4m ago 4M - - <unknown> <unknown> <unknown> rgw.pool.cep1.abmahl cep1.domain.com 192.0.2.11:8000 error 4m ago 4M 97.8M - 17.2.3 0912465dcea5 458f8405ee84
# podman ps --no-trunc CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 458f8405ee84ebe4c66464f852808b99d757acb616c77fbd8a8d9141b1bb65d6 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n client.rgw.pool.cep1.abmahl -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false About an hour ago Up About an hour ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-rgw-pool-cep1-abmahl 667e55c0d52db4aaf293e3f76b3b00bfc32c2bd32e49b6f524265a101a971afb private-reg/ceph:latest -n mgr.cep1.domain.com.asmugh -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false About an hour ago Up About an hour ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-mgr-cep1-domain-com-asmugh 048903dacf8379dc44661fb68e7e3ab01fe80af5b543bff381493cdb9408e112 quay.io/ceph/ceph-grafana:8.3.5 /bin/bash About an hour ago Up About an hour ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-grafana-cep1 8038262a4167687f86e05b4a6fc5201e85e76a8002cf4b77ed1056a57084f2c7 quay.io/prometheus/node-exporter:v1.3.1 --no-collector.timex --web.listen-address=192.0.2.11:9100 --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/rootfs About an hour ago Up About an hour ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-node-exporter-cep1 1b27873b18c0c0b9198a2a67eee38f186ce8e8790f059b56d95bb686b91fe36b private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 About an hour ago Up About an hour ago gallant_mahavira 5a33a3450bf699ec9c51325f8124e8192c8ba977482c0bf2de114a8cd81416e8 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.2 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 18 minutes ago Up 18 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-2 a415cc1afd5c70e978d22205ebaab5a89a394b7b841ac75a699b2b4967d647c2 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.6 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 18 minutes ago Up 18 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-6 00117776d64e0df02163f9d96f1d949f014da5a9479ab8e9e10132d1addcf2b1 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.9 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 18 minutes ago Up 18 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-9 0affb2ba888921d323c19cd80b40f3dbe5870bd37da5713ea453d05bf9ed9755 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.5 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 18 minutes ago Up 18 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-5 e01c512605439911220b708cfcfd4bfc622d9cba6f507690fdec52c63b62f2b0 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.4 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 18 minutes ago Up 18 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-4 64e6ddc4f22c439db03a54f14bb36495ed62f84118e532b79fbc656f168333df private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.8 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false 18 minutes ago Up 18 minutes ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-8
Wait what? Only running for 18m? The system seemed to already be at a steady state.
Some containers were killed near the beginning (those that were missing from the penultimate podman listing above): for example prometheus started fine but eventually logged SIGTERM received, perhaps from systemd, and didn't get restarted after that.
I assume that to be the case with other missing containers. The real reason for that being out of scope of this issue.
At least we can see the original issue on at least some containers: mgr is definitely running and ceph -s
sees it also but cephadm thinks mon is "error" and only shows MEM USE for it. Additionally 4 of the osd that cephadm thinks is "error" are correctly seen by ceph -s
.
But now all the containers were restarted? There is nothing special I could see in cephadm.log
at the timestamp when all containers were restarted. In general there is not much interesting in the log, lines similar to this do repeat sometimes:
2023-01-13 08:08:48,295 7f6ad225b740 DEBUG /usr/bin/podman: ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable) 2023-01-13 08:08:48,445 7f6ad225b740 DEBUG systemctl: enabled 2023-01-13 08:08:48,453 7f6ad225b740 DEBUG systemctl: failed 2023-01-13 08:08:48,502 7f6ad225b740 DEBUG /usr/bin/podman: 6084aee38804040de0f99e51ebf288f33f86b60c91d45c84376bb5788b8d5541,private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26,0912465dcea5159f56c7d4ccdf1d15a28abfe1be8a56c94eb86cab129c069726,2023-01-13 07:25:41.72503957 +0400 +04, 2023-01-13 08:08:48,604 7f6ad225b740 DEBUG systemctl: enabled 2023-01-13 08:08:48,612 7f6ad225b740 DEBUG systemctl: failed 2023-01-13 08:08:48,660 7f6ad225b740 DEBUG /usr/bin/podman: 359b8a6160ec6cf635cce6afa8b2c002877aa3c02a19a8ce8ef9131ad3c7604a,private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26,0912465dcea5159f56c7d4ccdf1d15a28abfe1be8a56c94eb86cab129c069726,2023-01-13 07:25:38.589799703 +0400 +04, 2023-01-13 08:08:48,763 7f6ad225b740 DEBUG systemctl: enabled 2023-01-13 08:08:48,771 7f6ad225b740 DEBUG systemctl: failed 2023-01-13 08:08:48,819 7f6ad225b740 DEBUG /usr/bin/podman: 788d7fd8f43a39897e8d84f8c650d3a43c066190cd9d96c23e14acb671e531ee,private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26,0912465dcea5159f56c7d4ccdf1d15a28abfe1be8a56c94eb86cab129c069726,2023-01-13 07:25:38.985777789 +0400 +04, 2023-01-13 08:08:48,920 7f6ad225b740 DEBUG systemctl: enabled 2023-01-13 08:08:48,929 7f6ad225b740 DEBUG systemctl: failed 2023-01-13 08:08:48,978 7f6ad225b740 DEBUG /usr/bin/podman: 69a36cd1aa834d65fbf3fde13a2618fce40e6be1efb8e316857ae6545db2feb6,private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26,0912465dcea5159f56c7d4ccdf1d15a28abfe1be8a56c94eb86cab129c069726,2023-01-13 07:25:41.548444374 +0400 +04, 2023-01-13 08:08:49,080 7f6ad225b740 DEBUG systemctl: enabled 2023-01-13 08:08:49,087 7f6ad225b740 DEBUG systemctl: active 2023-01-13 08:08:49,138 7f6ad225b740 DEBUG /usr/bin/podman: 40c66bb9109dcf2894282f2bb9cd5021ba0bd6eb2c390e741a45248d14e3dc97,private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26,0912465dcea5159f56c7d4ccdf1d15a28abfe1be8a56c94eb86cab129c069726,2023-01-13 07:25:45.537820031 +0400 +04,
and other normal looking lines like sestatus
output repeating a gazillion times. Will try to look at why everything was restarted around the 1h uptime.
The cluster is indeed 17.2.3, was planning to re-install to 17.2.5 after more tests. All unique identifiers in the logs were sanitized.
So the logs above show the persistent "error" state happening with mgr and 4 osd.
Updated by Redouane Kachach Elhichou over 1 year ago
I see, I think the problem (in terms of discrepancy) between the information you see in 'ceph orch ps' and ceph -s is that they use different 'source of truth':
- cephadm uses the information from the systemd status - ceph -s uses information from the real state (in terms of which processes are running and connected to the cluster).
My guess is what happens is that for some reason (that I have to investigate more in deep), the services associated with the daemons (you can list them by: systemctl | grep ceph-) are in error state because they cannot execute the corresponding unit commands. So from systemd pov (and hence cephadm) these services are in error. You can double check that by copying cephadm to the node and running the following cmd (the output of this command is what gets later reflected in the status you list by cephadm orch ps command):
./cephadm ls | grep -w -e name -e state
However, the container process is in fact running (that's why you see that mon/mgr is there when you run ceph -s).
My theory is that something goes wrong (because of the short timeout) during the startup so the systemd service fails but the process itself (podman container run by the service) is launched and stays there running. So systemd cannot re-run the service and fails (because the container is running). If this the issue the work-around would be to stop the container manually by running:
podman stop <container-id>
And then, restart the corresponding ceph-daemon by:
ceph orch daemon restart <daemon-name-from-orch-ps-output>
Please, keep in mind that ceph services are configured as following in case you try to restart the systemd service manually:
StartLimitInterval=30min StartLimitBurst=5
So if there have been more than 5 restart attempts in the last 30min systemd wont let you restart the service until 30min period has finished. 'ceph orch daemon restart <daemon-name>' takes care of this when restarting a daemon as it issues a reset-failed before restarting the systemd service.
It seems that you are able to reproduce the issue easily. Please, give a try to the recommendations above just to see if my theory about what's going on is correct and if the work-around works or not. This way we will confirm the issue then next step would be to know why the containers keep running when the systemd service has failed.
BTW: please, next time just after the startup (and when the daemon goes into error state) it would be good to see the output of the
systemctl status of these services in eror (at least some of them). That would shed some light about the exit code error.
Note: containers are in general stateless, so don't be afraid on stopping their corresponding process and restarting their daemon.
Updated by Voja Molani over 1 year ago
Thank you for your reply. I might not have time to look deeper into this until later but I want to reply to some of your points.
My theory is that something goes wrong (because of the short timeout) during the startup so the systemd service fails but the process itself (podman container run by the service) is launched and stays there running.
I am pretty sure that systemd is supposed to kill the units that did not start within the start timeout period.
I will try to confirm this.
It seems that you are able to reproduce the issue easily.
Well not exactly "easily" - I set the start timeout to slightly less than what the containers usually would take to start, perhaps 33-50% of time it gets to this state and doesn't get even worse with much more errors.
It takes quite a bit of time.
Updated by Redouane Kachach Elhichou over 1 year ago
Voja Molani wrote:
Thank you for your reply. I might not have time to look deeper into this until later but I want to reply to some of your points.
My theory is that something goes wrong (because of the short timeout) during the startup so the systemd service fails but the process itself (podman container run by the service) is launched and stays there running.
I am pretty sure that systemd is supposed to kill the units that did not start within the start timeout period.
I will try to confirm this.It seems that you are able to reproduce the issue easily.
Well not exactly "easily" - I set the start timeout to slightly less than what the containers usually would take to start, perhaps 33-50% of time it gets to this state and doesn't get even worse with much more errors.
It takes quite a bit of time.
I see, no rush specially since the process of reproducing the issue is time consuming.
Well, related tot he systemd behaviour that depends on the configuration. Right now we are using KillMode=none so I'm afraid if the start timeout is fired then the container will stay there running as we are asking systemd to do not kill the service unit.
KillMode was set to none long time ago to avoid systemd killing the podman containers as that may lead to some issues (we prefer podman stopping and cleaning its environment than systemd killing the process.)
A 'simple' fix to see if that's the issue is to change killMode to mixed (or control-group) and see if that's solves the problem or not.
Thanks.
Updated by Redouane Kachach Elhichou over 1 year ago
- Related to Bug #51361: KillMode=none is deprecated added
Updated by Redouane Kachach Elhichou over 1 year ago
- Related to Bug #58241: Systemd unit file TimeoutStartSec of 120 seconds is too low added
Updated by Voja Molani over 1 year ago
Doesn't this explain the problem. The sequence of events will be:Well, related tot he systemd behaviour that depends on the configuration. Right now we are using KillMode=none so I'm afraid if the start timeout is fired then the container will stay there running as we are asking systemd to do not kill the service unit.
- Systemd starts a container.
- Container does not start within timeout period and systemd will start to restart it later.
- The container keeps running.
- Systemd tries to restart the container but it is still running.
It makes no sense to have "any" TimeoutStartSec
if no action is taken when that timeout is reached - only thing accomplished will be that systemd and the real system state will be out of sync and systemd loses view to the service that's now running without systemd caring/knowing about it.
If no action will be taken on timeout, then the only correct configuration is to have a very high TimeoutStartSec
or disable timeouts.
I believe it is probably better to not try to kill containers that don't come up on time. I suspect in some problem cases containers might take a very, very long time to start and forcefully terminating them would only make matters worse.
Updated by Redouane Kachach Elhichou over 1 year ago
Doesn't this explain the problem. The sequence of events will be:
- Systemd starts a container.
- Container does not start within timeout period and systemd will start to restart it later.
- The container keeps running.
- Systemd tries to restart the container but it is still running.
Yes, I suspect this is what's going on. So at the end the container will remain running but the service will be reported as in error state because it failed.
It makes no sense to have "any"
TimeoutStartSec
if no action is taken when that timeout is reached - only thing accomplished will be that systemd and the real system state will be out of sync and systemd loses view to the service that's now running without systemd caring/knowing about it.
Well, the timeout still is useful as it marks the service as 'in error' state so the user can go and try to fix it (in this case by stopping manually the container which is safer than trying to kill it).
If no action will be taken on timeout, then the only correct configuration is to have a very high
TimeoutStartSec
or disable timeouts.I believe it is probably better to not try to kill containers that don't come up on time. I suspect in some problem cases containers might take a very, very long time to start and forcefully terminating them would only make matters worse.
Yeah, that's the current approach and actually it was introduced in the past because there were some issues related to systemd killing the container instead of having podman stoping it gracefully.
Anyway, the goal right now is to identify and ensure that the problem is what we think it is. Once we confirm it, then we can see what's the best solution to fix it :)
Updated by Voja Molani over 1 year ago
- File systemctl-ceph.log systemctl-ceph.log added
In my opinion the "best" solution would be to differentiate between status of "error" and status of starting up. Not sure if it is even possible or how much work that would involve.
Well, the timeout still is useful as it marks the service as 'in error' state so the user can go and try to fix it (in this case by stopping manually the container which is safer than trying to kill it).
But it isn't really errored - it just took 5 (or whatever) seconds too long to start and is now running a-ok but without systemd knowing about it.
Since I had raised the TimeoutStartSec
then this situation should not happen again in normal circumstances. But I could see that in some problem case (and judging by reports from other users there are plenty of those situations) the container could take even dozens of minutes to hour(s) to complete startup.
In those non-normal cases having ceph orch
lose the management of the container would just add to the stress on an admin who maybe does not know about these finer points of how the containers are managed.
Yeah, that's the current approach and actually it was introduced in the past because there were some issues related to systemd killing the container instead of having podman stoping it gracefully.
I believe that this is how it should be but then again I am a noob. I would hope that containers that did not completely start within x seconds would not be killed, as I said earlier.
IMHO there would be definitely value in trying to solve this matter for those non-normal situations where startup takes long but I am not sure what difference my input would make. Do I understand correctly that next I should check systemd status (maybe systemctl status x
and journalctl
log) and cephadm ls
for those containers that went to "error" state due to TimeoutStartSec
being exceeded but eventually did start right and are OK now as witnessed by ceph -s
? Anything else?
Currently most containers are in "error" state:
# ceph orch ps | egrep 'cep1|REFRE' NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID alertmanager.cep1 cep1.domain.com 192.0.2.11:9093,9094 running (4d) 23s ago 4M 34.3M - ba2b418f427c 96e9f81b3fb8 grafana.cep1 cep1.domain.com 192.0.2.11:3000 error 23s ago 4M 85.0M - 8.3.5 dad864ee21e9 aadcd354a612 mgr.cep1.domain.com.asmugh cep1.domain.com *:9283 running (4d) 23s ago 4M 451M - 17.2.3 0912465dcea5 8cc07c233522 mon.cep1.domain.com cep1.domain.com error 23s ago 4M 420M 2048M 17.2.3 0912465dcea5 7c9fe2039f56 node-exporter.cep1 cep1.domain.com 192.0.2.11:9100 error 23s ago 4M 28.5M - 1dbe0e931976 0f3900bb99ef osd.0 cep1.domain.com error 23s ago 4M 1181M 4096M 17.2.3 0912465dcea5 4967c9a06c1d osd.1 cep1.domain.com error 23s ago 4M 1110M 4096M 17.2.3 0912465dcea5 7f9edd875d32 osd.2 cep1.domain.com error 23s ago 4M 1077M 4096M 17.2.3 0912465dcea5 201fd4a9d42d osd.3 cep1.domain.com error 23s ago 4M 1150M 4096M 17.2.3 0912465dcea5 d0704e4945de osd.4 cep1.domain.com error 23s ago 4M 1045M 4096M 17.2.3 0912465dcea5 8a98581f2c11 osd.5 cep1.domain.com error 23s ago 4M 1074M 4096M 17.2.3 0912465dcea5 5d7452f25567 osd.6 cep1.domain.com error 23s ago 4M 996M 4096M 17.2.3 0912465dcea5 1fa0b8e4d937 osd.7 cep1.domain.com error 23s ago 4M 1088M 4096M 17.2.3 0912465dcea5 2813158cefb1 osd.8 cep1.domain.com error 23s ago 4M 1159M 4096M 17.2.3 0912465dcea5 923d0277c2da osd.9 cep1.domain.com error 23s ago 4M 1153M 4096M 17.2.3 0912465dcea5 1a1eabd0eee2 osd.10 cep1.domain.com error 23s ago 4M 1129M 4096M 17.2.3 0912465dcea5 a5224ee22d33 osd.11 cep1.domain.com error 23s ago 4M 1073M 4096M 17.2.3 0912465dcea5 98303d0e1cca prometheus.cep1 cep1.domain.com 192.0.2.11:9095 error 23s ago 4M 191M - 514e6a882f6e b356b8eb057b rgw.pool.cep1.abmahl cep1.domain.com 192.0.2.11:8000 error 23s ago 4M 122M - 17.2.3 0912465dcea5 44faf1ecd556
As all containers list "MEM USE" I then suspect they are all actually running OK. This is indeed what is shown in podman ps
and ceph -s
:
# podman ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES aadcd354a612 quay.io/ceph/ceph-grafana:8.3.5 /bin/bash 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-grafana-cep1 b356b8eb057b quay.io/prometheus/prometheus:v2.33.4 --config.file=/et... 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-prometheus-cep1 44faf1ecd556 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n client.rgw.poo... 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-rgw-pool-cep1-abmahl 7c9fe2039f56 private-reg/ceph:latest -n mon.cep1.domai... 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-mon-cep1-domain-com 0f3900bb99ef quay.io/prometheus/node-exporter:v1.3.1 --no-collector.ti... 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-node-exporter-cep1 98303d0e1cca private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.11 -f --se... 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-11 5d7452f25567 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.5 -f --set... 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-5 7f9edd875d32 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.1 -f --set... 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-1 1a1eabd0eee2 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.9 -f --set... 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-9 d0704e4945de private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.3 -f --set... 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-3 201fd4a9d42d private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.2 -f --set... 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-2 2813158cefb1 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.7 -f --set... 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-7 1fa0b8e4d937 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.6 -f --set... 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-6 4967c9a06c1d private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.0 -f --set... 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-0 923d0277c2da private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.8 -f --set... 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-8 8a98581f2c11 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.4 -f --set... 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-4 a5224ee22d33 private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 -n osd.10 -f --se... 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-10 96e9f81b3fb8 quay.io/prometheus/alertmanager:v0.23.0 --cluster.listen-... 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-alertmanager-cep1 8cc07c233522 private-reg/ceph:latest -n mgr.cep1.domai... 4 days ago Up 4 days ago ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-mgr-cep1-domain-com-asmugh d07c27bf3edc private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26 4 days ago Up 4 days ago gallant_mahavira
# ceph -s cluster: id: 5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30 health: HEALTH_WARN 17 failed cephadm daemon(s) services: mon: 3 daemons, quorum cep1.domain.com,cep2,cep3 (age 4d) mgr: cep2.atrumb(active, since 4d), standbys: cep1.domain.com.asmugh osd: 36 osds: 36 up (since 4d), 36 in (since 4d) rgw: 2 daemons active (2 hosts, 1 zones) data: pools: 8 pools, 1217 pgs objects: 184.38k objects, 684 GiB usage: 4.6 TiB used, 260 TiB / 265 TiB avail pgs: 1217 active+clean io: client: 4.0 KiB/s rd, 0 B/s wr, 3 op/s rd, 2 op/s wr
systemctl status ceph*
output - which unfortunately is missing some log entries due to journal rotation last night. This is the same machine and same reboot as in my previous update. I guess I need to reboot this to get the full journal for all containers if that is useful.
But as can be seen, eventually all containers started correctly, even mon that didn't start yet in my previous update 5 days ago. 5 days ago all osd were restarted every 1 hour because they could not find mon - strange, there were 2 other mons in the cluster...
Anyways now the cluster is fully functional and only problem is that cephadm has lost all information about the containers beyond "MEM USE" parameter.
[see attached log, redmine gives a server error trying to have this log inline]
Systemd seems to have full and correct picture of all the containers so not sure what exactly makes cephadm think the containers are in "error". Maybe this is a bug?
I could not spot any new information in cephadm ls
output
# cephadm ls |grep -w -e name -e state "name": "mon.cep1.domain.com", "state": "error", "name": "mgr.cep1.domain.com.asmugh", "state": "running", "name": "osd.0", "state": "error", "name": "osd.1", "state": "error", "name": "osd.10", "state": "error", "name": "osd.11", "state": "error", "name": "osd.2", "state": "error", "name": "osd.3", "state": "error", "name": "osd.4", "state": "error", "name": "osd.5", "state": "error", "name": "osd.6", "state": "error", "name": "osd.7", "state": "error", "name": "osd.8", "state": "error", "name": "osd.9", "state": "error", "name": "rgw.pool.cep1.abmahl", "state": "error", "name": "node-exporter.cep1", "state": "error", "name": "alertmanager.cep1", "state": "running", "name": "grafana.cep1", "state": "error", "name": "prometheus.cep1", "state": "error",
Updated by Redouane Kachach Elhichou over 1 year ago
Thank you for getting back with the results and for providing more info from your cluster.
Yesterday I was discussing this issue with other developers with expertise in the cephadm area and as it concerns them this is the first time this issue reported. We have seen Ceph deployment in bigger clusters (with more dense nodes and with more OSDs) but we haven't experienced issues related to this TimeoutStartSec so far.
Please, can your provide mode details about your cluster setup such as:
- Number of nodes
- Maximum number of OSDs per node
- Disks model
- Are you using an external or internal registry to get ceph related images?
- Is your network downloading fast?
- did you test pulling Ceph's container image manually? how long does it take?
I'm afraid that this slowness in the startup could be caused by some special circumstance in your cluster, otherwise bigger clusters (using slow disks) should have experienced a similar issue but it's not the case.
As I understood from the thread, increasing TimeoutStartSec to 200 seconds solves the issue in your setup. Can you confirm please?
BTW: Please, can you check if the work around I recommended works to solve your issue, for that as I suggested you have to:
1- Go to the node where daemons in error state are running
2- Stop the underlying podman containers manually by a podman stop <container-id>
3- Then restart the daemon from cephadm by:
ceph orch daemon restart <daemon-name-from-orch-ps-output>
Updated by Voja Molani over 1 year ago
- Number of nodes
3
- Maximum number of OSDs per node
12 HDD, 2 SSD for db
- Disks model
HGST HUS728T8TAL5200
- Are you using an external or internal registry to get ceph related images?
Internal for now but will be moving to stock registry on next reinstall.
- Is your network downloading fast?
Yes, downloading from the local container registry should be OK and not especially slow.
- did you test pulling Ceph's container image manually? how long does it take?
real 0m44.050s user 0m0.120s sys 0m0.093s
I'm afraid that this slowness in the startup could be caused by some special circumstance in your cluster, otherwise bigger clusters (using slow disks) should have experienced a similar issue but it's not the case.
It could be. When I said earlier that the disks are slow what I meant was that the system disks are slow. They are Dell "disk on module" flash devices.
As I understood from the thread, increasing TimeoutStartSec to 200 seconds solves the issue in your setup. Can you confirm please?
Yes that is correct.
BTW: Please, can you check if the work around I recommended works to solve your issue, for that as I suggested you have to:
1- Go to the node where daemons in error state are running
2- Stop the underlying podman containers manually by a podman stop <container-id>
3- Then restart the daemon from cephadm by:
Yes, this works. I tested it on the grafana container. Within seconds the container is in "running" state, I was not fast enough to see it in "error" state again after doing orch daemon restart
for it.
Updated by Redouane Kachach Elhichou over 1 year ago
- Status changed from New to Fix Under Review
- Pull request ID set to 49728
Updated by Adam King about 1 year ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to quincy, pacific
Updated by Backport Bot about 1 year ago
- Copied to Backport #58777: pacific: cephadm doesn't communicate with containers that failed initial start but were successfully restarted later by systemd added
Updated by Backport Bot about 1 year ago
- Copied to Backport #58778: quincy: cephadm doesn't communicate with containers that failed initial start but were successfully restarted later by systemd added
Updated by Adam King about 1 year ago
- Status changed from Pending Backport to Resolved