Project

General

Profile

Actions

Bug #58242

closed

cephadm doesn't communicate with containers that failed initial start but were successfully restarted later by systemd

Added by Voja Molani over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
quincy, pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

If, after a host restart, the initial start of a container failed for whatever reason (perhaps because the systemd start timeout TimeoutStartSec was exceeded) then cephadm does not detect the container as running even if a subsequent automatic restart by systemd successfully started the container.

To reproduce:
  • Restart the host that runs a lot of services and look at the log for how long it takes to start the various service containers.
  • Edit the ceph container systemd service file at /etc/systemd/system/ceph-<fsid><at>.service and change TimeoutStartSec to be just a few seconds less than the previously recorded longest time to start ceph services when all containers are starting at the same time. The idea is to have it long enough to allow some/most containers to start but cause a start timeout on others.
  • Restart the host.
  • Since the TimeoutStartSec now makes some containers time out on start when all the containers are starting at the same time, systemd will terminate + restart later the containers that timed out starting.
  • After systemd tries to later restart the containers that timed out, hopefully now the system is less busy and the container will start faster, therefore not exceeding the start timeout.
  • The containers that systemd restarted later, and that now did not exceed the start timeout because the system was less busy, are running.
  • Do a ceph orch ps and witness the error in the "STATUS" field for those containers that did not start on the initial start attempt (but are now running just fine after systemd restarted them).

If there is some other way besides TimeoutStartSec to make the initial container start after boot to fail but still allow the subsequent automatic restart by systemd to succeed then use that for reproducing.
Only thing that seems to matter is that the container needs to fail the initial start and successfully start on the later automatic restart by systemd. cephadm does not detect such containers as running in ceph orch ls or ceph orch ps or other commands, even if ceph -s shows that all mons, mgrs, osds etc are running just fine. Such containers seem to be invisible to cephadm.

cephadm needs to handle the case where a container failed the initial start attempt but successfully started on the later automatic restart attempt by systemd.


Files

systemctl-ceph.log (96.1 KB) systemctl-ceph.log Voja Molani, 01/18/2023 04:02 AM

Related issues 4 (1 open3 closed)

Related to Orchestrator - Bug #51361: KillMode=none is deprecatedNew

Actions
Related to Orchestrator - Bug #58241: Systemd unit file TimeoutStartSec of 120 seconds is too lowResolvedRedouane Kachach Elhichou

Actions
Copied to Orchestrator - Backport #58777: pacific: cephadm doesn't communicate with containers that failed initial start but were successfully restarted later by systemdResolvedAdam KingActions
Copied to Orchestrator - Backport #58778: quincy: cephadm doesn't communicate with containers that failed initial start but were successfully restarted later by systemdResolvedAdam KingActions
Actions #1

Updated by Redouane Kachach Elhichou over 1 year ago

  • Assignee set to Redouane Kachach Elhichou
Actions #2

Updated by Redouane Kachach Elhichou over 1 year ago

Thank you very much for reporting the issue and for providing the detailed instructions. I tried to reproduce the issue in a small cluster (3 nodes) however eventually cephadm showed the right information on 'ceph orch ps'.

I said eventually because the information at cephadm side is not reflected instantly but you have to wait a little bit to get daemons information updated. By default cephadm refreshes daemons information each 10min. You can force cephadm to refresh the information by adding the flag --refersh:

ceph orch ps --refresh

This behaviour is configurable by configurable through the variable daemon_cache_timeout (10 min by default), following is the cmd to change this variable value:

ceph config set mgr mgr/cephadm/daemon_cache_timeout XX (in seconds)

Please, keep in mind that when you pass the --refresh option or when you change the variable above, cephadm will trigger an information refresh operation whose duration depends on how big is the cluster (the more nodes it has the longer it takes). Normally the user can re-run the `ceph orch ps` command and check whether the info is up-to-date or not (by checking the column REFRESHED which is in this case our source of truth). 

Please, can you double check (with this info in mind) if the problem persists and the status is not updated?

BTW: It would be good when the problem happens to attach the output of 'ceph orch ps'

Actions #3

Updated by Voja Molani over 1 year ago

The cluster in question is 3 nodes.

Some time after boot:

# ceph orch ps | egrep 'cep1|REFRES'
NAME                               HOST               PORTS                 STATUS        REFRESHED  AGE  MEM USE  MEM LIM  VERSION         IMAGE ID      CONTAINER ID
alertmanager.cep1                  cep1.domain.com    192.0.2.11:9093,9094  error            6m ago   4M        -        -  <unknown>       <unknown>     <unknown>
grafana.cep1                       cep1.domain.com    192.0.2.11:3000       error            6m ago   4M        -        -  8.3.5           dad864ee21e9  048903dacf83
mgr.cep1.domain.com.asmugh         cep1.domain.com    *:9283                error            6m ago   4M     417M        -  17.2.3          0912465dcea5  667e55c0d52d
mon.cep1.domain.com                cep1.domain.com                          error            6m ago   4M        -    2048M  <unknown>       <unknown>     <unknown>
node-exporter.cep1                 cep1.domain.com    192.0.2.11:9100       error            6m ago   4M        -        -                  1dbe0e931976  8038262a4167
osd.0                              cep1.domain.com                          error            6m ago   4M        -    4096M  17.2.3          0912465dcea5  6084aee38804
osd.1                              cep1.domain.com                          error            6m ago   4M        -    4096M  17.2.3          0912465dcea5  359b8a6160ec
osd.2                              cep1.domain.com                          error            6m ago   4M        -    4096M  <unknown>       <unknown>     <unknown>
osd.3                              cep1.domain.com                          error            6m ago   4M        -    4096M  17.2.3          0912465dcea5  b01691462e1c
osd.4                              cep1.domain.com                          error            6m ago   4M        -    4096M  <unknown>       <unknown>     <unknown>
osd.5                              cep1.domain.com                          error            6m ago   4M        -    4096M  <unknown>       <unknown>     <unknown>
osd.6                              cep1.domain.com                          error            6m ago   4M        -    4096M  <unknown>       <unknown>     <unknown>
osd.7                              cep1.domain.com                          error            6m ago   4M        -    4096M  17.2.3          0912465dcea5  fe87322f6a62
osd.8                              cep1.domain.com                          error            6m ago   4M        -    4096M  <unknown>       <unknown>     <unknown>
osd.9                              cep1.domain.com                          running (6m)     6m ago   4M        -    4096M  17.2.3          0912465dcea5  e5dc85a68eab
osd.10                             cep1.domain.com                          error            6m ago   4M        -    4096M  17.2.3          0912465dcea5  788d7fd8f43a
osd.11                             cep1.domain.com                          error            6m ago   4M        -    4096M  17.2.3          0912465dcea5  69a36cd1aa83
prometheus.cep1                    cep1.domain.com    192.0.2.11:9095       error            6m ago   4M        -        -  <unknown>       <unknown>     <unknown>
rgw.pool.cep1.abmahl               cep1.domain.com    192.0.2.11:8000       error            6m ago   4M    77.8M        -  17.2.3          0912465dcea5  458f8405ee84

Refresh was 6m ago. only osd.9 is not "error" state. meanwhile at podman side:

# podman ps --no-trunc
CONTAINER ID                                                      IMAGE                                                                                                                         COMMAND                                                                                                                                                    CREATED         STATUS             PORTS       NAMES
458f8405ee84ebe4c66464f852808b99d757acb616c77fbd8a8d9141b1bb65d6  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n client.rgw.pool.cep1.abmahl -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false  14 minutes ago  Up 11 minutes ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-rgw-pool-cep1-abmahl
667e55c0d52db4aaf293e3f76b3b00bfc32c2bd32e49b6f524265a101a971afb  private-reg/ceph:latest                                                                   -n mgr.cep1.domain.com.asmugh -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false   14 minutes ago  Up 13 minutes ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-mgr-cep1-domain-com-asmugh
048903dacf8379dc44661fb68e7e3ab01fe80af5b543bff381493cdb9408e112  quay.io/ceph/ceph-grafana:8.3.5                                                                                               /bin/bash                                                                                                                                                  6 minutes ago   Up 6 minutes ago               ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-grafana-cep1
8038262a4167687f86e05b4a6fc5201e85e76a8002cf4b77ed1056a57084f2c7  quay.io/prometheus/node-exporter:v1.3.1                                                                                       --no-collector.timex --web.listen-address=192.0.2.11:9100 --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/rootfs                            6 minutes ago   Up 6 minutes ago               ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-node-exporter-cep1
fe87322f6a62ceba8fc86d07068594fd398eaecac4943798a75a226bc948fbf0  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.7 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        6 minutes ago   Up 6 minutes ago               ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-7
359b8a6160ec6cf635cce6afa8b2c002877aa3c02a19a8ce8ef9131ad3c7604a  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.1 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        6 minutes ago   Up 6 minutes ago               ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-1
788d7fd8f43a39897e8d84f8c650d3a43c066190cd9d96c23e14acb671e531ee  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.10 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                       6 minutes ago   Up 6 minutes ago               ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-10
b01691462e1c43e2c87deb3edcb5fe8afe82a7c4107623564bb8a01b80957560  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.3 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        6 minutes ago   Up 6 minutes ago               ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-3
69a36cd1aa834d65fbf3fde13a2618fce40e6be1efb8e316857ae6545db2feb6  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.11 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                       6 minutes ago   Up 6 minutes ago               ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-11
6084aee38804040de0f99e51ebf288f33f86b60c91d45c84376bb5788b8d5541  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.0 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        6 minutes ago   Up 6 minutes ago               ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-0
40c66bb9109dcf2894282f2bb9cd5021ba0bd6eb2c390e741a45248d14e3dc97  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.2 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        6 minutes ago   Up 6 minutes ago               ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-2
bd0e468d3db61cb04eb7ecb1b2a0711712e09a8671c24eca8411b5a6947881b2  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.6 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        6 minutes ago   Up 6 minutes ago               ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-6
e5dc85a68eabd44f46ace54489b32dd450b237259322b2c594e3e86d4441ce3e  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.9 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        6 minutes ago   Up 6 minutes ago               ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-9
386590aeba681b374e718be72278b08e3ecd18833fccf3a05992eedf5f363bbc  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.5 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        6 minutes ago   Up 6 minutes ago               ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-5
f68b7db5debe5e26a4e01746717aa8fa86966286a5bcc8021294e5f3956daaad  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.4 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        6 minutes ago   Up 6 minutes ago               ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-4
aa6cef9888d9b9af9362c46f18f1fc39f5d9c88a173e329116cd0a815ec7c4c2  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.8 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        6 minutes ago   Up 6 minutes ago               ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-8
1b27873b18c0c0b9198a2a67eee38f186ce8e8790f059b56d95bb686b91fe36b  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26                                                                                                                                                             5 minutes ago   Up 5 minutes ago               gallant_mahavira

When ceph orch ps was last refreshed podman was already seeing rgw, mgr as "up"; only 1 osd was listed in ceph orch ps (no "MEM USE" for it by the way?!) while all 12 containers showing in podman ps. Maybe there is a reason for the "error" state on rgw/mgr I wouldn't know. I do note that the "MEM USE" is listed only for rgw, mgr while podman shows many more containers - maybe they were launched after cephadm refreshed, the resolution in the "6 minutes ago" is not great.

Let's wind forward to the next refresh after the default 10 min (thanks, wasn't aware of that, I think it should be documented, when there's time I will look if I find a place & make a PR)

# ceph orch ps | egrep 'cep1|REFRES'
NAME                               HOST               PORTS                 STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION         IMAGE ID      CONTAINER ID
alertmanager.cep1                  cep1.domain.com    192.0.2.11:9093,9094  error            13s ago   4M        -        -  <unknown>       <unknown>     <unknown>
grafana.cep1                       cep1.domain.com    192.0.2.11:3000       error            13s ago   4M    66.7M        -  8.3.5           dad864ee21e9  048903dacf83
mgr.cep1.domain.com.asmugh         cep1.domain.com    *:9283                error            13s ago   4M     422M        -  17.2.3          0912465dcea5  667e55c0d52d
mon.cep1.domain.com                cep1.domain.com                          error            13s ago   4M        -    2048M  <unknown>       <unknown>     <unknown>
node-exporter.cep1                 cep1.domain.com    192.0.2.11:9100       error            13s ago   4M    26.7M        -                  1dbe0e931976  8038262a4167
osd.0                              cep1.domain.com                          error            13s ago   4M    10.4M    4096M  17.2.3          0912465dcea5  6084aee38804
osd.1                              cep1.domain.com                          error            13s ago   4M    10.8M    4096M  17.2.3          0912465dcea5  359b8a6160ec
osd.2                              cep1.domain.com                          running (11m)    13s ago   4M    11.5M    4096M  17.2.3          0912465dcea5  40c66bb9109d
osd.3                              cep1.domain.com                          error            13s ago   4M    11.3M    4096M  17.2.3          0912465dcea5  b01691462e1c
osd.4                              cep1.domain.com                          running (10m)    13s ago   4M    10.8M    4096M  17.2.3          0912465dcea5  f68b7db5debe
osd.5                              cep1.domain.com                          running (11m)    13s ago   4M    10.6M    4096M  17.2.3          0912465dcea5  386590aeba68
osd.6                              cep1.domain.com                          running (11m)    13s ago   4M    13.7M    4096M  17.2.3          0912465dcea5  bd0e468d3db6
osd.7                              cep1.domain.com                          error            13s ago   4M    13.8M    4096M  17.2.3          0912465dcea5  fe87322f6a62
osd.8                              cep1.domain.com                          running (10m)    13s ago   4M    13.9M    4096M  17.2.3          0912465dcea5  aa6cef9888d9
osd.9                              cep1.domain.com                          running (11m)    13s ago   4M    11.5M    4096M  17.2.3          0912465dcea5  e5dc85a68eab
osd.10                             cep1.domain.com                          error            13s ago   4M    13.7M    4096M  17.2.3          0912465dcea5  788d7fd8f43a
osd.11                             cep1.domain.com                          error            13s ago   4M    14.0M    4096M  17.2.3          0912465dcea5  69a36cd1aa83
prometheus.cep1                    cep1.domain.com    192.0.2.11:9095       error            13s ago   4M        -        -  <unknown>       <unknown>     <unknown>
rgw.pool.cep1.abmahl               cep1.domain.com    192.0.2.11:8000       error            13s ago   4M    86.9M        -  17.2.3          0912465dcea5  458f8405ee84
# podman ps --no-trunc
CONTAINER ID                                                      IMAGE                                                                                                                         COMMAND                                                                                                                                                    CREATED         STATUS             PORTS       NAMES
458f8405ee84ebe4c66464f852808b99d757acb616c77fbd8a8d9141b1bb65d6  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n client.rgw.pool.cep1.abmahl -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false  20 minutes ago  Up 17 minutes ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-rgw-pool-cep1-abmahl
667e55c0d52db4aaf293e3f76b3b00bfc32c2bd32e49b6f524265a101a971afb  private-reg/ceph:latest                                                                   -n mgr.cep1.domain.com.asmugh -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false   20 minutes ago  Up 19 minutes ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-mgr-cep1-domain-com-asmugh
048903dacf8379dc44661fb68e7e3ab01fe80af5b543bff381493cdb9408e112  quay.io/ceph/ceph-grafana:8.3.5                                                                                               /bin/bash                                                                                                                                                  12 minutes ago  Up 12 minutes ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-grafana-cep1
8038262a4167687f86e05b4a6fc5201e85e76a8002cf4b77ed1056a57084f2c7  quay.io/prometheus/node-exporter:v1.3.1                                                                                       --no-collector.timex --web.listen-address=192.0.2.11:9100 --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/rootfs                            12 minutes ago  Up 12 minutes ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-node-exporter-cep1
fe87322f6a62ceba8fc86d07068594fd398eaecac4943798a75a226bc948fbf0  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.7 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        12 minutes ago  Up 12 minutes ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-7
359b8a6160ec6cf635cce6afa8b2c002877aa3c02a19a8ce8ef9131ad3c7604a  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.1 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        12 minutes ago  Up 12 minutes ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-1
788d7fd8f43a39897e8d84f8c650d3a43c066190cd9d96c23e14acb671e531ee  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.10 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                       12 minutes ago  Up 12 minutes ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-10
b01691462e1c43e2c87deb3edcb5fe8afe82a7c4107623564bb8a01b80957560  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.3 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        12 minutes ago  Up 12 minutes ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-3
69a36cd1aa834d65fbf3fde13a2618fce40e6be1efb8e316857ae6545db2feb6  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.11 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                       12 minutes ago  Up 12 minutes ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-11
6084aee38804040de0f99e51ebf288f33f86b60c91d45c84376bb5788b8d5541  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.0 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        12 minutes ago  Up 12 minutes ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-0
40c66bb9109dcf2894282f2bb9cd5021ba0bd6eb2c390e741a45248d14e3dc97  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.2 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        12 minutes ago  Up 12 minutes ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-2
bd0e468d3db61cb04eb7ecb1b2a0711712e09a8671c24eca8411b5a6947881b2  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.6 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        12 minutes ago  Up 12 minutes ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-6
e5dc85a68eabd44f46ace54489b32dd450b237259322b2c594e3e86d4441ce3e  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.9 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        12 minutes ago  Up 12 minutes ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-9
386590aeba681b374e718be72278b08e3ecd18833fccf3a05992eedf5f363bbc  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.5 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        12 minutes ago  Up 12 minutes ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-5
f68b7db5debe5e26a4e01746717aa8fa86966286a5bcc8021294e5f3956daaad  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.4 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        12 minutes ago  Up 12 minutes ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-4
aa6cef9888d9b9af9362c46f18f1fc39f5d9c88a173e329116cd0a815ec7c4c2  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.8 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        12 minutes ago  Up 12 minutes ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-8
1b27873b18c0c0b9198a2a67eee38f186ce8e8790f059b56d95bb686b91fe36b  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26                                                                                                                                                             11 minutes ago  Up 11 minutes ago              gallant_mahavira

This is the end result I had in the case when I opened this issue. The containers never come out of "error" state with the exception of some (seemingly semi random, with few changing between reboots) containers, usually osd.

MEM USE is shown right for all containers. Let's take a look at ceph -s if that agrees to ceph orch ps

# ceph -s
  cluster:
    id:     5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30
    health: HEALTH_WARN
            13 failed cephadm daemon(s)
            1/3 mons down, quorum cep2,cep3
            2 osds down
            1 host (12 osds) down
            Degraded data redundancy: 184387/553161 objects degraded (33.333%), 1099 pgs degraded, 1217 pgs undersized

  services:
    mon: 3 daemons, quorum cep2,cep3 (age 27m), out of quorum: cep1.domain.com
    mgr: cep2.atrumb(active, since 27m), standbys: cep1.domain.com.asmugh
    osd: 36 osds: 24 up (since 27m), 26 in (since 17m)
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:                                                                                                                                                             
    pools:   8 pools, 1217 pgs
    objects: 184.39k objects, 684 GiB
    usage:   3.1 TiB used, 173 TiB / 176 TiB avail
    pgs:     184387/553161 objects degraded (33.333%)
             1099 active+undersized+degraded
             118  active+undersized

The health status shows that mgr is up - as it indeed is. But ceph orch ps shows it with "error" status. It does agree with mon being down so let's wait more for that.
Meanwhile, note "2 osds down" so 10 OSD should be all good while ceph orch ps had 6 osd in "error" status and only 6 osd as "running". It does see MEM USE for all OSD so let's just wait more then.

# ceph orch ps | egrep 'cep1|REFRES'
NAME                               HOST               PORTS                 STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION         IMAGE ID      CONTAINER ID
alertmanager.cep1                  cep1.domain.com    192.0.2.11:9093,9094  error             8m ago   4M        -        -  <unknown>       <unknown>     <unknown>
grafana.cep1                       cep1.domain.com    192.0.2.11:3000       error             8m ago   4M    69.4M        -  8.3.5           dad864ee21e9  048903dacf83
mgr.cep1.domain.com.asmugh         cep1.domain.com    *:9283                error             8m ago   4M     430M        -  17.2.3          0912465dcea5  667e55c0d52d
mon.cep1.domain.com                cep1.domain.com                          error             8m ago   4M        -    2048M  <unknown>       <unknown>     <unknown>
node-exporter.cep1                 cep1.domain.com    192.0.2.11:9100       error             8m ago   4M    29.3M        -                  1dbe0e931976  8038262a4167
osd.0                              cep1.domain.com                          error             8m ago   4M    10.6M    4096M  17.2.3          0912465dcea5  6084aee38804
osd.1                              cep1.domain.com                          error             8m ago   4M    11.0M    4096M  17.2.3          0912465dcea5  359b8a6160ec
osd.2                              cep1.domain.com                          running (51m)     8m ago   4M    11.7M    4096M  17.2.3          0912465dcea5  40c66bb9109d
osd.3                              cep1.domain.com                          error             8m ago   4M    11.5M    4096M  17.2.3          0912465dcea5  b01691462e1c
osd.4                              cep1.domain.com                          running (51m)     8m ago   4M    11.0M    4096M  17.2.3          0912465dcea5  f68b7db5debe
osd.5                              cep1.domain.com                          running (51m)     8m ago   4M    10.8M    4096M  17.2.3          0912465dcea5  386590aeba68
osd.6                              cep1.domain.com                          running (51m)     8m ago   4M    13.8M    4096M  17.2.3          0912465dcea5  bd0e468d3db6
osd.7                              cep1.domain.com                          error             8m ago   4M    13.9M    4096M  17.2.3          0912465dcea5  fe87322f6a62
osd.8                              cep1.domain.com                          running (51m)     8m ago   4M    14.1M    4096M  17.2.3          0912465dcea5  aa6cef9888d9
osd.9                              cep1.domain.com                          running (51m)     8m ago   4M    11.7M    4096M  17.2.3          0912465dcea5  e5dc85a68eab
osd.10                             cep1.domain.com                          error             8m ago   4M    13.9M    4096M  17.2.3          0912465dcea5  788d7fd8f43a
osd.11                             cep1.domain.com                          error             8m ago   4M    14.2M    4096M  17.2.3          0912465dcea5  69a36cd1aa83
prometheus.cep1                    cep1.domain.com    192.0.2.11:9095       error             8m ago   4M        -        -  <unknown>       <unknown>     <unknown>
rgw.pool.cep1.abmahl               cep1.domain.com    192.0.2.11:8000       error             8m ago   4M    96.4M        -  17.2.3          0912465dcea5  458f8405ee84

Still the same so let's just wait more if the 2 running osd come to ceph -s - on previous tests I could get the cluster healthy (sans cephadm daemon errors) with ceph orch ps showing an output similar to above.

# ceph orch ps | egrep 'cep1|REFRES'
NAME                               HOST               PORTS                 STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION         IMAGE ID      CONTAINER ID
alertmanager.cep1                  cep1.domain.com    192.0.2.11:9093,9094  error             4m ago   4M        -        -  <unknown>       <unknown>     <unknown>
grafana.cep1                       cep1.domain.com    192.0.2.11:3000       error             4m ago   4M    70.3M        -  8.3.5           dad864ee21e9  048903dacf83
mgr.cep1.domain.com.asmugh         cep1.domain.com    *:9283                error             4m ago   4M     433M        -  17.2.3          0912465dcea5  667e55c0d52d
mon.cep1.domain.com                cep1.domain.com                          error             4m ago   4M        -    2048M  <unknown>       <unknown>     <unknown>
node-exporter.cep1                 cep1.domain.com    192.0.2.11:9100       error             4m ago   4M    25.3M        -                  1dbe0e931976  8038262a4167
osd.0                              cep1.domain.com                          error             4m ago   4M        -    4096M  <unknown>       <unknown>     <unknown>
osd.1                              cep1.domain.com                          error             4m ago   4M        -    4096M  <unknown>       <unknown>     <unknown>
osd.2                              cep1.domain.com                          running (18m)     4m ago   4M    13.1M    4096M  17.2.3          0912465dcea5  5a33a3450bf6
osd.3                              cep1.domain.com                          error             4m ago   4M        -    4096M  <unknown>       <unknown>     <unknown>
osd.4                              cep1.domain.com                          running (18m)     4m ago   4M    10.7M    4096M  17.2.3          0912465dcea5  e01c51260543
osd.5                              cep1.domain.com                          running (18m)     4m ago   4M    10.6M    4096M  17.2.3          0912465dcea5  0affb2ba8889
osd.6                              cep1.domain.com                          running (18m)     4m ago   4M    11.5M    4096M  17.2.3          0912465dcea5  a415cc1afd5c
osd.7                              cep1.domain.com                          error             4m ago   4M        -    4096M  <unknown>       <unknown>     <unknown>
osd.8                              cep1.domain.com                          running (18m)     4m ago   4M    13.1M    4096M  17.2.3          0912465dcea5  64e6ddc4f22c
osd.9                              cep1.domain.com                          running (18m)     4m ago   4M    14.0M    4096M  17.2.3          0912465dcea5  00117776d64e
osd.10                             cep1.domain.com                          error             4m ago   4M        -    4096M  <unknown>       <unknown>     <unknown>
osd.11                             cep1.domain.com                          error             4m ago   4M        -    4096M  <unknown>       <unknown>     <unknown>
prometheus.cep1                    cep1.domain.com    192.0.2.11:9095       error             4m ago   4M        -        -  <unknown>       <unknown>     <unknown>
rgw.pool.cep1.abmahl               cep1.domain.com    192.0.2.11:8000       error             4m ago   4M    97.8M        -  17.2.3          0912465dcea5  458f8405ee84
# podman ps --no-trunc
CONTAINER ID                                                      IMAGE                                                                                                                         COMMAND                                                                                                                                                    CREATED            STATUS                PORTS       NAMES
458f8405ee84ebe4c66464f852808b99d757acb616c77fbd8a8d9141b1bb65d6  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n client.rgw.pool.cep1.abmahl -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false  About an hour ago  Up About an hour ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-rgw-pool-cep1-abmahl
667e55c0d52db4aaf293e3f76b3b00bfc32c2bd32e49b6f524265a101a971afb  private-reg/ceph:latest                                                                   -n mgr.cep1.domain.com.asmugh -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false   About an hour ago  Up About an hour ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-mgr-cep1-domain-com-asmugh
048903dacf8379dc44661fb68e7e3ab01fe80af5b543bff381493cdb9408e112  quay.io/ceph/ceph-grafana:8.3.5                                                                                               /bin/bash                                                                                                                                                  About an hour ago  Up About an hour ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-grafana-cep1
8038262a4167687f86e05b4a6fc5201e85e76a8002cf4b77ed1056a57084f2c7  quay.io/prometheus/node-exporter:v1.3.1                                                                                       --no-collector.timex --web.listen-address=192.0.2.11:9100 --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/rootfs                            About an hour ago  Up About an hour ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-node-exporter-cep1
1b27873b18c0c0b9198a2a67eee38f186ce8e8790f059b56d95bb686b91fe36b  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26                                                                                                                                                             About an hour ago  Up About an hour ago              gallant_mahavira
5a33a3450bf699ec9c51325f8124e8192c8ba977482c0bf2de114a8cd81416e8  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.2 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        18 minutes ago     Up 18 minutes ago                 ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-2
a415cc1afd5c70e978d22205ebaab5a89a394b7b841ac75a699b2b4967d647c2  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.6 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        18 minutes ago     Up 18 minutes ago                 ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-6
00117776d64e0df02163f9d96f1d949f014da5a9479ab8e9e10132d1addcf2b1  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.9 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        18 minutes ago     Up 18 minutes ago                 ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-9
0affb2ba888921d323c19cd80b40f3dbe5870bd37da5713ea453d05bf9ed9755  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.5 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        18 minutes ago     Up 18 minutes ago                 ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-5
e01c512605439911220b708cfcfd4bfc622d9cba6f507690fdec52c63b62f2b0  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.4 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        18 minutes ago     Up 18 minutes ago                 ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-4
64e6ddc4f22c439db03a54f14bb36495ed62f84118e532b79fbc656f168333df  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.8 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false                        18 minutes ago     Up 18 minutes ago                 ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-8

Wait what? Only running for 18m? The system seemed to already be at a steady state.

Some containers were killed near the beginning (those that were missing from the penultimate podman listing above): for example prometheus started fine but eventually logged SIGTERM received, perhaps from systemd, and didn't get restarted after that.
I assume that to be the case with other missing containers. The real reason for that being out of scope of this issue.

At least we can see the original issue on at least some containers: mgr is definitely running and ceph -s sees it also but cephadm thinks mon is "error" and only shows MEM USE for it. Additionally 4 of the osd that cephadm thinks is "error" are correctly seen by ceph -s.

But now all the containers were restarted? There is nothing special I could see in cephadm.log at the timestamp when all containers were restarted. In general there is not much interesting in the log, lines similar to this do repeat sometimes:

2023-01-13 08:08:48,295 7f6ad225b740 DEBUG /usr/bin/podman: ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)
2023-01-13 08:08:48,445 7f6ad225b740 DEBUG systemctl: enabled
2023-01-13 08:08:48,453 7f6ad225b740 DEBUG systemctl: failed
2023-01-13 08:08:48,502 7f6ad225b740 DEBUG /usr/bin/podman: 6084aee38804040de0f99e51ebf288f33f86b60c91d45c84376bb5788b8d5541,private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26,0912465dcea5159f56c7d4ccdf1d15a28abfe1be8a56c94eb86cab129c069726,2023-01-13 07:25:41.72503957 +0400 +04,
2023-01-13 08:08:48,604 7f6ad225b740 DEBUG systemctl: enabled
2023-01-13 08:08:48,612 7f6ad225b740 DEBUG systemctl: failed
2023-01-13 08:08:48,660 7f6ad225b740 DEBUG /usr/bin/podman: 359b8a6160ec6cf635cce6afa8b2c002877aa3c02a19a8ce8ef9131ad3c7604a,private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26,0912465dcea5159f56c7d4ccdf1d15a28abfe1be8a56c94eb86cab129c069726,2023-01-13 07:25:38.589799703 +0400 +04,
2023-01-13 08:08:48,763 7f6ad225b740 DEBUG systemctl: enabled
2023-01-13 08:08:48,771 7f6ad225b740 DEBUG systemctl: failed
2023-01-13 08:08:48,819 7f6ad225b740 DEBUG /usr/bin/podman: 788d7fd8f43a39897e8d84f8c650d3a43c066190cd9d96c23e14acb671e531ee,private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26,0912465dcea5159f56c7d4ccdf1d15a28abfe1be8a56c94eb86cab129c069726,2023-01-13 07:25:38.985777789 +0400 +04,
2023-01-13 08:08:48,920 7f6ad225b740 DEBUG systemctl: enabled
2023-01-13 08:08:48,929 7f6ad225b740 DEBUG systemctl: failed
2023-01-13 08:08:48,978 7f6ad225b740 DEBUG /usr/bin/podman: 69a36cd1aa834d65fbf3fde13a2618fce40e6be1efb8e316857ae6545db2feb6,private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26,0912465dcea5159f56c7d4ccdf1d15a28abfe1be8a56c94eb86cab129c069726,2023-01-13 07:25:41.548444374 +0400 +04,
2023-01-13 08:08:49,080 7f6ad225b740 DEBUG systemctl: enabled
2023-01-13 08:08:49,087 7f6ad225b740 DEBUG systemctl: active
2023-01-13 08:08:49,138 7f6ad225b740 DEBUG /usr/bin/podman: 40c66bb9109dcf2894282f2bb9cd5021ba0bd6eb2c390e741a45248d14e3dc97,private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26,0912465dcea5159f56c7d4ccdf1d15a28abfe1be8a56c94eb86cab129c069726,2023-01-13 07:25:45.537820031 +0400 +04,

and other normal looking lines like sestatus output repeating a gazillion times. Will try to look at why everything was restarted around the 1h uptime.

The cluster is indeed 17.2.3, was planning to re-install to 17.2.5 after more tests. All unique identifiers in the logs were sanitized.

So the logs above show the persistent "error" state happening with mgr and 4 osd.

Actions #4

Updated by Redouane Kachach Elhichou over 1 year ago

I see, I think the problem (in terms of discrepancy) between the information you see in 'ceph orch ps' and ceph -s is that they use different 'source of truth':

- cephadm uses the information from the systemd status
- ceph -s uses information from the real state (in terms of which processes are running and connected to the cluster). 

My guess is what happens is that for some reason (that I have to investigate more in deep), the services associated with the daemons (you can list them by: systemctl | grep ceph-) are in error state because they cannot execute the corresponding unit commands. So from systemd pov (and hence cephadm) these services are in error. You can double check that by copying cephadm to the node and running the following cmd (the output of this command is what gets later reflected in the status you list by cephadm orch ps command):

./cephadm ls | grep -w -e name -e state

However, the container process is in fact running (that's why you see that mon/mgr is there when you run ceph -s).

My theory is that something goes wrong (because of the short timeout) during the startup so the systemd service fails but the process itself (podman container run by the service) is launched and stays there running. So systemd cannot re-run the service and fails (because the container is running). If this the issue the work-around would be to stop the container manually by running:

podman stop <container-id>

And then, restart the corresponding ceph-daemon by:

ceph orch daemon restart <daemon-name-from-orch-ps-output>

Please, keep in mind that ceph services are configured as following in case you try to restart the systemd service manually:

StartLimitInterval=30min
StartLimitBurst=5

So if there have been more than 5 restart attempts in the last 30min systemd wont let you restart the service until 30min period has finished. 'ceph orch daemon restart <daemon-name>' takes care of this when restarting a daemon as it issues a reset-failed before restarting the systemd service.

It seems that you are able to reproduce the issue easily. Please, give a try to the recommendations above just to see if my theory about what's going on is correct and if the work-around works or not. This way we will confirm the issue then next step would be to know why the containers keep running when the systemd service has failed.

BTW: please, next time just after the startup (and when the daemon goes into error state) it would be good to see the output of the
systemctl status of these services in eror (at least some of them). That would shed some light about the exit code error.

Note: containers are in general stateless, so don't be afraid on stopping their corresponding process and restarting their daemon.

Actions #5

Updated by Voja Molani over 1 year ago

Thank you for your reply. I might not have time to look deeper into this until later but I want to reply to some of your points.

My theory is that something goes wrong (because of the short timeout) during the startup so the systemd service fails but the process itself (podman container run by the service) is launched and stays there running.

I am pretty sure that systemd is supposed to kill the units that did not start within the start timeout period.
I will try to confirm this.

It seems that you are able to reproduce the issue easily.

Well not exactly "easily" - I set the start timeout to slightly less than what the containers usually would take to start, perhaps 33-50% of time it gets to this state and doesn't get even worse with much more errors.
It takes quite a bit of time.

Actions #6

Updated by Redouane Kachach Elhichou over 1 year ago

Voja Molani wrote:

Thank you for your reply. I might not have time to look deeper into this until later but I want to reply to some of your points.

My theory is that something goes wrong (because of the short timeout) during the startup so the systemd service fails but the process itself (podman container run by the service) is launched and stays there running.

I am pretty sure that systemd is supposed to kill the units that did not start within the start timeout period.
I will try to confirm this.

It seems that you are able to reproduce the issue easily.

Well not exactly "easily" - I set the start timeout to slightly less than what the containers usually would take to start, perhaps 33-50% of time it gets to this state and doesn't get even worse with much more errors.
It takes quite a bit of time.

I see, no rush specially since the process of reproducing the issue is time consuming.

Well, related tot he systemd behaviour that depends on the configuration. Right now we are using KillMode=none so I'm afraid if the start timeout is fired then the container will stay there running as we are asking systemd to do not kill the service unit.

KillMode was set to none long time ago to avoid systemd killing the podman containers as that may lead to some issues (we prefer podman stopping and cleaning its environment than systemd killing the process.)

A 'simple' fix to see if that's the issue is to change killMode to mixed (or control-group) and see if that's solves the problem or not.

Thanks.

Actions #7

Updated by Redouane Kachach Elhichou over 1 year ago

  • Related to Bug #51361: KillMode=none is deprecated added
Actions #8

Updated by Redouane Kachach Elhichou over 1 year ago

  • Related to Bug #58241: Systemd unit file TimeoutStartSec of 120 seconds is too low added
Actions #9

Updated by Voja Molani over 1 year ago

Well, related tot he systemd behaviour that depends on the configuration. Right now we are using KillMode=none so I'm afraid if the start timeout is fired then the container will stay there running as we are asking systemd to do not kill the service unit.

Doesn't this explain the problem. The sequence of events will be:
  1. Systemd starts a container.
  2. Container does not start within timeout period and systemd will start to restart it later.
  3. The container keeps running.
  4. Systemd tries to restart the container but it is still running.

It makes no sense to have "any" TimeoutStartSec if no action is taken when that timeout is reached - only thing accomplished will be that systemd and the real system state will be out of sync and systemd loses view to the service that's now running without systemd caring/knowing about it.

If no action will be taken on timeout, then the only correct configuration is to have a very high TimeoutStartSec or disable timeouts.

I believe it is probably better to not try to kill containers that don't come up on time. I suspect in some problem cases containers might take a very, very long time to start and forcefully terminating them would only make matters worse.

Actions #10

Updated by Redouane Kachach Elhichou over 1 year ago

Doesn't this explain the problem. The sequence of events will be:
  1. Systemd starts a container.
  2. Container does not start within timeout period and systemd will start to restart it later.
  3. The container keeps running.
  4. Systemd tries to restart the container but it is still running.

Yes, I suspect this is what's going on. So at the end the container will remain running but the service will be reported as in error state because it failed.

It makes no sense to have "any" TimeoutStartSec if no action is taken when that timeout is reached - only thing accomplished will be that systemd and the real system state will be out of sync and systemd loses view to the service that's now running without systemd caring/knowing about it.

Well, the timeout still is useful as it marks the service as 'in error' state so the user can go and try to fix it (in this case by stopping manually the container which is safer than trying to kill it).

If no action will be taken on timeout, then the only correct configuration is to have a very high TimeoutStartSec or disable timeouts.

I believe it is probably better to not try to kill containers that don't come up on time. I suspect in some problem cases containers might take a very, very long time to start and forcefully terminating them would only make matters worse.

Yeah, that's the current approach and actually it was introduced in the past because there were some issues related to systemd killing the container instead of having podman stoping it gracefully.

Anyway, the goal right now is to identify and ensure that the problem is what we think it is. Once we confirm it, then we can see what's the best solution to fix it :)

Actions #11

Updated by Voja Molani over 1 year ago

In my opinion the "best" solution would be to differentiate between status of "error" and status of starting up. Not sure if it is even possible or how much work that would involve.

Well, the timeout still is useful as it marks the service as 'in error' state so the user can go and try to fix it (in this case by stopping manually the container which is safer than trying to kill it).

But it isn't really errored - it just took 5 (or whatever) seconds too long to start and is now running a-ok but without systemd knowing about it.

Since I had raised the TimeoutStartSec then this situation should not happen again in normal circumstances. But I could see that in some problem case (and judging by reports from other users there are plenty of those situations) the container could take even dozens of minutes to hour(s) to complete startup.
In those non-normal cases having ceph orch lose the management of the container would just add to the stress on an admin who maybe does not know about these finer points of how the containers are managed.

Yeah, that's the current approach and actually it was introduced in the past because there were some issues related to systemd killing the container instead of having podman stoping it gracefully.

I believe that this is how it should be but then again I am a noob. I would hope that containers that did not completely start within x seconds would not be killed, as I said earlier.

IMHO there would be definitely value in trying to solve this matter for those non-normal situations where startup takes long but I am not sure what difference my input would make. Do I understand correctly that next I should check systemd status (maybe systemctl status x and journalctl log) and cephadm ls for those containers that went to "error" state due to TimeoutStartSec being exceeded but eventually did start right and are OK now as witnessed by ceph -s? Anything else?

Currently most containers are in "error" state:

# ceph orch ps | egrep 'cep1|REFRE'
NAME                               HOST               PORTS                 STATUS        REFRESHED  AGE  MEM USE  MEM LIM  VERSION         IMAGE ID      CONTAINER ID
alertmanager.cep1                  cep1.domain.com    192.0.2.11:9093,9094  running (4d)    23s ago   4M    34.3M        -                  ba2b418f427c  96e9f81b3fb8
grafana.cep1                       cep1.domain.com    192.0.2.11:3000       error           23s ago   4M    85.0M        -  8.3.5           dad864ee21e9  aadcd354a612
mgr.cep1.domain.com.asmugh         cep1.domain.com    *:9283                running (4d)    23s ago   4M     451M        -  17.2.3          0912465dcea5  8cc07c233522
mon.cep1.domain.com                cep1.domain.com                          error           23s ago   4M     420M    2048M  17.2.3          0912465dcea5  7c9fe2039f56
node-exporter.cep1                 cep1.domain.com    192.0.2.11:9100       error           23s ago   4M    28.5M        -                  1dbe0e931976  0f3900bb99ef
osd.0                              cep1.domain.com                          error           23s ago   4M    1181M    4096M  17.2.3          0912465dcea5  4967c9a06c1d
osd.1                              cep1.domain.com                          error           23s ago   4M    1110M    4096M  17.2.3          0912465dcea5  7f9edd875d32
osd.2                              cep1.domain.com                          error           23s ago   4M    1077M    4096M  17.2.3          0912465dcea5  201fd4a9d42d
osd.3                              cep1.domain.com                          error           23s ago   4M    1150M    4096M  17.2.3          0912465dcea5  d0704e4945de
osd.4                              cep1.domain.com                          error           23s ago   4M    1045M    4096M  17.2.3          0912465dcea5  8a98581f2c11
osd.5                              cep1.domain.com                          error           23s ago   4M    1074M    4096M  17.2.3          0912465dcea5  5d7452f25567
osd.6                              cep1.domain.com                          error           23s ago   4M     996M    4096M  17.2.3          0912465dcea5  1fa0b8e4d937
osd.7                              cep1.domain.com                          error           23s ago   4M    1088M    4096M  17.2.3          0912465dcea5  2813158cefb1
osd.8                              cep1.domain.com                          error           23s ago   4M    1159M    4096M  17.2.3          0912465dcea5  923d0277c2da
osd.9                              cep1.domain.com                          error           23s ago   4M    1153M    4096M  17.2.3          0912465dcea5  1a1eabd0eee2
osd.10                             cep1.domain.com                          error           23s ago   4M    1129M    4096M  17.2.3          0912465dcea5  a5224ee22d33
osd.11                             cep1.domain.com                          error           23s ago   4M    1073M    4096M  17.2.3          0912465dcea5  98303d0e1cca
prometheus.cep1                    cep1.domain.com    192.0.2.11:9095       error           23s ago   4M     191M        -                  514e6a882f6e  b356b8eb057b
rgw.pool.cep1.abmahl               cep1.domain.com    192.0.2.11:8000       error           23s ago   4M     122M        -  17.2.3          0912465dcea5  44faf1ecd556

As all containers list "MEM USE" I then suspect they are all actually running OK. This is indeed what is shown in podman ps and ceph -s:

# podman ps
CONTAINER ID  IMAGE                                                                                     COMMAND               CREATED     STATUS         PORTS       NAMES
aadcd354a612  quay.io/ceph/ceph-grafana:8.3.5                                                           /bin/bash             4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-grafana-cep1
b356b8eb057b  quay.io/prometheus/prometheus:v2.33.4                                                     --config.file=/et...  4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-prometheus-cep1
44faf1ecd556  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n client.rgw.poo...  4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-rgw-pool-cep1-abmahl
7c9fe2039f56  private-reg/ceph:latest                                                                   -n mon.cep1.domai...  4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-mon-cep1-domain-com
0f3900bb99ef  quay.io/prometheus/node-exporter:v1.3.1                                                   --no-collector.ti...  4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-node-exporter-cep1
98303d0e1cca  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.11 -f --se...  4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-11
5d7452f25567  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.5 -f --set...  4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-5
7f9edd875d32  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.1 -f --set...  4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-1
1a1eabd0eee2  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.9 -f --set...  4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-9
d0704e4945de  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.3 -f --set...  4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-3
201fd4a9d42d  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.2 -f --set...  4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-2
2813158cefb1  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.7 -f --set...  4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-7
1fa0b8e4d937  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.6 -f --set...  4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-6
4967c9a06c1d  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.0 -f --set...  4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-0
923d0277c2da  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.8 -f --set...  4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-8
8a98581f2c11  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.4 -f --set...  4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-4
a5224ee22d33  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26  -n osd.10 -f --se...  4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-osd-10
96e9f81b3fb8  quay.io/prometheus/alertmanager:v0.23.0                                                   --cluster.listen-...  4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-alertmanager-cep1
8cc07c233522  private-reg/ceph:latest                                                                   -n mgr.cep1.domai...  4 days ago  Up 4 days ago              ceph-5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30-mgr-cep1-domain-com-asmugh
d07c27bf3edc  private-reg/ceph@sha256:3ae68015b3947a9c71adcebf13209edbd8a9b6e69e7f3c25cb1fe9bcdd8f5d26                        4 days ago  Up 4 days ago              gallant_mahavira
# ceph -s
  cluster:
    id:     5c0752e5-73f2-411f-8c8f-e9c2d0fd7d30
    health: HEALTH_WARN
            17 failed cephadm daemon(s)

  services:
    mon: 3 daemons, quorum cep1.domain.com,cep2,cep3 (age 4d)
    mgr: cep2.atrumb(active, since 4d), standbys: cep1.domain.com.asmugh
    osd: 36 osds: 36 up (since 4d), 36 in (since 4d)
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    pools:   8 pools, 1217 pgs
    objects: 184.38k objects, 684 GiB
    usage:   4.6 TiB used, 260 TiB / 265 TiB avail
    pgs:     1217 active+clean

  io:
    client:   4.0 KiB/s rd, 0 B/s wr, 3 op/s rd, 2 op/s wr

systemctl status ceph* output - which unfortunately is missing some log entries due to journal rotation last night. This is the same machine and same reboot as in my previous update. I guess I need to reboot this to get the full journal for all containers if that is useful.

But as can be seen, eventually all containers started correctly, even mon that didn't start yet in my previous update 5 days ago. 5 days ago all osd were restarted every 1 hour because they could not find mon - strange, there were 2 other mons in the cluster...

Anyways now the cluster is fully functional and only problem is that cephadm has lost all information about the containers beyond "MEM USE" parameter.

[see attached log, redmine gives a server error trying to have this log inline]

Systemd seems to have full and correct picture of all the containers so not sure what exactly makes cephadm think the containers are in "error". Maybe this is a bug?

I could not spot any new information in cephadm ls output

# cephadm ls |grep -w -e name -e state
        "name": "mon.cep1.domain.com",
        "state": "error",
        "name": "mgr.cep1.domain.com.asmugh",
        "state": "running",
        "name": "osd.0",
        "state": "error",
        "name": "osd.1",
        "state": "error",
        "name": "osd.10",
        "state": "error",
        "name": "osd.11",
        "state": "error",
        "name": "osd.2",
        "state": "error",
        "name": "osd.3",
        "state": "error",
        "name": "osd.4",
        "state": "error",
        "name": "osd.5",
        "state": "error",
        "name": "osd.6",
        "state": "error",
        "name": "osd.7",
        "state": "error",
        "name": "osd.8",
        "state": "error",
        "name": "osd.9",
        "state": "error",
        "name": "rgw.pool.cep1.abmahl",
        "state": "error",
        "name": "node-exporter.cep1",
        "state": "error",
        "name": "alertmanager.cep1",
        "state": "running",
        "name": "grafana.cep1",
        "state": "error",
        "name": "prometheus.cep1",
        "state": "error",
Actions #12

Updated by Redouane Kachach Elhichou over 1 year ago

Thank you for getting back with the results and for providing more info from your cluster.

Yesterday I was discussing this issue with other developers with expertise in the cephadm area and as it concerns them this is the first time this issue reported. We have seen Ceph deployment in bigger clusters (with more dense nodes and with more OSDs) but we haven't experienced issues related to this TimeoutStartSec so far.

Please, can your provide mode details about your cluster setup such as:
- Number of nodes
- Maximum number of OSDs per node
- Disks model
- Are you using an external or internal registry to get ceph related images?
- Is your network downloading fast?
- did you test pulling Ceph's container image manually? how long does it take?

I'm afraid that this slowness in the startup could be caused by some special circumstance in your cluster, otherwise bigger clusters (using slow disks) should have experienced a similar issue but it's not the case.

As I understood from the thread, increasing TimeoutStartSec to 200 seconds solves the issue in your setup. Can you confirm please?

BTW: Please, can you check if the work around I recommended works to solve your issue, for that as I suggested you have to:

1- Go to the node where daemons in error state are running
2- Stop the underlying podman containers manually by a podman stop <container-id>
3- Then restart the daemon from cephadm by:

ceph orch daemon restart <daemon-name-from-orch-ps-output>
Actions #13

Updated by Voja Molani over 1 year ago

- Number of nodes

3

- Maximum number of OSDs per node

12 HDD, 2 SSD for db

- Disks model

HGST HUS728T8TAL5200

- Are you using an external or internal registry to get ceph related images?

Internal for now but will be moving to stock registry on next reinstall.

- Is your network downloading fast?

Yes, downloading from the local container registry should be OK and not especially slow.

- did you test pulling Ceph's container image manually? how long does it take?

real    0m44.050s
user    0m0.120s
sys     0m0.093s

I'm afraid that this slowness in the startup could be caused by some special circumstance in your cluster, otherwise bigger clusters (using slow disks) should have experienced a similar issue but it's not the case.

It could be. When I said earlier that the disks are slow what I meant was that the system disks are slow. They are Dell "disk on module" flash devices.

As I understood from the thread, increasing TimeoutStartSec to 200 seconds solves the issue in your setup. Can you confirm please?

Yes that is correct.

BTW: Please, can you check if the work around I recommended works to solve your issue, for that as I suggested you have to:

1- Go to the node where daemons in error state are running
2- Stop the underlying podman containers manually by a podman stop <container-id>
3- Then restart the daemon from cephadm by:

Yes, this works. I tested it on the grafana container. Within seconds the container is in "running" state, I was not fast enough to see it in "error" state again after doing orch daemon restart for it.

Actions #14

Updated by Redouane Kachach Elhichou over 1 year ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 49728
Actions #15

Updated by Adam King about 1 year ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to quincy, pacific
Actions #16

Updated by Backport Bot about 1 year ago

  • Copied to Backport #58777: pacific: cephadm doesn't communicate with containers that failed initial start but were successfully restarted later by systemd added
Actions #17

Updated by Backport Bot about 1 year ago

  • Copied to Backport #58778: quincy: cephadm doesn't communicate with containers that failed initial start but were successfully restarted later by systemd added
Actions #18

Updated by Backport Bot about 1 year ago

  • Tags set to backport_processed
Actions #19

Updated by Adam King about 1 year ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF