Bug #51806
opencephadm: stopped contains end up in error state
0%
Description
[ceph: root@sebastians-laptop /]# ceph orch stop node-exporter Scheduled to stop node-exporter.sebastians-laptop on host 'sebastians-laptop' [ceph: root@sebastians-laptop /]# ceph orch ps --service-name node-exporter NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID node-exporter.sebastians-laptop sebastians-laptop *:9100 error 3m ago 4m - - <unknown> <unknown>
➜ cephadm git:(cephadm-container-name-dashes) ✗ sudo systemctl status ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop | cat ● ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service - Ceph node-exporter.sebastians-laptop for b2f78482-ead5-11eb-9ac0-482ae35a5fbb Loaded: loaded (/etc/systemd/system/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2021-07-22 16:04:16 CEST; 14s ago Process: 2907757 ExecStartPre=/bin/rm -f /run/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service-pid /run/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service-cid (code=exited, status=0/SUCCESS) Process: 2907758 ExecStart=/bin/bash /var/lib/ceph/b2f78482-ead5-11eb-9ac0-482ae35a5fbb/node-exporter.sebastians-laptop/unit.run (code=exited, status=0/SUCCESS) Main PID: 2907955 (conmon) Tasks: 8 (limit: 38293) Memory: 2.7M CPU: 562ms CGroup: /system.slice/system-ceph\x2db2f78482\x2dead5\x2d11eb\x2d9ac0\x2d482ae35a5fbb.slice/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service ├─container │ ├─2907958 /dev/init -- /bin/node_exporter --no-collector.timex --web.listen-address=:9100 │ └─2907960 /bin/node_exporter --no-collector.timex --web.listen-address=:9100 └─supervisor └─2907955 /usr/bin/conmon --api-version 1 -c b771c1403b134d57e8378aa979b297257a14880c249ce901263e1e771725c1d1 -u b771c1403b134d57e8378aa979b297257a14880c249ce901263e1e771725c1d1 -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/b771c1403b134d57e8378aa979b297257a14880c249ce901263e1e771725c1d1/userdata -p /run/containers/storage/overlay-containers/b771c1403b134d57e8378aa979b297257a14880c249ce901263e1e771725c1d1/userdata/pidfile -n ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb-node-exporter-sebastians-laptop --exit-dir /run/libpod/exits --socket-dir-path /run/libpod/socket -l journald --log-level warning --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/run/containers/storage/overlay-containers/b771c1403b134d57e8378aa979b297257a14880c249ce901263e1e771725c1d1/userdata/oci-log --conmon-pidfile /run/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service-pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /run/containers/storage --exit-command-arg --log-level --exit-command-arg warning --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/libpod --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --rm --exit-command-arg b771c1403b134d57e8378aa979b297257a14880c249ce901263e1e771725c1d1 Jul 22 16:04:16 sebastians-laptop conmon[2907955]: time="2021-07-22T14:04:16Z" level=info msg=" - textfile" source="node_exporter.go:104" Jul 22 16:04:16 sebastians-laptop conmon[2907955]: time="2021-07-22T14:04:16Z" level=info msg=" - time" source="node_exporter.go:104" Jul 22 16:04:16 sebastians-laptop conmon[2907955]: time="2021-07-22T14:04:16Z" level=info msg=" - uname" source="node_exporter.go:104" Jul 22 16:04:16 sebastians-laptop conmon[2907955]: time="2021-07-22T14:04:16Z" level=info msg=" - vmstat" source="node_exporter.go:104" Jul 22 16:04:16 sebastians-laptop conmon[2907955]: time="2021-07-22T14:04:16Z" level=info msg=" - xfs" source="node_exporter.go:104" Jul 22 16:04:16 sebastians-laptop conmon[2907955]: time="2021-07-22T14:04:16Z" level=info msg=" - zfs" source="node_exporter.go:104" Jul 22 16:04:16 sebastians-laptop conmon[2907955]: time="2021-07-22T14:04:16Z" level=info msg="Listening on :9100" source="node_exporter.go:170" Jul 22 16:04:16 sebastians-laptop podman[2907914]: 2021-07-22 16:04:16.490922626 +0200 CEST m=+0.269857979 container start b771c1403b134d57e8378aa979b297257a14880c249ce901263e1e771725c1d1 (image=docker.io/prom/node-exporter:v0.18.1, name=ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb-node-exporter-sebastians-laptop, maintainer=The Prometheus Authors <prometheus-developers@googlegroups.com>) Jul 22 16:04:16 sebastians-laptop bash[2907914]: b771c1403b134d57e8378aa979b297257a14880c249ce901263e1e771725c1d1 Jul 22 16:04:16 sebastians-laptop systemd[1]: Started Ceph node-exporter.sebastians-laptop for b2f78482-ead5-11eb-9ac0-482ae35a5fbb. ➜ cephadm git:(cephadm-container-name-dashes) ✗ sudo systemctl status ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop | cat ● ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service - Ceph node-exporter.sebastians-laptop for b2f78482-ead5-11eb-9ac0-482ae35a5fbb Loaded: loaded (/etc/systemd/system/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2021-07-22 16:05:04 CEST; 5s ago Process: 2907757 ExecStartPre=/bin/rm -f /run/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service-pid /run/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service-cid (code=exited, status=0/SUCCESS) Process: 2907758 ExecStart=/bin/bash /var/lib/ceph/b2f78482-ead5-11eb-9ac0-482ae35a5fbb/node-exporter.sebastians-laptop/unit.run (code=exited, status=0/SUCCESS) Process: 2908848 ExecStop=/bin/bash -c /bin/podman stop ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb-node-exporter.sebastians-laptop ; bash /var/lib/ceph/b2f78482-ead5-11eb-9ac0-482ae35a5fbb/node-exporter.sebastians-laptop/unit.stop (code=exited, status=0/SUCCESS) Process: 2909007 ExecStopPost=/bin/bash /var/lib/ceph/b2f78482-ead5-11eb-9ac0-482ae35a5fbb/node-exporter.sebastians-laptop/unit.poststop (code=exited, status=0/SUCCESS) Process: 2909008 ExecStopPost=/bin/rm -f /run/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service-pid /run/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service-cid (code=exited, status=0/SUCCESS) Main PID: 2907955 (code=exited, status=143) CPU: 1.030s Jul 22 16:04:16 sebastians-laptop systemd[1]: Started Ceph node-exporter.sebastians-laptop for b2f78482-ead5-11eb-9ac0-482ae35a5fbb. Jul 22 16:05:03 sebastians-laptop systemd[1]: Stopping Ceph node-exporter.sebastians-laptop for b2f78482-ead5-11eb-9ac0-482ae35a5fbb... Jul 22 16:05:03 sebastians-laptop bash[2908849]: Error: no container with name or ID "ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb-node-exporter.sebastians-laptop" found: no such container Jul 22 16:05:04 sebastians-laptop podman[2908926]: 2021-07-22 16:05:04.364723428 +0200 CEST m=+0.120139014 container remove b771c1403b134d57e8378aa979b297257a14880c249ce901263e1e771725c1d1 (image=docker.io/prom/node-exporter:v0.18.1, name=ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb-node-exporter-sebastians-laptop, maintainer=The Prometheus Authors <prometheus-developers@googlegroups.com>) Jul 22 16:05:04 sebastians-laptop bash[2908886]: ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb-node-exporter-sebastians-laptop Jul 22 16:05:04 sebastians-laptop systemd[1]: ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service: Main process exited, code=exited, status=143/n/a Jul 22 16:05:04 sebastians-laptop bash[2908968]: Error: no container with name or ID "ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb-node-exporter.sebastians-laptop" found: no such container Jul 22 16:05:04 sebastians-laptop systemd[1]: ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service: Failed with result 'exit-code'. Jul 22 16:05:04 sebastians-laptop systemd[1]: Stopped Ceph node-exporter.sebastians-laptop for b2f78482-ead5-11eb-9ac0-482ae35a5fbb. Jul 22 16:05:04 sebastians-laptop systemd[1]: ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service: Consumed 1.030s CPU time. ➜ cephadm git:(cephadm-container-name-dashes) ✗
might be related to conmon is exiting with 143 instead of 0
Updated by Adam King over 2 years ago
Adding
SuccessExitStatus=143
to the [Service] section of the unit file technically fixes this. Not sure if we want something like that though or if we should try to stop the 143 error code from coming up to begin with.
Updated by Adam King over 2 years ago
I'm thinking this might not be a cephadm specific issue. For one thing, no matter what version I tested this with (I even went back and tested 15.2.4 which came out about a year ago) node-exporter always goes into error state, so I think there's an issue specific to stopping those daemons. Other than that I checked stopping crash and mon daemons. From latest master, node-exporter and crash went to error state, while mon stopped succseffully. The logs for the stopping event for the three daemons (in order crash, node-exporter, mon) were
Aug 02 14:43:06 vm-00 systemd[1]: Stopping Ceph crash.vm-00 for de98bed2-f39d-11eb-9ca6-525400cfc295... -- Subject: Unit ceph-de98bed2-f39d-11eb-9ca6-525400cfc295@crash.vm-00.service has begun shutting down -- Defined-By: systemd -- Support: https://access.redhat.com/support -- -- Unit ceph-de98bed2-f39d-11eb-9ca6-525400cfc295@crash.vm-00.service has begun shutting down. Aug 02 14:43:07 vm-00 conmon[25470]: Traceback (most recent call last): Aug 02 14:43:07 vm-00 conmon[25470]: File "/usr/bin/ceph-crash", line 113, in <module> Aug 02 14:43:07 vm-00 conmon[25470]: main() Aug 02 14:43:07 vm-00 conmon[25470]: File "/usr/bin/ceph-crash", line 109, in main Aug 02 14:43:07 vm-00 conmon[25470]: time.sleep(args.delay * 60) Aug 02 14:43:07 vm-00 conmon[25470]: TypeError: handler() takes 1 positional argument but 2 were given Aug 02 14:43:07 vm-00 podman[38457]: c152a1fb9b7f1b9e8bbf5c316c821488b9305f46ea62dcec5148da002f5aad5d Aug 02 14:43:07 vm-00 systemd[1]: ceph-de98bed2-f39d-11eb-9ca6-525400cfc295@crash.vm-00.service: Main process exited, code=exited, status=1/FAILURE Aug 02 14:43:07 vm-00 systemd[1]: ceph-de98bed2-f39d-11eb-9ca6-525400cfc295@crash.vm-00.service: Failed with result 'exit-code'. -- Subject: Unit failed -- Unit ceph-de98bed2-f39d-11eb-9ca6-525400cfc295@node-exporter.vm-00.service has begun shutting down. Aug 02 15:26:47 vm-00 podman[48481]: a8d1605d86b4ea4f7b8e72c17efbe1fd433627c88dbfd61ad18b5f8034941dca Aug 02 15:26:47 vm-00 systemd[1]: ceph-de98bed2-f39d-11eb-9ca6-525400cfc295@node-exporter.vm-00.service: Main process exited, code=exited, status=143/n/a Aug 02 15:26:47 vm-00 systemd[1]: ceph-de98bed2-f39d-11eb-9ca6-525400cfc295@node-exporter.vm-00.service: Failed with result 'exit-code'. -- Subject: Unit failed -- Unit ceph-de98bed2-f39d-11eb-9ca6-525400cfc295@mon.vm-00.service has begun shutting down. Aug 02 15:21:48 vm-00 ceph-mon[17339]: received signal: Terminated from /dev/init -- /usr/bin/ceph-mon -n mon.vm-00 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --defa> Aug 02 15:21:48 vm-00 ceph-mon[17339]: mon.vm-00@0(leader) e3 *** Got Signal Terminated *** Aug 02 15:21:48 vm-00 ceph-mon[17339]: mon.vm-00@0(leader) e3 shutdown Aug 02 15:21:48 vm-00 conmon[17317]: 2021-08-02T15:21:48.873+0000 7fb8b1854700 -1 received signal: Terminated from /dev/init -- /usr/bin/ceph-mon -n mon.vm-00 -f --setuser ceph --setgroup ceph --default-log-to-file=> Aug 02 15:21:48 vm-00 conmon[17317]: 2021-08-02T15:21:48.873+0000 7fb8b1854700 -1 mon.vm-00@0(leader) e3 *** Got Signal Terminated *** Aug 02 15:21:48 vm-00 ceph-mon[17339]: rocksdb: close waiting for compaction thread to stop Aug 02 15:21:48 vm-00 ceph-mon[17339]: rocksdb: close compaction thread to stopped Aug 02 15:21:48 vm-00 ceph-mon[17339]: rocksdb: [db/db_impl/db_impl.cc:447] Shutdown: canceling all background work Aug 02 15:21:48 vm-00 ceph-mon[17339]: rocksdb: [db/db_impl/db_impl.cc:625] Shutdown complete Aug 02 15:21:49 vm-00 podman[45870]: 547198d768b2ed8d81a154ed02f2ab723b15af8f783bba4fd774ef307de15e0f Aug 02 15:21:49 vm-00 systemd[1]: ceph-de98bed2-f39d-11eb-9ca6-525400cfc295@mon.vm-00.service: Succeeded. -- Subject: Unit succeeded
When testing with 16.2.4, the latest Pacific release, node-exporter still went into error state but crash went into stopped as expected. Mon also still stopped with no issue.
-- Unit ceph-f6fbfd70-f3a9-11eb-9596-5254001e1abe@crash.vm-00.service has begun shutting down. Aug 02 16:45:09 vm-00 conmon[25204]: *** Interrupted with signal 15 *** Aug 02 16:45:09 vm-00 podman[39680]: 0063f0751110acc4a536775a02dc5572ca00a8ba6ee141ecea3aadbaa7082f43 Aug 02 16:45:09 vm-00 systemd[1]: ceph-f6fbfd70-f3a9-11eb-9596-5254001e1abe@crash.vm-00.service: Succeeded. -- Subject: Unit succeeded -- Unit ceph-f6fbfd70-f3a9-11eb-9596-5254001e1abe@node-exporter.vm-00.service has begun shutting down. Aug 02 16:45:15 vm-00 podman[40582]: e0b974db0616e8d64d1519f37cafdd1ffc22a631b47eab11a4e4be2d8cdd39a1 Aug 02 16:45:15 vm-00 systemd[1]: ceph-f6fbfd70-f3a9-11eb-9596-5254001e1abe@node-exporter.vm-00.service: Main process exited, code=exited, status=143/n/a Aug 02 16:45:15 vm-00 systemd[1]: ceph-f6fbfd70-f3a9-11eb-9596-5254001e1abe@node-exporter.vm-00.service: Failed with result 'exit-code'. -- Subject: Unit failed -- Unit ceph-f6fbfd70-f3a9-11eb-9596-5254001e1abe@mon.vm-00.service has begun shutting down. Aug 02 16:46:32 vm-00 conmon[17272]: cluster 2021-08-02T16:46:31.206031+0000 mgr.vm-00.qeiblt (mgr.14164) 1556 : cluster [DBG] pgmap v1499: 0 pgs: ; 0 B data, 0 B used, 0 B / 0 B avail Aug 02 16:46:32 vm-00 conmon[17272]: audit 2021-08-02T16:46:31.857425+0000 mon.vm-00 (mon.0) 510 : audit [DBG] from='mgr.14164 192.168.122.89:0/3202536674' entity='mgr.vm-00.qeiblt' cmd=[{"prefix": "config dump", "fo> Aug 02 16:46:32 vm-00 conmon[17272]: audit 2021-08-02T16:46:31.879757+0000 mon.vm-00 (mon.0) 511 : audit [INF] from='mgr.14164 192.168.122.89:0/3202536674' entity='mgr.vm-00.qeiblt' Aug 02 16:46:32 vm-00 conmon[17272]: debug 2021-08-02T16:46:32.521+0000 7fe2cbc7e700 -1 received signal: Terminated from /dev/init -- /usr/bin/ceph-mon -n mon.vm-00 -f --setuser ceph --setgroup ceph --default-log-to> Aug 02 16:46:32 vm-00 conmon[17272]: debug 2021-08-02T16:46:32.521+0000 7fe2cbc7e700 -1 mon.vm-00@0(leader) e3 *** Got Signal Terminated *** Aug 02 16:46:32 vm-00 conmon[17272]: debug 2021-08-02T16:46:32.521+0000 7fe2cbc7e700 1 mon.vm-00@0(leader) e3 shutdown Aug 02 16:46:32 vm-00 conmon[17272]: debug 2021-08-02T16:46:32.532+0000 7fe2dbe54700 1 rocksdb: close waiting for compaction thread to stop Aug 02 16:46:32 vm-00 conmon[17272]: debug 2021-08-02T16:46:32.532+0000 7fe2dbe54700 1 rocksdb: close compaction thread to stopped Aug 02 16:46:32 vm-00 conmon[17272]: debug 2021-08-02T16:46:32.532+0000 7fe2dbe54700 4 rocksdb: [db_impl/db_impl.cc:397] Shutdown: canceling all background work Aug 02 16:46:32 vm-00 conmon[17272]: debug 2021-08-02T16:46:32.533+0000 7fe2dbe54700 4 rocksdb: [db_impl/db_impl.cc:573] Shutdown complete Aug 02 16:46:32 vm-00 podman[42737]: 831727ddc2203120a2e88a2b817af64709a843facb61a550b61f6fa843139322 Aug 02 16:46:32 vm-00 systemd[1]: ceph-f6fbfd70-f3a9-11eb-9596-5254001e1abe@mon.vm-00.service: Succeeded. -- Subject: Unit succeeded
Given only some containers are having errors stopping, with different error code (crash had code 1, node-exporter 143) and it varies between ceph version despite the way we do the stopping being the same (my testing for latest master doesn't include https://github.com/ceph/ceph/pull/42242) it seems that either the way have done stopping daemons up until now isn't very good, or what we're doing is fine and it's the fault of daemons we're stopping for not handling sigterm and returning the exit code we'd like.
Updated by Redouane Kachach Elhichou almost 2 years ago
- Status changed from New to In Progress
- Assignee set to Redouane Kachach Elhichou
Updated by Redouane Kachach Elhichou almost 2 years ago
It seems that this issue (in case of node-exporter) is caused by node-exporter not handling correctly the SIGTERM signal (sent by docker stop or podman stop). This can be easily observed by running the container:
podman run -d --net="host" --pid="host" -v "/:/host:ro,rslave" 'quay.io/prometheus/node-exporter:latest' --path.rootfs=/host
Stopping it:
podman stop <container_id>
And checking its exit code:
podman ps -a | grep node
In the case of node-exporter we always get 143 code:
ee40291790d4 quay.io/prometheus/node-exporter:latest --path.rootfs=/ho... 14 minutes ago Exited (143) 14 minutes ago
Normally 143 is the exit code generated when the underlying Operating system terminates the container. In this case it's docker/podman stopping the container gracefully so it should exit with 0 instead of 143.
An issue has been opened to report this error:
https://github.com/prometheus/node_exporter/issues/2356
Updated by Redouane Kachach Elhichou almost 2 years ago
- Status changed from In Progress to New
Updated by Redouane Kachach Elhichou almost 2 years ago
- Status changed from New to Need More Info
Updated by Redouane Kachach Elhichou 7 months ago
- Assignee deleted (
Redouane Kachach Elhichou)