Bug #51806: cephadm: stopped contains end up in error state - Orchestrator - Ceph

Actions

Copy link

Bug #51806

open

cephadm: stopped contains end up in error state

Added by Sebastian Wagner over 2 years ago. Updated 7 months ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

[ceph: root@sebastians-laptop /]# ceph orch stop node-exporter
Scheduled to stop node-exporter.sebastians-laptop on host 'sebastians-laptop'
[ceph: root@sebastians-laptop /]# ceph orch ps --service-name node-exporter
NAME                             HOST               PORTS   STATUS  REFRESHED  AGE  MEM USE  MEM LIM  VERSION    IMAGE ID   
node-exporter.sebastians-laptop  sebastians-laptop  *:9100  error      3m ago   4m        -        -  <unknown>  <unknown>

➜  cephadm git:(cephadm-container-name-dashes) ✗ sudo systemctl status ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop | cat
● ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service - Ceph node-exporter.sebastians-laptop for b2f78482-ead5-11eb-9ac0-482ae35a5fbb
     Loaded: loaded (/etc/systemd/system/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@.service; enabled; vendor preset: disabled)
     Active: active (running) since Thu 2021-07-22 16:04:16 CEST; 14s ago
    Process: 2907757 ExecStartPre=/bin/rm -f /run/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service-pid /run/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service-cid (code=exited, status=0/SUCCESS)
    Process: 2907758 ExecStart=/bin/bash /var/lib/ceph/b2f78482-ead5-11eb-9ac0-482ae35a5fbb/node-exporter.sebastians-laptop/unit.run (code=exited, status=0/SUCCESS)
   Main PID: 2907955 (conmon)
      Tasks: 8 (limit: 38293)
     Memory: 2.7M
        CPU: 562ms
     CGroup: /system.slice/system-ceph\x2db2f78482\x2dead5\x2d11eb\x2d9ac0\x2d482ae35a5fbb.slice/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service
             ├─container
             │ ├─2907958 /dev/init -- /bin/node_exporter --no-collector.timex --web.listen-address=:9100
             │ └─2907960 /bin/node_exporter --no-collector.timex --web.listen-address=:9100
             └─supervisor
               └─2907955 /usr/bin/conmon --api-version 1 -c b771c1403b134d57e8378aa979b297257a14880c249ce901263e1e771725c1d1 -u b771c1403b134d57e8378aa979b297257a14880c249ce901263e1e771725c1d1 -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/b771c1403b134d57e8378aa979b297257a14880c249ce901263e1e771725c1d1/userdata -p /run/containers/storage/overlay-containers/b771c1403b134d57e8378aa979b297257a14880c249ce901263e1e771725c1d1/userdata/pidfile -n ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb-node-exporter-sebastians-laptop --exit-dir /run/libpod/exits --socket-dir-path /run/libpod/socket -l journald --log-level warning --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/run/containers/storage/overlay-containers/b771c1403b134d57e8378aa979b297257a14880c249ce901263e1e771725c1d1/userdata/oci-log --conmon-pidfile /run/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service-pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /run/containers/storage --exit-command-arg --log-level --exit-command-arg warning --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/libpod --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --rm --exit-command-arg b771c1403b134d57e8378aa979b297257a14880c249ce901263e1e771725c1d1

Jul 22 16:04:16 sebastians-laptop conmon[2907955]: time="2021-07-22T14:04:16Z" level=info msg=" - textfile" source="node_exporter.go:104" 
Jul 22 16:04:16 sebastians-laptop conmon[2907955]: time="2021-07-22T14:04:16Z" level=info msg=" - time" source="node_exporter.go:104" 
Jul 22 16:04:16 sebastians-laptop conmon[2907955]: time="2021-07-22T14:04:16Z" level=info msg=" - uname" source="node_exporter.go:104" 
Jul 22 16:04:16 sebastians-laptop conmon[2907955]: time="2021-07-22T14:04:16Z" level=info msg=" - vmstat" source="node_exporter.go:104" 
Jul 22 16:04:16 sebastians-laptop conmon[2907955]: time="2021-07-22T14:04:16Z" level=info msg=" - xfs" source="node_exporter.go:104" 
Jul 22 16:04:16 sebastians-laptop conmon[2907955]: time="2021-07-22T14:04:16Z" level=info msg=" - zfs" source="node_exporter.go:104" 
Jul 22 16:04:16 sebastians-laptop conmon[2907955]: time="2021-07-22T14:04:16Z" level=info msg="Listening on :9100" source="node_exporter.go:170" 
Jul 22 16:04:16 sebastians-laptop podman[2907914]: 2021-07-22 16:04:16.490922626 +0200 CEST m=+0.269857979 container start b771c1403b134d57e8378aa979b297257a14880c249ce901263e1e771725c1d1 (image=docker.io/prom/node-exporter:v0.18.1, name=ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb-node-exporter-sebastians-laptop, maintainer=The Prometheus Authors <prometheus-developers@googlegroups.com>)
Jul 22 16:04:16 sebastians-laptop bash[2907914]: b771c1403b134d57e8378aa979b297257a14880c249ce901263e1e771725c1d1
Jul 22 16:04:16 sebastians-laptop systemd[1]: Started Ceph node-exporter.sebastians-laptop for b2f78482-ead5-11eb-9ac0-482ae35a5fbb.
➜  cephadm git:(cephadm-container-name-dashes) ✗ sudo systemctl status ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop | cat
● ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service - Ceph node-exporter.sebastians-laptop for b2f78482-ead5-11eb-9ac0-482ae35a5fbb
     Loaded: loaded (/etc/systemd/system/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@.service; enabled; vendor preset: disabled)
     Active: failed (Result: exit-code) since Thu 2021-07-22 16:05:04 CEST; 5s ago
    Process: 2907757 ExecStartPre=/bin/rm -f /run/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service-pid /run/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service-cid (code=exited, status=0/SUCCESS)
    Process: 2907758 ExecStart=/bin/bash /var/lib/ceph/b2f78482-ead5-11eb-9ac0-482ae35a5fbb/node-exporter.sebastians-laptop/unit.run (code=exited, status=0/SUCCESS)
    Process: 2908848 ExecStop=/bin/bash -c /bin/podman stop ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb-node-exporter.sebastians-laptop ; bash /var/lib/ceph/b2f78482-ead5-11eb-9ac0-482ae35a5fbb/node-exporter.sebastians-laptop/unit.stop (code=exited, status=0/SUCCESS)
    Process: 2909007 ExecStopPost=/bin/bash /var/lib/ceph/b2f78482-ead5-11eb-9ac0-482ae35a5fbb/node-exporter.sebastians-laptop/unit.poststop (code=exited, status=0/SUCCESS)
    Process: 2909008 ExecStopPost=/bin/rm -f /run/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service-pid /run/ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service-cid (code=exited, status=0/SUCCESS)
   Main PID: 2907955 (code=exited, status=143)
        CPU: 1.030s

Jul 22 16:04:16 sebastians-laptop systemd[1]: Started Ceph node-exporter.sebastians-laptop for b2f78482-ead5-11eb-9ac0-482ae35a5fbb.
Jul 22 16:05:03 sebastians-laptop systemd[1]: Stopping Ceph node-exporter.sebastians-laptop for b2f78482-ead5-11eb-9ac0-482ae35a5fbb...
Jul 22 16:05:03 sebastians-laptop bash[2908849]: Error: no container with name or ID "ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb-node-exporter.sebastians-laptop" found: no such container
Jul 22 16:05:04 sebastians-laptop podman[2908926]: 2021-07-22 16:05:04.364723428 +0200 CEST m=+0.120139014 container remove b771c1403b134d57e8378aa979b297257a14880c249ce901263e1e771725c1d1 (image=docker.io/prom/node-exporter:v0.18.1, name=ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb-node-exporter-sebastians-laptop, maintainer=The Prometheus Authors <prometheus-developers@googlegroups.com>)
Jul 22 16:05:04 sebastians-laptop bash[2908886]: ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb-node-exporter-sebastians-laptop
Jul 22 16:05:04 sebastians-laptop systemd[1]: ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service: Main process exited, code=exited, status=143/n/a
Jul 22 16:05:04 sebastians-laptop bash[2908968]: Error: no container with name or ID "ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb-node-exporter.sebastians-laptop" found: no such container
Jul 22 16:05:04 sebastians-laptop systemd[1]: ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service: Failed with result 'exit-code'.
Jul 22 16:05:04 sebastians-laptop systemd[1]: Stopped Ceph node-exporter.sebastians-laptop for b2f78482-ead5-11eb-9ac0-482ae35a5fbb.
Jul 22 16:05:04 sebastians-laptop systemd[1]: ceph-b2f78482-ead5-11eb-9ac0-482ae35a5fbb@node-exporter.sebastians-laptop.service: Consumed 1.030s CPU time.
➜  cephadm git:(cephadm-container-name-dashes) ✗

might be related to conmon is exiting with 143 instead of 0

Actions

Copy link

Updated by Adam King over 2 years ago

Adding

SuccessExitStatus=143

to the [Service] section of the unit file technically fixes this. Not sure if we want something like that though or if we should try to stop the 143 error code from coming up to begin with.

Actions

Copy link

Updated by Adam King over 2 years ago

I'm thinking this might not be a cephadm specific issue. For one thing, no matter what version I tested this with (I even went back and tested 15.2.4 which came out about a year ago) node-exporter always goes into error state, so I think there's an issue specific to stopping those daemons. Other than that I checked stopping crash and mon daemons. From latest master, node-exporter and crash went to error state, while mon stopped succseffully. The logs for the stopping event for the three daemons (in order crash, node-exporter, mon) were

Aug 02 14:43:06 vm-00 systemd[1]: Stopping Ceph crash.vm-00 for de98bed2-f39d-11eb-9ca6-525400cfc295...
-- Subject: Unit ceph-de98bed2-f39d-11eb-9ca6-525400cfc295@crash.vm-00.service has begun shutting down
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- Unit ceph-de98bed2-f39d-11eb-9ca6-525400cfc295@crash.vm-00.service has begun shutting down.
Aug 02 14:43:07 vm-00 conmon[25470]: Traceback (most recent call last):
Aug 02 14:43:07 vm-00 conmon[25470]:   File "/usr/bin/ceph-crash", line 113, in <module>
Aug 02 14:43:07 vm-00 conmon[25470]:     main()
Aug 02 14:43:07 vm-00 conmon[25470]:   File "/usr/bin/ceph-crash", line 109, in main
Aug 02 14:43:07 vm-00 conmon[25470]:     time.sleep(args.delay * 60)
Aug 02 14:43:07 vm-00 conmon[25470]: TypeError: handler() takes 1 positional argument but 2 were given
Aug 02 14:43:07 vm-00 podman[38457]: c152a1fb9b7f1b9e8bbf5c316c821488b9305f46ea62dcec5148da002f5aad5d
Aug 02 14:43:07 vm-00 systemd[1]: ceph-de98bed2-f39d-11eb-9ca6-525400cfc295@crash.vm-00.service: Main process exited, code=exited, status=1/FAILURE
Aug 02 14:43:07 vm-00 systemd[1]: ceph-de98bed2-f39d-11eb-9ca6-525400cfc295@crash.vm-00.service: Failed with result 'exit-code'.
-- Subject: Unit failed

-- Unit ceph-de98bed2-f39d-11eb-9ca6-525400cfc295@node-exporter.vm-00.service has begun shutting down.
Aug 02 15:26:47 vm-00 podman[48481]: a8d1605d86b4ea4f7b8e72c17efbe1fd433627c88dbfd61ad18b5f8034941dca
Aug 02 15:26:47 vm-00 systemd[1]: ceph-de98bed2-f39d-11eb-9ca6-525400cfc295@node-exporter.vm-00.service: Main process exited, code=exited, status=143/n/a
Aug 02 15:26:47 vm-00 systemd[1]: ceph-de98bed2-f39d-11eb-9ca6-525400cfc295@node-exporter.vm-00.service: Failed with result 'exit-code'.
-- Subject: Unit failed

-- Unit ceph-de98bed2-f39d-11eb-9ca6-525400cfc295@mon.vm-00.service has begun shutting down.
Aug 02 15:21:48 vm-00 ceph-mon[17339]: received  signal: Terminated from /dev/init -- /usr/bin/ceph-mon -n mon.vm-00 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --defa>
Aug 02 15:21:48 vm-00 ceph-mon[17339]: mon.vm-00@0(leader) e3 *** Got Signal Terminated ***
Aug 02 15:21:48 vm-00 ceph-mon[17339]: mon.vm-00@0(leader) e3 shutdown
Aug 02 15:21:48 vm-00 conmon[17317]: 2021-08-02T15:21:48.873+0000 7fb8b1854700 -1 received  signal: Terminated from /dev/init -- /usr/bin/ceph-mon -n mon.vm-00 -f --setuser ceph --setgroup ceph --default-log-to-file=>
Aug 02 15:21:48 vm-00 conmon[17317]: 2021-08-02T15:21:48.873+0000 7fb8b1854700 -1 mon.vm-00@0(leader) e3 *** Got Signal Terminated ***
Aug 02 15:21:48 vm-00 ceph-mon[17339]: rocksdb: close waiting for compaction thread to stop
Aug 02 15:21:48 vm-00 ceph-mon[17339]: rocksdb: close compaction thread to stopped
Aug 02 15:21:48 vm-00 ceph-mon[17339]: rocksdb: [db/db_impl/db_impl.cc:447] Shutdown: canceling all background work
Aug 02 15:21:48 vm-00 ceph-mon[17339]: rocksdb: [db/db_impl/db_impl.cc:625] Shutdown complete
Aug 02 15:21:49 vm-00 podman[45870]: 547198d768b2ed8d81a154ed02f2ab723b15af8f783bba4fd774ef307de15e0f
Aug 02 15:21:49 vm-00 systemd[1]: ceph-de98bed2-f39d-11eb-9ca6-525400cfc295@mon.vm-00.service: Succeeded.
-- Subject: Unit succeeded

When testing with 16.2.4, the latest Pacific release, node-exporter still went into error state but crash went into stopped as expected. Mon also still stopped with no issue.

-- Unit ceph-f6fbfd70-f3a9-11eb-9596-5254001e1abe@crash.vm-00.service has begun shutting down.
Aug 02 16:45:09 vm-00 conmon[25204]: *** Interrupted with signal 15 ***
Aug 02 16:45:09 vm-00 podman[39680]: 0063f0751110acc4a536775a02dc5572ca00a8ba6ee141ecea3aadbaa7082f43
Aug 02 16:45:09 vm-00 systemd[1]: ceph-f6fbfd70-f3a9-11eb-9596-5254001e1abe@crash.vm-00.service: Succeeded.
-- Subject: Unit succeeded

-- Unit ceph-f6fbfd70-f3a9-11eb-9596-5254001e1abe@node-exporter.vm-00.service has begun shutting down.
Aug 02 16:45:15 vm-00 podman[40582]: e0b974db0616e8d64d1519f37cafdd1ffc22a631b47eab11a4e4be2d8cdd39a1
Aug 02 16:45:15 vm-00 systemd[1]: ceph-f6fbfd70-f3a9-11eb-9596-5254001e1abe@node-exporter.vm-00.service: Main process exited, code=exited, status=143/n/a
Aug 02 16:45:15 vm-00 systemd[1]: ceph-f6fbfd70-f3a9-11eb-9596-5254001e1abe@node-exporter.vm-00.service: Failed with result 'exit-code'.
-- Subject: Unit failed

-- Unit ceph-f6fbfd70-f3a9-11eb-9596-5254001e1abe@mon.vm-00.service has begun shutting down.
Aug 02 16:46:32 vm-00 conmon[17272]: cluster 2021-08-02T16:46:31.206031+0000 mgr.vm-00.qeiblt (mgr.14164) 1556 : cluster [DBG] pgmap v1499: 0 pgs: ; 0 B data, 0 B used, 0 B / 0 B avail
Aug 02 16:46:32 vm-00 conmon[17272]: audit 2021-08-02T16:46:31.857425+0000 mon.vm-00 (mon.0) 510 : audit [DBG] from='mgr.14164 192.168.122.89:0/3202536674' entity='mgr.vm-00.qeiblt' cmd=[{"prefix": "config dump", "fo>
Aug 02 16:46:32 vm-00 conmon[17272]: audit 2021-08-02T16:46:31.879757+0000 mon.vm-00 (mon.0) 511 : audit [INF] from='mgr.14164 192.168.122.89:0/3202536674' entity='mgr.vm-00.qeiblt' 
Aug 02 16:46:32 vm-00 conmon[17272]: debug 2021-08-02T16:46:32.521+0000 7fe2cbc7e700 -1 received  signal: Terminated from /dev/init -- /usr/bin/ceph-mon -n mon.vm-00 -f --setuser ceph --setgroup ceph --default-log-to>
Aug 02 16:46:32 vm-00 conmon[17272]: debug 2021-08-02T16:46:32.521+0000 7fe2cbc7e700 -1 mon.vm-00@0(leader) e3 *** Got Signal Terminated ***
Aug 02 16:46:32 vm-00 conmon[17272]: debug 2021-08-02T16:46:32.521+0000 7fe2cbc7e700  1 mon.vm-00@0(leader) e3 shutdown
Aug 02 16:46:32 vm-00 conmon[17272]: debug 2021-08-02T16:46:32.532+0000 7fe2dbe54700  1 rocksdb: close waiting for compaction thread to stop
Aug 02 16:46:32 vm-00 conmon[17272]: debug 2021-08-02T16:46:32.532+0000 7fe2dbe54700  1 rocksdb: close compaction thread to stopped
Aug 02 16:46:32 vm-00 conmon[17272]: debug 2021-08-02T16:46:32.532+0000 7fe2dbe54700  4 rocksdb: [db_impl/db_impl.cc:397] Shutdown: canceling all background work
Aug 02 16:46:32 vm-00 conmon[17272]: debug 2021-08-02T16:46:32.533+0000 7fe2dbe54700  4 rocksdb: [db_impl/db_impl.cc:573] Shutdown complete
Aug 02 16:46:32 vm-00 podman[42737]: 831727ddc2203120a2e88a2b817af64709a843facb61a550b61f6fa843139322
Aug 02 16:46:32 vm-00 systemd[1]: ceph-f6fbfd70-f3a9-11eb-9596-5254001e1abe@mon.vm-00.service: Succeeded.
-- Subject: Unit succeeded

Given only some containers are having errors stopping, with different error code (crash had code 1, node-exporter 143) and it varies between ceph version despite the way we do the stopping being the same (my testing for latest master doesn't include https://github.com/ceph/ceph/pull/42242) it seems that either the way have done stopping daemons up until now isn't very good, or what we're doing is fine and it's the fault of daemons we're stopping for not handling sigterm and returning the exit code we'd like.

Actions

Copy link

Updated by Redouane Kachach Elhichou almost 2 years ago

Status changed from New to In Progress
Assignee set to Redouane Kachach Elhichou

Actions

Copy link

Updated by Redouane Kachach Elhichou almost 2 years ago

It seems that this issue (in case of node-exporter) is caused by node-exporter not handling correctly the SIGTERM signal (sent by docker stop or podman stop). This can be easily observed by running the container:

podman run -d --net="host" --pid="host" -v "/:/host:ro,rslave" 'quay.io/prometheus/node-exporter:latest' --path.rootfs=/host

Stopping it:

podman stop <container_id>

And checking its exit code:

podman ps -a | grep node

In the case of node-exporter we always get 143 code:

ee40291790d4  quay.io/prometheus/node-exporter:latest   --path.rootfs=/ho...  14 minutes ago     Exited (143) 14 minutes ago

Normally 143 is the exit code generated when the underlying Operating system terminates the container. In this case it's docker/podman stopping the container gracefully so it should exit with 0 instead of 143.

An issue has been opened to report this error:
https://github.com/prometheus/node_exporter/issues/2356

Actions

Copy link