Actions
Bug #46036
closedcephadm: killmode=none: systemd units failed, but containers still running
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Description
# ceph orch ps NAME HOST STATUS REFRESHED AGE VERSION IMAGE NAME IMAGE ID CONTAINER ID osd.0 hostXXXXX-4 error 6m ago 92m 15.2.3.252 ceph/ceph 33194941836f c5dd2b0cc77d osd.1 hostXXXXX-4 error 6m ago 90m 15.2.3.252 ceph/ceph 33194941836f b65dc56c76a2
turns out, the systemd unit failed:
● ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.2.service - Ceph osd.2 for 92d2d4c0-af05-11ea-9578-0cc47aaa2edc Loaded: loaded (/etc/systemd/system/ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Tue 2020-06-16 12:05:49 UTC; 1h 32min ago Process: 3861 ExecStopPost=/bin/bash /var/lib/ceph/92d2d4c0-af05-11ea-9578-0cc47aaa2edc/osd.2/unit.poststop (code=exited, status=0/SUCCESS) Process: 3693 ExecStart=/bin/bash /var/lib/ceph/92d2d4c0-af05-11ea-9578-0cc47aaa2edc/osd.2/unit.run (code=exited, status=125) Process: 3676 ExecStartPre=/usr/bin/podman rm ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc-osd.2 (code=exited, status=2) Main PID: 3693 (code=exited, status=125) Tasks: 34 CGroup: /system.slice/system-ceph\x2d92d2d4c0\x2daf05\x2d11ea\x2d9578\x2d0cc47aaa2edc.slice/ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.2.service ├─28935 /bin/bash /var/lib/ceph/92d2d4c0-af05-11ea-9578-0cc47aaa2edc/osd.2/unit.run ├─29335 /usr/bin/podman run --rm --net=host --ipc=host --privileged --group-add=disk --name ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc-osd.2 -e CONTAINER_IMAGE=ceph/ceph -e NODE_NAME=hostXXXXX-4 -v /var/run/ceph/92d2d4c0-af05-11ea-9578-0cc47aaa2edc> └─29396 /usr/bin/conmon --api-version 1 -s -c 2f88b58cb64519fc90842f6a473703da44c5612d2686b5beae86b0ff2a7d50bb -u 2f88b58cb64519fc90842f6a473703da44c5612d2686b5beae86b0ff2a7d50bb -r /usr/sbin/runc -b /var/lib/containers/storage/btrfs-containers/2f88b58cb64519fc90842f6a473703da44c5612d2686b5beae86b0ff2a7d5> Jun 16 13:32:08 hostXXXXX-4 bash[28935]: Uptime(secs): 5400.0 total, 0.0 interval Jun 16 13:32:08 hostXXXXX-4 bash[28935]: Flush(GB): cumulative 0.000, interval 0.000 Jun 16 13:32:08 hostXXXXX-4 bash[28935]: AddFile(GB): cumulative 0.000, interval 0.000 Jun 16 13:32:08 hostXXXXX-4 bash[28935]: AddFile(Total Files): cumulative 0, interval 0 Jun 16 13:32:08 hostXXXXX-4 bash[28935]: AddFile(L0 Files): cumulative 0, interval 0 Jun 16 13:32:08 hostXXXXX-4 bash[28935]: AddFile(Keys): cumulative 0, interval 0 Jun 16 13:32:08 hostXXXXX-4 bash[28935]: Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Jun 16 13:32:08 hostXXXXX-4 bash[28935]: Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Jun 16 13:32:08 hostXXXXX-4 bash[28935]: Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count Jun 16 13:32:08 hostXXXXX-4 bash[28935]: ** File Read Latency Histogram By Level [default] **
where the log shows something like
-- Unit ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service has begun starting up. Jun 16 12:03:06 hostXXXXX-4 podman[31032]: Error: cannot remove container b65dc56c76a247e9178fa81005a93cf44e502a06fb46bcc79b9bf484128b2907 as it is running -> Jun 16 12:03:06 hostXXXXX-4 systemd[1]: Started Ceph osd.1 for 92d2d4c0-af05-11ea-9578-0cc47aaa2edc. -- Subject: Unit ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service has finished start-up -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service has finished starting up. -- -- The start-up result is done. Jun 16 12:03:06 hostXXXXX-4 bash[31047]: WARNING: The same type, major and minor should not be used for multiple devices. Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1 Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-block-78635646-a4a6-474e-> Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/ln -snf /dev/ceph-block-78635646-a4a6-474e-8832-c3a3e668cf9d/osd-block-f645cf27-857b-48ae-> Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-1/block Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/chown -R ceph:ceph /dev/mapper/ceph--block--78635646--a4a6--474e--8832--c3a3e668cf9d-osd--> Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1 Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/ln -snf /dev/ceph-block-dbs-83a1ab17-f232-4f60-887f-111b89f3f655/osd-block-db-3c9563a7-9ab> Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/chown -h ceph:ceph /dev/ceph-block-dbs-83a1ab17-f232-4f60-887f-111b89f3f655/osd-block-db-3> Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/chown -R ceph:ceph /dev/mapper/ceph--block--dbs--83a1ab17--f232--4f60--887f--111b89f3f655-> Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-1/block.db Jun 16 12:03:07 hostXXXXX-4 bash[31047]: Running command: /usr/bin/chown -R ceph:ceph /dev/mapper/ceph--block--dbs--83a1ab17--f232--4f60--887f--111b89f3f655-> Jun 16 12:03:07 hostXXXXX-4 bash[31047]: --> ceph-volume lvm activate successful for osd ID: 1 Jun 16 12:03:08 hostXXXXX-4 bash[31047]: WARNING: The same type, major and minor should not be used for multiple devices. Jun 16 12:03:08 hostXXXXX-4 bash[31047]: Error: error creating container storage: the container name "ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc-osd.1" is alr> Jun 16 12:03:08 hostXXXXX-4 systemd[1]: ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service: Main process exited, code=exited, status=125/n/a Jun 16 12:03:08 hostXXXXX-4 bash[31227]: WARNING: The same type, major and minor should not be used for multiple devices. Jun 16 12:03:09 hostXXXXX-4 systemd[1]: ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service: Unit entered failed state. Jun 16 12:03:09 hostXXXXX-4 systemd[1]: ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service: Failed with result 'exit-code'. Jun 16 12:03:19 hostXXXXX-4 systemd[1]: ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service: Service RestartSec=10s expired, scheduling restart. Jun 16 12:03:19 hostXXXXX-4 systemd[1]: Stopped Ceph osd.1 for 92d2d4c0-af05-11ea-9578-0cc47aaa2edc. -- Subject: Unit ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service has finished shutting down -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service has finished shutting down. Jun 16 12:03:19 hostXXXXX-4 systemd[1]: ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service: Start request repeated too quickly. Jun 16 12:03:19 hostXXXXX-4 systemd[1]: Failed to start Ceph osd.1 for 92d2d4c0-af05-11ea-9578-0cc47aaa2edc. -- Subject: Unit ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service has failed -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service has failed. -- -- The result is failed. Jun 16 12:03:19 hostXXXXX-4 systemd[1]: ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service: Unit entered failed state. Jun 16 12:03:19 hostXXXXX-4 systemd[1]: ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service: Failed with result 'exit-code'.
Adding a set -e
changes the output to /var/lib/ceph/92d2d4c0-af05-11ea-9578-0cc47aaa2edc/osd.1/unit.run:
hostXXXXX-4:~ # systemctl status ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1 ● ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service - Ceph osd.1 for 92d2d4c0-af05-11ea-9578-0cc47aaa2edc Loaded: loaded (/etc/systemd/system/ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@.service; enabled; vendor preset: disabled) Active: activating (auto-restart) (Result: exit-code) since Tue 2020-06-16 14:07:27 UTC; 4s ago Process: 10391 ExecStopPost=/bin/bash /var/lib/ceph/92d2d4c0-af05-11ea-9578-0cc47aaa2edc/osd.1/unit.poststop (code=exited, status=0/SUCCESS) Process: 10216 ExecStart=/bin/bash /var/lib/ceph/92d2d4c0-af05-11ea-9578-0cc47aaa2edc/osd.1/unit.run (code=exited, status=125) Process: 10201 ExecStartPre=/usr/bin/podman rm ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc-osd.1 (code=exited, status=2) Main PID: 10216 (code=exited, status=125) Tasks: 29 CGroup: /system.slice/system-ceph\x2d92d2d4c0\x2daf05\x2d11ea\x2d9578\x2d0cc47aaa2edc.slice/ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service ├─25971 /bin/bash /var/lib/ceph/92d2d4c0-af05-11ea-9578-0cc47aaa2edc/osd.1/unit.run ├─26395 /usr/bin/podman run --rm --net=host --ipc=host --privileged --group-add=disk --name ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc-osd.1 -e CON> └─26452 /usr/bin/conmon --api-version 1 -s -c b65dc56c76a247e9178fa81005a93cf44e502a06fb46bcc79b9bf484128b2907 -u b65dc56c76a247e9178fa81005a93cf4> Jun 16 14:07:27 hostXXXXX-4 systemd[1]: ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc@osd.1.service: Failed with result 'exit-code'.
now, let's kill the container:
hostXXXXX-4:~ # podman stop ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc-osd.1 b65dc56c76a247e9178fa81005a93cf44e502a06fb46bcc79b9bf484128b2907
Adding this line into /var/lib/ceph/92d2d4c0-af05-11ea-9578-0cc47aaa2edc/osd.1/unit.run
! /usr/bin/podman rm --storage ceph-92d2d4c0-af05-11ea-9578-0cc47aaa2edc-osd.1
now, the service is up again.
Updated by Sebastian Wagner almost 4 years ago
https://github.com/ceph/ceph/pull/35524 is part of the solution. the other part is adding a set -e
Updated by Sebastian Wagner almost 4 years ago
- Related to Bug #44990: cephadm: exec: "/usr/bin/ceph-mon": stat /usr/bin/ceph-mon: no such file or directory added
Updated by Sebastian Wagner almost 4 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 35651
Updated by Sebastian Wagner almost 4 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Sebastian Wagner almost 4 years ago
- Status changed from Pending Backport to Resolved
- Target version set to v15.2.5
Updated by Sebastian Wagner over 3 years ago
- Related to Bug #46654: Unsupported podman container configuration via systemd added
Actions