Bug #44272
closedon SUSE, crash daemon starts but then always stops a couple minutes later
0%
Description
Recently cephadm/orchestrator started deploying crash daemon on all cluster nodes.
On SUSE (at least), the crash daemon does not stay up for long. After some minutes, it always stops. Journalctl has this to say about it:
# journalctl -u "ceph-899b6a04-5715-11ea-9d8c-525400f299cb@crash.admin.service" | head -- Logs begin at Mon 2020-02-24 15:47:30 CET, end at Mon 2020-02-24 16:18:38 CET. -- Feb 24 15:54:29 admin systemd[1]: Starting Ceph crash.admin for 899b6a04-5715-11ea-9d8c-525400f299cb... Feb 24 15:54:29 admin podman[15929]: Error: no container with name or ID ceph-899b6a04-5715-11ea-9d8c-525400f299cb-crash.admin found: no such container Feb 24 15:54:29 admin systemd[1]: Started Ceph crash.admin for 899b6a04-5715-11ea-9d8c-525400f299cb. Feb 24 15:54:30 admin bash[15941]: INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s Feb 24 16:00:16 admin systemd[1]: Stopping Ceph crash.admin for 899b6a04-5715-11ea-9d8c-525400f299cb... Feb 24 16:00:16 admin podman[20703]: time="2020-02-24T16:00:16+01:00" level=error msg="container_linux.go:389: signaling init process caused \"permission denied\"" Feb 24 16:00:16 admin podman[20703]: container_linux.go:389: signaling init process caused "permission denied" Feb 24 16:00:16 admin podman[20703]: Error: permission denied Feb 24 16:00:31 admin systemd[1]: ceph-899b6a04-5715-11ea-9d8c-525400f299cb@crash.admin.service: State 'stop-post' timed out. Terminating.
Files
Updated by Nathan Cutler about 4 years ago
- Subject changed from crash daemon not managed by cephadm on SUSE to on SUSE, crash daemon starts but then always stops a couple minutes later
Updated by Sebastian Wagner about 4 years ago
related: https://github.com/opencontainers/runc/issues/2236
After reading the code at container_linux.go:389, podman error seems to not be the cause of this. systemd seems to be the first real message for the shutdown of crash.
Updated by Sebastian Wagner about 4 years ago
Rethinking. I think this is an apparmor problem. Adding the output of dmesg would be helpful.
Updated by Sebastian Wagner about 4 years ago
- Status changed from New to Triaged
Updated by Nathan Cutler about 4 years ago
OK, I will reproduce, obtain dmesg output, and post here.
One thing I did notice is that, with the upstream container, "crash" is not listed in "ceph orch ps". With the downstream container, it is listed.
Updated by Nathan Cutler about 4 years ago
OK, some more information:
admin:~ # ceph orch ps NAME HOST STATUS REFRESHED VERSION IMAGE NAME IMAGE ID CONTAINER ID crash.admin admin error 3m ago 15.1.0.1521 registry.suse.de/devel/storage/7.0/containers/ses/7/ceph/ceph:latest 09e408f3e7f6 853ab695fdb4 mgr.admin.xdltoy admin running 3m ago 15.1.0.1521 registry.suse.de/devel/storage/7.0/containers/ses/7/ceph/ceph:latest 09e408f3e7f6 cb3b6e3ada75 mon.admin admin running 3m ago 15.1.0.1521 registry.suse.de/devel/storage/7.0/containers/ses/7/ceph/ceph:latest 09e408f3e7f6 b59f5006b9c0 osd.0 admin running 3m ago 15.1.0.1521 registry.suse.de/devel/storage/7.0/containers/ses/7/ceph/ceph:latest 09e408f3e7f6 0ca3d9ea7824 osd.1 admin running 3m ago 15.1.0.1521 registry.suse.de/devel/storage/7.0/containers/ses/7/ceph/ceph:latest 09e408f3e7f6 0272b2aceb59 osd.2 admin running 3m ago 15.1.0.1521 registry.suse.de/devel/storage/7.0/containers/ses/7/ceph/ceph:latest 09e408f3e7f6 7a04ebd44a49 osd.3 admin running 3m ago 15.1.0.1521 registry.suse.de/devel/storage/7.0/containers/ses/7/ceph/ceph:latest 09e408f3e7f6 4e9a833bba0c
admin:~ # cat /etc/os-release NAME="SLES" VERSION="15-SP2" VERSION_ID="15.2" PRETTY_NAME="SUSE Linux Enterprise Server 15 SP2" ID="sles" ID_LIKE="suse" ANSI_COLOR="0;32" CPE_NAME="cpe:/o:suse:sles:15:sp2"
admin:~ # ceph --version ceph version 15.1.0-1521-gcdf35413a0 (cdf35413a036bd1aa59a8c718bb177839c45cab1) octopus (rc) admin:~ # ceph versions { "mon": { "ceph version 15.1.0-1521-gcdf35413a0 (cdf35413a036bd1aa59a8c718bb177839c45cab1) octopus (rc)": 1 }, "mgr": { "ceph version 15.1.0-1521-gcdf35413a0 (cdf35413a036bd1aa59a8c718bb177839c45cab1) octopus (rc)": 1 }, "osd": { "ceph version 15.1.0-1521-gcdf35413a0 (cdf35413a036bd1aa59a8c718bb177839c45cab1) octopus (rc)": 4 }, "mds": {}, "overall": { "ceph version 15.1.0-1521-gcdf35413a0 (cdf35413a036bd1aa59a8c718bb177839c45cab1) octopus (rc)": 6 } }
And dmesg output is attached!
Updated by Sebastian Wagner about 4 years ago
from dmesg:
[ 525.062394] audit: type=1400 audit(1583421345.488:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="libpod-default-1.4.4" pid=14802 comm="apparmor_parser" [ 529.245377] audit: type=1400 audit(1583421349.672:3): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=15648 comm="ceph-mon" requested_mask="send" denied_mask="send" signal=rtmin+1 peer="libpod-default-1.4.4" [ 529.246334] audit: type=1400 audit(1583421349.672:4): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=15648 comm="ceph-mon" requested_mask="receive" denied_mask="receive" signal=rtmin+1 peer="libpod-default-1.4.4" [ 529.248060] audit: type=1400 audit(1583421349.676:5): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=15648 comm="ceph-mon" requested_mask="send" denied_mask="send" signal=rtmin+1 peer="libpod-default-1.4.4" [ 529.249204] audit: type=1400 audit(1583421349.676:6): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=15648 comm="ceph-mon" requested_mask="receive" denied_mask="receive" signal=rtmin+1 peer="libpod-default-1.4.4" [ 535.358734] audit: type=1400 audit(1583421355.788:7): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=16794 comm="ceph-mgr" requested_mask="send" denied_mask="send" signal=rtmin+1 peer="libpod-default-1.4.4" [ 535.359912] audit: type=1400 audit(1583421355.788:8): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=16794 comm="ceph-mgr" requested_mask="receive" denied_mask="receive" signal=rtmin+1 peer="libpod-default-1.4.4" [ 535.361136] audit: type=1400 audit(1583421355.788:9): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=16794 comm="ceph-mgr" requested_mask="send" denied_mask="send" signal=rtmin+1 peer="libpod-default-1.4.4" [ 535.362404] audit: type=1400 audit(1583421355.788:10): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=16794 comm="ceph-mgr" requested_mask="receive" denied_mask="receive" signal=rtmin+1 peer="libpod-default-1.4.4" [ 594.823456] audit: type=1400 audit(1583421415.251:11): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=21699 comm="podman" requested_mask="receive" denied_mask="receive" signal=exists peer="unconfined" [ 594.832888] audit: type=1400 audit(1583421415.259:12): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=21709 comm="runc" requested_mask="receive" denied_mask="receive" signal=term peer="unconfined" [ 594.836021] audit: type=1400 audit(1583421415.263:13): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=21699 comm="podman" requested_mask="receive" denied_mask="receive" signal=exists peer="unconfined"
Updated by Sebastian Wagner about 4 years ago
- Status changed from Triaged to Fix Under Review
- Pull request ID set to 33850
Updated by Sage Weil about 4 years ago
- Status changed from Fix Under Review to Resolved