Bug #44272: on SUSE, crash daemon starts but then always stops a couple minutes later - Orchestrator - Ceph

Actions

Copy link

Bug #44272

closed

on SUSE, crash daemon starts but then always stops a couple minutes later

Added by Nathan Cutler about 4 years ago. Updated about 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Sebastian Wagner

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

33850

Crash signature (v1):

Crash signature (v2):

Description

Recently cephadm/orchestrator started deploying crash daemon on all cluster nodes.

On SUSE (at least), the crash daemon does not stay up for long. After some minutes, it always stops. Journalctl has this to say about it:

# journalctl -u "ceph-899b6a04-5715-11ea-9d8c-525400f299cb@crash.admin.service" | head
-- Logs begin at Mon 2020-02-24 15:47:30 CET, end at Mon 2020-02-24 16:18:38 CET. --
Feb 24 15:54:29 admin systemd[1]: Starting Ceph crash.admin for 899b6a04-5715-11ea-9d8c-525400f299cb...
Feb 24 15:54:29 admin podman[15929]: Error: no container with name or ID ceph-899b6a04-5715-11ea-9d8c-525400f299cb-crash.admin found: no such container
Feb 24 15:54:29 admin systemd[1]: Started Ceph crash.admin for 899b6a04-5715-11ea-9d8c-525400f299cb.
Feb 24 15:54:30 admin bash[15941]: INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s
Feb 24 16:00:16 admin systemd[1]: Stopping Ceph crash.admin for 899b6a04-5715-11ea-9d8c-525400f299cb...
Feb 24 16:00:16 admin podman[20703]: time="2020-02-24T16:00:16+01:00" level=error msg="container_linux.go:389: signaling init process caused \"permission denied\"" 
Feb 24 16:00:16 admin podman[20703]: container_linux.go:389: signaling init process caused "permission denied" 
Feb 24 16:00:16 admin podman[20703]: Error: permission denied
Feb 24 16:00:31 admin systemd[1]: ceph-899b6a04-5715-11ea-9d8c-525400f299cb@crash.admin.service: State 'stop-post' timed out. Terminating.

Files

dmesg.out (35.9 KB) dmesg.out

Nathan Cutler, 03/05/2020 03:43 PM

Actions

Copy link

Updated by Nathan Cutler about 4 years ago

Subject changed from crash daemon not managed by cephadm on SUSE to on SUSE, crash daemon starts but then always stops a couple minutes later

Actions

Copy link

Updated by Sebastian Wagner about 4 years ago

related: https://github.com/opencontainers/runc/issues/2236

After reading the code at container_linux.go:389, podman error seems to not be the cause of this. systemd seems to be the first real message for the shutdown of crash.

Actions

Copy link

Updated by Sebastian Wagner about 4 years ago

Rethinking. I think this is an apparmor problem. Adding the output of dmesg would be helpful.

Actions

Copy link

Updated by Sebastian Wagner about 4 years ago

Status changed from New to Triaged

Actions

Copy link

Updated by Nathan Cutler about 4 years ago

OK, I will reproduce, obtain dmesg output, and post here.

One thing I did notice is that, with the upstream container, "crash" is not listed in "ceph orch ps". With the downstream container, it is listed.

Actions

Copy link

Updated by Nathan Cutler about 4 years ago

File dmesg.out dmesg.out added

OK, some more information:

admin:~ # ceph orch ps
NAME              HOST   STATUS   REFRESHED  VERSION      IMAGE NAME                                                            IMAGE ID      CONTAINER ID  
crash.admin       admin  error    3m ago     15.1.0.1521  registry.suse.de/devel/storage/7.0/containers/ses/7/ceph/ceph:latest  09e408f3e7f6  853ab695fdb4  
mgr.admin.xdltoy  admin  running  3m ago     15.1.0.1521  registry.suse.de/devel/storage/7.0/containers/ses/7/ceph/ceph:latest  09e408f3e7f6  cb3b6e3ada75  
mon.admin         admin  running  3m ago     15.1.0.1521  registry.suse.de/devel/storage/7.0/containers/ses/7/ceph/ceph:latest  09e408f3e7f6  b59f5006b9c0  
osd.0             admin  running  3m ago     15.1.0.1521  registry.suse.de/devel/storage/7.0/containers/ses/7/ceph/ceph:latest  09e408f3e7f6  0ca3d9ea7824  
osd.1             admin  running  3m ago     15.1.0.1521  registry.suse.de/devel/storage/7.0/containers/ses/7/ceph/ceph:latest  09e408f3e7f6  0272b2aceb59  
osd.2             admin  running  3m ago     15.1.0.1521  registry.suse.de/devel/storage/7.0/containers/ses/7/ceph/ceph:latest  09e408f3e7f6  7a04ebd44a49  
osd.3             admin  running  3m ago     15.1.0.1521  registry.suse.de/devel/storage/7.0/containers/ses/7/ceph/ceph:latest  09e408f3e7f6  4e9a833bba0c

admin:~ # cat /etc/os-release
NAME="SLES" 
VERSION="15-SP2" 
VERSION_ID="15.2" 
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP2" 
ID="sles" 
ID_LIKE="suse" 
ANSI_COLOR="0;32" 
CPE_NAME="cpe:/o:suse:sles:15:sp2"

admin:~ # ceph --version
ceph version 15.1.0-1521-gcdf35413a0 (cdf35413a036bd1aa59a8c718bb177839c45cab1) octopus (rc)
admin:~ # ceph versions
{
    "mon": {
        "ceph version 15.1.0-1521-gcdf35413a0 (cdf35413a036bd1aa59a8c718bb177839c45cab1) octopus (rc)": 1
    },
    "mgr": {
        "ceph version 15.1.0-1521-gcdf35413a0 (cdf35413a036bd1aa59a8c718bb177839c45cab1) octopus (rc)": 1
    },
    "osd": {
        "ceph version 15.1.0-1521-gcdf35413a0 (cdf35413a036bd1aa59a8c718bb177839c45cab1) octopus (rc)": 4
    },
    "mds": {},
    "overall": {
        "ceph version 15.1.0-1521-gcdf35413a0 (cdf35413a036bd1aa59a8c718bb177839c45cab1) octopus (rc)": 6
    }
}

And dmesg output is attached!

Actions

Copy link

Updated by Nathan Cutler about 4 years ago

Assignee set to Sebastian Wagner

Actions

Copy link

Updated by Sebastian Wagner about 4 years ago

from dmesg:

[  525.062394] audit: type=1400 audit(1583421345.488:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="libpod-default-1.4.4" pid=14802 comm="apparmor_parser" 
[  529.245377] audit: type=1400 audit(1583421349.672:3): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=15648 comm="ceph-mon" requested_mask="send" denied_mask="send" signal=rtmin+1 peer="libpod-default-1.4.4" 
[  529.246334] audit: type=1400 audit(1583421349.672:4): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=15648 comm="ceph-mon" requested_mask="receive" denied_mask="receive" signal=rtmin+1 peer="libpod-default-1.4.4" 
[  529.248060] audit: type=1400 audit(1583421349.676:5): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=15648 comm="ceph-mon" requested_mask="send" denied_mask="send" signal=rtmin+1 peer="libpod-default-1.4.4" 
[  529.249204] audit: type=1400 audit(1583421349.676:6): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=15648 comm="ceph-mon" requested_mask="receive" denied_mask="receive" signal=rtmin+1 peer="libpod-default-1.4.4" 
[  535.358734] audit: type=1400 audit(1583421355.788:7): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=16794 comm="ceph-mgr" requested_mask="send" denied_mask="send" signal=rtmin+1 peer="libpod-default-1.4.4" 
[  535.359912] audit: type=1400 audit(1583421355.788:8): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=16794 comm="ceph-mgr" requested_mask="receive" denied_mask="receive" signal=rtmin+1 peer="libpod-default-1.4.4" 
[  535.361136] audit: type=1400 audit(1583421355.788:9): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=16794 comm="ceph-mgr" requested_mask="send" denied_mask="send" signal=rtmin+1 peer="libpod-default-1.4.4" 
[  535.362404] audit: type=1400 audit(1583421355.788:10): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=16794 comm="ceph-mgr" requested_mask="receive" denied_mask="receive" signal=rtmin+1 peer="libpod-default-1.4.4" 
[  594.823456] audit: type=1400 audit(1583421415.251:11): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=21699 comm="podman" requested_mask="receive" denied_mask="receive" signal=exists peer="unconfined" 
[  594.832888] audit: type=1400 audit(1583421415.259:12): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=21709 comm="runc" requested_mask="receive" denied_mask="receive" signal=term peer="unconfined" 
[  594.836021] audit: type=1400 audit(1583421415.263:13): apparmor="DENIED" operation="signal" profile="libpod-default-1.4.4" pid=21699 comm="podman" requested_mask="receive" denied_mask="receive" signal=exists peer="unconfined"

Actions

Copy link