Bug #51974: mgr/devicehealth: module fails with I/O error - mgr - Ceph

Actions

Copy link

Bug #51974

closed

mgr/devicehealth: module fails with I/O error

Added by Dimitri Savineau almost 3 years ago. Updated almost 3 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v17.0.0

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Description of problem¶

Looks like something introduced in Quincy makes the devicehealth mgr module to crash as part of the ceph-container demo test.

Environment¶

ceph version 17.0.0-6531-gea33bc4f (ea33bc4fad5d30ef383f3f90876c9cdfb21dc53b) quincy (dev)
Platform (OS/distro/release): Debian Buster 10.10 (host) and CentOS 8.4 (container)
Cluster details (nodes, monitors, OSDs): 1 container running all ceph services as part of the ceph-container demo project [1] (see the process tree in additional section)

How reproducible¶

Everytime with ceph@master container. I don't see the same with stable release (pacific, octopus, nautilus)

Actual results¶

# ceph crash ls
ID                                                                ENTITY            NEW  
2021-07-30T20:17:12.313855Z_faa33dc7-42fd-4aae-b41a-9e2ac41f08a5  mgr.a7f23c65d48e   *
# ceph crash info 2021-07-30T20:17:12.313855Z_faa33dc7-42fd-4aae-b41a-9e2ac41f08a5
{
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 338, in serve\n    if self.db_ready() and self.enable_monitoring:",
        "  File \"/usr/share/ceph/mgr/mgr_module.py\", line 1132, in db_ready\n    return self.db is not None",
        "  File \"/usr/share/ceph/mgr/mgr_module.py\", line 1144, in db\n    self._db = self.open_db()",
        "  File \"/usr/share/ceph/mgr/mgr_module.py\", line 1125, in open_db\n    db = sqlite3.connect(uri, check_same_thread=False, uri=True)",
        "sqlite3.OperationalError: disk I/O error" 
    ],
    "ceph_version": "17.0.0-6531-gea33bc4f",
    "crash_id": "2021-07-30T20:17:12.313855Z_faa33dc7-42fd-4aae-b41a-9e2ac41f08a5",
    "entity_name": "mgr.a7f23c65d48e",
    "mgr_module": "devicehealth",
    "mgr_module_caller": "PyModuleRunner::serve",
    "mgr_python_exception": "OperationalError",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mgr",
    "stack_sig": "6a9db3345c5202aea65e3e052878299878c0e1be3dec92a7034a4e0a0efb13fb",
    "timestamp": "2021-07-30T20:17:12.313855Z",
    "utsname_hostname": "a7f23c65d48e",
    "utsname_machine": "x86_64",
    "utsname_release": "5.10.0-0.bpo.7-amd64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Debian 5.10.40-1~bpo10+1 (2021-06-04)" 
}

And the cluster status is HEALTH_ERR

# ceph -s
  cluster:
    id:     de70f84f-4a04-4cf7-87fe-9387385f99c2
    health: HEALTH_ERR
            Module 'devicehealth' has failed: disk I/O error
            1 mgr modules have recently crashed

  services:
    mon:        1 daemons, quorum a7f23c65d48e (age 9m)
    mgr:        a7f23c65d48e(active, since 9m)
    mds:        1/1 daemons up
    osd:        2 osds: 2 up (since 9m), 2 in (since 9m)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
    rgw-nfs:    1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   10 pools, 400 pgs
    objects: 249 objects, 8.8 KiB
    usage:   34 MiB used, 200 GiB / 200 GiB avail
    pgs:     400 active+clean

Expected results¶

The devicehealth mgr module to fails and the cluster status to be HEALTH_OK

Additional information¶

/usr/sbin/dockerd -H fd://
 \_ docker-containerd --config /var/run/docker/containerd/containerd.toml --log-level info
     \_ docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/a7f23c65d48e617d3054b6df3624fb60fa7141258ed89adbcea64e62c1639a5e -address /var/run/docker/containerd/containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-runc
         \_ /bin/bash /opt/ceph-container/bin/entrypoint.sh demo
             \_ /usr/bin/ceph-mon --cluster ceph --setuser ceph --setgroup ceph --default-log-to-stderr=true --err-to-stderr=true --default-log-to-file=false -i a7f23c65d48e --mon-data /var/lib/ceph/mon/ceph-a7f23c65d48e --public-addr 127.0.0.1
             \_ ceph-mgr --cluster ceph --setuser ceph --setgroup ceph --default-log-to-stderr=true --err-to-stderr=true --default-log-to-file=false -i a7f23c65d48e 
             \_ ceph-osd --cluster ceph --setuser ceph --setgroup ceph --default-log-to-stderr=true --err-to-stderr=true --default-log-to-file=false -i 0
             \_ ceph-osd --cluster ceph --setuser ceph --setgroup ceph --default-log-to-stderr=true --err-to-stderr=true --default-log-to-file=false -i 1
             \_ ceph-mds --cluster ceph --setuser ceph --setgroup ceph --default-log-to-stderr=true --err-to-stderr=true --default-log-to-file=false -i demo
             \_ radosgw --cluster ceph --setuser ceph --setgroup ceph --default-log-to-stderr=true --err-to-stderr=true --default-log-to-file=false -n client.rgw.a7f23c65d48e -k /var/lib/ceph/radosgw/ceph-rgw.a7f23c65d48e/keyring
             \_ python3 app.py
             \_ dbus-daemon --system
             \_ rpcbind
             \_ rpc.statd -L
             \_ ganesha.nfsd  -L STDOUT
             \_ rbd-mirror --cluster ceph --setuser ceph --setgroup ceph --default-log-to-stderr=true --err-to-stderr=true --default-log-to-file=false
             \_ /usr/libexec/platform-python -s /usr/bin/ceph --cluster ceph -w

[1] https://github.com/ceph/ceph-container/blob/master/src/daemon/demo.sh

Actions

Copy link

Updated by Dimitri Savineau almost 3 years ago

Steps to reproduce:

Pull the latest ceph@master container image for demo (replace docker by podman if needed)

# docker pull docker.io/ceph/daemon:latest-master

Create local directories for demo container bindmount

# mkdir ceph

Run the demo container (replace docker by podman if needed)

# docker run --rm -d --privileged --name ceph-demo -v $(pwd)/ceph:/var/lib/ceph -e RGW_FRONTEND_TYPE=beast -e DEBUG=verbose -e RGW_FRONTEND_PORT=8000 -e MON_IP=127.0.0.1 -e CEPH_PUBLIC_NETWORK=0.0.0.0/0 -e CLUSTER=ceph -e CEPH_DEMO_UID=demo -e CEPH_DEMO_ACCESS_KEY=G1EZ5R4K6IJ7XUQKMAED -e CEPH_DEMO_SECRET_KEY=cNmUrqpBKjCMzcfqG8fg4Qk07Xkoyau52OmvnSsz -e CEPH_DEMO_BUCKET=foobar -e SREE_PORT=5001 -e DATA_TO_SYNC=/etc/modprobe.d -e DATA_TO_SYNC_BUCKET=github -e OSD_COUNT=2 docker.io/ceph/daemon:latest-master demo

Enter into the container and check the status (replace docker by podman if needed)

# docker exec -it ceph-demo bash

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

The problem looks to be the ceph-mgr caps are too restrictive:

[root@d9cf10892e40 /]# ceph auth get mgr.d9cf10892e40
[mgr.d9cf10892e40]
        key = AQAYdQRhbZzUBBAAG1FIweQUpLH9IOeeFOGjOg==
        caps mon = "allow *" 
exported keyring for mgr.d9cf10892e40

It needs full osd/mds permissions (and probably others too).

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

vstart uses these perms:

mgr.x
        key: AQDjegRh1FMzHhAAN9kNqnn66EcH39qz7a0TCw==
        caps: [mds] allow *
        caps: [mon] allow profile mgr
        caps: [osd] allow *

I'd also check what cephadm does.

Actions

Copy link

Updated by Dimitri Savineau almost 3 years ago

Status changed from New to Closed

Thanks for the debug Patrick.

I've sent a PR for fixing this in ceph-container project.

https://github.com/ceph/ceph-container/pull/1921/commits/4bdc2b190e2c85aff245821abf229b999c0d01f9

Let's close this issue.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #51974

mgr/devicehealth: module fails with I/O error

Description of problem¶

Environment¶

How reproducible¶

Actual results¶

Expected results¶

Additional information¶

Updated by Dimitri Savineau almost 3 years ago

Updated by Patrick Donnelly almost 3 years ago

Updated by Patrick Donnelly almost 3 years ago

Updated by Dimitri Savineau almost 3 years ago