Bug #51974
closedmgr/devicehealth: module fails with I/O error
0%
Description
Description of problem¶
Looks like something introduced in Quincy makes the devicehealth mgr module to crash as part of the ceph-container demo test.
Environment¶
- ceph version 17.0.0-6531-gea33bc4f (ea33bc4fad5d30ef383f3f90876c9cdfb21dc53b) quincy (dev)
- Platform (OS/distro/release): Debian Buster 10.10 (host) and CentOS 8.4 (container)
- Cluster details (nodes, monitors, OSDs): 1 container running all ceph services as part of the ceph-container demo project [1] (see the process tree in additional section)
How reproducible¶
Everytime with ceph@master container. I don't see the same with stable release (pacific, octopus, nautilus)
Actual results¶
# ceph crash ls ID ENTITY NEW 2021-07-30T20:17:12.313855Z_faa33dc7-42fd-4aae-b41a-9e2ac41f08a5 mgr.a7f23c65d48e * # ceph crash info 2021-07-30T20:17:12.313855Z_faa33dc7-42fd-4aae-b41a-9e2ac41f08a5 { "backtrace": [ " File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 338, in serve\n if self.db_ready() and self.enable_monitoring:", " File \"/usr/share/ceph/mgr/mgr_module.py\", line 1132, in db_ready\n return self.db is not None", " File \"/usr/share/ceph/mgr/mgr_module.py\", line 1144, in db\n self._db = self.open_db()", " File \"/usr/share/ceph/mgr/mgr_module.py\", line 1125, in open_db\n db = sqlite3.connect(uri, check_same_thread=False, uri=True)", "sqlite3.OperationalError: disk I/O error" ], "ceph_version": "17.0.0-6531-gea33bc4f", "crash_id": "2021-07-30T20:17:12.313855Z_faa33dc7-42fd-4aae-b41a-9e2ac41f08a5", "entity_name": "mgr.a7f23c65d48e", "mgr_module": "devicehealth", "mgr_module_caller": "PyModuleRunner::serve", "mgr_python_exception": "OperationalError", "os_id": "centos", "os_name": "CentOS Linux", "os_version": "8", "os_version_id": "8", "process_name": "ceph-mgr", "stack_sig": "6a9db3345c5202aea65e3e052878299878c0e1be3dec92a7034a4e0a0efb13fb", "timestamp": "2021-07-30T20:17:12.313855Z", "utsname_hostname": "a7f23c65d48e", "utsname_machine": "x86_64", "utsname_release": "5.10.0-0.bpo.7-amd64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP Debian 5.10.40-1~bpo10+1 (2021-06-04)" }
And the cluster status is HEALTH_ERR
# ceph -s cluster: id: de70f84f-4a04-4cf7-87fe-9387385f99c2 health: HEALTH_ERR Module 'devicehealth' has failed: disk I/O error 1 mgr modules have recently crashed services: mon: 1 daemons, quorum a7f23c65d48e (age 9m) mgr: a7f23c65d48e(active, since 9m) mds: 1/1 daemons up osd: 2 osds: 2 up (since 9m), 2 in (since 9m) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) rgw-nfs: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 10 pools, 400 pgs objects: 249 objects, 8.8 KiB usage: 34 MiB used, 200 GiB / 200 GiB avail pgs: 400 active+clean
Expected results¶
The devicehealth mgr module to fails and the cluster status to be HEALTH_OK
Additional information¶
/usr/sbin/dockerd -H fd:// \_ docker-containerd --config /var/run/docker/containerd/containerd.toml --log-level info \_ docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/a7f23c65d48e617d3054b6df3624fb60fa7141258ed89adbcea64e62c1639a5e -address /var/run/docker/containerd/containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-runc \_ /bin/bash /opt/ceph-container/bin/entrypoint.sh demo \_ /usr/bin/ceph-mon --cluster ceph --setuser ceph --setgroup ceph --default-log-to-stderr=true --err-to-stderr=true --default-log-to-file=false -i a7f23c65d48e --mon-data /var/lib/ceph/mon/ceph-a7f23c65d48e --public-addr 127.0.0.1 \_ ceph-mgr --cluster ceph --setuser ceph --setgroup ceph --default-log-to-stderr=true --err-to-stderr=true --default-log-to-file=false -i a7f23c65d48e \_ ceph-osd --cluster ceph --setuser ceph --setgroup ceph --default-log-to-stderr=true --err-to-stderr=true --default-log-to-file=false -i 0 \_ ceph-osd --cluster ceph --setuser ceph --setgroup ceph --default-log-to-stderr=true --err-to-stderr=true --default-log-to-file=false -i 1 \_ ceph-mds --cluster ceph --setuser ceph --setgroup ceph --default-log-to-stderr=true --err-to-stderr=true --default-log-to-file=false -i demo \_ radosgw --cluster ceph --setuser ceph --setgroup ceph --default-log-to-stderr=true --err-to-stderr=true --default-log-to-file=false -n client.rgw.a7f23c65d48e -k /var/lib/ceph/radosgw/ceph-rgw.a7f23c65d48e/keyring \_ python3 app.py \_ dbus-daemon --system \_ rpcbind \_ rpc.statd -L \_ ganesha.nfsd -L STDOUT \_ rbd-mirror --cluster ceph --setuser ceph --setgroup ceph --default-log-to-stderr=true --err-to-stderr=true --default-log-to-file=false \_ /usr/libexec/platform-python -s /usr/bin/ceph --cluster ceph -w
[1] https://github.com/ceph/ceph-container/blob/master/src/daemon/demo.sh
Updated by Dimitri Savineau almost 3 years ago
Steps to reproduce:
Pull the latest ceph@master container image for demo (replace docker by podman if needed)
# docker pull docker.io/ceph/daemon:latest-master
Create local directories for demo container bindmount
# mkdir ceph
Run the demo container (replace docker by podman if needed)
# docker run --rm -d --privileged --name ceph-demo -v $(pwd)/ceph:/var/lib/ceph -e RGW_FRONTEND_TYPE=beast -e DEBUG=verbose -e RGW_FRONTEND_PORT=8000 -e MON_IP=127.0.0.1 -e CEPH_PUBLIC_NETWORK=0.0.0.0/0 -e CLUSTER=ceph -e CEPH_DEMO_UID=demo -e CEPH_DEMO_ACCESS_KEY=G1EZ5R4K6IJ7XUQKMAED -e CEPH_DEMO_SECRET_KEY=cNmUrqpBKjCMzcfqG8fg4Qk07Xkoyau52OmvnSsz -e CEPH_DEMO_BUCKET=foobar -e SREE_PORT=5001 -e DATA_TO_SYNC=/etc/modprobe.d -e DATA_TO_SYNC_BUCKET=github -e OSD_COUNT=2 docker.io/ceph/daemon:latest-master demo
Enter into the container and check the status (replace docker by podman if needed)
# docker exec -it ceph-demo bash
Updated by Patrick Donnelly almost 3 years ago
The problem looks to be the ceph-mgr caps are too restrictive:
[root@d9cf10892e40 /]# ceph auth get mgr.d9cf10892e40 [mgr.d9cf10892e40] key = AQAYdQRhbZzUBBAAG1FIweQUpLH9IOeeFOGjOg== caps mon = "allow *" exported keyring for mgr.d9cf10892e40
It needs full osd/mds permissions (and probably others too).
Updated by Patrick Donnelly almost 3 years ago
vstart uses these perms:
mgr.x key: AQDjegRh1FMzHhAAN9kNqnn66EcH39qz7a0TCw== caps: [mds] allow * caps: [mon] allow profile mgr caps: [osd] allow *
I'd also check what cephadm does.
Updated by Dimitri Savineau almost 3 years ago
- Status changed from New to Closed
Thanks for the debug Patrick.
I've sent a PR for fixing this in ceph-container project.
https://github.com/ceph/ceph-container/pull/1921/commits/4bdc2b190e2c85aff245821abf229b999c0d01f9
Let's close this issue.