Actions
Bug #49674
closedmonitoring unit.run files don't remove container first
Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
the ceph daemon unit.run files look like
set -e /usr/bin/install -d -m0770 -o 167 -g 167 /var/run/ceph/11a72f1d-1752-43ba-bc6a-84b856e3c1bb # mon.eutow ! /usr/bin/docker rm -f ceph-11a72f1d-1752-43ba-bc6a-84b856e3c1bb-mon.eutow /usr/bin/docker run --rm --net=host --ipc=host --privileged --group-add=disk --name ceph-11a72f1d-1752-43ba-bc6a-84b856e3c1bb-mon.eutow -e CONTAINER_IMAGE=docker.io/ceph/ceph:v15.2.6 -e NODE_NAME=eutow -v /var/run/ceph/11a72f1d-1752-43ba-bc6a-84b856e3c1bb:/var/run/ceph:z -v /var/log/ceph/11a72f1d-1752-43ba-bc6a-84b856e3c1bb:/var/log/ceph:z -v /var/lib/ceph/11a72f1d-1752-43ba-bc6a-84b856e3c1bb/crash:/var/lib/ceph/crash:z -v /var/lib/ceph/11a72f1d-1752-43ba-bc6a-84b856e3c1bb/mon.eutow:/var/lib/ceph/mon/ceph-eutow:z -v /var/lib/ceph/11a72f1d-1752-43ba-bc6a-84b856e3c1bb/mon.eutow/config:/etc/ceph/ceph.conf:z -v /dev:/dev -v /run/udev:/run/udev --entrypoint /usr/bin/ceph-mon docker.io/ceph/ceph:v15.2.6 -n mon.eutow -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix="debug " --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true
notably, there is that docker rm -f ... line.
the monitoring unit.run files however look like
/usr/bin/docker run --rm --net=host --user 65534 --name ceph-11a72f1d-1752-43ba-bc6a-84b856e3c1bb-node-exporter.eutow -e CONTAINER_IMAGE=prom/node-exporter -e NODE_NAME=eutow -v /proc:/host/proc:ro -v /sys:/host/sys:ro -v /:/rootfs:ro prom/node-exporter --no-collector.timex
which can be problematic if the unit doesn't shut down cleanly for whatever reason
Updated by Sebastian Wagner about 3 years ago
when was the node exporter deployed?
ceph orch redeploy node-exporter
might do the trick!
Updated by Sebastian Wagner about 3 years ago
- Status changed from New to Need More Info
Updated by Sage Weil about 3 years ago
- Status changed from Need More Info to Rejected
yep, that fixed it!
Updated by Sage Weil about 3 years ago
- Status changed from Rejected to In Progress
hmm, this is something we should probably deal with on upgrade (which is where I ran into it).
2021-03-16T13:42:51.356808+0000 mgr.reesi004.tplfrt [ERR] cephadm exited with an error code: 1, stderr:Reconfig daemon prometheus.reesi001 ... Non-zero exit code 1 from systemctl restart ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@prometheus.reesi001 systemctl: stderr Job for ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@prometheus.reesi001.service failed because the control process exited with error code. systemctl: stderr See "systemctl status ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@prometheus.reesi001.service" and "journalctl -xe" for details. Traceback (most recent call last): File "<stdin>", line 7874, in <module> File "<stdin>", line 7863, in main File "<stdin>", line 1720, in _default_image File "<stdin>", line 4150, in command_deploy File "<stdin>", line 2586, in deploy_daemon File "<stdin>", line 1414, in call_throws RuntimeError: Failed command: systemctl restart ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@prometheus.reesi001 Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/serve.py", line 1101, in _remote_connection yield (conn, connr) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1029, in _run_cephadm code, '\n'.join(err))) orchestrator._interface.OrchestratorError: cephadm exited with an error code: 1, stderr:Reconfig daemon prometheus.reesi001 ... Non-zero exit code 1 from systemctl restart ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@prometheus.reesi001 systemctl: stderr Job for ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@prometheus.reesi001.service failed because the control process exited with error code. systemctl: stderr See "systemctl status ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@prometheus.reesi001.service" and "journalctl -xe" for details. Traceback (most recent call last): File "<stdin>", line 7874, in <module> File "<stdin>", line 7863, in main File "<stdin>", line 1720, in _default_image File "<stdin>", line 4150, in command_deploy File "<stdin>", line 2586, in deploy_daemon File "<stdin>", line 1414, in call_throws RuntimeError: Failed command: systemctl restart ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@prometheus.reesi001
this is on the lab cluster
Updated by Sage Weil about 3 years ago
- Status changed from In Progress to Resolved
this is a matter of redeploying to regenerate unit.run files
Actions