Bug #49674: monitoring unit.run files don't remove container first - Orchestrator - Ceph

Actions

Copy link

Bug #49674

closed

monitoring unit.run files don't remove container first

Added by Sage Weil about 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

the ceph daemon unit.run files look like

set -e
/usr/bin/install -d -m0770 -o 167 -g 167 /var/run/ceph/11a72f1d-1752-43ba-bc6a-84b856e3c1bb
# mon.eutow
! /usr/bin/docker rm -f ceph-11a72f1d-1752-43ba-bc6a-84b856e3c1bb-mon.eutow
/usr/bin/docker run --rm --net=host --ipc=host --privileged --group-add=disk --name ceph-11a72f1d-1752-43ba-bc6a-84b856e3c1bb-mon.eutow -e CONTAINER_IMAGE=docker.io/ceph/ceph:v15.2.6 -e NODE_NAME=eutow -v /var/run/ceph/11a72f1d-1752-43ba-bc6a-84b856e3c1bb:/var/run/ceph:z -v /var/log/ceph/11a72f1d-1752-43ba-bc6a-84b856e3c1bb:/var/log/ceph:z -v /var/lib/ceph/11a72f1d-1752-43ba-bc6a-84b856e3c1bb/crash:/var/lib/ceph/crash:z -v /var/lib/ceph/11a72f1d-1752-43ba-bc6a-84b856e3c1bb/mon.eutow:/var/lib/ceph/mon/ceph-eutow:z -v /var/lib/ceph/11a72f1d-1752-43ba-bc6a-84b856e3c1bb/mon.eutow/config:/etc/ceph/ceph.conf:z -v /dev:/dev -v /run/udev:/run/udev --entrypoint /usr/bin/ceph-mon docker.io/ceph/ceph:v15.2.6 -n mon.eutow -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix="debug " --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true

notably, there is that docker rm -f ... line.

the monitoring unit.run files however look like

/usr/bin/docker run --rm --net=host --user 65534 --name ceph-11a72f1d-1752-43ba-bc6a-84b856e3c1bb-node-exporter.eutow -e CONTAINER_IMAGE=prom/node-exporter -e NODE_NAME=eutow -v /proc:/host/proc:ro -v /sys:/host/sys:ro -v /:/rootfs:ro prom/node-exporter --no-collector.timex

which can be problematic if the unit doesn't shut down cleanly for whatever reason

Actions

Copy link

Updated by Sebastian Wagner about 3 years ago

when was the node exporter deployed?

ceph orch redeploy node-exporter

might do the trick!

Actions

Copy link

Updated by Sebastian Wagner about 3 years ago

Status changed from New to Need More Info

Actions

Copy link

Updated by Sage Weil about 3 years ago

Status changed from Need More Info to Rejected

yep, that fixed it!

Actions

Copy link

Updated by Sage Weil about 3 years ago

Status changed from Rejected to In Progress

hmm, this is something we should probably deal with on upgrade (which is where I ran into it).

2021-03-16T13:42:51.356808+0000 mgr.reesi004.tplfrt [ERR] cephadm exited with an error code: 1, stderr:Reconfig daemon prometheus.reesi001 ...
Non-zero exit code 1 from systemctl restart ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@prometheus.reesi001
systemctl: stderr Job for ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@prometheus.reesi001.service failed because the control process exited with error code.
systemctl: stderr See "systemctl status ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@prometheus.reesi001.service" and "journalctl -xe" for details.
Traceback (most recent call last):
  File "<stdin>", line 7874, in <module>
  File "<stdin>", line 7863, in main
  File "<stdin>", line 1720, in _default_image
  File "<stdin>", line 4150, in command_deploy
  File "<stdin>", line 2586, in deploy_daemon
  File "<stdin>", line 1414, in call_throws
RuntimeError: Failed command: systemctl restart ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@prometheus.reesi001
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1101, in _remote_connection
    yield (conn, connr)
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1029, in _run_cephadm
    code, '\n'.join(err)))
orchestrator._interface.OrchestratorError: cephadm exited with an error code: 1, stderr:Reconfig daemon prometheus.reesi001 ...
Non-zero exit code 1 from systemctl restart ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@prometheus.reesi001
systemctl: stderr Job for ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@prometheus.reesi001.service failed because the control process exited with error code.
systemctl: stderr See "systemctl status ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@prometheus.reesi001.service" and "journalctl -xe" for details.
Traceback (most recent call last):
  File "<stdin>", line 7874, in <module>
  File "<stdin>", line 7863, in main
  File "<stdin>", line 1720, in _default_image
  File "<stdin>", line 4150, in command_deploy
  File "<stdin>", line 2586, in deploy_daemon
  File "<stdin>", line 1414, in call_throws
RuntimeError: Failed command: systemctl restart ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@prometheus.reesi001

this is on the lab cluster

Actions

Copy link

Updated by Sage Weil about 3 years ago

Status changed from In Progress to Resolved

this is a matter of redeploying to regenerate unit.run files

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Orchestrator

Custom queries

Bug #49674

monitoring unit.run files don't remove container first

Updated by Sebastian Wagner about 3 years ago

Updated by Sebastian Wagner about 3 years ago

Updated by Sage Weil about 3 years ago

Updated by Sage Weil about 3 years ago

Updated by Sage Weil about 3 years ago