Support #64570: prometheus couldnt start daemon - Orchestrator - Ceph

Actions

Copy link

Support #64570

open

prometheus couldnt start daemon

Added by ruben blanco 2 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

cephadm/services

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

Ceph - v18.2.1

Pull request ID:

Description

I need some help , and I dont know if this is a bug.
Im using ceph v18.2.1 with 3 baremetal nodes.
I was deploy ceph with cephadm.
Sometimes I need to redeploy containers like prometheus , grafana , alertmanager with the dashboard (services section) because if mgr active changes to another node , I need to redefine hosts in services like mentioned to make work alerts and prometheus and grafana.
Some day I was selecting another node to Prometheus service because I was needed to change it to another node that has mgr active. I dont know what happened but daemon process of prometheus service was delete it. I think that was redeploying prometheus.
I was searching but only I view that daemon process has a infinite loop trying to redeploy daemon process for prometheus.
I delete the service and I recreate the service as documention mentioned with zero results , now I have in the cluster all monitoring disable , but I have every second notificacions that prometheus is not present on Xnode.

I was searching what can happened and watching the log , I only can view that Podman search the container name to view the status of the container that no exists (with podman container inspect) and all of that produce the error.
So if I dont have the container nowadays running and dont exists , I cant redeploy the container with the dashboard so..
I was thinking in redeploy the container with podman mannually , but Im here asking what can I do first to do something wrong.
I do "podman ps -a" to watch all process running or another state of running , and I only have running processes.
I try with "ceph orch" run the service and the process of prometheus service with the same error.

I was searching in the tracker of ceph something or similar to this and I was looking this , but I dont believe is the same https://tracker.ceph.com/issues/64491

Well.. My Error log is this , if this can help more... (error log copied from logs ERRORs section in dashboard)

2/21/24 10:02:21 PM
[WRN]
Health check failed: Failed to place 1 daemon(s) (CEPHADM_DAEMON_PLACE_FAIL)

2/21/24 10:02:20 PM
[ERR]
Failed while placing prometheus.cephtest02 on cephtest02: cephadm exited with an error code: 1, stderr: Non-zero exit code 125 from 
/usr/bin/podman container inspect --format {{.State.Status}} ceph-b9721a76-bf7a-11ee-abaf-1402ec32409b-prometheus-cephtest02 /usr/bin/podman: 
stderr Error: inspecting object: no such container ceph-b9721a76-bf7a-11ee-abaf-1402ec32409b-prometheus-cephtest02 Non-zero exit code 125 
from /usr/bin/podman container inspect --format {{.State.Status}} ceph-b9721a76-bf7a-11ee-abaf-1402ec32409b-prometheus.cephtest02 /usr/bin/podman: 
stderr Error: inspecting object: no such container ceph-b9721a76-bf7a-11ee-abaf-1402ec32409b-prometheus.cephtest02 Deploy daemon prometheus.cephtest02 ... 
Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/var/lib/ceph/b9721a76-bf7a-11ee-abaf-1402ec32409b/cephadm.8c89112927b45a1984d03fb02785df709234bdb856619c217e1ad5d54aebef2b/__main__.py", 
line 10700, in File "/var/lib/ceph/b9721a76-bf7a-11ee-abaf-1402ec32409b/cephadm.8c89112927b45a1984d03fb02785df709234bdb856619c217e1ad5d54aebef2b/__main__.py", line 10688, in main File "/var/lib/ceph/b9721a76-bf7a-11ee-abaf-1402ec32409b/cephadm.8c89112927b45a1984d03fb02785df709234bdb856619c217e1ad5d54aebef2b/__main__.py", line 6620, in command_deploy_from File "/var/lib/ceph/b9721a76-bf7a-11ee-abaf-1402ec32409b/cephadm.8c89112927b45a1984d03fb02785df709234bdb856619c217e1ad5d54aebef2b/__main__.py", line 6638, in _common_deploy File "/var/lib/ceph/b9721a76-bf7a-11ee-abaf-1402ec32409b/cephadm.8c89112927b45a1984d03fb02785df709234bdb856619c217e1ad5d54aebef2b/__main__.py", 
line 6689, in _dispatch_deploy File "/var/lib/ceph/b9721a76-bf7a-11ee-abaf-1402ec32409b/cephadm.8c89112927b45a1984d03fb02785df709234bdb856619c217e1ad5d54aebef2b/__main__.py", 
line 6520, in get_deployment_container File "/var/lib/ceph/b9721a76-bf7a-11ee-abaf-1402ec32409b/cephadm.8c89112927b45a1984d03fb02785df709234bdb856619c217e1ad5d54aebef2b/__main__.py", 
line 3689, in get_container File "/var/lib/ceph/b9721a76-bf7a-11ee-abaf-1402ec32409b/cephadm.8c89112927b45a1984d03fb02785df709234bdb856619c217e1ad5d54aebef2b/__main__.py", 
line 3013, in get_daemon_args File "/var/lib/ceph/b9721a76-bf7a-11ee-abaf-1402ec32409b/cephadm.8c89112927b45a1984d03fb02785df709234bdb856619c217e1ad5d54aebef2b/__main__.py", 
line 2291, in get_ip_addresses File "/usr/lib/python3.11/socket.py", line 962, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags): 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ socket.gaierror: [Errno -2] Name or service not known

I can enable the monitoring services again and do some tasks for test something , if someone can help much appreciate

No data to display

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Orchestrator

Custom queries

Support #64570

prometheus couldnt start daemon