Project

General

Profile

Actions

Bug #51616

closed

Updating node-exporter deployment progress stuck

Added by Tobias Fischer almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

after upgrading to 16.2.5 via "ceph orch upgrade start --ceph-version 16.2.5" I get more and more instances of

    Updating node-exporter deployment (+9 -1 -> 9) (0s)
      [............................]

in ceph status
rebooting active mgr cleans but it reappears after some time and starts growing again.
removing all node-exporter daemons did not help


Related issues 1 (0 open1 closed)

Related to Orchestrator - Bug #51961: Stuck progress indicators in ceph status outputResolvedCory Snyder

Actions
Actions #1

Updated by Harry Coin almost 3 years ago

Confirmed.
cluster:
id: 4067126d-01cb-40af-824a-881c130140f8
health: HEALTH_OK
(muted: AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED(2w))

services:
mon: 5 daemons, quorum noc3,noc2,sysmon1,noc4,noc1 (age 2m)
mgr: noc4.tvhgac(active, since 8h), standbys: noc1.zvkerj, noc2.nyefxi, noc3.afxrza
mds: 5/5 daemons up, 3 standby
osd: 18 osds: 18 up (since 16h), 18 in (since 23h)
data:
volumes: 5/5 healthy
pools: 19 pools, 1809 pgs
objects: 10.40M objects, 15 TiB
usage: 40 TiB used, 28 TiB / 68 TiB avail
pgs: 1809 active+clean
io:
client: 1.6 KiB/s rd, 1.3 MiB/s wr, 0 op/s rd, 34 op/s wr
progress:
Updating node-exporter deployment (+4 -4 -> 5) (0s)
[............................]
Updating node-exporter deployment (+4 -4 -> 5) (0s)
[............................]
Updating node-exporter deployment (+4 -4 -> 5) (0s)
[............................]
Updating node-exporter deployment (+4 -4 -> 5) (0s)
[............................]
Updating node-exporter deployment (+4 -4 -> 5) (0s)
[............................]
Updating node-exporter deployment (+4 -4 -> 5) (0s)
[............................]
Updating node-exporter deployment (+4 -4 -> 5) (0s
....

And from the ceph users list:

...
2021-07-08T22:01:55.356953+0000 mgr.excalibur.kuumco [ERR] Failed to apply alertmanager spec AlertManagerSpec({'placement': PlacementSpec(count=1), 'service_type': 'alertmanager', 'service_id': None, 'unmanaged': False, 'preview_only': False, 'networks': [], 'config': None, 'user_data': {}, 'port': None}): name alertmanager.aladdin already in use
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 582, in _apply_all_services
if self._apply_service(spec):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 743, in _apply_service
rank_generation=slot.rank_generation,
File "/usr/share/ceph/mgr/cephadm/module.py", line 613, in get_unique_name
f'name {daemon_type}.{name} already in use')
orchestrator._interface.OrchestratorValidationError: name alertmanager.aladdin already in use
2021-07-08T22:01:55.372569+0000 mgr.excalibur.kuumco [ERR] Failed to apply node-exporter spec MonitoringSpec({'placement': PlacementSpec(host_pattern='*'), 'service_type': 'node-exporter', 'service_id': None, 'unmanaged': False, 'preview_only': False, 'networks': [], 'config': None, 'port': None}): name node-exporter.aladdin already in use
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 582, in _apply_all_services
if self._apply_service(spec):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 743, in _apply_service
rank_generation=slot.rank_generation,
File "/usr/share/ceph/mgr/cephadm/module.py", line 613, in get_unique_name
f'name {daemon_type}.{name} already in use')
orchestrator._interface.OrchestratorValidationError: name node-exporter.aladdin already in use

Also my 'ceph -s' output keeps getting longer and longer (currently 517 lines) with messages like these:

Updating node-exporter deployment (+6 -6 -> 13) (0s)
[............................]
Updating alertmanager deployment (+1 -1 -> 1) (0s)
[............................]

What's the best way to go about fixing this? I've tried using 'ceph orch daemon redeploy alertmanager.aladdin' and the same for node-exporter, but it doesn't seem to help.

Thanks,
Bryan

Actions #2

Updated by Enrico Kern almost 3 years ago

I have exactly the same issue. In addition the dashboard is not working anymore. mgr is listenning on port 8443 but there is no connection possible.

 progress:
    Updating alertmanager deployment (+1 -1 -> 1) (0s)
      [............................]
    Updating prometheus deployment (+1 -1 -> 1) (0s)
      [............................]
    Updating grafana deployment (+1 -1 -> 1) (0s)
      [............................]
    Updating prometheus deployment (+1 -1 -> 1) (0s)
      [............................]
    Updating node-exporter deployment (+5 -5 -> 5) (0s)
      [............................]
    Updating node-exporter deployment (+5 -5 -> 5) (0s)
      [............................]
    Updating alertmanager deployment (+1 -1 -> 1) (0s)
      [............................]
    Updating grafana deployment (+1 -1 -> 1) (0s)
      [............................]
    Updating prometheus deployment (+1 -1 -> 1) (0s)
      [............................]
    Updating node-exporter deployment (+5 -5 -> 5) (0s)
      [............................]
    Updating grafana deployment (+1 -1 -> 1) (0s)
      [............................]
    Updating alertmanager deployment (+1 -1 -> 1) (0s)

# lsof -i :8443
lsof: no pwd entry for UID 167
COMMAND   PID     USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
lsof: no pwd entry for UID 167
ceph-mgr 2788      167   49u  IPv4 100533      0t0  TCP stor01-dem:8443 (LISTEN)
 curl stor01-dem:8443
curl: (56) Recv failure: Connection reset by peer
Actions #3

Updated by Harry Coin almost 3 years ago

Workaround (caution: temporarily disruptive), Assuming this is the only reported problem remaining after upgrade otherwise completes:

1. ceph orch rm node-exporter

Wait 30+ seconds.

2. Stop all managers.

3. Start all managers.

4. ceph orch apply node-exporter '*'

Actions #4

Updated by Cory Snyder over 2 years ago

  • Related to Bug #51961: Stuck progress indicators in ceph status output added
Actions #5

Updated by Burkhard Obergoeker over 2 years ago

Thanks, Harry, this woraround worked for me, though I watched some different daemons which stuck during the update process.
I already started the same update again before I noticed this bugreport and got rid of the other messages but the one concerning the node-exporter.
Fortunately it was able to get healed by the woraround provided.

Best regards
Burkhard

Actions #6

Updated by Sebastian Wagner over 2 years ago

  • Status changed from New to Pending Backport
Actions #7

Updated by Sebastian Wagner over 2 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF