Bug #51616: Updating node-exporter deployment progress stuck - Orchestrator - Ceph

Actions

Copy link

Bug #51616

closed

Updating node-exporter deployment progress stuck

Added by Tobias Fischer almost 3 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v16.2.5

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

after upgrading to 16.2.5 via "ceph orch upgrade start --ceph-version 16.2.5" I get more and more instances of

    Updating node-exporter deployment (+9 -1 -> 9) (0s)
      [............................]

in ceph status
rebooting active mgr cleans but it reappears after some time and starts growing again.
removing all node-exporter daemons did not help

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Harry Coin almost 3 years ago

Confirmed.
cluster:
id: 4067126d-01cb-40af-824a-881c130140f8
health: HEALTH_OK
(muted: AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED(2w))

services:
    mon: 5 daemons, quorum noc3,noc2,sysmon1,noc4,noc1 (age 2m)
    mgr: noc4.tvhgac(active, since 8h), standbys: noc1.zvkerj, noc2.nyefxi, noc3.afxrza
    mds: 5/5 daemons up, 3 standby
    osd: 18 osds: 18 up (since 16h), 18 in (since 23h)

data:
    volumes: 5/5 healthy
    pools:   19 pools, 1809 pgs
    objects: 10.40M objects, 15 TiB
    usage:   40 TiB used, 28 TiB / 68 TiB avail
    pgs:     1809 active+clean

io:
    client:   1.6 KiB/s rd, 1.3 MiB/s wr, 0 op/s rd, 34 op/s wr

progress:
    Updating node-exporter deployment (+4 -4 -> 5) (0s)
      [............................] 
    Updating node-exporter deployment (+4 -4 -> 5) (0s)
      [............................] 
    Updating node-exporter deployment (+4 -4 -> 5) (0s)
      [............................] 
    Updating node-exporter deployment (+4 -4 -> 5) (0s)
      [............................] 
    Updating node-exporter deployment (+4 -4 -> 5) (0s)
      [............................] 
    Updating node-exporter deployment (+4 -4 -> 5) (0s)
      [............................] 
    Updating node-exporter deployment (+4 -4 -> 5) (0s
....

And from the ceph users list:

...
2021-07-08T22:01:55.356953+0000 mgr.excalibur.kuumco [ERR] Failed to apply alertmanager spec AlertManagerSpec({'placement': PlacementSpec(count=1), 'service_type': 'alertmanager', 'service_id': None, 'unmanaged': False, 'preview_only': False, 'networks': [], 'config': None, 'user_data': {}, 'port': None}): name alertmanager.aladdin already in use
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 582, in _apply_all_services
if self._apply_service(spec):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 743, in _apply_service
rank_generation=slot.rank_generation,
File "/usr/share/ceph/mgr/cephadm/module.py", line 613, in get_unique_name
f'name {daemon_type}.{name} already in use')
orchestrator._interface.OrchestratorValidationError: name alertmanager.aladdin already in use
2021-07-08T22:01:55.372569+0000 mgr.excalibur.kuumco [ERR] Failed to apply node-exporter spec MonitoringSpec({'placement': PlacementSpec(host_pattern='*'), 'service_type': 'node-exporter', 'service_id': None, 'unmanaged': False, 'preview_only': False, 'networks': [], 'config': None, 'port': None}): name node-exporter.aladdin already in use
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 582, in _apply_all_services
if self._apply_service(spec):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 743, in _apply_service
rank_generation=slot.rank_generation,
File "/usr/share/ceph/mgr/cephadm/module.py", line 613, in get_unique_name
f'name {daemon_type}.{name} already in use')
orchestrator._interface.OrchestratorValidationError: name node-exporter.aladdin already in use

Also my 'ceph -s' output keeps getting longer and longer (currently 517 lines) with messages like these:

Updating node-exporter deployment (+6 -6 -> 13) (0s)
      [............................]
    Updating alertmanager deployment (+1 -1 -> 1) (0s)
      [............................]

What's the best way to go about fixing this? I've tried using 'ceph orch daemon redeploy alertmanager.aladdin' and the same for node-exporter, but it doesn't seem to help.

Thanks,
Bryan

Actions

Copy link

Updated by Enrico Kern almost 3 years ago

I have exactly the same issue. In addition the dashboard is not working anymore. mgr is listenning on port 8443 but there is no connection possible.

 progress:
    Updating alertmanager deployment (+1 -1 -> 1) (0s)
      [............................]
    Updating prometheus deployment (+1 -1 -> 1) (0s)
      [............................]
    Updating grafana deployment (+1 -1 -> 1) (0s)
      [............................]
    Updating prometheus deployment (+1 -1 -> 1) (0s)
      [............................]
    Updating node-exporter deployment (+5 -5 -> 5) (0s)
      [............................]
    Updating node-exporter deployment (+5 -5 -> 5) (0s)
      [............................]
    Updating alertmanager deployment (+1 -1 -> 1) (0s)
      [............................]
    Updating grafana deployment (+1 -1 -> 1) (0s)
      [............................]
    Updating prometheus deployment (+1 -1 -> 1) (0s)
      [............................]
    Updating node-exporter deployment (+5 -5 -> 5) (0s)
      [............................]
    Updating grafana deployment (+1 -1 -> 1) (0s)
      [............................]
    Updating alertmanager deployment (+1 -1 -> 1) (0s)

# lsof -i :8443
lsof: no pwd entry for UID 167
COMMAND   PID     USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
lsof: no pwd entry for UID 167
ceph-mgr 2788      167   49u  IPv4 100533      0t0  TCP stor01-dem:8443 (LISTEN)
 curl stor01-dem:8443
curl: (56) Recv failure: Connection reset by peer

Actions

Copy link

Updated by Harry Coin almost 3 years ago

Workaround (caution: temporarily disruptive), Assuming this is the only reported problem remaining after upgrade otherwise completes:

1. ceph orch rm node-exporter

Wait 30+ seconds.

2. Stop all managers.

3. Start all managers.

4. ceph orch apply node-exporter '*'

Actions

Copy link

Updated by Cory Snyder over 2 years ago

Related to Bug #51961: Stuck progress indicators in ceph status output added

Actions

Copy link

Updated by Burkhard Obergoeker over 2 years ago

Thanks, Harry, this woraround worked for me, though I watched some different daemons which stuck during the update process.
I already started the same update again before I noticed this bugreport and got rid of the other messages but the one concerning the node-exporter.
Fortunately it was able to get healed by the woraround provided.

Best regards
Burkhard

Actions

Copy link