Bug #54267: Progress module keeps reporting Unhandled exception and put the cluster into an error state. - Ceph - Ceph

Actions

Copy link

Bug #54267

open

Progress module keeps reporting Unhandled exception and put the cluster into an error state.

Added by Daniel Persson over 2 years ago. Updated over 2 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

common

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v16.2.7

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi.

We are currently doing a lot of recovering and moving misplaced data as we have removed and added new machines and OSDs to the cluster. During this process, the progress module keeps reporting an error and puts the cluster in an error state.

HEALTH_ERROR : MGR_MODULE_ERROR: Module 'progress' has failed: ('9f1a152c-2f09-4d01-bd81-95c0e2873d9d',)

I looked around in the logs and found this segment around when the error occurred.

--------------------------------------------
2022-02-13T01:08:31.848+0100 7ff4fb582700 1 log_channel(cluster) log [ERR] : Unhandled exception from module 'progress' while running on mgr.ceph-mgr3: ('9f1a152c-2f09-4d01-bd81-95c0e2873d9d',)
2022-02-13T01:08:31.852+0100 7ff4fb582700 -1 progress.serve:
2022-02-13T01:08:31.852+0100 7ff4fb582700 -1 Traceback (most recent call last):
File "/usr/share/ceph/mgr/progress/module.py", line 716, in serve
self._process_pg_summary()
File "/usr/share/ceph/mgr/progress/module.py", line 629, in _process_pg_summary
ev = self._events[ev_id]
KeyError: '9f1a152c-2f09-4d01-bd81-95c0e2873d9d'
-------------------------------------------

Best regards
Daniel

Actions

Copy link

Updated by Daniel Persson over 2 years ago

Daniel Persson wrote:

Hi.

We are currently doing a lot of recovering and moving misplaced data as we have removed and added new machines and OSDs to the cluster. During this process, the progress module keeps reporting an error and puts the cluster in an error state.

HEALTH_ERROR : MGR_MODULE_ERROR: Module 'progress' has failed: ('9f1a152c-2f09-4d01-bd81-95c0e2873d9d',)

I looked around in the logs and found this segment around when the error occurred.

--------------------------------------------
2022-02-13T01:08:31.848+0100 7ff4fb582700 1 log_channel(cluster) log [ERR] : Unhandled exception from module 'progress' while running on mgr.ceph-mgr3: ('9f1a152c-2f09-4d01-bd81-95c0e2873d9d',)
2022-02-13T01:08:31.852+0100 7ff4fb582700 -1 progress.serve:
2022-02-13T01:08:31.852+0100 7ff4fb582700 -1 Traceback (most recent call last):
File "/usr/share/ceph/mgr/progress/module.py", line 716, in serve
self._process_pg_summary()
File "/usr/share/ceph/mgr/progress/module.py", line 629, in _process_pg_summary
ev = self._events[ev_id]
KeyError: '9f1a152c-2f09-4d01-bd81-95c0e2873d9d'
-------------------------------------------

Best regards
Daniel

Seems to be resolved by https://github.com/ceph/ceph/commit/b70d4a9caae0eb859e10b68f93573d507625d267

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #54267

Progress module keeps reporting Unhandled exception and put the cluster into an error state.

Updated by Daniel Persson over 2 years ago