Project

General

Profile

Actions

Bug #54267

open

Progress module keeps reporting Unhandled exception and put the cluster into an error state.

Added by Daniel Persson over 2 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
common
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi.

We are currently doing a lot of recovering and moving misplaced data as we have removed and added new machines and OSDs to the cluster. During this process, the progress module keeps reporting an error and puts the cluster in an error state.

HEALTH_ERROR : MGR_MODULE_ERROR: Module 'progress' has failed: ('9f1a152c-2f09-4d01-bd81-95c0e2873d9d',)

I looked around in the logs and found this segment around when the error occurred.

--------------------------------------------
2022-02-13T01:08:31.848+0100 7ff4fb582700 1 log_channel(cluster) log [ERR] : Unhandled exception from module 'progress' while running on mgr.ceph-mgr3: ('9f1a152c-2f09-4d01-bd81-95c0e2873d9d',)
2022-02-13T01:08:31.852+0100 7ff4fb582700 -1 progress.serve:
2022-02-13T01:08:31.852+0100 7ff4fb582700 -1 Traceback (most recent call last):
File "/usr/share/ceph/mgr/progress/module.py", line 716, in serve
self._process_pg_summary()
File "/usr/share/ceph/mgr/progress/module.py", line 629, in _process_pg_summary
ev = self._events[ev_id]
KeyError: '9f1a152c-2f09-4d01-bd81-95c0e2873d9d'
-------------------------------------------

Best regards
Daniel

Actions #1

Updated by Daniel Persson over 2 years ago

Daniel Persson wrote:

Hi.

We are currently doing a lot of recovering and moving misplaced data as we have removed and added new machines and OSDs to the cluster. During this process, the progress module keeps reporting an error and puts the cluster in an error state.

HEALTH_ERROR : MGR_MODULE_ERROR: Module 'progress' has failed: ('9f1a152c-2f09-4d01-bd81-95c0e2873d9d',)

I looked around in the logs and found this segment around when the error occurred.

--------------------------------------------
2022-02-13T01:08:31.848+0100 7ff4fb582700 1 log_channel(cluster) log [ERR] : Unhandled exception from module 'progress' while running on mgr.ceph-mgr3: ('9f1a152c-2f09-4d01-bd81-95c0e2873d9d',)
2022-02-13T01:08:31.852+0100 7ff4fb582700 -1 progress.serve:
2022-02-13T01:08:31.852+0100 7ff4fb582700 -1 Traceback (most recent call last):
File "/usr/share/ceph/mgr/progress/module.py", line 716, in serve
self._process_pg_summary()
File "/usr/share/ceph/mgr/progress/module.py", line 629, in _process_pg_summary
ev = self._events[ev_id]
KeyError: '9f1a152c-2f09-4d01-bd81-95c0e2873d9d'
-------------------------------------------

Best regards
Daniel

Seems to be resolved by https://github.com/ceph/ceph/commit/b70d4a9caae0eb859e10b68f93573d507625d267

Actions

Also available in: Atom PDF