Actions
Bug #63735
openModule 'pg_autoscaler' has failed: float division by zero
Status:
New
Priority:
Normal
Assignee:
Category:
pg_autoscaler module
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
Yes
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Dear maintainer,
After the last upgrade from 17.2.6 to 17.2.7 my Ceph is stuck in HEALTH_ERR with the following message:
Module 'pg_autoscaler' has failed: float division by zero
I'm unable to stop the module to get an healthy cluster again. After setting the autoscaler to off on every pools (ceph osd pool set <pool-name> pg_autoscale_mode off), changing the default autoscale mode (ceph config set global osd_pool_default_pg_autoscale_mode off) or setting the noautoscale global flag (ceph osd pool set noautoscale) doesn't change anything.
When starting the mgr, I got the following error:
2023-12-06T10:39:34.470+0000 7f1e90f68700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'pg_autoscaler' while running on mgr.sapsrvmon12.mnmszl: float division by zero 2023-12-06T10:39:34.470+0000 7f1e90f68700 -1 pg_autoscaler.serve: 2023-12-06T10:39:34.470+0000 7f1e90f68700 -1 Traceback (most recent call last): File "/usr/share/ceph/mgr/pg_autoscaler/module.py", line 325, in serve self._maybe_adjust() File "/usr/share/ceph/mgr/pg_autoscaler/module.py", line 691, in _maybe_adjust ps, root_map = self._get_pool_status(osdmap, pools) File "/usr/share/ceph/mgr/pg_autoscaler/module.py", line 653, in _get_pool_status pool_stats, ret, threshold, 'first', overlapped_roots) File "/usr/share/ceph/mgr/pg_autoscaler/module.py", line 610, in _get_pool_pg_targets 'logical_used': float(actual_raw_used)/raw_used_rate, ZeroDivisionError: float division by zero
Every "get" command related to the autoscaler throws an error. For example:
$ ceph osd pool autoscale-status Error EIO: Module 'pg_autoscaler' has experienced an error and cannot handle commands: float division by zero
And inside the logs:
2023-12-06T10:41:33.889+0000 7f1e9ff86700 0 log_channel(audit) log [DBG] : from='client.6798270353 -' entity='client.admin' cmd=[{"prefix": "osd pool autoscale-status", "target": ["mon-mgr", ""]}]: dispatch 2023-12-06T10:41:33.889+0000 7f1e90767700 -1 mgr.server reply reply (5) Input/output error Module 'pg_autoscaler' has experienced an error and cannot handle commands: float division by zero
Thanks for your help.
Benjamin
Actions