Project

General

Profile

Actions

Bug #63735

open

Module 'pg_autoscaler' has failed: float division by zero

Added by Benjamin Mare 5 months ago. Updated 4 months ago.

Status:
New
Priority:
Normal
Category:
pg_autoscaler module
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Yes
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Dear maintainer,

After the last upgrade from 17.2.6 to 17.2.7 my Ceph is stuck in HEALTH_ERR with the following message:

Module 'pg_autoscaler' has failed: float division by zero

I'm unable to stop the module to get an healthy cluster again. After setting the autoscaler to off on every pools (ceph osd pool set <pool-name> pg_autoscale_mode off), changing the default autoscale mode (ceph config set global osd_pool_default_pg_autoscale_mode off) or setting the noautoscale global flag (ceph osd pool set noautoscale) doesn't change anything.

When starting the mgr, I got the following error:

2023-12-06T10:39:34.470+0000 7f1e90f68700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'pg_autoscaler' while running on mgr.sapsrvmon12.mnmszl: float division by zero
2023-12-06T10:39:34.470+0000 7f1e90f68700 -1 pg_autoscaler.serve:
2023-12-06T10:39:34.470+0000 7f1e90f68700 -1 Traceback (most recent call last):
  File "/usr/share/ceph/mgr/pg_autoscaler/module.py", line 325, in serve
    self._maybe_adjust()
  File "/usr/share/ceph/mgr/pg_autoscaler/module.py", line 691, in _maybe_adjust
    ps, root_map = self._get_pool_status(osdmap, pools)
  File "/usr/share/ceph/mgr/pg_autoscaler/module.py", line 653, in _get_pool_status
    pool_stats, ret, threshold, 'first', overlapped_roots)
  File "/usr/share/ceph/mgr/pg_autoscaler/module.py", line 610, in _get_pool_pg_targets
    'logical_used': float(actual_raw_used)/raw_used_rate,
ZeroDivisionError: float division by zero

Every "get" command related to the autoscaler throws an error. For example:

$ ceph osd pool autoscale-status
Error EIO: Module 'pg_autoscaler' has experienced an error and cannot handle commands: float division by zero

And inside the logs:

2023-12-06T10:41:33.889+0000 7f1e9ff86700  0 log_channel(audit) log [DBG] : from='client.6798270353 -' entity='client.admin' cmd=[{"prefix": "osd pool autoscale-status", "target": ["mon-mgr", ""]}]: dispatch
2023-12-06T10:41:33.889+0000 7f1e90767700 -1 mgr.server reply reply (5) Input/output error Module 'pg_autoscaler' has experienced an error and cannot handle commands: float division by zero

Thanks for your help.
Benjamin

Actions #1

Updated by Ilya Dryomov 4 months ago

  • Target version deleted (v17.2.7)
Actions

Also available in: Atom PDF