Bug #63735: Module 'pg_autoscaler' has failed: float division by zero - mgr - Ceph

Actions

Copy link

Bug #63735

open

Module 'pg_autoscaler' has failed: float division by zero

Added by Benjamin Mare 5 months ago. Updated 4 months ago.

Status:

New

Priority:

Normal

Assignee:

Kamoltat (Junior) Sirivadhna

Category:

pg_autoscaler module

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Yes

Severity:

1 - critical

Reviewed:

Affected Versions:

Ceph - v17.2.7

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Dear maintainer,

After the last upgrade from 17.2.6 to 17.2.7 my Ceph is stuck in HEALTH_ERR with the following message:

Module 'pg_autoscaler' has failed: float division by zero

I'm unable to stop the module to get an healthy cluster again. After setting the autoscaler to off on every pools (ceph osd pool set <pool-name> pg_autoscale_mode off), changing the default autoscale mode (ceph config set global osd_pool_default_pg_autoscale_mode off) or setting the noautoscale global flag (ceph osd pool set noautoscale) doesn't change anything.

When starting the mgr, I got the following error:

2023-12-06T10:39:34.470+0000 7f1e90f68700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'pg_autoscaler' while running on mgr.sapsrvmon12.mnmszl: float division by zero
2023-12-06T10:39:34.470+0000 7f1e90f68700 -1 pg_autoscaler.serve:
2023-12-06T10:39:34.470+0000 7f1e90f68700 -1 Traceback (most recent call last):
  File "/usr/share/ceph/mgr/pg_autoscaler/module.py", line 325, in serve
    self._maybe_adjust()
  File "/usr/share/ceph/mgr/pg_autoscaler/module.py", line 691, in _maybe_adjust
    ps, root_map = self._get_pool_status(osdmap, pools)
  File "/usr/share/ceph/mgr/pg_autoscaler/module.py", line 653, in _get_pool_status
    pool_stats, ret, threshold, 'first', overlapped_roots)
  File "/usr/share/ceph/mgr/pg_autoscaler/module.py", line 610, in _get_pool_pg_targets
    'logical_used': float(actual_raw_used)/raw_used_rate,
ZeroDivisionError: float division by zero

Every "get" command related to the autoscaler throws an error. For example:

$ ceph osd pool autoscale-status
Error EIO: Module 'pg_autoscaler' has experienced an error and cannot handle commands: float division by zero

And inside the logs:

2023-12-06T10:41:33.889+0000 7f1e9ff86700  0 log_channel(audit) log [DBG] : from='client.6798270353 -' entity='client.admin' cmd=[{"prefix": "osd pool autoscale-status", "target": ["mon-mgr", ""]}]: dispatch
2023-12-06T10:41:33.889+0000 7f1e90767700 -1 mgr.server reply reply (5) Input/output error Module 'pg_autoscaler' has experienced an error and cannot handle commands: float division by zero

Thanks for your help.
Benjamin

Actions

Copy link

Updated by Ilya Dryomov 4 months ago

Target version deleted (~~v17.2.7~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #63735

Module 'pg_autoscaler' has failed: float division by zero

Updated by Ilya Dryomov 4 months ago