Project

General

Profile

Actions

Bug #61570

closed

pg_autoscaler warns that a pool has too many pgs when it has the exact right amount

Added by Laura Flores 11 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Category:
pg_autoscaler module
Target version:
-
% Done:

100%

Source:
Q/A
Tags:
backport_processed
Backport:
reef, quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Looks like a possible bug in the autoscaler logic. The autoscaler warns that a pool has too many pgs when it has the exact right amount.

/a/yuriw-2023-05-30_20:25:48-rados-wip-yuri5-testing-2023-05-30-0828-quincy-distro-default-smithi/7290292

2023-05-31T05:16:23.742 DEBUG:teuthology.orchestra.run.smithi148:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v 'but it is still running' | egrep -v 'had wrong client addr' | egrep -v 'had wrong cluster addr' | egrep -v 'must scrub before tier agent can activate' | egrep -v 'failsafe engaged, dropping updates' | egrep -v 'failsafe disengaged, no longer dropping updates' | egrep -v 'overall HEALTH_' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(PG_' | egrep -v '\(SMALLER_PG_NUM\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(CACHE_POOL_NO_HIT_SET\)' | egrep -v '\(CACHE_POOL_NEAR_FULL\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(POOL_BACKFILLFULL\)' | egrep -v '\(POOL_FULL\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(POOL_NEARFULL\)' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(AUTH_BAD_CAPS\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v '\(MON_DOWN\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-05-31T05:16:23.813 INFO:teuthology.orchestra.run.smithi148.stdout:2023-05-31T05:02:56.604176+0000 mon.a (mon.0) 2048 : cluster [WRN] Health check failed: 1 pools have too many placement groups (POOL_TOO_MANY_PGS)
2023-05-31T05:16:23.813 WARNING:tasks.ceph:Found errors (ERR|WRN|SEC) in cluster log
2023-05-31T05:16:23.814 DEBUG:teuthology.orchestra.run.smithi148:> sudo egrep '\[SEC\]' /var/log/ceph/ceph.log | egrep -v 'but it is still running' | egrep -v 'had wrong client addr' | egrep -v 'had wrong cluster addr' | egrep -v 'must scrub before tier agent can activate' | egrep -v 'failsafe engaged, dropping updates' | egrep -v 'failsafe disengaged, no longer dropping updates' | egrep -v 'overall HEALTH_' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(PG_' | egrep -v '\(SMALLER_PG_NUM\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(CACHE_POOL_NO_HIT_SET\)' | egrep -v '\(CACHE_POOL_NEAR_FULL\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(POOL_BACKFILLFULL\)' | egrep -v '\(POOL_FULL\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(POOL_NEARFULL\)' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(AUTH_BAD_CAPS\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v '\(MON_DOWN\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-05-31T05:16:23.886 DEBUG:teuthology.orchestra.run.smithi148:> sudo egrep '\[ERR\]' /var/log/ceph/ceph.log | egrep -v 'but it is still running' | egrep -v 'had wrong client addr' | egrep -v 'had wrong cluster addr' | egrep -v 'must scrub before tier agent can activate' | egrep -v 'failsafe engaged, dropping updates' | egrep -v 'failsafe disengaged, no longer dropping updates' | egrep -v 'overall HEALTH_' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(PG_' | egrep -v '\(SMALLER_PG_NUM\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(CACHE_POOL_NO_HIT_SET\)' | egrep -v '\(CACHE_POOL_NEAR_FULL\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(POOL_BACKFILLFULL\)' | egrep -v '\(POOL_FULL\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(POOL_NEARFULL\)' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(AUTH_BAD_CAPS\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v '\(MON_DOWN\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-05-31T05:16:23.956 DEBUG:teuthology.orchestra.run.smithi148:> sudo egrep '\[WRN\]' /var/log/ceph/ceph.log | egrep -v 'but it is still running' | egrep -v 'had wrong client addr' | egrep -v 'had wrong cluster addr' | egrep -v 'must scrub before tier agent can activate' | egrep -v 'failsafe engaged, dropping updates' | egrep -v 'failsafe disengaged, no longer dropping updates' | egrep -v 'overall HEALTH_' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(PG_' | egrep -v '\(SMALLER_PG_NUM\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(CACHE_POOL_NO_HIT_SET\)' | egrep -v '\(CACHE_POOL_NEAR_FULL\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(POOL_BACKFILLFULL\)' | egrep -v '\(POOL_FULL\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(POOL_NEARFULL\)' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(AUTH_BAD_CAPS\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v '\(MON_DOWN\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-05-31T05:16:24.036 INFO:teuthology.orchestra.run.smithi148.stdout:2023-05-31T05:02:56.604176+0000 mon.a (mon.0) 2048 : cluster [WRN] Health check failed: 1 pools have too many placement groups (POOL_TOO_MANY_PGS)

From the mon log:

2023-05-31T05:02:57.616+0000 7f63836ea700 20 mon.a@0(leader).mgrstat health checks:
{
    "POOL_TOO_MANY_PGS": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "1 pools have too many placement groups",
            "count": 1
        },
        "detail": [
            {
                "message": "Pool modewarn has 32 placement groups, should have 32" 
            }
        ]
    }
}


Related issues 3 (0 open3 closed)

Copied to mgr - Backport #62985: reef: pg_autoscaler warns that a pool has too many pgs when it has the exact right amountResolvedKamoltat (Junior) SirivadhnaActions
Copied to mgr - Backport #62986: pacific: pg_autoscaler warns that a pool has too many pgs when it has the exact right amountRejectedKamoltat (Junior) SirivadhnaActions
Copied to mgr - Backport #62987: quincy: pg_autoscaler warns that a pool has too many pgs when it has the exact right amountResolvedKamoltat (Junior) SirivadhnaActions
Actions

Also available in: Atom PDF