Actions
Bug #61570
closedpg_autoscaler warns that a pool has too many pgs when it has the exact right amount
Status:
Resolved
Priority:
Normal
Assignee:
Category:
pg_autoscaler module
Target version:
-
% Done:
100%
Source:
Q/A
Tags:
backport_processed
Backport:
reef, quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Description
Looks like a possible bug in the autoscaler logic. The autoscaler warns that a pool has too many pgs when it has the exact right amount.
/a/yuriw-2023-05-30_20:25:48-rados-wip-yuri5-testing-2023-05-30-0828-quincy-distro-default-smithi/7290292
2023-05-31T05:16:23.742 DEBUG:teuthology.orchestra.run.smithi148:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v 'but it is still running' | egrep -v 'had wrong client addr' | egrep -v 'had wrong cluster addr' | egrep -v 'must scrub before tier agent can activate' | egrep -v 'failsafe engaged, dropping updates' | egrep -v 'failsafe disengaged, no longer dropping updates' | egrep -v 'overall HEALTH_' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(PG_' | egrep -v '\(SMALLER_PG_NUM\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(CACHE_POOL_NO_HIT_SET\)' | egrep -v '\(CACHE_POOL_NEAR_FULL\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(POOL_BACKFILLFULL\)' | egrep -v '\(POOL_FULL\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(POOL_NEARFULL\)' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(AUTH_BAD_CAPS\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v '\(MON_DOWN\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-05-31T05:16:23.813 INFO:teuthology.orchestra.run.smithi148.stdout:2023-05-31T05:02:56.604176+0000 mon.a (mon.0) 2048 : cluster [WRN] Health check failed: 1 pools have too many placement groups (POOL_TOO_MANY_PGS)
2023-05-31T05:16:23.813 WARNING:tasks.ceph:Found errors (ERR|WRN|SEC) in cluster log
2023-05-31T05:16:23.814 DEBUG:teuthology.orchestra.run.smithi148:> sudo egrep '\[SEC\]' /var/log/ceph/ceph.log | egrep -v 'but it is still running' | egrep -v 'had wrong client addr' | egrep -v 'had wrong cluster addr' | egrep -v 'must scrub before tier agent can activate' | egrep -v 'failsafe engaged, dropping updates' | egrep -v 'failsafe disengaged, no longer dropping updates' | egrep -v 'overall HEALTH_' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(PG_' | egrep -v '\(SMALLER_PG_NUM\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(CACHE_POOL_NO_HIT_SET\)' | egrep -v '\(CACHE_POOL_NEAR_FULL\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(POOL_BACKFILLFULL\)' | egrep -v '\(POOL_FULL\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(POOL_NEARFULL\)' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(AUTH_BAD_CAPS\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v '\(MON_DOWN\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-05-31T05:16:23.886 DEBUG:teuthology.orchestra.run.smithi148:> sudo egrep '\[ERR\]' /var/log/ceph/ceph.log | egrep -v 'but it is still running' | egrep -v 'had wrong client addr' | egrep -v 'had wrong cluster addr' | egrep -v 'must scrub before tier agent can activate' | egrep -v 'failsafe engaged, dropping updates' | egrep -v 'failsafe disengaged, no longer dropping updates' | egrep -v 'overall HEALTH_' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(PG_' | egrep -v '\(SMALLER_PG_NUM\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(CACHE_POOL_NO_HIT_SET\)' | egrep -v '\(CACHE_POOL_NEAR_FULL\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(POOL_BACKFILLFULL\)' | egrep -v '\(POOL_FULL\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(POOL_NEARFULL\)' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(AUTH_BAD_CAPS\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v '\(MON_DOWN\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-05-31T05:16:23.956 DEBUG:teuthology.orchestra.run.smithi148:> sudo egrep '\[WRN\]' /var/log/ceph/ceph.log | egrep -v 'but it is still running' | egrep -v 'had wrong client addr' | egrep -v 'had wrong cluster addr' | egrep -v 'must scrub before tier agent can activate' | egrep -v 'failsafe engaged, dropping updates' | egrep -v 'failsafe disengaged, no longer dropping updates' | egrep -v 'overall HEALTH_' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(PG_' | egrep -v '\(SMALLER_PG_NUM\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(CACHE_POOL_NO_HIT_SET\)' | egrep -v '\(CACHE_POOL_NEAR_FULL\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(POOL_BACKFILLFULL\)' | egrep -v '\(POOL_FULL\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(POOL_NEARFULL\)' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(AUTH_BAD_CAPS\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v '\(MON_DOWN\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-05-31T05:16:24.036 INFO:teuthology.orchestra.run.smithi148.stdout:2023-05-31T05:02:56.604176+0000 mon.a (mon.0) 2048 : cluster [WRN] Health check failed: 1 pools have too many placement groups (POOL_TOO_MANY_PGS)
From the mon log:
2023-05-31T05:02:57.616+0000 7f63836ea700 20 mon.a@0(leader).mgrstat health checks:
{
"POOL_TOO_MANY_PGS": {
"severity": "HEALTH_WARN",
"summary": {
"message": "1 pools have too many placement groups",
"count": 1
},
"detail": [
{
"message": "Pool modewarn has 32 placement groups, should have 32"
}
]
}
}
Actions