Project

General

Profile

Actions

Bug #61570

closed

pg_autoscaler warns that a pool has too many pgs when it has the exact right amount

Added by Laura Flores 11 months ago. Updated 28 days ago.

Status:
Resolved
Priority:
Normal
Category:
pg_autoscaler module
Target version:
-
% Done:

100%

Source:
Q/A
Tags:
backport_processed
Backport:
reef, quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Looks like a possible bug in the autoscaler logic. The autoscaler warns that a pool has too many pgs when it has the exact right amount.

/a/yuriw-2023-05-30_20:25:48-rados-wip-yuri5-testing-2023-05-30-0828-quincy-distro-default-smithi/7290292

2023-05-31T05:16:23.742 DEBUG:teuthology.orchestra.run.smithi148:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v 'but it is still running' | egrep -v 'had wrong client addr' | egrep -v 'had wrong cluster addr' | egrep -v 'must scrub before tier agent can activate' | egrep -v 'failsafe engaged, dropping updates' | egrep -v 'failsafe disengaged, no longer dropping updates' | egrep -v 'overall HEALTH_' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(PG_' | egrep -v '\(SMALLER_PG_NUM\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(CACHE_POOL_NO_HIT_SET\)' | egrep -v '\(CACHE_POOL_NEAR_FULL\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(POOL_BACKFILLFULL\)' | egrep -v '\(POOL_FULL\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(POOL_NEARFULL\)' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(AUTH_BAD_CAPS\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v '\(MON_DOWN\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-05-31T05:16:23.813 INFO:teuthology.orchestra.run.smithi148.stdout:2023-05-31T05:02:56.604176+0000 mon.a (mon.0) 2048 : cluster [WRN] Health check failed: 1 pools have too many placement groups (POOL_TOO_MANY_PGS)
2023-05-31T05:16:23.813 WARNING:tasks.ceph:Found errors (ERR|WRN|SEC) in cluster log
2023-05-31T05:16:23.814 DEBUG:teuthology.orchestra.run.smithi148:> sudo egrep '\[SEC\]' /var/log/ceph/ceph.log | egrep -v 'but it is still running' | egrep -v 'had wrong client addr' | egrep -v 'had wrong cluster addr' | egrep -v 'must scrub before tier agent can activate' | egrep -v 'failsafe engaged, dropping updates' | egrep -v 'failsafe disengaged, no longer dropping updates' | egrep -v 'overall HEALTH_' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(PG_' | egrep -v '\(SMALLER_PG_NUM\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(CACHE_POOL_NO_HIT_SET\)' | egrep -v '\(CACHE_POOL_NEAR_FULL\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(POOL_BACKFILLFULL\)' | egrep -v '\(POOL_FULL\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(POOL_NEARFULL\)' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(AUTH_BAD_CAPS\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v '\(MON_DOWN\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-05-31T05:16:23.886 DEBUG:teuthology.orchestra.run.smithi148:> sudo egrep '\[ERR\]' /var/log/ceph/ceph.log | egrep -v 'but it is still running' | egrep -v 'had wrong client addr' | egrep -v 'had wrong cluster addr' | egrep -v 'must scrub before tier agent can activate' | egrep -v 'failsafe engaged, dropping updates' | egrep -v 'failsafe disengaged, no longer dropping updates' | egrep -v 'overall HEALTH_' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(PG_' | egrep -v '\(SMALLER_PG_NUM\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(CACHE_POOL_NO_HIT_SET\)' | egrep -v '\(CACHE_POOL_NEAR_FULL\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(POOL_BACKFILLFULL\)' | egrep -v '\(POOL_FULL\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(POOL_NEARFULL\)' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(AUTH_BAD_CAPS\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v '\(MON_DOWN\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-05-31T05:16:23.956 DEBUG:teuthology.orchestra.run.smithi148:> sudo egrep '\[WRN\]' /var/log/ceph/ceph.log | egrep -v 'but it is still running' | egrep -v 'had wrong client addr' | egrep -v 'had wrong cluster addr' | egrep -v 'must scrub before tier agent can activate' | egrep -v 'failsafe engaged, dropping updates' | egrep -v 'failsafe disengaged, no longer dropping updates' | egrep -v 'overall HEALTH_' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(PG_' | egrep -v '\(SMALLER_PG_NUM\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(CACHE_POOL_NO_HIT_SET\)' | egrep -v '\(CACHE_POOL_NEAR_FULL\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(POOL_BACKFILLFULL\)' | egrep -v '\(POOL_FULL\)' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(POOL_NEARFULL\)' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(AUTH_BAD_CAPS\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v '\(MON_DOWN\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-05-31T05:16:24.036 INFO:teuthology.orchestra.run.smithi148.stdout:2023-05-31T05:02:56.604176+0000 mon.a (mon.0) 2048 : cluster [WRN] Health check failed: 1 pools have too many placement groups (POOL_TOO_MANY_PGS)

From the mon log:

2023-05-31T05:02:57.616+0000 7f63836ea700 20 mon.a@0(leader).mgrstat health checks:
{
    "POOL_TOO_MANY_PGS": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "1 pools have too many placement groups",
            "count": 1
        },
        "detail": [
            {
                "message": "Pool modewarn has 32 placement groups, should have 32" 
            }
        ]
    }
}


Related issues 3 (0 open3 closed)

Copied to mgr - Backport #62985: reef: pg_autoscaler warns that a pool has too many pgs when it has the exact right amountResolvedKamoltat (Junior) SirivadhnaActions
Copied to mgr - Backport #62986: pacific: pg_autoscaler warns that a pool has too many pgs when it has the exact right amountRejectedKamoltat (Junior) SirivadhnaActions
Copied to mgr - Backport #62987: quincy: pg_autoscaler warns that a pool has too many pgs when it has the exact right amountResolvedKamoltat (Junior) SirivadhnaActions
Actions #1

Updated by Laura Flores 11 months ago

Current logic:

722             if p['pg_autoscale_mode'] == 'warn':
723                 msg = 'Pool %s has %d placement groups, should have %d' % (
724                     p['pool_name'],
725                     p['pg_num_target'],
726                     p['pg_num_final'])
727                 if p['pg_num_final'] > p['pg_num_target']:
728                     too_few.append(msg)
729                 else:
730                     too_many.append(msg)

Maybe this logic is correct, but it's worth investigating.

Actions #2

Updated by Kamoltat (Junior) Sirivadhna 11 months ago

  • Pull request ID set to 51923
Actions #3

Updated by Radoslaw Zarzynski 9 months ago

  • Status changed from New to Fix Under Review
Actions #4

Updated by Aishwarya Mathuria 9 months ago

/a/yuriw-2023-07-28_14:25:29-rados-wip-yuri7-testing-2023-07-27-1336-quincy-distro-default-smithi/7355501

Actions #5

Updated by Laura Flores 9 months ago

/a/yuriw-2023-07-26_15:58:31-rados-wip-yuri8-testing-2023-07-24-0819-quincy-distro-default-smithi/7353602

Actions #6

Updated by Jan-Philipp Litza 8 months ago

This was introduced while fixing #58894, when the conditional on would_adjust was moved below the block for pg_autoscale_mode== "warn".

Even with the proposed change merged, the health check now is now much more sensitive than the output of ceph osd pool autoscale-status, since it triggers immediately and not only after the threshold (see the paragraph "NEW PG_NUM" in the docs about placement groups)

Actions #7

Updated by Dan van der Ster 8 months ago

Seeing this in 16.2.14 now.

Actions #8

Updated by Kamoltat (Junior) Sirivadhna 7 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport changed from quincy to reef, quincy, pacific
Actions #9

Updated by Backport Bot 7 months ago

  • Copied to Backport #62985: reef: pg_autoscaler warns that a pool has too many pgs when it has the exact right amount added
Actions #10

Updated by Backport Bot 7 months ago

  • Copied to Backport #62986: pacific: pg_autoscaler warns that a pool has too many pgs when it has the exact right amount added
Actions #11

Updated by Backport Bot 7 months ago

  • Copied to Backport #62987: quincy: pg_autoscaler warns that a pool has too many pgs when it has the exact right amount added
Actions #12

Updated by Backport Bot 7 months ago

  • Tags set to backport_processed
Actions #14

Updated by Konstantin Shalygin 28 days ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF