Project

General

Profile

Actions

Bug #64938

open

Pool created with single PG splits into many on single OSD causes OSD to hit max_pgs_per_osd

Added by Prashant D about 2 months ago. Updated 25 days ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

With autoscale mode ON, if a new pool is created without specifying the pg_num/pgp_num values then the pool gets created with 1 (single) PG. The newly created pool gets scaled later by pg_autoscaler in accordance with target PG count. If the autoscaler is off, then the pool gets created with 32 PGs (osd_pool_default_pgp_num).

The problem with the new pool with 1 PG is that PG splits on a single OSD due to pg_autoscale not scaling the pool gradually. This leads to a situation where OSD hits the max_pgs_per_osd limit causing PGs from other pools to get stuck in activating/creating state.

e.g 4+2 ec pool created without specifying pg_num and pgp_num values

# ceph osd pool create default.rgw.buckets.data erasure myprofile --pg_num_min 4096

# osdmaptool --print /tmp/osdmap.376 |grep default.rgw.buckets.data
osdmaptool: osdmap file '/tmp/osdmap.376'
pool 7 'default.rgw.buckets.data' erasure profile myprofile size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 376 flags hashpspool,creating stripe_width 16384 pg_num_min 4096

PG stuck in activating state :

[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
    pg 6.8b is stuck inactive for 11h, current state activating+undersized+degraded+remapped, last acting [46,15]
[WRN] PG_DEGRADED: Degraded data redundancy: 14/12387 objects degraded (0.113%), 1 pg degraded, 1 pg undersized
    pg 6.8b is stuck undersized for 11h, current state activating+undersized+degraded+remapped, last acting [46,15]

OSD logs :

---- osd.16 ----
2024-02-29T02:15:56.562+0000 7f54ecbd7640  1 osd.16 pg_epoch: 664 pg[6.8b( empty local-lis/les=0/0 n=0 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=664) [16,15,27] r=0 lpr=664 pi=[489,664)/1 crt=0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
2024-02-29T02:15:57.569+0000 7f54ecbd7640  1 osd.16 pg_epoch: 665 pg[6.8b( empty local-lis/les=0/0 n=0 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=665) [16,15,27]/[46,15] r=-1 lpr=665 pi=[489,665)/1 crt=0'0 mlcod 0'0 remapped mbc={}] start_peering_interval up [16,15,27] -> [16,15,27], acting [16,15,27] -> [46,15], acting_primary 16 -> 46, up_primary 16 -> 16, role 0 -> -1, features acting 4540138322906710015 upacting 4540138322906710015
2024-02-29T02:15:57.569+0000 7f54ecbd7640  1 osd.16 pg_epoch: 665 pg[6.8b( empty local-lis/les=0/0 n=0 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=665) [16,15,27]/[46,15] r=-1 lpr=665 pi=[489,665)/1 crt=0'0 mlcod 0'0 remapped NOTIFY mbc={}] state<Start>: transitioning to Stray
2024-02-29T13:58:10.245+0000 7f54ecbd7640  1 osd.16 pg_epoch: 681 pg[6.8b( v 468'3081 lc 0'0 (0'0,468'3081] local-lis/les=0/0 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=681) [16,15,27]/[15] r=-1 lpr=681 pi=[489,681)/2 luod=0'0 crt=468'3081 mlcod 0'0 active+remapped m=7 mbc={}] start_peering_interval up [16,15,27] -> [16,15,27], acting [46,15] -> [15], acting_primary 46 -> 15, up_primary 16 -> 16, role -1 -> -1, features acting 4540138322906710015 upacting 4540138322906710015
2024-02-29T13:58:10.245+0000 7f54ecbd7640  1 osd.16 pg_epoch: 681 pg[6.8b( v 468'3081 lc 0'0 (0'0,468'3081] local-lis/les=0/0 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=681) [16,15,27]/[15] r=-1 lpr=681 pi=[489,681)/2 crt=468'3081 mlcod 0'0 remapped NOTIFY m=7 mbc={}] state<Start>: transitioning to Stray
2024-02-29T13:58:11.325+0000 7f54ecbd7640  1 osd.16 pg_epoch: 682 pg[6.8b( v 468'3081 lc 0'0 (0'0,468'3081] local-lis/les=0/0 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=682) [16,15,27]/[16,15] r=0 lpr=682 pi=[489,682)/2 crt=468'3081 mlcod 0'0 remapped NOTIFY m=7 mbc={}] start_peering_interval up [16,15,27] -> [16,15,27], acting [15] -> [16,15], acting_primary 15 -> 16, up_primary 16 -> 16, role -1 -> 0, features acting 4540138322906710015 upacting 4540138322906710015
2024-02-29T13:58:11.325+0000 7f54ecbd7640  1 osd.16 pg_epoch: 682 pg[6.8b( v 468'3081 lc 0'0 (0'0,468'3081] local-lis/les=0/0 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=682) [16,15,27]/[16,15] r=0 lpr=682 pi=[489,682)/2 crt=468'3081 mlcod 0'0 remapped m=7 mbc={}] state<Start>: transitioning to Primary
2024-02-29T13:58:12.340+0000 7f54ecbd7640  1 osd.16 pg_epoch: 683 pg[6.8b( v 468'3081 lc 0'0 (0'0,468'3081] local-lis/les=682/683 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=682) [16,15,27]/[16,15] async=[27] r=0 lpr=682 pi=[489,682)/2 crt=468'3081 mlcod 0'0 active+undersized+degraded+remapped m=7 mbc={255={(0+3)=7}}] state<Started/Primary/Active>: react AllReplicasActivated Activating complete
2024-02-29T13:58:13.336+0000 7f54ecbd7640  1 osd.16 pg_epoch: 684 pg[6.8b( v 468'3081 (0'0,468'3081] local-lis/les=682/683 n=13 ec=489/393 lis/c=682/489 les/c/f=683/490/0 sis=684 pruub=15.003605843s) [16,15,27] async=[27] r=0 lpr=684 pi=[489,684)/3 crt=468'3081 mlcod 468'3081 active pruub 42908.445312500s@ mbc={255={}}] start_peering_interval up [16,15,27] -> [16,15,27], acting [16,15] -> [16,15,27], acting_primary 16 -> 16, up_primary 16 -> 16, role 0 -> 0, features acting 4540138322906710015 upacting 4540138322906710015
2024-02-29T13:58:13.336+0000 7f54ecbd7640  1 osd.16 pg_epoch: 684 pg[6.8b( v 468'3081 (0'0,468'3081] local-lis/les=682/683 n=13 ec=489/393 lis/c=682/489 les/c/f=683/490/0 sis=684 pruub=15.003605843s) [16,15,27] r=0 lpr=684 pi=[489,684)/3 crt=468'3081 mlcod 0'0 unknown pruub 42908.445312500s@ mbc={}] state<Start>: transitioning to Primary
2024-02-29T13:58:14.375+0000 7f54ecbd7640  1 osd.16 pg_epoch: 685 pg[6.8b( v 468'3081 (0'0,468'3081] local-lis/les=684/685 n=13 ec=489/393 lis/c=682/489 les/c/f=683/490/0 sis=684) [16,15,27] r=0 lpr=684 pi=[489,684)/3 crt=468'3081 mlcod 0'0 active mbc={}] state<Started/Primary/Active>: react AllReplicasActivated Activating complete
---- osd.15 ----
2024-02-29T02:15:58.569+0000 7fbeca99b640  1 osd.15 pg_epoch: 665 pg[6.8b( empty local-lis/les=0/0 n=0 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=665) [16,15,27]/[46,15] r=1 lpr=665 pi=[489,665)/1 crt=0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Stray
2024-02-29T13:58:10.243+0000 7fbeca99b640  1 osd.15 pg_epoch: 681 pg[6.8b( v 468'3081 lc 0'0 (0'0,468'3081] local-lis/les=665/666 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=681) [16,15,27]/[15] r=0 lpr=681 pi=[489,681)/2 luod=0'0 crt=468'3081 mlcod 0'0 active+remapped m=7 mbc={}] start_peering_interval up [16,15,27] -> [16,15,27], acting [46,15] -> [15], acting_primary 46 -> 15, up_primary 16 -> 16, role 1 -> 0, features acting 4540138322906710015 upacting 4540138322906710015
2024-02-29T13:58:10.243+0000 7fbeca99b640  1 osd.15 pg_epoch: 681 pg[6.8b( v 468'3081 lc 0'0 (0'0,468'3081] local-lis/les=665/666 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=681) [16,15,27]/[15] r=0 lpr=681 pi=[489,681)/2 crt=468'3081 mlcod 0'0 remapped m=7 mbc={}] state<Start>: transitioning to Primary
2024-02-29T13:58:11.320+0000 7fbeca99b640  1 osd.15 pg_epoch: 682 pg[6.8b( v 468'3081 lc 0'0 (0'0,468'3081] local-lis/les=665/666 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=682) [16,15,27]/[16,15] r=1 lpr=682 pi=[489,682)/2 crt=468'3081 mlcod 0'0 remapped m=7 mbc={}] start_peering_interval up [16,15,27] -> [16,15,27], acting [15] -> [16,15], acting_primary 15 -> 16, up_primary 16 -> 16, role 0 -> 1, features acting 4540138322906710015 upacting 4540138322906710015
2024-02-29T13:58:11.320+0000 7fbeca99b640  1 osd.15 pg_epoch: 682 pg[6.8b( v 468'3081 lc 0'0 (0'0,468'3081] local-lis/les=665/666 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=682) [16,15,27]/[16,15] r=1 lpr=682 pi=[489,682)/2 crt=468'3081 mlcod 0'0 remapped NOTIFY m=7 mbc={}] state<Start>: transitioning to Stray
2024-02-29T13:58:13.333+0000 7fbeca99b640  1 osd.15 pg_epoch: 684 pg[6.8b( v 468'3081 (0'0,468'3081] local-lis/les=682/683 n=13 ec=489/393 lis/c=682/489 les/c/f=683/490/0 sis=684 pruub=15.000221252s) [16,15,27] r=1 lpr=684 pi=[489,684)/3 luod=0'0 crt=468'3081 mlcod 0'0 active pruub 42911.820312500s@ mbc={}] start_peering_interval up [16,15,27] -> [16,15,27], acting [16,15] -> [16,15,27], acting_primary 16 -> 16, up_primary 16 -> 16, role 1 -> 1, features acting 4540138322906710015 upacting 4540138322906710015
2024-02-29T13:58:13.333+0000 7fbeca99b640  1 osd.15 pg_epoch: 684 pg[6.8b( v 468'3081 (0'0,468'3081] local-lis/les=682/683 n=13 ec=489/393 lis/c=682/489 les/c/f=683/490/0 sis=684 pruub=15.000137329s) [16,15,27] r=1 lpr=684 pi=[489,684)/3 crt=468'3081 mlcod 0'0 unknown NOTIFY pruub 42911.820312500s@ mbc={}] state<Start>: transitioning to Stray
---- osd.27 ----
2024-02-29T02:15:58.585+0000 7f49fbf1f640  1 osd.27 666 maybe_wait_for_max_pg withhold creation of pg 6.8b: 1217 >= 1200
2024-02-29T13:58:12.340+0000 7f49fbf1f640  1 osd.27 pg_epoch: 682 pg[6.8b( empty local-lis/les=0/0 n=0 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=682) [16,15,27]/[16,15] r=-1 lpr=682 pi=[489,682)/2 crt=0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Stray
2024-02-29T13:58:13.341+0000 7f49fbf1f640  1 osd.27 pg_epoch: 684 pg[6.8b( v 468'3081 (0'0,468'3081] local-lis/les=0/0 n=13 ec=489/393 lis/c=682/489 les/c/f=683/490/0 sis=684) [16,15,27] r=2 lpr=684 pi=[489,684)/3 luod=0'0 crt=468'3081 mlcod 0'0 active mbc={}] start_peering_interval up [16,15,27] -> [16,15,27], acting [16,15] -> [16,15,27], acting_primary 16 -> 16, up_primary 16 -> 16, role -1 -> 2, features acting 4540138322906710015 upacting 4540138322906710015
2024-02-29T13:58:13.341+0000 7f49fbf1f640  1 osd.27 pg_epoch: 684 pg[6.8b( v 468'3081 (0'0,468'3081] local-lis/les=0/0 n=13 ec=489/393 lis/c=682/489 les/c/f=683/490/0 sis=684) [16,15,27] r=2 lpr=684 pi=[489,684)/3 crt=468'3081 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
---- osd.46 ----
2024-02-29T02:08:22.415+0000 7fa0a3c76640  1 osd.46 pg_epoch: 490 pg[6.8b( v 468'3081 lc 0'0 (0'0,468'3081] local-lis/les=486/487 n=13 ec=489/393 lis/c=486/486 les/c/f=487/487/0 sis=489) [135,169,46] r=2 lpr=489 pi=[486,489)/1 crt=468'3081 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
2024-02-29T02:15:56.554+0000 7fa0a3c76640  1 osd.46 pg_epoch: 664 pg[6.8b( v 468'3081 (0'0,468'3081] local-lis/les=489/490 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=664 pruub=9.879022598s) [16,15,27] r=-1 lpr=664 pi=[489,664)/1 luod=0'0 crt=468'3081 lcod 0'0 mlcod 0'0 active pruub 695.198425293s@ mbc={}] start_peering_interval up [135,169,46] -> [16,15,27], acting [135,169,46] -> [16,15,27], acting_primary 135 -> 16, up_primary 135 -> 16, role 2 -> -1, features acting 4540138322906710015 upacting 4540138322906710015
2024-02-29T02:15:56.555+0000 7fa0a3c76640  1 osd.46 pg_epoch: 664 pg[6.8b( v 468'3081 (0'0,468'3081] local-lis/les=489/490 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=664 pruub=9.878875732s) [16,15,27] r=-1 lpr=664 pi=[489,664)/1 crt=468'3081 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 695.198425293s@ mbc={}] state<Start>: transitioning to Stray
2024-02-29T02:15:57.561+0000 7fa0a3c76640  1 osd.46 pg_epoch: 665 pg[6.8b( v 468'3081 (0'0,468'3081] local-lis/les=489/490 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=665) [16,15,27]/[46,15] r=0 lpr=665 pi=[489,665)/1 crt=468'3081 lcod 0'0 mlcod 0'0 remapped NOTIFY mbc={}] start_peering_interval up [16,15,27] -> [16,15,27], acting [16,15,27] -> [46,15], acting_primary 16 -> 46, up_primary 16 -> 16, role -1 -> 0, features acting 4540138322906710015 upacting 4540138322906710015
2024-02-29T02:15:57.561+0000 7fa0a3c76640  1 osd.46 pg_epoch: 665 pg[6.8b( v 468'3081 (0'0,468'3081] local-lis/les=489/490 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=665) [16,15,27]/[46,15] r=0 lpr=665 pi=[489,665)/1 crt=468'3081 lcod 0'0 mlcod 0'0 remapped mbc={}] state<Start>: transitioning to Primary
2024-02-29T13:58:10.322+0000 7fa0a3c76640  1 osd.46 pg_epoch: 681 pg[6.8b( v 468'3081 (0'0,468'3081] local-lis/les=665/666 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=681) [16,15,27]/[15] async=[16,27] r=-1 lpr=681 pi=[489,681)/2 crt=468'3081 lcod 0'0 mlcod 0'0 remapped MUST_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB mbc={255={(0+3)=7}}] start_peering_interval up [16,15,27] -> [16,15,27], acting [46,15] -> [15], acting_primary 46 -> 15, up_primary 16 -> 16, role 0 -> -1, features acting 4540138322906710015 upacting 4540138322906710015
2024-02-29T13:58:10.323+0000 7fa0a3c76640  1 osd.46 pg_epoch: 681 pg[6.8b( v 468'3081 (0'0,468'3081] local-lis/les=665/666 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=681) [16,15,27]/[15] r=-1 lpr=681 pi=[489,681)/2 crt=468'3081 lcod 0'0 mlcod 0'0 remapped NOTIFY MUST_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB mbc={}] state<Start>: transitioning to Stray
2024-02-29T13:58:11.322+0000 7fa0a3c76640  1 osd.46 pg_epoch: 682 pg[6.8b( v 468'3081 (0'0,468'3081] local-lis/les=665/666 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=682) [16,15,27]/[16,15] r=-1 lpr=682 pi=[489,682)/2 crt=468'3081 lcod 0'0 mlcod 0'0 remapped NOTIFY MUST_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB mbc={}] start_peering_interval up [16,15,27] -> [16,15,27], acting [15] -> [16,15], acting_primary 15 -> 16, up_primary 16 -> 16, role -1 -> -1, features acting 4540138322906710015 upacting 4540138322906710015
2024-02-29T13:58:11.322+0000 7fa0a3c76640  1 osd.46 pg_epoch: 682 pg[6.8b( v 468'3081 (0'0,468'3081] local-lis/les=665/666 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=682) [16,15,27]/[16,15] r=-1 lpr=682 pi=[489,682)/2 crt=468'3081 lcod 0'0 mlcod 0'0 remapped NOTIFY MUST_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB mbc={}] state<Start>: transitioning to Stray
2024-02-29T13:58:13.333+0000 7fa0a3c76640  1 osd.46 pg_epoch: 684 pg[6.8b( v 468'3081 (0'0,468'3081] local-lis/les=665/666 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=684) [16,15,27] r=-1 lpr=684 pi=[489,684)/3 luod=0'0 crt=468'3081 lcod 0'0 mlcod 0'0 active MUST_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB mbc={}] start_peering_interval up [16,15,27] -> [16,15,27], acting [16,15] -> [16,15,27], acting_primary 16 -> 16, up_primary 16 -> 16, role -1 -> -1, features acting 4540138322906710015 upacting 4540138322906710015
2024-02-29T13:58:13.333+0000 7fa0a3c76640  1 osd.46 pg_epoch: 684 pg[6.8b( v 468'3081 (0'0,468'3081] local-lis/les=665/666 n=13 ec=489/393 lis/c=489/489 les/c/f=490/490/0 sis=684) [16,15,27] r=-1 lpr=684 pi=[489,684)/3 crt=468'3081 lcod 0'0 mlcod 0'0 unknown NOTIFY MUST_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB mbc={}] state<Start>: transitioning to Stray

The workaround for this issue is to either bump up the value for mon_max_pg_per_osd or specify pg_num/pgp_num greater than 1 at the time of pool creation.

Actions #1

Updated by Prashant D about 2 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 56205
Actions #2

Updated by Radoslaw Zarzynski about 2 months ago

Reviewed.

Actions #3

Updated by Prashant D about 1 month ago

The pg_autoscaler adjusts the pg_num through _maybe_adjust based on pg_num_min in this scenario. The DaemonServer then adjust PGs in multiple of std::min(left, mgr_max_pg_num_change - pg_gap) -- mgr_max_pg_num_change is by default 128 PGs.

void DaemonServer::adjust_pgs()
...
          } else if (p.get_pg_num_target() > p.get_pg_num()) {
            // pg_num increase (split)
            bool active = true;
            auto q = pg_map.num_pg_by_pool_state.find(i.first);
            if (q != pg_map.num_pg_by_pool_state.end()) {
              for (auto& j : q->second) {
                if ((j.first & (PG_STATE_ACTIVE|PG_STATE_PEERED)) == 0) {
                  dout(20) << "pool " << i.first << " has " << j.second
                           << " pgs in " << pg_state_string(j.first)
                           << dendl;
                  active = false;
                  break;
                }
              }
            } else {
              active = false;
            }
            unsigned pg_gap = p.get_pg_num() - p.get_pgp_num();
            unsigned max_jump = cct->_conf->mgr_max_pg_num_change;
            if (!active) {
              dout(10) << "pool " << i.first
                       << " pg_num_target " << p.get_pg_num_target()
                       << " pg_num " << p.get_pg_num()
                       << " - not all pgs active" 
                       << dendl;
            } else if (pg_gap >= max_jump) {
              dout(10) << "pool " << i.first
                       << " pg_num " << p.get_pg_num()
                       << " - pgp_num " << p.get_pgp_num()
                       << " gap >= max_pg_num_change " << max_jump
                       << " - must scale pgp_num first" 
                       << dendl;
            } else {
              unsigned add = std::min(
                std::min(left, max_jump - pg_gap),
                p.get_pg_num_target() - p.get_pg_num());
              unsigned target = p.get_pg_num() + add;
              left -= add;
              dout(10) << "pool " << i.first
                       << " pg_num_target " << p.get_pg_num_target()
                       << " pg_num " << p.get_pg_num()
                       << " -> " << target << dendl;
              pg_num_to_set[osdmap.get_pool_name(i.first)] = target;
            }
          }
        }
...

We should consider lowering the mgr_max_pg_num_change value to 32 PG to avoid splitting single PG (when pool is created with 1 PG) in multiple of 128 PGs.

Actions #4

Updated by Radoslaw Zarzynski 25 days ago

Bump up. Prashant, let's talk about it.

Actions

Also available in: Atom PDF