Project

General

Profile

Bug #41735

pg_autoscaler throws HEALTH_WARN with auto_scale on for all pools

Added by Vasu Kulkarni about 1 year ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature:

Description

Old pools have auto_scale on and ceph health still shows HEALTH_WARN (20 < 30)

sh-4.2# ceph osd pool autoscale-status
 POOL                                 SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE 
 rook-ceph-cephfilesystem-data0         0                 3.0        297.0G  0.0000                 1.0       4              on        
 .rgw.root                              0                 3.0        297.0G  0.0000                 1.0       8              on        
 rook-ceph-cephblockpool            576.9k                3.0        297.0G  0.0000                 1.0       4              on        
 rook-ceph-cephfilesystem-metadata   1549k                3.0        297.0G  0.0000                 1.0       4              on        

sh-4.2# ceph version
ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)

sh-4.2# ceph pg ls
PG  OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE        SINCE VERSION REPORTED  UP        ACTING    SCRUB_STAMP                DEEP_SCRUB_STAMP           
1.0       2        0         0       0     0           0          0   0 active+clean   90m    42'4   401:849 [0,2,1]p0 [0,2,1]p0 2019-09-09 23:25:58.767705 2019-09-06 23:04:48.764917 
1.1       1        0         0       0     0           0          0   0 active+clean   91m    39'2  401:1180 [2,1,0]p2 [2,1,0]p2 2019-09-09 23:22:17.830001 2019-09-09 23:15:15.200786 
1.2       4        0         0       0    36         345         27   0 active+clean   91m    42'8  401:6455 [0,1,2]p0 [0,1,2]p0 2019-09-09 23:25:20.698986 2019-09-09 23:25:20.698986 
1.3       2        0         0       0    17           0          0   0 active+clean   91m    42'2  401:4809 [1,0,2]p1 [1,0,2]p1 2019-09-09 23:25:09.661832 2019-09-09 23:15:21.986925 
2.0       9        0         0       0  1526           0          0   0 active+clean   90m    40'8 402:11312 [0,2,1]p0 [0,2,1]p0 2019-09-09 23:26:01.722927 2019-09-06 23:04:51.466968 
2.1       4        0         0       0   160           0          0   0 active+clean   91m    40'6   401:462 [1,2,0]p1 [1,2,0]p1 2019-09-09 23:25:54.683444 2019-09-06 23:04:51.466968 
2.2       2        0         0       0     0           0          0   0 active+clean   91m    42'4   401:464 [0,1,2]p0 [0,1,2]p0 2019-09-09 23:25:23.719986 2019-09-06 23:04:51.466968 
2.3       7        0         0       0   600       13860         30   0 active+clean   91m    40'7   401:468 [0,1,2]p0 [0,1,2]p0 2019-09-09 23:24:06.789026 2019-09-09 23:24:06.789026 
3.0       0        0         0       0     0           0          0   0 active+clean   91m     0'0   401:455 [2,0,1]p2 [2,0,1]p2 2019-09-09 23:15:33.106745 2019-09-09 23:15:33.106745 
3.1       0        0         0       0     0           0          0   0 active+clean   91m     0'0   401:458 [1,0,2]p1 [1,0,2]p1 2019-09-09 23:15:36.899952 2019-09-06 23:04:55.747801 
3.2       0        0         0       0     0           0          0   0 active+clean   91m     0'0   401:461 [1,2,0]p1 [1,2,0]p1 2019-09-09 23:21:59.619765 2019-09-09 23:21:59.619765 
3.3       0        0         0       0     0           0          0   0 active+clean   91m     0'0   401:455 [0,1,2]p0 [0,1,2]p0 2019-09-09 23:15:58.340066 2019-09-06 23:04:55.747801 
4.0       0        0         0       0     0           0          0   0 active+clean   17h     0'0   401:395 [1,0,2]p1 [1,0,2]p1 2019-09-09 07:08:50.651610 2019-09-06 23:09:43.671929 
4.1       0        0         0       0     0           0          0   0 active+clean   16h     0'0   401:395 [2,0,1]p2 [2,0,1]p2 2019-09-09 08:47:42.989819 2019-09-06 23:09:43.671929 
4.2       0        0         0       0     0           0          0   0 active+clean   15h     0'0   401:395 [2,0,1]p2 [2,0,1]p2 2019-09-09 09:02:26.803052 2019-09-09 09:02:26.803052 
4.3       0        0         0       0     0           0          0   0 active+clean    7h     0'0   401:395 [2,0,1]p2 [2,0,1]p2 2019-09-09 16:57:00.042792 2019-09-09 16:57:00.042792 
4.4       0        0         0       0     0           0          0   0 active+clean   14h     0'0   401:395 [1,2,0]p1 [1,2,0]p1 2019-09-09 10:45:22.251071 2019-09-06 23:09:43.671929 
4.5       0        0         0       0     0           0          0   0 active+clean    9h     0'0   401:395 [0,2,1]p0 [0,2,1]p0 2019-09-09 14:57:47.728906 2019-09-06 23:09:43.671929 
4.6       0        0         0       0     0           0          0   0 active+clean    4h     0'0   401:395 [2,1,0]p2 [2,1,0]p2 2019-09-09 20:05:52.983982 2019-09-06 23:09:43.671929 
4.7       0        0         0       0     0           0          0   0 active+clean   12h     0'0   401:395 [1,2,0]p1 [1,2,0]p1 2019-09-09 12:14:59.441725 2019-09-06 23:09:43.671929 

* NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilisation. See http://docs.ceph.com/docs/master/dev/placement-group/#omap-statistics for further details.

Related issues

Related to Ceph - Bug #45135: nautilus: "too few PGs per OSD (2 < min 30) (TOO_FEW_PGS)" in smoke (all suites seem broken) Resolved
Copied to RADOS - Backport #45231: nautilus: pg_autoscaler throws HEALTH_WARN with auto_scale on for all pools Resolved

History

#1 Updated by Sage Weil about 1 year ago

  • Description updated (diff)

#2 Updated by Sage Weil about 1 year ago

  • Status changed from New to Need More Info

can you attach the 'ceph health detail' output so i can see which warning it's throwing?

#3 Updated by Sage Weil about 1 year ago

  • Status changed from Need More Info to Fix Under Review
  • Pull request ID set to 30352

Rook should probably set this option explicitly, since it is working with nautilus and we won't backport this (or the change that enables the pg_autoscaler by default).

#4 Updated by Vasu Kulkarni about 1 year ago

sorry I missed that


sh-4.2# ceph health detail
HEALTH_WARN too few PGs per OSD (20 < min 30)
TOO_FEW_PGS too few PGs per OSD (20 < min 30)

sh-4.2# ceph -s
  cluster:
    id:     f7ad6fb6-05ad-4a32-9f2d-b9c75a8bfdc5
    health: HEALTH_WARN
            too few PGs per OSD (20 < min 30)

  services:
    mon: 3 daemons, quorum a,b,c (age 5d)
    mgr: a(active, since 5d)
    mds: rook-ceph-cephfilesystem:1 {0=rook-ceph-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 5d), 3 in (since 5d)

  data:
    pools:   5 pools, 20 pgs
    objects: 31 objects, 2.3 KiB
    usage:   3.2 GiB used, 294 GiB / 297 GiB avail
    pgs:     20 active+clean

  io:
    client:   853 B/s rd, 1 op/s rd, 0 op/s wr

sh-4.2# ceph osd tree
ID CLASS WEIGHT  TYPE NAME                          STATUS REWEIGHT PRI-AFF 
-1       0.29008 root default                                               
-7       0.09669     host example-deviceset-0-p7js6                         
 2   hdd 0.09669         osd.2                          up  1.00000 1.00000 
-3       0.09669     host example-deviceset-1-vwqwk                         
 1   hdd 0.09669         osd.1                          up  1.00000 1.00000 
-5       0.09669     host example-deviceset-2-2h9w8                         
 0   hdd 0.09669         osd.0                          up  1.00000 1.00000 

#6 Updated by Lenz Grimmer 8 months ago

  • Status changed from Resolved to Pending Backport
  • Target version set to v15.0.0
  • Backport set to nautilus

This change needs to be backported into Nautilus to fix a regression (#45135)

#7 Updated by Lenz Grimmer 8 months ago

  • Related to Bug #45135: nautilus: "too few PGs per OSD (2 < min 30) (TOO_FEW_PGS)" in smoke (all suites seem broken) added

#8 Updated by Neha Ojha 8 months ago

#9 Updated by Nathan Cutler 7 months ago

  • Copied to Backport #45231: nautilus: pg_autoscaler throws HEALTH_WARN with auto_scale on for all pools added

#10 Updated by Nathan Cutler 7 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF