Bug #41735
pg_autoscaler throws HEALTH_WARN with auto_scale on for all pools
% Done:
0%
Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Old pools have auto_scale on and ceph health still shows HEALTH_WARN (20 < 30)
sh-4.2# ceph osd pool autoscale-status POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE rook-ceph-cephfilesystem-data0 0 3.0 297.0G 0.0000 1.0 4 on .rgw.root 0 3.0 297.0G 0.0000 1.0 8 on rook-ceph-cephblockpool 576.9k 3.0 297.0G 0.0000 1.0 4 on rook-ceph-cephfilesystem-metadata 1549k 3.0 297.0G 0.0000 1.0 4 on sh-4.2# ceph version ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable) sh-4.2# ceph pg ls PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 1.0 2 0 0 0 0 0 0 0 active+clean 90m 42'4 401:849 [0,2,1]p0 [0,2,1]p0 2019-09-09 23:25:58.767705 2019-09-06 23:04:48.764917 1.1 1 0 0 0 0 0 0 0 active+clean 91m 39'2 401:1180 [2,1,0]p2 [2,1,0]p2 2019-09-09 23:22:17.830001 2019-09-09 23:15:15.200786 1.2 4 0 0 0 36 345 27 0 active+clean 91m 42'8 401:6455 [0,1,2]p0 [0,1,2]p0 2019-09-09 23:25:20.698986 2019-09-09 23:25:20.698986 1.3 2 0 0 0 17 0 0 0 active+clean 91m 42'2 401:4809 [1,0,2]p1 [1,0,2]p1 2019-09-09 23:25:09.661832 2019-09-09 23:15:21.986925 2.0 9 0 0 0 1526 0 0 0 active+clean 90m 40'8 402:11312 [0,2,1]p0 [0,2,1]p0 2019-09-09 23:26:01.722927 2019-09-06 23:04:51.466968 2.1 4 0 0 0 160 0 0 0 active+clean 91m 40'6 401:462 [1,2,0]p1 [1,2,0]p1 2019-09-09 23:25:54.683444 2019-09-06 23:04:51.466968 2.2 2 0 0 0 0 0 0 0 active+clean 91m 42'4 401:464 [0,1,2]p0 [0,1,2]p0 2019-09-09 23:25:23.719986 2019-09-06 23:04:51.466968 2.3 7 0 0 0 600 13860 30 0 active+clean 91m 40'7 401:468 [0,1,2]p0 [0,1,2]p0 2019-09-09 23:24:06.789026 2019-09-09 23:24:06.789026 3.0 0 0 0 0 0 0 0 0 active+clean 91m 0'0 401:455 [2,0,1]p2 [2,0,1]p2 2019-09-09 23:15:33.106745 2019-09-09 23:15:33.106745 3.1 0 0 0 0 0 0 0 0 active+clean 91m 0'0 401:458 [1,0,2]p1 [1,0,2]p1 2019-09-09 23:15:36.899952 2019-09-06 23:04:55.747801 3.2 0 0 0 0 0 0 0 0 active+clean 91m 0'0 401:461 [1,2,0]p1 [1,2,0]p1 2019-09-09 23:21:59.619765 2019-09-09 23:21:59.619765 3.3 0 0 0 0 0 0 0 0 active+clean 91m 0'0 401:455 [0,1,2]p0 [0,1,2]p0 2019-09-09 23:15:58.340066 2019-09-06 23:04:55.747801 4.0 0 0 0 0 0 0 0 0 active+clean 17h 0'0 401:395 [1,0,2]p1 [1,0,2]p1 2019-09-09 07:08:50.651610 2019-09-06 23:09:43.671929 4.1 0 0 0 0 0 0 0 0 active+clean 16h 0'0 401:395 [2,0,1]p2 [2,0,1]p2 2019-09-09 08:47:42.989819 2019-09-06 23:09:43.671929 4.2 0 0 0 0 0 0 0 0 active+clean 15h 0'0 401:395 [2,0,1]p2 [2,0,1]p2 2019-09-09 09:02:26.803052 2019-09-09 09:02:26.803052 4.3 0 0 0 0 0 0 0 0 active+clean 7h 0'0 401:395 [2,0,1]p2 [2,0,1]p2 2019-09-09 16:57:00.042792 2019-09-09 16:57:00.042792 4.4 0 0 0 0 0 0 0 0 active+clean 14h 0'0 401:395 [1,2,0]p1 [1,2,0]p1 2019-09-09 10:45:22.251071 2019-09-06 23:09:43.671929 4.5 0 0 0 0 0 0 0 0 active+clean 9h 0'0 401:395 [0,2,1]p0 [0,2,1]p0 2019-09-09 14:57:47.728906 2019-09-06 23:09:43.671929 4.6 0 0 0 0 0 0 0 0 active+clean 4h 0'0 401:395 [2,1,0]p2 [2,1,0]p2 2019-09-09 20:05:52.983982 2019-09-06 23:09:43.671929 4.7 0 0 0 0 0 0 0 0 active+clean 12h 0'0 401:395 [1,2,0]p1 [1,2,0]p1 2019-09-09 12:14:59.441725 2019-09-06 23:09:43.671929 * NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilisation. See http://docs.ceph.com/docs/master/dev/placement-group/#omap-statistics for further details.
Related issues
History
#1 Updated by Sage Weil over 4 years ago
- Description updated (diff)
#2 Updated by Sage Weil over 4 years ago
- Status changed from New to Need More Info
can you attach the 'ceph health detail' output so i can see which warning it's throwing?
#3 Updated by Sage Weil over 4 years ago
- Status changed from Need More Info to Fix Under Review
- Pull request ID set to 30352
Rook should probably set this option explicitly, since it is working with nautilus and we won't backport this (or the change that enables the pg_autoscaler by default).
#4 Updated by Vasu Kulkarni over 4 years ago
sorry I missed that
sh-4.2# ceph health detail HEALTH_WARN too few PGs per OSD (20 < min 30) TOO_FEW_PGS too few PGs per OSD (20 < min 30) sh-4.2# ceph -s cluster: id: f7ad6fb6-05ad-4a32-9f2d-b9c75a8bfdc5 health: HEALTH_WARN too few PGs per OSD (20 < min 30) services: mon: 3 daemons, quorum a,b,c (age 5d) mgr: a(active, since 5d) mds: rook-ceph-cephfilesystem:1 {0=rook-ceph-cephfilesystem-a=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 5d), 3 in (since 5d) data: pools: 5 pools, 20 pgs objects: 31 objects, 2.3 KiB usage: 3.2 GiB used, 294 GiB / 297 GiB avail pgs: 20 active+clean io: client: 853 B/s rd, 1 op/s rd, 0 op/s wr sh-4.2# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.29008 root default -7 0.09669 host example-deviceset-0-p7js6 2 hdd 0.09669 osd.2 up 1.00000 1.00000 -3 0.09669 host example-deviceset-1-vwqwk 1 hdd 0.09669 osd.1 up 1.00000 1.00000 -5 0.09669 host example-deviceset-2-2h9w8 0 hdd 0.09669 osd.0 up 1.00000 1.00000
#5 Updated by Neha Ojha over 4 years ago
- Status changed from Fix Under Review to Resolved
#6 Updated by Lenz Grimmer almost 4 years ago
- Status changed from Resolved to Pending Backport
- Target version set to v15.0.0
- Backport set to nautilus
This change needs to be backported into Nautilus to fix a regression (#45135)
#7 Updated by Lenz Grimmer almost 4 years ago
- Related to Bug #45135: nautilus: "too few PGs per OSD (2 < min 30) (TOO_FEW_PGS)" in smoke (all suites seem broken) added
#8 Updated by Neha Ojha almost 4 years ago
nautilus backport: https://github.com/ceph/ceph/pull/34618
#9 Updated by Nathan Cutler almost 4 years ago
- Copied to Backport #45231: nautilus: pg_autoscaler throws HEALTH_WARN with auto_scale on for all pools added
#10 Updated by Nathan Cutler almost 4 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".