Bug #58366
closedCrimson: unable to initialize pool for rbd due to inactive pgs
0%
Description
While trying to test Crimson-osd on DSAL lab cluster, it was observed that once OSDs were brought up and OSD pool was created, 31 pgs remained in inactive state and eventually progressed into unknown state. Due to pgs being unavailable, association of pool for rbd application is not possible
The issue is not reproducible with Quincy build on the same lab setup.
Crimson image - https://shaman.ceph.com/repos/ceph/main/aa49dee4e60f69d68f1c8252eef8f1c6cd991c08/crimson/267610/
Cephadm shell -
[root@dell-r730-043 /]# cephadm shell Inferring fsid 129128f4-816f-11ed-ae0e-801844e02b40 Inferring config /var/lib/ceph/129128f4-816f-11ed-ae0e-801844e02b40/mon.dell-r730-043.dsal.lab.eng.rdu2.redhat.com/config Using ceph image with id 'd92233276102' and tag 'aa49dee4e60f69d68f1c8252eef8f1c6cd991c08-crimson' created on 2022-12-13 16:56:05 +0000 UTC quay.ceph.io/ceph-ci/ceph@sha256:7b703795d72ebf9fb6e9c28a88f6b50d10161225951107951541631dd2640a1b [ceph: root@dell-r730-043 /]# ceph -v ceph version 18.0.0-1417-gaa49dee4 (aa49dee4e60f69d68f1c8252eef8f1c6cd991c08) reef (dev)
ceph status shows a Health warning (Reduced data availability: 31 pgs inactive) and the pgs status as inactive
[ceph: root@dell-r730-043 /]# ceph -s cluster: id: 129128f4-816f-11ed-ae0e-801844e02b40 health: HEALTH_WARN Reduced data availability: 31 pgs inactive services: mon: 3 daemons, quorum dell-r730-043.dsal.lab.eng.rdu2.redhat.com,dell-r730-006,dell-r730-026 (age 5d) mgr: dell-r730-043.dsal.lab.eng.rdu2.redhat.com.bhyhhe(active, since 5d), standbys: dell-r730-006.efjfln osd: 9 osds: 9 up (since 5d), 9 in (since 5d) data: pools: 2 pools, 33 pgs objects: 0 objects, 0 B usage: 53 MiB used, 900 GiB / 900 GiB avail pgs: 93.939% pgs unknown 31 unknown 2 active+clean progress: Global Recovery Event (0s) [............................]
DSAL Lab setup -
HOST ADDR LABELS STATUS dell-r730-006.dsal.lab.eng.rdu2.redhat.com 10.1.8.16 crash alertmanager mon mgr osd node-exporter dell-r730-026.dsal.lab.eng.rdu2.redhat.com 10.1.8.36 crash osd mon node-exporter dell-r730-043.dsal.lab.eng.rdu2.redhat.com 10.1.8.53 _admin crash alertmanager mon mgr prometheus osd grafana installer node-exporter 3 hosts in cluster
Ceph OSDs up and running -
[ceph: root@dell-r730-043 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.87918 root default -2 0.29306 host dell-r730-006 2 0.09769 osd.2 up 1.00000 1.00000 3 0.09769 osd.3 up 1.00000 1.00000 6 0.09769 osd.6 up 1.00000 1.00000 -4 0.29306 host dell-r730-026 0 0.09769 osd.0 up 1.00000 1.00000 5 0.09769 osd.5 up 1.00000 1.00000 7 0.09769 osd.7 up 1.00000 1.00000 -3 0.29306 host dell-r730-043 1 0.09769 osd.1 up 1.00000 1.00000 4 0.09769 osd.4 up 1.00000 1.00000 8 0.09769 osd.8 up 1.00000 1.00000
Ceph Health Details -
[ceph: root@dell-r730-043 /]# ceph health detail HEALTH_WARN Reduced data availability: 31 pgs inactive [WRN] PG_AVAILABILITY: Reduced data availability: 31 pgs inactive pg 2.1 is stuck inactive for 5d, current state unknown, last acting [] pg 2.2 is stuck inactive for 5d, current state unknown, last acting [] pg 2.3 is stuck inactive for 5d, current state unknown, last acting [] pg 2.4 is stuck inactive for 5d, current state unknown, last acting [] pg 2.5 is stuck inactive for 5d, current state unknown, last acting [] pg 2.6 is stuck inactive for 5d, current state unknown, last acting [] pg 2.7 is stuck inactive for 5d, current state unknown, last acting [] pg 2.8 is stuck inactive for 5d, current state unknown, last acting [] pg 2.9 is stuck inactive for 5d, current state unknown, last acting [] pg 2.a is stuck inactive for 5d, current state unknown, last acting [] pg 2.b is stuck inactive for 5d, current state unknown, last acting [] pg 2.c is stuck inactive for 5d, current state unknown, last acting [] pg 2.d is stuck inactive for 5d, current state unknown, last acting [] pg 2.e is stuck inactive for 5d, current state unknown, last acting [] pg 2.f is stuck inactive for 5d, current state unknown, last acting [] pg 2.10 is stuck inactive for 5d, current state unknown, last acting [] pg 2.11 is stuck inactive for 5d, current state unknown, last acting [] pg 2.12 is stuck inactive for 5d, current state unknown, last acting [] pg 2.13 is stuck inactive for 5d, current state unknown, last acting [] pg 2.14 is stuck inactive for 5d, current state unknown, last acting [] pg 2.15 is stuck inactive for 5d, current state unknown, last acting [] pg 2.16 is stuck inactive for 5d, current state unknown, last acting [] pg 2.17 is stuck inactive for 5d, current state unknown, last acting [] pg 2.18 is stuck inactive for 5d, current state unknown, last acting [] pg 2.19 is stuck inactive for 5d, current state unknown, last acting [] pg 2.1a is stuck inactive for 5d, current state unknown, last acting [] pg 2.1b is stuck inactive for 5d, current state unknown, last acting [] pg 2.1c is stuck inactive for 5d, current state unknown, last acting [] pg 2.1d is stuck inactive for 5d, current state unknown, last acting [] pg 2.1e is stuck inactive for 5d, current state unknown, last acting [] pg 2.1f is stuck inactive for 5d, current state unknown, last acting []
Ceph ps stats -
[ceph: root@dell-r730-043 /]# ceph pg stat 33 pgs: 31 unknown, 2 active+clean; 0 B data, 53 MiB used, 900 GiB / 900 GiB avail
Ceph pg dump stats -
[ceph: root@dell-r730-043 /]# ceph pg dump_stuck PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY 2.e unknown [] -1 [] -1 2.d unknown [] -1 [] -1 2.c unknown [] -1 [] -1 2.b unknown [] -1 [] -1 2.a unknown [] -1 [] -1 2.9 unknown [] -1 [] -1 2.8 unknown [] -1 [] -1 2.7 unknown [] -1 [] -1 2.6 unknown [] -1 [] -1 2.5 unknown [] -1 [] -1 2.3 unknown [] -1 [] -1 2.1 unknown [] -1 [] -1 2.2 unknown [] -1 [] -1 2.4 unknown [] -1 [] -1 2.f unknown [] -1 [] -1 2.1b unknown [] -1 [] -1 2.1c unknown [] -1 [] -1 2.1a unknown [] -1 [] -1 2.1d unknown [] -1 [] -1 2.1f unknown [] -1 [] -1 2.1e unknown [] -1 [] -1 2.19 unknown [] -1 [] -1 2.18 unknown [] -1 [] -1 2.17 unknown [] -1 [] -1 2.16 unknown [] -1 [] -1 2.15 unknown [] -1 [] -1 2.14 unknown [] -1 [] -1 2.13 unknown [] -1 [] -1 2.12 unknown [] -1 [] -1 2.11 unknown [] -1 [] -1 2.10 unknown [] -1 [] -1 ok
Ceph Config Dump -
[ceph: root@dell-r730-043 /]# ceph config dump WHO MASK LEVEL OPTION VALUE RO global basic container_image quay.ceph.io/ceph-ci/ceph@sha256:7b703795d72ebf9fb6e9c28a88f6b50d10161225951107951541631dd2640a1b * global basic log_to_file true mon advanced auth_allow_insecure_global_id_reclaim false mon advanced public_network 10.1.8.0/22 * mgr advanced mgr/cephadm/container_init True * mgr advanced mgr/cephadm/migration_current 5 * mgr advanced mgr/dashboard/ALERTMANAGER_API_HOST http://host.containers.internal:9093 * mgr advanced mgr/dashboard/GRAFANA_API_SSL_VERIFY false * mgr advanced mgr/dashboard/GRAFANA_API_URL https://host.containers.internal:3000 * mgr advanced mgr/dashboard/PROMETHEUS_API_HOST http://host.containers.internal:9095 * mgr advanced mgr/dashboard/ssl_server_port 8443 * mgr advanced mgr/orchestrator/orchestrator cephadm osd host:dell-r730-006 basic osd_memory_target 29262227456 osd host:dell-r730-026 basic osd_memory_target 30693042176 osd host:dell-r730-043 basic osd_memory_target 28187644586 osd advanced osd_memory_target_autotune true
Ceph pg dump attached
Ceph /var/logs - https://drive.google.com/file/d/1bVTcnDwnVHZCtvuz1V55Hisz0-MTGzuz/view?usp=share_link
Files