Project

General

Profile

Actions

Bug #58366

closed

Crimson: unable to initialize pool for rbd due to inactive pgs

Added by Harsh Kumar over 1 year ago. Updated 11 months ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
crimson
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rbd
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While trying to test Crimson-osd on DSAL lab cluster, it was observed that once OSDs were brought up and OSD pool was created, 31 pgs remained in inactive state and eventually progressed into unknown state. Due to pgs being unavailable, association of pool for rbd application is not possible

The issue is not reproducible with Quincy build on the same lab setup.

Crimson image - https://shaman.ceph.com/repos/ceph/main/aa49dee4e60f69d68f1c8252eef8f1c6cd991c08/crimson/267610/

Cephadm shell -

[root@dell-r730-043 /]# cephadm shell
Inferring fsid 129128f4-816f-11ed-ae0e-801844e02b40
Inferring config /var/lib/ceph/129128f4-816f-11ed-ae0e-801844e02b40/mon.dell-r730-043.dsal.lab.eng.rdu2.redhat.com/config
Using ceph image with id 'd92233276102' and tag 'aa49dee4e60f69d68f1c8252eef8f1c6cd991c08-crimson' created on 2022-12-13 16:56:05 +0000 UTC
quay.ceph.io/ceph-ci/ceph@sha256:7b703795d72ebf9fb6e9c28a88f6b50d10161225951107951541631dd2640a1b

[ceph: root@dell-r730-043 /]# ceph -v
ceph version 18.0.0-1417-gaa49dee4 (aa49dee4e60f69d68f1c8252eef8f1c6cd991c08) reef (dev)

ceph status shows a Health warning (Reduced data availability: 31 pgs inactive) and the pgs status as inactive

[ceph: root@dell-r730-043 /]# ceph -s
  cluster:
    id:     129128f4-816f-11ed-ae0e-801844e02b40
    health: HEALTH_WARN
            Reduced data availability: 31 pgs inactive

  services:
    mon: 3 daemons, quorum dell-r730-043.dsal.lab.eng.rdu2.redhat.com,dell-r730-006,dell-r730-026 (age 5d)
    mgr: dell-r730-043.dsal.lab.eng.rdu2.redhat.com.bhyhhe(active, since 5d), standbys: dell-r730-006.efjfln
    osd: 9 osds: 9 up (since 5d), 9 in (since 5d)

  data:
    pools:   2 pools, 33 pgs
    objects: 0 objects, 0 B
    usage:   53 MiB used, 900 GiB / 900 GiB avail
    pgs:     93.939% pgs unknown
             31 unknown
             2  active+clean

  progress:
    Global Recovery Event (0s)
      [............................]

DSAL Lab setup -

HOST                                        ADDR       LABELS                                                                            STATUS  
dell-r730-006.dsal.lab.eng.rdu2.redhat.com  10.1.8.16  crash alertmanager mon mgr osd node-exporter                                              
dell-r730-026.dsal.lab.eng.rdu2.redhat.com  10.1.8.36  crash osd mon node-exporter                                                               
dell-r730-043.dsal.lab.eng.rdu2.redhat.com  10.1.8.53  _admin crash alertmanager mon mgr prometheus osd grafana installer node-exporter          
3 hosts in cluster

Ceph OSDs up and running -

[ceph: root@dell-r730-043 /]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME               STATUS  REWEIGHT  PRI-AFF
-1         0.87918  root default                                     
-2         0.29306      host dell-r730-006                           
 2         0.09769          osd.2               up   1.00000  1.00000
 3         0.09769          osd.3               up   1.00000  1.00000
 6         0.09769          osd.6               up   1.00000  1.00000
-4         0.29306      host dell-r730-026                           
 0         0.09769          osd.0               up   1.00000  1.00000
 5         0.09769          osd.5               up   1.00000  1.00000
 7         0.09769          osd.7               up   1.00000  1.00000
-3         0.29306      host dell-r730-043                           
 1         0.09769          osd.1               up   1.00000  1.00000
 4         0.09769          osd.4               up   1.00000  1.00000
 8         0.09769          osd.8               up   1.00000  1.00000

Ceph Health Details -

[ceph: root@dell-r730-043 /]# ceph health detail
HEALTH_WARN Reduced data availability: 31 pgs inactive
[WRN] PG_AVAILABILITY: Reduced data availability: 31 pgs inactive
    pg 2.1 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.2 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.3 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.4 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.5 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.6 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.7 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.8 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.9 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.a is stuck inactive for 5d, current state unknown, last acting []
    pg 2.b is stuck inactive for 5d, current state unknown, last acting []
    pg 2.c is stuck inactive for 5d, current state unknown, last acting []
    pg 2.d is stuck inactive for 5d, current state unknown, last acting []
    pg 2.e is stuck inactive for 5d, current state unknown, last acting []
    pg 2.f is stuck inactive for 5d, current state unknown, last acting []
    pg 2.10 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.11 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.12 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.13 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.14 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.15 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.16 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.17 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.18 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.19 is stuck inactive for 5d, current state unknown, last acting []
    pg 2.1a is stuck inactive for 5d, current state unknown, last acting []
    pg 2.1b is stuck inactive for 5d, current state unknown, last acting []
    pg 2.1c is stuck inactive for 5d, current state unknown, last acting []
    pg 2.1d is stuck inactive for 5d, current state unknown, last acting []
    pg 2.1e is stuck inactive for 5d, current state unknown, last acting []
    pg 2.1f is stuck inactive for 5d, current state unknown, last acting []

Ceph ps stats -

[ceph: root@dell-r730-043 /]# ceph pg stat 
33 pgs: 31 unknown, 2 active+clean; 0 B data, 53 MiB used, 900 GiB / 900 GiB avail

Ceph pg dump stats -

[ceph: root@dell-r730-043 /]# ceph pg dump_stuck  
PG_STAT  STATE    UP  UP_PRIMARY  ACTING  ACTING_PRIMARY
2.e      unknown  []          -1      []              -1
2.d      unknown  []          -1      []              -1
2.c      unknown  []          -1      []              -1
2.b      unknown  []          -1      []              -1
2.a      unknown  []          -1      []              -1
2.9      unknown  []          -1      []              -1
2.8      unknown  []          -1      []              -1
2.7      unknown  []          -1      []              -1
2.6      unknown  []          -1      []              -1
2.5      unknown  []          -1      []              -1
2.3      unknown  []          -1      []              -1
2.1      unknown  []          -1      []              -1
2.2      unknown  []          -1      []              -1
2.4      unknown  []          -1      []              -1
2.f      unknown  []          -1      []              -1
2.1b     unknown  []          -1      []              -1
2.1c     unknown  []          -1      []              -1
2.1a     unknown  []          -1      []              -1
2.1d     unknown  []          -1      []              -1
2.1f     unknown  []          -1      []              -1
2.1e     unknown  []          -1      []              -1
2.19     unknown  []          -1      []              -1
2.18     unknown  []          -1      []              -1
2.17     unknown  []          -1      []              -1
2.16     unknown  []          -1      []              -1
2.15     unknown  []          -1      []              -1
2.14     unknown  []          -1      []              -1
2.13     unknown  []          -1      []              -1
2.12     unknown  []          -1      []              -1
2.11     unknown  []          -1      []              -1
2.10     unknown  []          -1      []              -1
ok

Ceph Config Dump -

[ceph: root@dell-r730-043 /]# ceph config dump
WHO     MASK                LEVEL     OPTION                                 VALUE                                                                                              RO
global                      basic     container_image                        quay.ceph.io/ceph-ci/ceph@sha256:7b703795d72ebf9fb6e9c28a88f6b50d10161225951107951541631dd2640a1b  * 
global                      basic     log_to_file                            true                                                                                                 
mon                         advanced  auth_allow_insecure_global_id_reclaim  false                                                                                                
mon                         advanced  public_network                         10.1.8.0/22                                                                                        * 
mgr                         advanced  mgr/cephadm/container_init             True                                                                                               * 
mgr                         advanced  mgr/cephadm/migration_current          5                                                                                                  * 
mgr                         advanced  mgr/dashboard/ALERTMANAGER_API_HOST    http://host.containers.internal:9093                                                               * 
mgr                         advanced  mgr/dashboard/GRAFANA_API_SSL_VERIFY   false                                                                                              * 
mgr                         advanced  mgr/dashboard/GRAFANA_API_URL          https://host.containers.internal:3000                                                              * 
mgr                         advanced  mgr/dashboard/PROMETHEUS_API_HOST      http://host.containers.internal:9095                                                               * 
mgr                         advanced  mgr/dashboard/ssl_server_port          8443                                                                                               * 
mgr                         advanced  mgr/orchestrator/orchestrator          cephadm                                                                                              
osd     host:dell-r730-006  basic     osd_memory_target                      29262227456                                                                                          
osd     host:dell-r730-026  basic     osd_memory_target                      30693042176                                                                                          
osd     host:dell-r730-043  basic     osd_memory_target                      28187644586                                                                                          
osd                         advanced  osd_memory_target_autotune             true

Ceph pg dump attached
Ceph /var/logs - https://drive.google.com/file/d/1bVTcnDwnVHZCtvuz1V55Hisz0-MTGzuz/view?usp=share_link


Files

pg_dump_all.txt (15.4 KB) pg_dump_all.txt Harsh Kumar, 12/27/2022 09:25 PM
Actions #1

Updated by Samuel Just over 1 year ago

It doesn't look like pool id 2 actually maps to any osds. Can you attach the osdmap?

Actions #2

Updated by Nitzan Mordechai over 1 year ago

  • Assignee set to Nitzan Mordechai

@Harsh Kumar, can you also attach logs with DEBUG messages?
as Sam mention, it doesn't looks like any of the pgs from poolid 2 are mapped.

Actions #3

Updated by Harsh Kumar over 1 year ago

Hey Nitzan, Sam,

Apologies for the delay, I had a hard time restoring my DSAL lab setup to the above ceph cluster configuration.
I couldn't find a recent crimson image off of reef build, so I instead used a crimson image off of quincy build, however, I way able to reproduce the issue with this build as well.


[ceph: root@dell-r730-043 /]# ceph -s
  cluster:
    id:     7a245c9c-9016-11ed-a78e-801844e02b40
    health: HEALTH_WARN
            Failed to apply 1 service(s): prometheus
            Reduced data availability: 31 pgs inactive

  services:
    mon: 3 daemons, quorum dell-r730-043.dsal.lab.eng.rdu2.redhat.com,dell-r730-026,dell-r730-006 (age 44h)
    mgr: dell-r730-043.dsal.lab.eng.rdu2.redhat.com.nbllxz(active, since 2d), standbys: dell-r730-006.jxqbih
    osd: 9 osds: 9 up (since 44h), 9 in (since 44h)

  data:
    pools:   2 pools, 33 pgs
    objects: 2 objects, 449 KiB
    usage:   100 MiB used, 900 GiB / 900 GiB avail
    pgs:     93.939% pgs unknown
             31 unknown
             2  active+clean

  progress:
    Global Recovery Event (20h)
      [=...........................] (remaining: 4w)

[ceph: root@dell-r730-043 /]# ceph -v
ceph version 17.2.5-624-ga112c1c7 (a112c1c71fab41e0cb0b3bb2cc58e9792caa32cb) quincy (stable)

OSD map summary -

[ceph: root@dell-r730-043 /]# ceph osd dump
epoch 37
fsid 7a245c9c-9016-11ed-a78e-801844e02b40
created 2023-01-09T12:10:28.248287+0000
modified 2023-01-10T12:11:28.073432+0000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 8
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client luminous
min_compat_client jewel
require_osd_release quincy
stretch_mode_enabled false
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 16 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 1 pgp_num_target 32 autoscale_mode on last_change 34 lfor 0/0/34 flags hashpspool stripe_width 0
max_osd 9
osd.0 up   in  weight 1 up_from 11 up_thru 0 down_at 0 last_clean_interval [0,0) v2:10.1.8.53:6802/2115189942 v2:10.1.8.53:6803/2115189942 exists,up de67cb44-edfd-4082-83ea-e6a87aef4c3c
osd.1 up   in  weight 1 up_from 11 up_thru 24 down_at 0 last_clean_interval [0,0) v2:10.1.8.36:6801/1664737338 v2:10.1.8.36:6800/1664737338 exists,up c725c525-dd09-4fad-a577-8a18489c48de
osd.2 up   in  weight 1 up_from 13 up_thru 34 down_at 0 last_clean_interval [0,0) v2:10.1.8.53:6807/441164187 v2:10.1.8.53:6806/441164187 exists,up ee9fb0ae-4603-4640-b17d-b058ce65c3c4
osd.3 up   in  weight 1 up_from 13 up_thru 0 down_at 0 last_clean_interval [0,0) v2:10.1.8.36:6804/729704151 v2:10.1.8.36:6805/729704151 exists,up dd1f2bcf-6ba3-4188-a447-4f05e69f7e8c
osd.4 up   in  weight 1 up_from 19 up_thru 0 down_at 0 last_clean_interval [0,0) v2:10.1.8.53:6811/3933002731 v2:10.1.8.53:6810/3933002731 exists,up eb1d84eb-1caf-4cb5-a00c-703a085eda2c
osd.5 up   in  weight 1 up_from 18 up_thru 0 down_at 0 last_clean_interval [0,0) v2:10.1.8.36:6808/410057933 v2:10.1.8.36:6809/410057933 exists,up 872e488e-1cfa-4fbb-9865-e5b2e53912df
osd.6 up   in  weight 1 up_from 24 up_thru 0 down_at 0 last_clean_interval [0,0) v2:10.1.8.16:6801/2634977743 v2:10.1.8.16:6800/2634977743 exists,up 006169f7-e79b-4d3f-ab0d-5bc82529c614
osd.7 up   in  weight 1 up_from 27 up_thru 0 down_at 0 last_clean_interval [0,0) v2:10.1.8.16:6804/3089609507 v2:10.1.8.16:6805/3089609507 exists,up e33a1271-6975-4eb3-a44a-7054a0528ba1
osd.8 up   in  weight 1 up_from 29 up_thru 0 down_at 0 last_clean_interval [0,0) v2:10.1.8.16:6808/3765740467 v2:10.1.8.16:6809/3765740467 exists,up c9741a00-acd2-43aa-a7ef-086d1f07f6fc

ceph osd tree -

[ceph: root@dell-r730-043 /]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME               STATUS  REWEIGHT  PRI-AFF
-1         0.87918  root default                                     
-4         0.29306      host dell-r730-006                           
 6         0.09769          osd.6               up   1.00000  1.00000
 7         0.09769          osd.7               up   1.00000  1.00000
 8         0.09769          osd.8               up   1.00000  1.00000
-3         0.29306      host dell-r730-026                           
 1         0.09769          osd.1               up   1.00000  1.00000
 3         0.09769          osd.3               up   1.00000  1.00000
 5         0.09769          osd.5               up   1.00000  1.00000
-2         0.29306      host dell-r730-043                           
 0         0.09769          osd.0               up   1.00000  1.00000
 2         0.09769          osd.2               up   1.00000  1.00000
 4         0.09769          osd.4               up   1.00000  1.00000

Please let me know if you need access to the cluster for further investigation
I will attach the requested logs shortly.

Actions #4

Updated by Nitzan Mordechai over 1 year ago

Harsh, please attach the logs with debug level

Actions #5

Updated by Harsh Kumar over 1 year ago

Nitzan Mordechai wrote:

Harsh, please attach the logs with debug level

Hi Nitzan,

Please find the logs for the current cluster setup here - https://drive.google.com/file/d/1x2MLvSRlK1AWOYHAuRreTAMmCK0zn8xU/view?usp=share_link

Actions #6

Updated by Nitzan Mordechai over 1 year ago

It appears that autoscaler is on, which cause the issue to happen.

Actions #7

Updated by Matan Breizman 11 months ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF