Project

General

Profile

Actions

Bug #52408

closed

osds not peering correctly after startup

Added by Jeff Layton over 2 years ago. Updated over 2 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I might not have the right terminology here. I have a host that I run 3 VMs on that act as cephadm cluster nodes (mostly as a target for kclient testing).

Recently, I upgraded to a newer version of the bleeding-edge quincy build I'm using, and when I start up the cluster nodes, it goes into this state and never recovers on its own:

# ceph -s
  cluster:
    id:     251b9faa-ff79-11eb-b671-52540031ba78
    health: HEALTH_WARN
            2 filesystems are degraded
            6 MDSs report slow metadata IOs
            Reduced data availability: 208 pgs inactive, 58 pgs peering
            12 slow ops, oldest one blocked for 344 sec, daemons [osd.2,mon.cephadm1] have slow ops.

  services:
    mon: 3 daemons, quorum cephadm1,cephadm2,cephadm3 (age 5m)
    mgr: cephadm1.julnog(active, since 5m), standbys: cephadm2.sjuknm
    mds: 6/6 daemons up, 2 standby
    osd: 3 osds: 3 up (since 5m), 3 in (since 7d)

  data:
    volumes: 0/2 healthy, 2 recovering
    pools:   5 pools, 208 pgs
    objects: 690 objects, 633 MiB
    usage:   2.2 GiB used, 929 GiB / 932 GiB avail
    pgs:     72.115% pgs unknown
             27.885% pgs not active
             150 unknown
             58  peering

It looks like osd.2 in this case is waiting on the other osd's to do certain ops, but they never seem to get there. If I then log into each host and do a "systemctl restart" on the osd daemons, they'll recover and come back:

# ceph -s
  cluster:
    id:     251b9faa-ff79-11eb-b671-52540031ba78
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum cephadm1,cephadm2,cephadm3 (age 9m)
    mgr: cephadm1.julnog(active, since 9m), standbys: cephadm2.sjuknm
    mds: 6/6 daemons up, 2 standby
    osd: 3 osds: 3 up (since 11s), 3 in (since 7d)

  data:
    volumes: 2/2 healthy
    pools:   5 pools, 208 pgs
    objects: 2.02k objects, 2.1 GiB
    usage:   6.7 GiB used, 2.7 TiB / 2.7 TiB avail
    pgs:     130 active+clean
             78  active+clean+wait

  io:
    client:   0 B/s wr, 0 op/s rd, 5 op/s wr
    recovery: 7.5 KiB/s, 1 objects/s

I've looked in the logs and don't see anything obvious (though I'm not an expert on the osd, so I could be missing it). This is very reproducible, so let me know if I need to collect any info to help with debugging.


Files

cephadm_logs.tar.gz (895 KB) cephadm_logs.tar.gz logs from cephadm hosts after cranking up debugging Jeff Layton, 08/25/2021 06:53 PM
osd_logs_20210826.tar.gz (129 KB) osd_logs_20210826.tar.gz logs from cluster build Jeff Layton, 08/26/2021 10:59 AM
20210830-osdlogs.tar.gz (471 KB) 20210830-osdlogs.tar.gz Jeff Layton, 08/30/2021 04:55 PM
Actions #1

Updated by Jeff Layton over 2 years ago

My current build is based on upstream commit a49f10e760b4. It has some MDS patches on top, but nothing that should affect osd behavior. I've been seeing this for a few weeks over several builds though, so I don't think it's a very recent regression.

Actions #2

Updated by Neha Ojha over 2 years ago

1. Can you try to reproduce this with 1 pool containing few pgs?
2. Turn the autoscaler off (ceph osd pool set foo pg_autoscale_mode off) and check if the issue still reproduces
3. Provide the following debug data:
- ceph health detail
- ceph osd dump
- pg query from one the pgs stuck in peering
- osd logs with debug_osd=20, debug_ms=1 from when the pgs are stuck
4. Does this reproduce on pacific?

Actions #3

Updated by Jeff Layton over 2 years ago

This time when I brought it up, one osd didn't go "up". First two bits of info you asked for:

[ceph: root@cephadm1 /]# ceph -s      
  cluster:
    id:     251b9faa-ff79-11eb-b671-52540031ba78
    health: HEALTH_WARN
            1 osds down
            1 host (1 osds) down
            Degraded data redundancy: 2041/6123 objects degraded (33.333%), 159 pgs degraded, 208 pgs undersized

  services:
    mon: 3 daemons, quorum cephadm1,cephadm2,cephadm3 (age 8m)
    mgr: cephadm2.sjuknm(active, since 8m), standbys: cephadm1.julnog
    mds: 6/6 daemons up, 2 standby
    osd: 3 osds: 2 up (since 6m), 3 in (since 8d)

  data:
    volumes: 2/2 healthy
    pools:   5 pools, 208 pgs
    objects: 2.04k objects, 2.1 GiB
    usage:   7.2 GiB used, 2.7 TiB / 2.7 TiB avail
    pgs:     2041/6123 objects degraded (33.333%)
             159 active+undersized+degraded
             49  active+undersized

[ceph: root@cephadm1 /]# ceph health detail
HEALTH_WARN 1 osds down; 1 host (1 osds) down; Degraded data redundancy: 2041/6123 objects degraded (33.333%), 159 pgs degraded, 208 pgs undersized
[WRN] OSD_DOWN: 1 osds down
    osd.0 (root=default,host=cephadm2) is down
[WRN] OSD_HOST_DOWN: 1 host (1 osds) down
    host cephadm2 (root=default) (1 osds) is down
[WRN] PG_DEGRADED: Degraded data redundancy: 2041/6123 objects degraded (33.333%), 159 pgs degraded, 208 pgs undersized
    pg 2.24 is stuck undersized for 118s, current state active+undersized, last acting [1,2]
    pg 2.25 is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 2.26 is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 2.27 is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 2.28 is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
    pg 2.29 is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 2.2a is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 2.2b is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
    pg 2.2c is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
    pg 2.2d is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 2.2e is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 2.2f is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 2.30 is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 2.31 is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
    pg 2.32 is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 2.33 is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 2.34 is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 2.35 is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
    pg 2.36 is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
    pg 2.37 is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 2.38 is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 2.39 is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 2.3a is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
    pg 2.3b is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 2.3c is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 2.3d is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
    pg 4.20 is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 4.21 is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
    pg 4.23 is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
    pg 4.28 is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
    pg 4.29 is stuck undersized for 118s, current state active+undersized, last acting [2,1]
    pg 4.2a is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 4.2b is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 4.2c is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 4.2d is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
    pg 4.2e is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
    pg 4.2f is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
    pg 4.30 is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 4.31 is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 4.32 is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 4.33 is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
    pg 4.34 is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 4.35 is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 4.36 is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
    pg 4.37 is stuck undersized for 118s, current state active+undersized, last acting [2,1]
    pg 4.3a is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 4.3b is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
    pg 4.3c is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 4.3d is stuck undersized for 118s, current state active+undersized+degraded, last acting [1,2]
    pg 4.3e is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
    pg 4.3f is stuck undersized for 118s, current state active+undersized+degraded, last acting [2,1]
[ceph: root@cephadm1 /]# ceph osd dump
epoch 339
fsid 251b9faa-ff79-11eb-b671-52540031ba78
created 2021-08-17T16:36:06.806675+0000
modified 2021-08-25T18:05:00.918772+0000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 49
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client luminous
min_compat_client jewel
require_osd_release quincy
stretch_mode_enabled false
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 31 lfor 0/0/26 flags hashpspool stripe_width 0 pg_num_min 1 application mgr
pool 2 'cephfs.test.meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 31 lfor 0/0/26 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 3 'cephfs.test.data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 31 lfor 0/0/26 flags hashpspool stripe_width 0 application cephfs
pool 4 'cephfs.scratch.meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 31 lfor 0/0/28 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 5 'cephfs.scratch.data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 31 lfor 0/0/29 flags hashpspool stripe_width 0 application cephfs
max_osd 3
osd.0 down in  weight 1 up_from 307 up_thru 310 down_at 338 last_clean_interval [282,284) [v2:192.168.1.82:6802/479218218,v1:192.168.1.82:6803/479218218] [v2:192.168.1.82:6806/479218218,v1:192.168.1.82:6807/479218218] exists 2bf640b5-3f5d-42b7-9540-be312626eeb5
osd.1 up   in  weight 1 up_from 336 up_thru 338 down_at 335 last_clean_interval [310,319) [v2:192.168.1.83:6804/3731032313,v1:192.168.1.83:6805/3731032313] [v2:192.168.1.83:6806/3731032313,v1:192.168.1.83:6807/3731032313] exists,up 29c5656f-2568-4afc-a3c6-e99312e6dc63
osd.2 up   in  weight 1 up_from 331 up_thru 338 down_at 330 last_clean_interval [302,319) [v2:192.168.1.81:6802/1750854532,v1:192.168.1.81:6803/1750854532] [v2:192.168.1.81:6804/1750854532,v1:192.168.1.81:6805/1750854532] exists,up c3ad9be8-8713-44f5-b145-71ed6f0e2163
blocklist 192.168.1.81:0/3782342364 expires 2021-08-26T18:02:40.214265+0000
blocklist 192.168.1.81:0/3471956381 expires 2021-08-26T18:02:40.214265+0000
blocklist 192.168.1.81:6817/346880914 expires 2021-08-26T18:02:40.214265+0000
blocklist 192.168.1.81:6816/346880914 expires 2021-08-26T18:02:40.214265+0000
blocklist 192.168.1.81:6805/3938546991 expires 2021-08-26T18:02:31.141081+0000
blocklist 192.168.1.81:6806/2443975889 expires 2021-08-26T18:02:30.980017+0000
blocklist 192.168.1.82:6805/3230328380 expires 2021-08-26T18:02:30.937792+0000
blocklist 192.168.1.82:6804/3230328380 expires 2021-08-26T18:02:30.937792+0000
blocklist 192.168.1.82:6801/2302659088 expires 2021-08-26T18:02:30.909602+0000
blocklist 192.168.1.82:6800/2302659088 expires 2021-08-26T18:02:30.909602+0000
blocklist 192.168.1.83:6802/401740188 expires 2021-08-26T18:02:30.846952+0000
blocklist 192.168.1.83:6801/1672504140 expires 2021-08-26T18:02:30.809979+0000
blocklist 192.168.1.83:6800/1672504140 expires 2021-08-26T18:02:30.809979+0000
blocklist 192.168.1.81:6817/3821187006 expires 2021-08-26T18:02:30.198476+0000
blocklist 192.168.1.81:6816/3821187006 expires 2021-08-26T18:02:30.198476+0000
blocklist 192.168.1.81:6802/1204954916 expires 2021-08-26T18:02:27.969167+0000
blocklist 192.168.1.81:6801/3146019248 expires 2021-08-26T18:02:27.581398+0000
blocklist 192.168.1.81:6800/3146019248 expires 2021-08-26T18:02:27.581398+0000
blocklist 192.168.1.82:0/3002287722 expires 2021-08-26T13:34:42.153084+0000
blocklist 192.168.1.82:0/3251156383 expires 2021-08-25T20:52:13.681744+0000
blocklist 192.168.1.82:6800/1015443557 expires 2021-08-26T13:34:38.617603+0000
blocklist 192.168.1.81:0/2246032227 expires 2021-08-25T20:52:18.683113+0000
blocklist 192.168.1.82:0/1631074194 expires 2021-08-26T13:15:13.357659+0000
blocklist 192.168.1.81:6802/2049541985 expires 2021-08-26T13:15:04.753007+0000
blocklist 192.168.1.81:6800/1039979870 expires 2021-08-25T20:52:01.575492+0000
blocklist 192.168.1.82:6803/2845339467 expires 2021-08-26T13:15:10.182985+0000
blocklist 192.168.1.82:6812/888354374 expires 2021-08-26T13:34:42.153084+0000
blocklist 192.168.1.81:6802/3114529334 expires 2021-08-26T13:34:37.363967+0000
blocklist 192.168.1.82:6812/3969933283 expires 2021-08-25T20:52:08.660348+0000
blocklist 192.168.1.81:6817/3039084464 expires 2021-08-25T20:52:18.683113+0000
blocklist 192.168.1.82:6813/888354374 expires 2021-08-26T13:34:42.153084+0000
blocklist 192.168.1.82:6800/3506976858 expires 2021-08-26T13:15:10.258137+0000
blocklist 192.168.1.82:0/2532729966 expires 2021-08-26T13:34:52.178326+0000
blocklist 192.168.1.81:6806/2359441640 expires 2021-08-26T13:34:37.939603+0000
blocklist 192.168.1.82:6813/1866043889 expires 2021-08-26T13:34:52.178326+0000
blocklist 192.168.1.81:6803/3253291804 expires 2021-08-25T20:52:00.468050+0000
blocklist 192.168.1.82:6802/175006601 expires 2021-08-25T20:52:04.496200+0000
blocklist 192.168.1.81:0/2113884414 expires 2021-08-26T18:02:30.198476+0000
blocklist 192.168.1.82:6812/1758673359 expires 2021-08-25T20:52:13.681744+0000
blocklist 192.168.1.82:0/3878118157 expires 2021-08-26T13:34:42.153084+0000
blocklist 192.168.1.81:6807/4059157461 expires 2021-08-25T20:52:04.433026+0000
blocklist 192.168.1.82:0/428012362 expires 2021-08-26T13:34:52.178326+0000
blocklist 192.168.1.81:6804/3938546991 expires 2021-08-26T18:02:31.141081+0000
blocklist 192.168.1.81:6816/853332031 expires 2021-08-26T13:15:19.511009+0000
blocklist 192.168.1.82:0/3414884925 expires 2021-08-25T20:52:13.681744+0000
blocklist 192.168.1.81:6807/2359441640 expires 2021-08-26T13:34:37.939603+0000
blocklist 192.168.1.82:0/2509371689 expires 2021-08-26T13:15:14.504275+0000
blocklist 192.168.1.82:6801/1015443557 expires 2021-08-26T13:34:38.617603+0000
blocklist 192.168.1.81:0/3853103761 expires 2021-08-26T13:15:19.511009+0000
blocklist 192.168.1.82:6802/33126925 expires 2021-08-26T13:34:38.704412+0000
blocklist 192.168.1.81:0/2363123193 expires 2021-08-26T18:02:30.198476+0000
blocklist 192.168.1.82:0/2155073570 expires 2021-08-25T20:52:08.660348+0000
blocklist 192.168.1.81:6802/3253291804 expires 2021-08-25T20:52:00.468050+0000
blocklist 192.168.1.82:6813/287702748 expires 2021-08-26T13:15:14.504275+0000
blocklist 192.168.1.82:0/664363868 expires 2021-08-26T13:15:13.357659+0000
blocklist 192.168.1.81:6803/3114529334 expires 2021-08-26T13:34:37.363967+0000
blocklist 192.168.1.81:6804/3000324990 expires 2021-08-26T13:34:37.981064+0000
blocklist 192.168.1.82:6813/207728757 expires 2021-08-26T13:15:13.357659+0000
blocklist 192.168.1.82:6802/2845339467 expires 2021-08-26T13:15:10.182985+0000
blocklist 192.168.1.82:0/378803931 expires 2021-08-25T20:52:08.660348+0000
blocklist 192.168.1.81:6817/853332031 expires 2021-08-26T13:15:19.511009+0000
blocklist 192.168.1.81:0/4109606007 expires 2021-08-26T13:15:19.511009+0000
blocklist 192.168.1.83:6803/401740188 expires 2021-08-26T18:02:30.846952+0000
blocklist 192.168.1.82:0/3173462331 expires 2021-08-26T13:34:52.178326+0000
blocklist 192.168.1.81:0/3473913313 expires 2021-08-26T18:02:40.214265+0000
blocklist 192.168.1.81:0/3836021228 expires 2021-08-25T20:52:18.683113+0000
blocklist 192.168.1.81:6800/2733477518 expires 2021-08-26T13:15:04.875893+0000
blocklist 192.168.1.81:6803/1204954916 expires 2021-08-26T18:02:27.969167+0000
blocklist 192.168.1.81:6801/1039979870 expires 2021-08-25T20:52:01.575492+0000
blocklist 192.168.1.81:0/2561200057 expires 2021-08-25T20:52:18.683113+0000
blocklist 192.168.1.81:6803/2049541985 expires 2021-08-26T13:15:04.753007+0000
blocklist 192.168.1.82:0/1097481021 expires 2021-08-26T13:34:42.153084+0000
blocklist 192.168.1.82:0/2616982610 expires 2021-08-26T13:15:13.357659+0000
blocklist 192.168.1.82:0/2252382332 expires 2021-08-25T20:52:13.681744+0000
blocklist 192.168.1.81:6805/3000324990 expires 2021-08-26T13:34:37.981064+0000
blocklist 192.168.1.83:6800/3915355784 expires 2021-08-25T20:52:04.585279+0000
blocklist 192.168.1.82:6813/3969933283 expires 2021-08-25T20:52:08.660348+0000
blocklist 192.168.1.83:6801/3915355784 expires 2021-08-25T20:52:04.585279+0000
blocklist 192.168.1.81:6800/1116413352 expires 2021-08-26T13:34:37.316137+0000
blocklist 192.168.1.82:6812/1866043889 expires 2021-08-26T13:34:52.178326+0000
blocklist 192.168.1.82:6801/3506976858 expires 2021-08-26T13:15:10.258137+0000
blocklist 192.168.1.82:6803/175006601 expires 2021-08-25T20:52:04.496200+0000
blocklist 192.168.1.81:6807/2443975889 expires 2021-08-26T18:02:30.980017+0000
blocklist 192.168.1.82:6812/207728757 expires 2021-08-26T13:15:13.357659+0000
blocklist 192.168.1.83:6802/4292161508 expires 2021-08-25T20:52:04.534949+0000
blocklist 192.168.1.81:6801/2733477518 expires 2021-08-26T13:15:04.875893+0000
blocklist 192.168.1.83:6801/2991738018 expires 2021-08-26T13:15:10.318889+0000
blocklist 192.168.1.83:6800/2991738018 expires 2021-08-26T13:15:10.318889+0000
blocklist 192.168.1.81:6801/1116413352 expires 2021-08-26T13:34:37.316137+0000
blocklist 192.168.1.82:6812/287702748 expires 2021-08-26T13:15:14.504275+0000
blocklist 192.168.1.81:6816/3039084464 expires 2021-08-25T20:52:18.683113+0000
blocklist 192.168.1.82:0/3038358969 expires 2021-08-25T20:52:08.660348+0000
blocklist 192.168.1.81:6805/3861618632 expires 2021-08-25T20:52:01.305248+0000
blocklist 192.168.1.82:0/3969954756 expires 2021-08-26T13:34:42.153084+0000
blocklist 192.168.1.81:0/207683970 expires 2021-08-26T18:02:30.198476+0000
blocklist 192.168.1.81:0/1172344888 expires 2021-08-26T13:15:19.511009+0000
blocklist 192.168.1.81:6804/3861618632 expires 2021-08-25T20:52:01.305248+0000
blocklist 192.168.1.82:0/2782946587 expires 2021-08-26T13:15:13.357659+0000
blocklist 192.168.1.83:6802/2202142153 expires 2021-08-26T13:15:09.591493+0000
blocklist 192.168.1.83:6803/2202142153 expires 2021-08-26T13:15:09.591493+0000
blocklist 192.168.1.82:6813/1758673359 expires 2021-08-25T20:52:13.681744+0000
blocklist 192.168.1.81:6806/4059157461 expires 2021-08-25T20:52:04.433026+0000
blocklist 192.168.1.81:0/2343935914 expires 2021-08-26T18:02:30.198476+0000
blocklist 192.168.1.83:6803/4292161508 expires 2021-08-25T20:52:04.534949+0000
blocklist 192.168.1.82:6803/33126925 expires 2021-08-26T13:34:38.704412+0000
blocklist 192.168.1.82:0/2895126510 expires 2021-08-25T20:52:08.660348+0000

I'm not sure what you're asking for with this:

- pg query from one the pgs stuck in peering

I'll also have to look at how to rev up the debug_osd value cluster-wide. The rest of the info will take some time as I'll need to rebuild the cluster from scratch.

Actions #4

Updated by Neha Ojha over 2 years ago

Jeff Layton wrote:

This time when I brought it up, one osd didn't go "up". First two bits of info you asked for:

1. I meant using just 1 pool instead of 5 pools. This time PGs are not in unknow or peering, they are in active+undersized+degraded and active+undersized+degraded, which is expected when OSDs go down. Looks like osd.0 went down, anything in the logs that indicates why that happened?

This is not the same issue you saw earlier.

2. For (2), I meant setting "ceph osd pool set <pool-name> pg_autoscale_mode off" for all these pools. Currently, they have "autoscale_mode on"

pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 31 lfor 0/0/26 flags hashpspool stripe_width 0 pg_num_min 1 application mgr
pool 2 'cephfs.test.meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 31 lfor 0/0/26 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 3 'cephfs.test.data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 31 lfor 0/0/26 flags hashpspool stripe_width 0 application cephfs
pool 4 'cephfs.scratch.meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 31 lfor 0/0/28 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 5 'cephfs.scratch.data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 31 lfor 0/0/29 flags hashpspool stripe_width 0 application cephfs

[...]

I'm not sure what you're asking for with this:

- pg query from one the pgs stuck in peering

If you run "ceph pg dump pgs|grep peering", it will list the pgs that are in peering, just pick a pg that is in peering and run "ceph pg <pg-id> query".

I'll also have to look at how to rev up the debug_osd value cluster-wide. The rest of the info will take some time as I'll need to rebuild the cluster from scratch.

Running "ceph config set osd debug_osd 20" should work.

Actions #5

Updated by Jeff Layton over 2 years ago

Nothing in the logs for crashed osd.0. I think the last thing in the logs was a rocksdb dump. coredumpctl also didn't have anything,

Most of what I'm doing is cephfs testing, and a single fs requires 2 pools. I can rebuild the cluster from scratch and create a single pool in it, but I don't really have a great way to load any data into it, and it'll blow away this cluster. Maybe that's ok. I'll try it tomorrow.

I turned off the autoscaler and rebooted the machine that houses all the VMs and it didn't seem to help. Still seeing them come up non-functional. This time:

[ceph: root@cephadm1 /]# ceph -s
  cluster:
    id:     251b9faa-ff79-11eb-b671-52540031ba78
    health: HEALTH_WARN
            2 filesystems are degraded
            6 MDSs report slow metadata IOs
            Reduced data availability: 208 pgs inactive, 58 pgs peering
            12 slow ops, oldest one blocked for 84 sec, daemons [osd.2,mon.cephadm1] have slow ops.

  services:
    mon: 3 daemons, quorum cephadm1,cephadm2,cephadm3 (age 98s)
    mgr: cephadm1.julnog(active, since 80s), standbys: cephadm2.sjuknm
    mds: 6/6 daemons up, 2 standby
    osd: 3 osds: 3 up (since 89s), 3 in (since 8d)

  data:
    volumes: 0/2 healthy, 2 recovering
    pools:   5 pools, 208 pgs
    objects: 699 objects, 666 MiB
    usage:   2.2 GiB used, 929 GiB / 932 GiB avail
    pgs:     72.115% pgs unknown
             27.885% pgs not active
             150 unknown
             58  peering

[ceph: root@cephadm1 /]# ceph osd status
ID  HOST       USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE      
 0               0      0       0        0       0        0   exists,up  
 1               0      0       0        0       0        0   exists,up  
 2  cephadm1  2289M   929G      0        0       0        0   exists,up  
Actions #6

Updated by Jeff Layton over 2 years ago

peering info:

2.3d           2                   0         0          0        0   4194304         4960          10  1454      1454  peering  2021-08-25T19:42:34.921895+0000   316'12954   390:13390  [2,0,1]           2  [2,0,1]               2   283'12895  2021-08-25T13:24:27.623246+0000        283'12895  2021-08-25T13:24:27.623246+0000

# ceph pg 2.3d query
{
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "state": "peering",
    "epoch": 391,
    "up": [
        2,
        0,
        1
    ],
    "acting": [
        2,
        0,
        1
    ],
    "info": {
        "pgid": "2.3d",
        "last_update": "316'12954",
        "last_complete": "316'12954",
        "log_tail": "229'11500",
        "last_user_version": 12899,
        "last_backfill": "MAX",
        "purged_snaps": [],
        "history": {
            "epoch_created": 26,
            "epoch_pool_created": 15,
            "last_epoch_started": 339,
            "last_interval_started": 338,
            "last_epoch_clean": 311,
            "last_interval_clean": 310,
            "last_epoch_split": 26,
            "last_epoch_marked_full": 0,
            "same_up_since": 389,
            "same_interval_since": 389,
            "same_primary_since": 389,
            "last_scrub": "283'12895",
            "last_scrub_stamp": "2021-08-25T13:24:27.623246+0000",
            "last_deep_scrub": "283'12895",
            "last_deep_scrub_stamp": "2021-08-25T13:24:27.623246+0000",
            "last_clean_scrub_stamp": "2021-08-25T13:24:27.623246+0000",
            "prior_readable_until_ub": 0
        },
        "stats": {
            "version": "316'12954",
            "reported_seq": 13391,
            "reported_epoch": 391,
            "state": "peering",
            "last_fresh": "2021-08-25T19:42:39.861242+0000",
            "last_change": "2021-08-25T19:42:34.921895+0000",
            "last_active": "2021-08-25T18:14:39.220353+0000",
            "last_peered": "2021-08-25T18:05:01.001374+0000",
            "last_clean": "2021-08-25T15:04:17.540522+0000",
            "last_became_active": "2021-08-25T18:05:01.001370+0000",
            "last_became_peered": "2021-08-25T18:05:01.001370+0000",
            "last_unstale": "2021-08-25T19:42:39.861242+0000",
            "last_undegraded": "2021-08-25T19:42:39.861242+0000",
            "last_fullsized": "2021-08-25T19:42:39.861242+0000",
            "mapping_epoch": 389,
            "log_start": "229'11500",
            "ondisk_log_start": "229'11500",
            "created": 26,
            "last_epoch_clean": 311,
            "parent": "0.0",
            "parent_split_bits": 6,
            "last_scrub": "283'12895",
            "last_scrub_stamp": "2021-08-25T13:24:27.623246+0000",
            "last_deep_scrub": "283'12895",
            "last_deep_scrub_stamp": "2021-08-25T13:24:27.623246+0000",
            "last_clean_scrub_stamp": "2021-08-25T13:24:27.623246+0000",
            "log_size": 1454,
            "ondisk_log_size": 1454,
            "stats_invalid": false,
            "dirty_stats_invalid": false,
            "omap_stats_invalid": false,
            "hitset_stats_invalid": false,
            "hitset_bytes_stats_invalid": false,
            "pin_stats_invalid": false,
            "manifest_stats_invalid": false,
            "snaptrimq_len": 0,
            "stat_sum": {
                "num_bytes": 4194304,
                "num_objects": 2,
                "num_object_clones": 0,
                "num_object_copies": 6,
                "num_objects_missing_on_primary": 0,
                "num_objects_missing": 0,
                "num_objects_degraded": 0,
                "num_objects_misplaced": 0,
                "num_objects_unfound": 0,
                "num_objects_dirty": 2,
                "num_whiteouts": 0,
                "num_read": 32,
                "num_read_kb": 57386,
                "num_write": 12235,
                "num_write_kb": 88430,
                "num_scrub_errors": 0,
                "num_shallow_scrub_errors": 0,
                "num_deep_scrub_errors": 0,
                "num_objects_recovered": 0,
                "num_bytes_recovered": 0,
                "num_keys_recovered": 0,
                "num_objects_omap": 1,
                "num_objects_hit_set_archive": 0,
                "num_bytes_hit_set_archive": 0,
                "num_flush": 0,
                "num_flush_kb": 0,
                "num_evict": 0,
                "num_evict_kb": 0,
                "num_promote": 0,
                "num_flush_mode_high": 0,
                "num_flush_mode_low": 0,
                "num_evict_mode_some": 0,
                "num_evict_mode_full": 0,
                "num_objects_pinned": 0,
                "num_legacy_snapsets": 0,
                "num_large_omap_objects": 0,
                "num_objects_manifest": 0,
                "num_omap_bytes": 4960,
                "num_omap_keys": 10,
                "num_objects_repaired": 0
            },
            "up": [
                2,
                0,
                1
            ],
            "acting": [
                2,
                0,
                1
            ],
            "avail_no_missing": [],
            "object_location_counts": [],
            "blocked_by": [
                0,
                1
            ],
            "up_primary": 2,
            "acting_primary": 2,
            "purged_snaps": []
        },
        "empty": 0,
        "dne": 0,
        "incomplete": 0,
        "last_epoch_started": 339,
        "hit_set_history": {
            "current_last_update": "0'0",
            "history": []
        }
    },
    "peer_info": [],
    "recovery_state": [
        {
            "name": "Started/Primary/Peering/GetInfo",
            "enter_time": "2021-08-25T19:42:34.921882+0000",
            "requested_info_from": [
                {
                    "osd": "0" 
                },
                {
                    "osd": "1" 
                }
            ]
        },
        {
            "name": "Started/Primary/Peering",
            "enter_time": "2021-08-25T19:42:34.921877+0000",
            "past_intervals": [
                {
                    "first": "310",
                    "last": "388",
                    "all_participants": [
                        {
                            "osd": 0
                        },
                        {
                            "osd": 1
                        },
                        {
                            "osd": 2
                        }
                    ],
                    "intervals": [
                        {
                            "first": "338",
                            "last": "349",
                            "acting": "1,2" 
                        },
                        {
                            "first": "373",
                            "last": "387",
                            "acting": "0,1,2" 
                        }
                    ]
                }
            ],
            "probing_osds": [
                "0",
                "1",
                "2" 
            ],
            "down_osds_we_would_probe": [],
            "peering_blocked_by": []
        },
        {
            "name": "Started",
            "enter_time": "2021-08-25T19:42:34.921835+0000" 
        }
    ],
    "agent_state": {}
}
Actions #7

Updated by Jeff Layton over 2 years ago

Tore down the old cluster and built a Pacific one (v16.2.5). That one doesn't have the same issue. I'll do a clean teardown and rebuild tomorrow with an updated cluster image, and try some of the other things you mentioned.

Actions #8

Updated by Jeff Layton over 2 years ago

Tore down and rebuild the cluster again using my quincy-based image. This time, I didn't create any filesystems. cephadm created one pool for the mgr automatically. When I rebooted it after setting up the OSDs:

[ceph: root@cephadm1 /]# ceph -s
  cluster:
    id:     28c5bdca-0659-11ec-83f7-52540031ba78
    health: HEALTH_WARN
            Reduced data availability: 128 pgs inactive, 45 pgs peering
            6 slow ops, oldest one blocked for 283 sec, daemons [osd.0,mon.cephadm1] have slow ops.

  services:
    mon: 3 daemons, quorum cephadm1,cephadm2,cephadm3 (age 4m)
    mgr: cephadm2.lifbko(active, since 4m), standbys: cephadm1.ppexdf
    osd: 3 osds: 3 up (since 4m), 3 in (since 6m)

  data:
    pools:   1 pools, 128 pgs
    objects: 0 objects, 133 KiB
    usage:   21 MiB used, 931 GiB / 932 GiB avail
    pgs:     64.844% pgs unknown
             35.156% pgs not active
             83 unknown
             45 peering

Turning off the autoscaler didn't help.

[ceph: root@cephadm1 /]# ceph health detail
HEALTH_WARN Reduced data availability: 128 pgs inactive, 45 pgs peering; 6 slow ops, oldest one blocked for 128 sec, daemons [osd.0,mon.cephadm1] have slow ops.
[WRN] PG_AVAILABILITY: Reduced data availability: 128 pgs inactive, 45 pgs peering
    pg 1.23 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.4e is stuck inactive for 2m, current state unknown, last acting []
    pg 1.4f is stuck inactive for 2m, current state unknown, last acting []
    pg 1.50 is stuck peering for 11m, current state peering, last acting [0,2,1]
    pg 1.51 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.52 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.53 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.54 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.55 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.56 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.57 is stuck peering for 11m, current state peering, last acting [0,1,2]
    pg 1.58 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.59 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.5a is stuck inactive for 2m, current state unknown, last acting []
    pg 1.5b is stuck inactive for 2m, current state unknown, last acting []
    pg 1.5c is stuck peering for 11m, current state peering, last acting [0,2,1]
    pg 1.5d is stuck inactive for 2m, current state unknown, last acting []
    pg 1.5e is stuck inactive for 2m, current state unknown, last acting []
    pg 1.5f is stuck inactive for 2m, current state unknown, last acting []
    pg 1.60 is stuck peering for 11m, current state peering, last acting [0,2,1]
    pg 1.61 is stuck peering for 11m, current state peering, last acting [0,2,1]
    pg 1.62 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.63 is stuck peering for 11m, current state peering, last acting [0,2,1]
    pg 1.64 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.65 is stuck peering for 11m, current state peering, last acting [0,2,1]
    pg 1.66 is stuck peering for 11m, current state peering, last acting [0,2,1]
    pg 1.67 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.68 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.69 is stuck peering for 11m, current state peering, last acting [0,2,1]
    pg 1.6a is stuck inactive for 2m, current state unknown, last acting []
    pg 1.6b is stuck peering for 11m, current state peering, last acting [0,1,2]
    pg 1.6c is stuck peering for 11m, current state peering, last acting [0,2,1]
    pg 1.6d is stuck peering for 11m, current state peering, last acting [0,2,1]
    pg 1.6e is stuck inactive for 2m, current state unknown, last acting []
    pg 1.6f is stuck peering for 11m, current state peering, last acting [0,2,1]
    pg 1.70 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.71 is stuck peering for 11m, current state peering, last acting [0,2,1]
    pg 1.72 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.73 is stuck peering for 11m, current state peering, last acting [0,1,2]
    pg 1.74 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.75 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.76 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.77 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.78 is stuck inactive for 2m, current state unknown, last acting []
    pg 1.79 is stuck peering for 11m, current state peering, last acting [0,2,1]
    pg 1.7a is stuck inactive for 2m, current state unknown, last acting []
    pg 1.7b is stuck inactive for 2m, current state unknown, last acting []
    pg 1.7c is stuck peering for 11m, current state peering, last acting [0,2,1]
    pg 1.7d is stuck peering for 11m, current state peering, last acting [0,2,1]
    pg 1.7e is stuck inactive for 2m, current state unknown, last acting []
    pg 1.7f is stuck inactive for 2m, current state unknown, last acting []
[WRN] SLOW_OPS: 6 slow ops, oldest one blocked for 128 sec, daemons [osd.0,mon.cephadm1] have slow ops.

[ceph: root@cephadm1 /]# ceph osd dump
epoch 40
fsid 28c5bdca-0659-11ec-83f7-52540031ba78
created 2021-08-26T10:34:19.286122+0000
modified 2021-08-26T10:48:36.401141+0000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 9
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client luminous
min_compat_client jewel
require_osd_release quincy
stretch_mode_enabled false
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off last_change 33 lfor 0/0/21 flags hashpspool stripe_width 0 pg_num_min 1 application mgr
max_osd 3
osd.0 up   in  weight 1 up_from 36 up_thru 36 down_at 35 last_clean_interval [28,33) [v2:192.168.1.81:6800/3122039392,v1:192.168.1.81:6801/3122039392] [v2:192.168.1.81:6802/3122039392,v1:192.168.1.81:6803/3122039392] exists,up e3133cb7-5d9b-413b-ab65-93fedc376f60
osd.1 up   in  weight 1 up_from 11 up_thru 23 down_at 0 last_clean_interval [0,0) [v2:192.168.1.82:6800/1519779875,v1:192.168.1.82:6801/1519779875] [v2:192.168.1.82:6802/1519779875,v1:192.168.1.82:6803/1519779875] exists,up bd17b066-797d-4206-839b-319411d40773
osd.2 up   in  weight 1 up_from 16 up_thru 23 down_at 0 last_clean_interval [0,0) [v2:192.168.1.83:6800/848375691,v1:192.168.1.83:6801/848375691] [v2:192.168.1.83:6802/848375691,v1:192.168.1.83:6803/848375691] exists,up e08144ea-b6e1-42d8-9387-b03bae8ba3db
blocklist 192.168.1.81:0/844329453 expires 2021-08-27T10:48:36.401112+0000
blocklist 192.168.1.81:0/4250703294 expires 2021-08-27T10:48:36.401112+0000
blocklist 192.168.1.81:0/547920859 expires 2021-08-27T10:48:26.379833+0000
blocklist 192.168.1.81:6809/2584104857 expires 2021-08-27T10:48:26.379833+0000
blocklist 192.168.1.81:6808/2584104857 expires 2021-08-27T10:48:26.379833+0000
blocklist 192.168.1.81:0/1017451114 expires 2021-08-27T10:48:36.401112+0000
blocklist 192.168.1.81:0/798518791 expires 2021-08-27T10:48:26.379833+0000
blocklist 192.168.1.82:0/528335870 expires 2021-08-27T10:48:21.378008+0000
blocklist 192.168.1.82:6809/3270807017 expires 2021-08-27T10:48:21.378008+0000
blocklist 192.168.1.82:6808/3270807017 expires 2021-08-27T10:48:21.378008+0000
blocklist 192.168.1.82:0/3073917861 expires 2021-08-27T10:48:21.378008+0000
blocklist 192.168.1.81:0/3104129647 expires 2021-08-27T10:34:38.280473+0000
blocklist 192.168.1.81:6800/943978261 expires 2021-08-27T10:34:38.280473+0000
blocklist 192.168.1.81:6808/367342793 expires 2021-08-27T10:40:26.570177+0000
blocklist 192.168.1.82:0/1822256338 expires 2021-08-27T10:48:31.393294+0000
blocklist 192.168.1.81:0/625022588 expires 2021-08-27T10:40:11.557816+0000
blocklist 192.168.1.81:0/239295845 expires 2021-08-27T10:40:16.560014+0000
blocklist 192.168.1.81:0/150736215 expires 2021-08-27T10:34:58.320328+0000
blocklist 192.168.1.82:0/4161546944 expires 2021-08-27T10:48:31.393294+0000
blocklist 192.168.1.82:0/3380815887 expires 2021-08-27T10:40:21.565594+0000
blocklist 192.168.1.81:0/710847600 expires 2021-08-27T10:40:11.557816+0000
blocklist 192.168.1.81:0/2917746 expires 2021-08-27T10:40:16.560014+0000
blocklist 192.168.1.82:6809/857605359 expires 2021-08-27T10:48:31.393294+0000
blocklist 192.168.1.81:0/2841767184 expires 2021-08-27T10:40:16.560014+0000
blocklist 192.168.1.81:0/1953531468 expires 2021-08-27T10:40:11.557816+0000
blocklist 192.168.1.81:6809/631663844 expires 2021-08-27T10:40:16.560014+0000
blocklist 192.168.1.82:0/2491331392 expires 2021-08-27T10:40:21.565594+0000
blocklist 192.168.1.81:0/2743754457 expires 2021-08-27T10:35:40.550845+0000
blocklist 192.168.1.81:0/3758899206 expires 2021-08-27T10:34:38.280473+0000
blocklist 192.168.1.81:6801/943978261 expires 2021-08-27T10:34:38.280473+0000
blocklist 192.168.1.81:6809/718548886 expires 2021-08-27T10:48:36.401112+0000
blocklist 192.168.1.82:0/3529153627 expires 2021-08-27T10:48:31.393294+0000
blocklist 192.168.1.81:0/559997549 expires 2021-08-27T10:40:26.570177+0000
blocklist 192.168.1.82:6809/110830539 expires 2021-08-27T10:40:21.565594+0000
blocklist 192.168.1.81:6801/3992800725 expires 2021-08-27T10:40:11.557816+0000
blocklist 192.168.1.81:6808/718548886 expires 2021-08-27T10:48:36.401112+0000
blocklist 192.168.1.82:0/2740955439 expires 2021-08-27T10:40:21.565594+0000
blocklist 192.168.1.81:6800/2245843916 expires 2021-08-27T10:35:40.550845+0000
blocklist 192.168.1.81:0/1041901179 expires 2021-08-27T10:40:26.570177+0000
blocklist 192.168.1.81:0/3413551308 expires 2021-08-27T10:35:40.550845+0000
blocklist 192.168.1.81:6800/3989691267 expires 2021-08-27T10:34:58.320328+0000
blocklist 192.168.1.81:0/3340321852 expires 2021-08-27T10:48:26.379833+0000
blocklist 192.168.1.81:0/3147562416 expires 2021-08-27T10:34:38.280473+0000
blocklist 192.168.1.81:0/1494757314 expires 2021-08-27T10:40:26.570177+0000
blocklist 192.168.1.82:0/1642338075 expires 2021-08-27T10:48:21.378008+0000
blocklist 192.168.1.81:6809/367342793 expires 2021-08-27T10:40:26.570177+0000
blocklist 192.168.1.81:0/1655568960 expires 2021-08-27T10:34:58.320328+0000
blocklist 192.168.1.81:6808/631663844 expires 2021-08-27T10:40:16.560014+0000
blocklist 192.168.1.82:6808/110830539 expires 2021-08-27T10:40:21.565594+0000
blocklist 192.168.1.81:0/2655941731 expires 2021-08-27T10:34:58.320328+0000
blocklist 192.168.1.81:0/1901477329 expires 2021-08-27T10:35:40.550845+0000
blocklist 192.168.1.82:6808/857605359 expires 2021-08-27T10:48:31.393294+0000
blocklist 192.168.1.81:0/1330855016 expires 2021-08-27T10:40:11.557816+0000
blocklist 192.168.1.81:6801/2245843916 expires 2021-08-27T10:35:40.550845+0000
blocklist 192.168.1.81:6801/3989691267 expires 2021-08-27T10:34:58.320328+0000
blocklist 192.168.1.81:6800/3992800725 expires 2021-08-27T10:40:11.557816+0000

[ceph: root@cephadm1 /]# ceph pg 1.7c query
{
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "state": "peering",
    "epoch": 40,
    "up": [
        0,
        2,
        1
    ],
    "acting": [
        0,
        2,
        1
    ],
    "info": {
        "pgid": "1.7c",
        "last_update": "19'76",
        "last_complete": "19'76",
        "log_tail": "0'0",
        "last_user_version": 76,
        "last_backfill": "MAX",
        "purged_snaps": [],
        "history": {
            "epoch_created": 21,
            "epoch_pool_created": 17,
            "last_epoch_started": 24,
            "last_interval_started": 23,
            "last_epoch_clean": 24,
            "last_interval_clean": 23,
            "last_epoch_split": 21,
            "last_epoch_marked_full": 0,
            "same_up_since": 36,
            "same_interval_since": 36,
            "same_primary_since": 36,
            "last_scrub": "0'0",
            "last_scrub_stamp": "2021-08-26T10:38:27.178940+0000",
            "last_deep_scrub": "0'0",
            "last_deep_scrub_stamp": "2021-08-26T10:38:27.178940+0000",
            "last_clean_scrub_stamp": "2021-08-26T10:38:27.178940+0000",
            "prior_readable_until_ub": 0
        },
        "stats": {
            "version": "19'76",
            "reported_seq": 134,
            "reported_epoch": 40,
            "state": "peering",
            "last_fresh": "2021-08-26T10:48:38.352679+0000",
            "last_change": "2021-08-26T10:48:21.440289+0000",
            "last_active": "2021-08-26T10:40:11.379982+0000",
            "last_peered": "2021-08-26T10:38:50.067124+0000",
            "last_clean": "2021-08-26T10:38:50.067124+0000",
            "last_became_active": "2021-08-26T10:38:50.063372+0000",
            "last_became_peered": "2021-08-26T10:38:50.063372+0000",
            "last_unstale": "2021-08-26T10:48:38.352679+0000",
            "last_undegraded": "2021-08-26T10:48:38.352679+0000",
            "last_fullsized": "2021-08-26T10:48:38.352679+0000",
            "mapping_epoch": 36,
            "log_start": "0'0",
            "ondisk_log_start": "0'0",
            "created": 21,
            "last_epoch_clean": 24,
            "parent": "0.0",
            "parent_split_bits": 7,
            "last_scrub": "0'0",
            "last_scrub_stamp": "2021-08-26T10:38:27.178940+0000",
            "last_deep_scrub": "0'0",
            "last_deep_scrub_stamp": "2021-08-26T10:38:27.178940+0000",
            "last_clean_scrub_stamp": "2021-08-26T10:38:27.178940+0000",
            "log_size": 76,
            "ondisk_log_size": 76,
            "stats_invalid": true,
            "dirty_stats_invalid": false,
            "omap_stats_invalid": false,
            "hitset_stats_invalid": false,
            "hitset_bytes_stats_invalid": false,
            "pin_stats_invalid": false,
            "manifest_stats_invalid": false,
            "snaptrimq_len": 0,
            "stat_sum": {
                "num_bytes": 3588,
                "num_objects": 0,
                "num_object_clones": 0,
                "num_object_copies": 0,
                "num_objects_missing_on_primary": 0,
                "num_objects_missing": 0,
                "num_objects_degraded": 0,
                "num_objects_misplaced": 0,
                "num_objects_unfound": 0,
                "num_objects_dirty": 0,
                "num_whiteouts": 0,
                "num_read": 0,
                "num_read_kb": 0,
                "num_write": 0,
                "num_write_kb": 10,
                "num_scrub_errors": 0,
                "num_shallow_scrub_errors": 0,
                "num_deep_scrub_errors": 0,
                "num_objects_recovered": 0,
                "num_bytes_recovered": 0,
                "num_keys_recovered": 0,
                "num_objects_omap": 0,
                "num_objects_hit_set_archive": 0,
                "num_bytes_hit_set_archive": 0,
                "num_flush": 0,
                "num_flush_kb": 0,
                "num_evict": 0,
                "num_evict_kb": 0,
                "num_promote": 0,
                "num_flush_mode_high": 0,
                "num_flush_mode_low": 0,
                "num_evict_mode_some": 0,
                "num_evict_mode_full": 0,
                "num_objects_pinned": 0,
                "num_legacy_snapsets": 0,
                "num_large_omap_objects": 0,
                "num_objects_manifest": 0,
                "num_omap_bytes": 0,
                "num_omap_keys": 0,
                "num_objects_repaired": 0
            },
            "up": [
                0,
                2,
                1
            ],
            "acting": [
                0,
                2,
                1
            ],
            "avail_no_missing": [],
            "object_location_counts": [],
            "blocked_by": [
                1,
                2
            ],
            "up_primary": 0,
            "acting_primary": 0,
            "purged_snaps": []
        },
        "empty": 0,
        "dne": 0,
        "incomplete": 0,
        "last_epoch_started": 24,
        "hit_set_history": {
            "current_last_update": "0'0",
            "history": []
        }
    },
    "peer_info": [],
    "recovery_state": [
        {
            "name": "Started/Primary/Peering/GetInfo",
            "enter_time": "2021-08-26T10:48:21.440273+0000",
            "requested_info_from": [
                {
                    "osd": "1" 
                },
                {
                    "osd": "2" 
                }
            ]
        },
        {
            "name": "Started/Primary/Peering",
            "enter_time": "2021-08-26T10:48:21.440268+0000",
            "past_intervals": [
                {
                    "first": "23",
                    "last": "35",
                    "all_participants": [
                        {
                            "osd": 0
                        },
                        {
                            "osd": 1
                        },
                        {
                            "osd": 2
                        }
                    ],
                    "intervals": [
                        {
                            "first": "28",
                            "last": "34",
                            "acting": "0,1,2" 
                        }
                    ]
                }
            ],
            "probing_osds": [
                "0",
                "1",
                "2" 
            ],
            "down_osds_we_would_probe": [],
            "peering_blocked_by": []
        },
        {
            "name": "Started",
            "enter_time": "2021-08-26T10:48:21.440067+0000" 
        }
    ],
    "agent_state": {}
}
Actions #10

Updated by Neha Ojha over 2 years ago

Thanks for providing these logs, but they don't have debug_osd=20 (we need it on all the osds). The pg query for 1.7c tells us that primary osd is waiting for some info from its peers.

    "recovery_state": [
        {
            "name": "Started/Primary/Peering/GetInfo",
            "enter_time": "2021-08-26T10:48:21.440273+0000",
            "requested_info_from": [
                {
                    "osd": "1" 
                },
                {
                    "osd": "2" 
                }
            ]
        },

What is interesting from osd.0's log, which has debug_ms=1, is that osd.1 and osd.2 are not responding to its heartbeat checks.

osd.0

Aug 26 06:55:54 cephadm1 ceph-osd[3797]: -- 192.168.1.81:0/3122039392 --> [v2:192.168.1.82:6804/1519779875,v1:192.168.1.82:6805/1519779875] -- osd_ping(ping e40 up_from 36 ping_stamp 2021-08-26T10:55:54.177487+0000/454.206329346s send_stamp 454.206329346s) v5 -- 0x56127d47d980 con 0x56127d146800
Aug 26 06:55:54 cephadm1 ceph-osd[3797]: osd.0 40 heartbeat_check: no reply from 192.168.1.82:6804 osd.1 ever on either front or back, first ping sent 2021-08-26T10:48:24.739504+0000 (oldest deadline 2021-08-26T10:48:44.739504+0000)
Aug 26 06:55:54 cephadm1 conmon[3773]: 2021-08-26T10:55:54.821+0000 7fb4a9f61700 -1 osd.0 40 heartbeat_check: no reply from 192.168.1.82:6804 osd.1 ever on either front or back, first ping sent 2021-08-26T10:48:24.739504+0000 (oldest deadline 2021-08-26T10:48:44.739504+0000)

osd.2

Aug 26 06:55:54 cephadm1 ceph-osd[3797]: -- 192.168.1.81:0/3122039392 --> [v2:192.168.1.83:6804/848375691,v1:192.168.1.83:6805/848375691] -- osd_ping(ping e40 up_from 36 ping_stamp 2021-08-26T10:55:54.177487+0000/454.206329346s send_stamp 454.206329346s) v5 -- 0x56127d47dc80 con 0x561285b56800
Aug 26 06:55:54 cephadm1 ceph-osd[3797]: osd.0 40 heartbeat_check: no reply from 192.168.1.83:6804 osd.2 ever on either front or back, first ping sent 2021-08-26T10:48:24.739504+0000 (oldest deadline 2021-08-26T10:48:44.739504+0000)
Aug 26 06:55:54 cephadm1 conmon[3773]: 2021-08-26T10:55:54.821+0000 7fb4a9f61700 -1 osd.0 40 heartbeat_check: no reply from 192.168.1.83:6804 osd.2 ever on either front or back, first ping sent 2021-08-26T10:48:24.739504+0000 (oldest deadline 2021-08-26T10:48:44.739504+0000)

In fact, there are no messages that osd.0 has received from osd.1 or osd.2, which explains why osd.0 is waiting for info in the pq query output.

grep "<==" osd.0.log |grep "osd\." returns nothing

By the way, why are there so many blocklist entries in the osd dump output? Comparing with the osd logs, it looks like the "expires" time is in the future.

Actions #11

Updated by Jeff Layton over 2 years ago

Ok. I wasn't clear on whether I needed to run "ceph config set debug_osd 20" on all the hosts or just 1. I ran it on all of them this time, so hopefully this will have what you need.

As far as the blocklist entries. I have no idea why there are so many. I rebuilt the cluster from scratch before I did this, so I definitely didn't do any manual blocklisting.

Actions #12

Updated by Jeff Layton over 2 years ago

Other requested info from this rebuild of the cluster:

[ceph: root@cephadm1 /]# ceph health detail
HEALTH_WARN Reduced data availability: 128 pgs inactive, 41 pgs peering; 2 slow ops, oldest one blocked for 183 sec, mon.cephadm1 has slow ops
[WRN] PG_AVAILABILITY: Reduced data availability: 128 pgs inactive, 41 pgs peering
    pg 1.23 is stuck inactive for 14m, current state peering, last acting [2,0,1]
    pg 1.4e is stuck inactive for 13m, current state unknown, last acting []
    pg 1.4f is stuck peering for 46m, current state peering, last acting [2,0,1]
    pg 1.50 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.51 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.52 is stuck peering for 46m, current state peering, last acting [2,0,1]
    pg 1.53 is stuck peering for 45m, current state peering, last acting [2,1,0]
    pg 1.54 is stuck peering for 45m, current state peering, last acting [2,0,1]
    pg 1.55 is stuck peering for 45m, current state peering, last acting [2,1,0]
    pg 1.56 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.57 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.58 is stuck peering for 46m, current state peering, last acting [2,0,1]
    pg 1.59 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.5a is stuck inactive for 13m, current state unknown, last acting []
    pg 1.5b is stuck peering for 45m, current state peering, last acting [2,0,1]
    pg 1.5c is stuck inactive for 13m, current state unknown, last acting []
    pg 1.5d is stuck peering for 45m, current state peering, last acting [2,0,1]
    pg 1.5e is stuck inactive for 13m, current state unknown, last acting []
    pg 1.5f is stuck peering for 45m, current state peering, last acting [2,0,1]
    pg 1.60 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.61 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.62 is stuck peering for 46m, current state peering, last acting [2,0,1]
    pg 1.63 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.64 is stuck peering for 46m, current state peering, last acting [2,1,0]
    pg 1.65 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.66 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.67 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.68 is stuck peering for 46m, current state peering, last acting [2,0,1]
    pg 1.69 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.6a is stuck inactive for 13m, current state unknown, last acting []
    pg 1.6b is stuck inactive for 13m, current state unknown, last acting []
    pg 1.6c is stuck inactive for 13m, current state unknown, last acting []
    pg 1.6d is stuck inactive for 13m, current state unknown, last acting []
    pg 1.6e is stuck peering for 46m, current state peering, last acting [2,1,0]
    pg 1.6f is stuck inactive for 13m, current state unknown, last acting []
    pg 1.70 is stuck peering for 46m, current state peering, last acting [2,1,0]
    pg 1.71 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.72 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.73 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.74 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.75 is stuck peering for 45m, current state peering, last acting [2,0,1]
    pg 1.76 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.77 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.78 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.79 is stuck inactive for 13m, current state unknown, last acting []
    pg 1.7a is stuck inactive for 13m, current state unknown, last acting []
    pg 1.7b is stuck inactive for 13m, current state unknown, last acting []
    pg 1.7c is stuck inactive for 13m, current state unknown, last acting []
    pg 1.7d is stuck inactive for 13m, current state unknown, last acting []
    pg 1.7e is stuck inactive for 13m, current state unknown, last acting []
    pg 1.7f is stuck peering for 45m, current state peering, last acting [2,0,1]
[WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 183 sec, mon.cephadm1 has slow ops

[ceph: root@cephadm1 /]# ceph osd dump
epoch 35
fsid 1d11c63a-09ac-11ec-83e1-52540031ba78
created 2021-08-30T16:06:17.924486+0000
modified 2021-08-30T16:44:02.503884+0000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 9
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client luminous
min_compat_client jewel
require_osd_release quincy
stretch_mode_enabled false
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 18 lfor 0/0/16 flags hashpspool stripe_width 0 pg_num_min 1 application mgr
max_osd 3
osd.0 up   in  weight 1 up_from 11 up_thru 18 down_at 0 last_clean_interval [0,0) [v2:192.168.1.82:6800/1472739640,v1:192.168.1.82:6801/1472739640] [v2:192.168.1.82:6802/1472739640,v1:192.168.1.82:6803/1472739640] exists,up 730e413e-49ab-458a-83b0-34c24cb66938
osd.1 up   in  weight 1 up_from 11 up_thru 18 down_at 0 last_clean_interval [0,0) [v2:192.168.1.83:6800/1816343666,v1:192.168.1.83:6801/1816343666] [v2:192.168.1.83:6802/1816343666,v1:192.168.1.83:6803/1816343666] exists,up 548b53fd-b4b3-4e61-86ff-9f82e55ca6e4
osd.2 up   in  weight 1 up_from 31 up_thru 31 down_at 30 last_clean_interval [10,19) [v2:192.168.1.81:6800/72088798,v1:192.168.1.81:6801/72088798] [v2:192.168.1.81:6802/72088798,v1:192.168.1.81:6803/72088798] exists,up 6afffa04-4f42-4cdb-861c-bebb181b76c3
blocklist 192.168.1.81:6809/337892551 expires 2021-08-31T16:44:02.503859+0000
blocklist 192.168.1.81:6808/337892551 expires 2021-08-31T16:44:02.503859+0000
blocklist 192.168.1.81:0/8221569 expires 2021-08-31T16:44:02.503859+0000
blocklist 192.168.1.82:0/1586070014 expires 2021-08-31T16:43:57.498724+0000
blocklist 192.168.1.82:0/4207781163 expires 2021-08-31T16:43:57.498724+0000
blocklist 192.168.1.82:6809/2189466650 expires 2021-08-31T16:43:57.498724+0000
blocklist 192.168.1.82:6808/2189466650 expires 2021-08-31T16:43:57.498724+0000
blocklist 192.168.1.81:0/3897135071 expires 2021-08-31T16:43:52.492951+0000
blocklist 192.168.1.81:0/2750379948 expires 2021-08-31T16:44:02.503859+0000
blocklist 192.168.1.81:6808/1432673690 expires 2021-08-31T16:43:52.492951+0000
blocklist 192.168.1.81:0/1443542911 expires 2021-08-31T16:43:47.491241+0000
blocklist 192.168.1.81:6808/1942461460 expires 2021-08-31T16:43:47.491241+0000
blocklist 192.168.1.81:0/2028798737 expires 2021-08-31T16:43:47.491241+0000
blocklist 192.168.1.82:0/2293395088 expires 2021-08-31T16:43:57.498724+0000
blocklist 192.168.1.81:0/1819881066 expires 2021-08-31T16:15:17.357643+0000
blocklist 192.168.1.81:0/2449999456 expires 2021-08-31T16:15:17.357643+0000
blocklist 192.168.1.81:0/508510127 expires 2021-08-31T16:44:02.503859+0000
blocklist 192.168.1.81:6801/1863706383 expires 2021-08-31T16:06:56.896151+0000
blocklist 192.168.1.82:0/425587864 expires 2021-08-31T16:15:22.363478+0000
blocklist 192.168.1.81:0/1794485041 expires 2021-08-31T16:15:07.349315+0000
blocklist 192.168.1.81:0/2443605730 expires 2021-08-31T16:43:52.492951+0000
blocklist 192.168.1.81:6809/114631411 expires 2021-08-31T16:15:17.357643+0000
blocklist 192.168.1.81:6801/1884939811 expires 2021-08-31T16:07:41.554995+0000
blocklist 192.168.1.81:0/1214794039 expires 2021-08-31T16:43:47.491241+0000
blocklist 192.168.1.82:0/3926102788 expires 2021-08-31T16:15:22.363478+0000
blocklist 192.168.1.81:6800/1884939811 expires 2021-08-31T16:07:41.554995+0000
blocklist 192.168.1.81:0/2362077335 expires 2021-08-31T16:15:07.349315+0000
blocklist 192.168.1.81:6800/522015385 expires 2021-08-31T16:06:36.907232+0000
blocklist 192.168.1.82:0/3183819242 expires 2021-08-31T16:15:12.351056+0000
blocklist 192.168.1.81:0/4252510019 expires 2021-08-31T16:07:41.554995+0000
blocklist 192.168.1.81:0/1414647395 expires 2021-08-31T16:15:07.349315+0000
blocklist 192.168.1.82:0/3281747658 expires 2021-08-31T16:15:12.351056+0000
blocklist 192.168.1.81:6801/2142336589 expires 2021-08-31T16:15:07.349315+0000
blocklist 192.168.1.81:0/330713352 expires 2021-08-31T16:07:41.554995+0000
blocklist 192.168.1.81:0/1175724933 expires 2021-08-31T16:15:17.357643+0000
blocklist 192.168.1.82:0/1977810235 expires 2021-08-31T16:15:12.351056+0000
blocklist 192.168.1.81:0/2239717237 expires 2021-08-31T16:06:36.907232+0000
blocklist 192.168.1.81:0/2188107794 expires 2021-08-31T16:06:56.896151+0000
blocklist 192.168.1.82:6808/3525423328 expires 2021-08-31T16:15:22.363478+0000
blocklist 192.168.1.81:6809/1432673690 expires 2021-08-31T16:43:52.492951+0000
blocklist 192.168.1.81:0/3740038005 expires 2021-08-31T16:15:07.349315+0000
blocklist 192.168.1.81:6809/1942461460 expires 2021-08-31T16:43:47.491241+0000
blocklist 192.168.1.82:6808/1356675448 expires 2021-08-31T16:15:12.351056+0000
blocklist 192.168.1.82:6809/1356675448 expires 2021-08-31T16:15:12.351056+0000
blocklist 192.168.1.81:6808/114631411 expires 2021-08-31T16:15:17.357643+0000
blocklist 192.168.1.81:0/1640337537 expires 2021-08-31T16:06:56.896151+0000
blocklist 192.168.1.81:0/2740342954 expires 2021-08-31T16:07:41.554995+0000
blocklist 192.168.1.81:6800/1863706383 expires 2021-08-31T16:06:56.896151+0000
blocklist 192.168.1.81:6801/522015385 expires 2021-08-31T16:06:36.907232+0000
blocklist 192.168.1.81:0/1585336734 expires 2021-08-31T16:06:56.896151+0000
blocklist 192.168.1.81:6800/2142336589 expires 2021-08-31T16:15:07.349315+0000
blocklist 192.168.1.81:0/3544481835 expires 2021-08-31T16:06:36.907232+0000
blocklist 192.168.1.81:0/896854713 expires 2021-08-31T16:06:36.907232+0000
blocklist 192.168.1.81:0/3017539052 expires 2021-08-31T16:43:52.492951+0000
blocklist 192.168.1.82:6809/3525423328 expires 2021-08-31T16:15:22.363478+0000
blocklist 192.168.1.82:0/3001280241 expires 2021-08-31T16:15:22.363478+0000
[ceph: root@cephadm1 /]# 

[ceph: root@cephadm1 /]# ceph pg 1.23 query
{
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "state": "peering",
    "epoch": 35,
    "up": [
        2,
        0,
        1
    ],
    "acting": [
        2,
        0,
        1
    ],
    "info": {
        "pgid": "1.23",
        "last_update": "13'76",
        "last_complete": "13'76",
        "log_tail": "0'0",
        "last_user_version": 76,
        "last_backfill": "MAX",
        "purged_snaps": [],
        "history": {
            "epoch_created": 16,
            "epoch_pool_created": 12,
            "last_epoch_started": 19,
            "last_interval_started": 18,
            "last_epoch_clean": 19,
            "last_interval_clean": 18,
            "last_epoch_split": 16,
            "last_epoch_marked_full": 0,
            "same_up_since": 31,
            "same_interval_since": 31,
            "same_primary_since": 31,
            "last_scrub": "13'76",
            "last_scrub_stamp": "2021-08-30T16:11:30.219856+0000",
            "last_deep_scrub": "0'0",
            "last_deep_scrub_stamp": "2021-08-30T16:10:16.575078+0000",
            "last_clean_scrub_stamp": "2021-08-30T16:11:30.219856+0000",
            "prior_readable_until_ub": 0
        },
        "stats": {
            "version": "13'76",
            "reported_seq": 133,
            "reported_epoch": 35,
            "state": "peering",
            "last_fresh": "2021-08-30T16:44:02.531701+0000",
            "last_change": "2021-08-30T16:43:50.002719+0000",
            "last_active": "2021-08-30T16:43:48.592258+0000",
            "last_peered": "2021-08-30T16:11:30.219856+0000",
            "last_clean": "2021-08-30T16:11:30.219856+0000",
            "last_became_active": "2021-08-30T16:10:49.108728+0000",
            "last_became_peered": "2021-08-30T16:10:49.108728+0000",
            "last_unstale": "2021-08-30T16:44:02.531701+0000",
            "last_undegraded": "2021-08-30T16:44:02.531701+0000",
            "last_fullsized": "2021-08-30T16:44:02.531701+0000",
            "mapping_epoch": 31,
            "log_start": "0'0",
            "ondisk_log_start": "0'0",
            "created": 16,
            "last_epoch_clean": 19,
            "parent": "0.0",
            "parent_split_bits": 7,
            "last_scrub": "13'76",
            "last_scrub_stamp": "2021-08-30T16:11:30.219856+0000",
            "last_deep_scrub": "0'0",
            "last_deep_scrub_stamp": "2021-08-30T16:10:16.575078+0000",
            "last_clean_scrub_stamp": "2021-08-30T16:11:30.219856+0000",
            "log_size": 76,
            "ondisk_log_size": 76,
            "stats_invalid": false,
            "dirty_stats_invalid": false,
            "omap_stats_invalid": false,
            "hitset_stats_invalid": false,
            "hitset_bytes_stats_invalid": false,
            "pin_stats_invalid": false,
            "manifest_stats_invalid": false,
            "snaptrimq_len": 0,
            "stat_sum": {
                "num_bytes": 0,
                "num_objects": 0,
                "num_object_clones": 0,
                "num_object_copies": 0,
                "num_objects_missing_on_primary": 0,
                "num_objects_missing": 0,
                "num_objects_degraded": 0,
                "num_objects_misplaced": 0,
                "num_objects_unfound": 0,
                "num_objects_dirty": 0,
                "num_whiteouts": 0,
                "num_read": 0,
                "num_read_kb": 0,
                "num_write": 0,
                "num_write_kb": 0,
                "num_scrub_errors": 0,
                "num_shallow_scrub_errors": 0,
                "num_deep_scrub_errors": 0,
                "num_objects_recovered": 0,
                "num_bytes_recovered": 0,
                "num_keys_recovered": 0,
                "num_objects_omap": 0,
                "num_objects_hit_set_archive": 0,
                "num_bytes_hit_set_archive": 0,
                "num_flush": 0,
                "num_flush_kb": 0,
                "num_evict": 0,
                "num_evict_kb": 0,
                "num_promote": 0,
                "num_flush_mode_high": 0,
                "num_flush_mode_low": 0,
                "num_evict_mode_some": 0,
                "num_evict_mode_full": 0,
                "num_objects_pinned": 0,
                "num_legacy_snapsets": 0,
                "num_large_omap_objects": 0,
                "num_objects_manifest": 0,
                "num_omap_bytes": 0,
                "num_omap_keys": 0,
                "num_objects_repaired": 0
            },
            "up": [
                2,
                0,
                1
            ],
            "acting": [
                2,
                0,
                1
            ],
            "avail_no_missing": [],
            "object_location_counts": [],
            "blocked_by": [
                0,
                1
            ],
            "up_primary": 2,
            "acting_primary": 2,
            "purged_snaps": []
        },
        "empty": 0,
        "dne": 0,
        "incomplete": 0,
        "last_epoch_started": 19,
        "hit_set_history": {
            "current_last_update": "0'0",
            "history": []
        }
    },
    "peer_info": [],
    "recovery_state": [
        {
            "name": "Started/Primary/Peering/GetInfo",
            "enter_time": "2021-08-30T16:43:50.002687+0000",
            "requested_info_from": [
                {
                    "osd": "0" 
                },
                {
                    "osd": "1" 
                }
            ]
        },
        {
            "name": "Started/Primary/Peering",
            "enter_time": "2021-08-30T16:43:50.002684+0000",
            "past_intervals": [
                {
                    "first": "18",
                    "last": "30",
                    "all_participants": [
                        {
                            "osd": 0
                        },
                        {
                            "osd": 1
                        },
                        {
                            "osd": 2
                        }
                    ],
                    "intervals": [
                        {
                            "first": "18",
                            "last": "29",
                            "acting": "0,1,2" 
                        }
                    ]
                }
            ],
            "probing_osds": [
                "0",
                "1",
                "2" 
            ],
            "down_osds_we_would_probe": [],
            "peering_blocked_by": []
        },
        {
            "name": "Started",
            "enter_time": "2021-08-30T16:43:50.002661+0000" 
        }
    ],
    "agent_state": {}
}
Actions #13

Updated by Neha Ojha over 2 years ago

Jeff Layton wrote:

Ok. I wasn't clear on whether I needed to run "ceph config set debug_osd 20" on all the hosts or just 1. I ran it on all of them this time, so hopefully this will have what you need.

As far as the blocklist entries. I have no idea why there are so many. I rebuilt the cluster from scratch before I did this, so I definitely didn't do any manual blocklisting.

Jeff, this definitely looks like osds are having trouble communicating with each other, indicating a networking issue. You said pacific works fine for you, do you see these blocklist entries in pacific?

Actions #14

Updated by Jeff Layton over 2 years ago

Odd. The hosts in question are all KVM nodes on the same physical host, so I wouldn't expect networking issues.

I rebuilt my cluster with pacific and when I reboot the box running all of the VMs, the cluster quickly returns to HEALTH_OK. I do see some blocklist entries just after install:

[ceph: root@cephadm1 /]# ceph --version
ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
[ceph: root@cephadm1 /]# ceph osd dump
epoch 13
fsid 73040e18-10be-11ec-ab41-52540031ba78
created 2021-09-08T16:04:55.640673+0000
modified 2021-09-08T16:10:03.284524+0000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 4
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client luminous
min_compat_client jewel
require_osd_release pacific
stretch_mode_enabled false
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 13 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
max_osd 3
osd.0 up   in  weight 1 up_from 9 up_thru 11 down_at 0 last_clean_interval [0,0) [v2:192.168.1.82:6800/3010177339,v1:192.168.1.82:6801/3010177339] [v2:192.168.1.82:6802/3010177339,v1:192.168.1.82:6803/3010177339] exists,up 174ae027-03c6-4275-a0f3-a85ea95be8fc
osd.1 up   in  weight 1 up_from 9 up_thru 0 down_at 0 last_clean_interval [0,0) [v2:192.168.1.81:6802/992287293,v1:192.168.1.81:6803/992287293] [v2:192.168.1.81:6804/992287293,v1:192.168.1.81:6805/992287293] exists,up bd163834-e4b3-49b3-92b6-2abf883a74c3
osd.2 up   in  weight 1 up_from 10 up_thru 0 down_at 0 last_clean_interval [0,0) [v2:192.168.1.83:6800/2300744829,v1:192.168.1.83:6801/2300744829] [v2:192.168.1.83:6802/2300744829,v1:192.168.1.83:6803/2300744829] exists,up ec8da6ff-27be-4f3c-a626-f32d34ec5197
blocklist 192.168.1.81:0/1872391922 expires 2021-09-09T16:05:33.954869+0000
blocklist 192.168.1.81:0/686890172 expires 2021-09-09T16:05:33.954869+0000
blocklist 192.168.1.81:6800/2656346969 expires 2021-09-09T16:05:33.954869+0000
blocklist 192.168.1.81:0/1927526433 expires 2021-09-09T16:05:14.930694+0000
blocklist 192.168.1.81:0/1190485869 expires 2021-09-09T16:05:14.930694+0000
blocklist 192.168.1.81:6800/2777673140 expires 2021-09-09T16:05:14.930694+0000
blocklist 192.168.1.81:6801/2656346969 expires 2021-09-09T16:05:33.954869+0000
blocklist 192.168.1.81:6801/2777673140 expires 2021-09-09T16:05:14.930694+0000

The cluster is HEALTH_OK at this point. Could these be one-time setup tasks that cephadm does? Maybe it's not tearing down some OSD sessions correctly?

After rebooting, I see some more:

[ceph: root@cephadm1 /]# ceph osd dump
epoch 24
fsid 73040e18-10be-11ec-ab41-52540031ba78
created 2021-09-08T16:04:55.640673+0000
modified 2021-09-08T16:13:50.132577+0000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 6
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client luminous
min_compat_client jewel
require_osd_release pacific
stretch_mode_enabled false
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 24 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
max_osd 3
osd.0 up   in  weight 1 up_from 17 up_thru 17 down_at 16 last_clean_interval [9,13) [v2:192.168.1.82:6800/2298926360,v1:192.168.1.82:6801/2298926360] [v2:192.168.1.82:6802/2298926360,v1:192.168.1.82:6803/2298926360] exists,up 174ae027-03c6-4275-a0f3-a85ea95be8fc
osd.1 up   in  weight 1 up_from 17 up_thru 0 down_at 16 last_clean_interval [9,13) [v2:192.168.1.81:6800/2967190577,v1:192.168.1.81:6801/2967190577] [v2:192.168.1.81:6802/2967190577,v1:192.168.1.81:6803/2967190577] exists,up bd163834-e4b3-49b3-92b6-2abf883a74c3
osd.2 up   in  weight 1 up_from 16 up_thru 16 down_at 15 last_clean_interval [10,13) [v2:192.168.1.83:6800/4000736394,v1:192.168.1.83:6801/4000736394] [v2:192.168.1.83:6802/4000736394,v1:192.168.1.83:6803/4000736394] exists,up ec8da6ff-27be-4f3c-a626-f32d34ec5197
blocklist 192.168.1.81:0/2558941951 expires 2021-09-09T16:13:49.118715+0000
blocklist 192.168.1.81:0/4121153711 expires 2021-09-09T16:13:49.118715+0000
blocklist 192.168.1.81:6808/594445995 expires 2021-09-09T16:13:49.118715+0000
blocklist 192.168.1.82:0/1792548082 expires 2021-09-09T16:13:44.113696+0000
blocklist 192.168.1.81:6809/594445995 expires 2021-09-09T16:13:49.118715+0000
blocklist 192.168.1.82:6809/2484823990 expires 2021-09-09T16:13:44.113696+0000
blocklist 192.168.1.82:6808/2484823990 expires 2021-09-09T16:13:44.113696+0000
blocklist 192.168.1.82:0/172116376 expires 2021-09-09T16:13:44.113696+0000
blocklist 192.168.1.81:6800/4169171074 expires 2021-09-09T16:13:34.100082+0000
blocklist 192.168.1.81:6801/2777673140 expires 2021-09-09T16:05:14.930694+0000
blocklist 192.168.1.81:6809/212921974 expires 2021-09-09T16:13:39.101765+0000
blocklist 192.168.1.81:0/1927526433 expires 2021-09-09T16:05:14.930694+0000
blocklist 192.168.1.81:6800/2656346969 expires 2021-09-09T16:05:33.954869+0000
blocklist 192.168.1.81:0/3002483477 expires 2021-09-09T16:13:34.100082+0000
blocklist 192.168.1.81:0/686890172 expires 2021-09-09T16:05:33.954869+0000
blocklist 192.168.1.81:0/1190485869 expires 2021-09-09T16:05:14.930694+0000
blocklist 192.168.1.81:0/179561595 expires 2021-09-09T16:13:39.101765+0000
blocklist 192.168.1.81:0/1872391922 expires 2021-09-09T16:05:33.954869+0000
blocklist 192.168.1.81:6801/2656346969 expires 2021-09-09T16:05:33.954869+0000
blocklist 192.168.1.81:6801/4169171074 expires 2021-09-09T16:13:34.100082+0000
blocklist 192.168.1.81:6800/2777673140 expires 2021-09-09T16:05:14.930694+0000
blocklist 192.168.1.81:0/3338781454 expires 2021-09-09T16:13:34.100082+0000
blocklist 192.168.1.81:6808/212921974 expires 2021-09-09T16:13:39.101765+0000
blocklist 192.168.1.81:0/3589616606 expires 2021-09-09T16:13:39.101765+0000

In any case, this seems like a red-herring, since the presence of these blacklist entries doesn't seem to be affecting Pacific the same way.

Actions #15

Updated by Jeff Layton over 2 years ago

Erm, in fact, right after doing cephadm bootstrap, before rebooting anything:

[ceph: root@cephadm1 /]# ceph osd dump
epoch 3
fsid 4eed21d2-10c2-11ec-bb67-52540031ba78
created 2021-09-08T16:32:12.392309+0000
modified 2021-09-08T16:32:50.176264+0000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 1
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client luminous
min_compat_client jewel
require_osd_release pacific
stretch_mode_enabled false
max_osd 0
blocklist 192.168.1.81:0/1034545389 expires 2021-09-09T16:32:50.176242+0000
blocklist 192.168.1.81:6801/3790415778 expires 2021-09-09T16:32:50.176242+0000
blocklist 192.168.1.81:6800/3790415778 expires 2021-09-09T16:32:50.176242+0000
blocklist 192.168.1.81:0/2873830775 expires 2021-09-09T16:32:50.176242+0000
blocklist 192.168.1.81:0/1433153503 expires 2021-09-09T16:32:31.170929+0000
blocklist 192.168.1.81:6801/3911022568 expires 2021-09-09T16:32:31.170929+0000
blocklist 192.168.1.81:6800/3911022568 expires 2021-09-09T16:32:31.170929+0000
blocklist 192.168.1.81:0/3673357812 expires 2021-09-09T16:32:31.170929+0000
Actions #16

Updated by Jeff Layton over 2 years ago

  • Status changed from New to Can't reproduce

I rebuilt my cluster yesterday using a container image based on commit d906f946e845, and I'm not able to reproduce this now. After rebooting the box running all of the VMs, it settles into HEALTH_OK rather quickly now. I do still see a lot of blacklist entries after a reboot, but I'm not convinced that's related to the problem.

I'm not sure what fixed this, but I'll go ahead and close this bug out. If it happens again, I'll plan to reopen it and we can pursue it further,

Actions

Also available in: Atom PDF