Bug #64544: Scrub stuck and 'pg has invalid (post-split) stat' - Ceph - Ceph

Actions

Copy link

Bug #64544

open

Scrub stuck and 'pg has invalid (post-split) stat'

Added by Cedric Lemarchand 2 months ago. Updated 2 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Following an upgrade from Nautilus (14.2.22) to Pacific (16.2.13) with ceph-ansible, we
encounter an issue with a cache pool becoming completely stuck. This cluste provides rbd volumes to Openstack instances, all VM are unable to access the cluster anymore after the upgrade, client i/o fall to 0. All Ceph daemons are in Pacific 16.2.13.

At the failure:

cluster:
    id:     ea62fb29-d52a-4bef-baaa-1d67113cb5a8
    health: HEALTH_WARN
            1 pools have too few placement groups
            1 pools have too many placement groups
            15489 slow ops, oldest one blocked for 2472 sec, daemons [osd.0,osd.1,osd.10,osd.11,osd.112,osd.113,osd.118,osd.119,osd.122,osd.123]... have slow ops.

services:
    mon: 3 daemons, quorum srv12539,srv12540,srv12541 (age 61m)
    mgr: srv12539(active, since 62m), standbys: srv12540, srv12541
    osd: 72 osds: 72 up (since 37m), 72 in (since 4d)

data:

pools:   8 pools, 2281 pgs

objects: 14.59M objects, 92 TiB

usage:   274 TiB used, 145 TiB / 419 TiB avail

pgs:     2281 active+clean

HEALTH_WARN 1 pools have too few placement groups; 1 pools have too many placement groups; 15498 slow ops, oldest one blocked for 2476 sec, daemons [osd.0,osd.1,osd.10,osd.11,osd.112,osd.113,osd.118,osd.119,osd.122,osd.123]... have slow ops.
[WRN] POOL_TOO_FEW_PGS: 1 pools have too few placement groups
Pool backups has 8 placement groups, should have 32
[WRN] POOL_TOO_MANY_PGS: 1 pools have too many placement groups
Pool vms_cache has 128 placement groups, should have 32
[WRN] SLOW_OPS: 15498 slow ops, oldest one blocked for 2476 sec, daemons [osd.0,osd.1,osd.10,osd.11,osd.112,osd.113,osd.118,osd.119,osd.122,osd.123]... have slow ops.

Ceph monitor logs:

2024-02-18T20:55:43.896369+0000 osd.5 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ]) [9858/25386]
2024-02-18T20:55:43.934807+0000 osd.137 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:43.958227+0000 osd.168 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:43.960882+0000 osd.156 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:43.983814+0000 osd.179 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:43.994018+0000 osd.161 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:43.996826+0000 osd.148 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.024271+0000 osd.129 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.049936+0000 osd.153 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.058970+0000 osd.147 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.068820+0000 osd.2 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.102385+0000 osd.181 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.102636+0000 osd.164 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.115634+0000 osd.170 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.117991+0000 osd.152 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.120363+0000 osd.144 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.135708+0000 osd.9 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.163940+0000 osd.154 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.184635+0000 osd.3 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.197423+0000 osd.149 [WRN] 9 slow requests (by type [ 'reached pg' : 9 ] most affected pool [ 'volumes_cache' : 9 ])
2024-02-18T20:55:44.202329+0000 osd.130 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.202752+0000 osd.176 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.210942+0000 osd.184 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.245946+0000 osd.126 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.246265+0000 osd.175 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.254191+0000 osd.165 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.255694+0000 osd.136 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.323538+0000 osd.128 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.326127+0000 osd.143 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.353385+0000 osd.159 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.391411+0000 osd.158 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.414203+0000 osd.118 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.439213+0000 osd.10 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.441135+0000 osd.177 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.445179+0000 osd.119 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.485781+0000 osd.160 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.540852+0000 osd.162 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.563359+0000 osd.1 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.623261+0000 osd.8 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.685824+0000 osd.0 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.722310+0000 osd.4 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.764521+0000 osd.146 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.840433+0000 osd.125 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 254 ])
2024-02-18T20:55:44.853328+0000 osd.187 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.074622+0000 osd.167 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.162836+0000 osd.171 [WRN] 26 slow requests (by type [ 'reached pg' : 26 ] most affected pool [ 'volumes_cache' : 26 ])
2024-02-18T20:55:44.241211+0000 osd.123 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.500381+0000 osd.173 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.506071+0000 osd.122 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.521373+0000 osd.155 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.553376+0000 osd.134 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.661393+0000 osd.113 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.670939+0000 osd.150 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.675777+0000 osd.11 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.701456+0000 osd.186 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.732061+0000 osd.7 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.766065+0000 osd.182 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.783667+0000 osd.112 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.806487+0000 osd.178 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.852037+0000 osd.6 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.864096+0000 osd.142 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.886002+0000 osd.137 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.902095+0000 osd.140 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])
2024-02-18T20:55:44.924068+0000 osd.5 [WRN] 256 slow requests (by type [ 'reached pg' : 256 ] most affected pool [ 'vms_cache' : 256 ])

--- POOLS --- [1426/25386]

POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL

images 1 1024 35 TiB 4.67M 104 TiB 51.36 33 TiB

volumes 2 32 41 GiB 10.76k 124 GiB 0.12 33 TiB

vms 3 1024 57 TiB 9.41M 170 TiB 63.38 33 TiB

images_cache 11 32 5.3 MiB 5.76k 68 MiB 0 33 TiB

vms_cache 12 128 11 MiB 486.29k 161 MiB 0 33 TiB

volumes_cache 13 32 281 KiB 1.54k 18 MiB 0 33 TiB

backups 14 8 0 B 0 0 B 0 33 TiB

device_health_metrics 15 1 0 B 72 0 B 0 33 TiB

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 419.18036 root default
-4 419.18036 datacenter EU
-97 209.59074 rack U72
-73 69.86388 host srv8967
112 nvme 5.82199 osd.112 up 1.00000 1.00000
122 nvme 5.82199 osd.122 up 1.00000 1.00000
128 nvme 5.82199 osd.128 up 1.00000 1.00000
136 nvme 5.82198 osd.136 up 1.00000 1.00000
141 nvme 5.82199 osd.141 up 1.00000 1.00000
147 nvme 5.82199 osd.147 up 1.00000 1.00000
153 nvme 5.82199 osd.153 up 1.00000 1.00000
159 nvme 5.82199 osd.159 up 1.00000 1.00000
165 nvme 5.82199 osd.165 up 1.00000 1.00000
171 nvme 5.82199 osd.171 up 1.00000 1.00000
177 nvme 5.82199 osd.177 up 1.00000 1.00000
184 nvme 5.82199 osd.184 up 1.00000 1.00000
-93 69.86388 host srv8968
113 nvme 5.82199 osd.113 up 1.00000 1.00000
123 nvme 5.82199 osd.123 up 1.00000 1.00000
131 nvme 5.82199 osd.131 up 1.00000 1.00000
138 nvme 5.82199 osd.138 up 1.00000 1.00000
143 nvme 5.82198 osd.143 up 1.00000 1.00000
149 nvme 5.82199 osd.149 up 1.00000 1.00000
154 nvme 5.82199 osd.154 up 1.00000 1.00000
160 nvme 5.82199 osd.160 up 1.00000 1.00000
166 nvme 5.82199 osd.166 up 1.00000 1.00000
172 nvme 5.82199 osd.172 up 1.00000 1.00000
178 nvme 5.82199 osd.178 up 1.00000 1.00000
186 nvme 5.82199 osd.186 up 1.00000 1.00000
-77 69.86299 host srv8969
118 nvme 5.82199 osd.118 up 1.00000 1.00000
126 nvme 5.82199 osd.126 up 1.00000 1.00000
132 nvme 5.82199 osd.132 up 1.00000 1.00000
137 nvme 5.82199 osd.137 up 1.00000 1.00000
144 nvme 5.82199 osd.144 up 1.00000 1.00000
150 nvme 5.82199 osd.150 up 1.00000 1.00000
156 nvme 5.82199 osd.156 up 1.00000 1.00000
161 nvme 5.82199 osd.161 up 1.00000 1.00000
167 nvme 5.82199 osd.167 up 1.00000 1.00000
173 nvme 5.82199 osd.173 up 1.00000 1.00000
179 nvme 5.82199 osd.179 up 1.00000 1.00000
187 nvme 5.82199 osd.187 up 1.00000 1.00000
-32 209.58963 rack U74
-89 69.86385 host srv8965
120 nvme 5.82199 osd.120 up 1.00000 1.00000
125 nvme 5.82199 osd.125 up 1.00000 1.00000
130 nvme 5.82199 osd.130 up 1.00000 1.00000
135 nvme 5.82199 osd.135 up 1.00000 1.00000
142 nvme 5.82199 osd.142 up 1.00000 1.00000
148 nvme 5.82199 osd.148 up 1.00000 1.00000
155 nvme 5.82199 osd.155 up 1.00000 1.00000
162 nvme 5.82198 osd.162 up 1.00000 1.00000
168 nvme 5.82199 osd.168 up 1.00000 1.00000
175 nvme 5.82198 osd.175 up 1.00000 1.00000
181 nvme 5.82199 osd.181 up 1.00000 1.00000
183 nvme 5.82198 osd.183 up 1.00000 1.00000
-81 69.86299 host srv8966
119 nvme 5.82199 osd.119 up 1.00000 1.00000
124 nvme 5.82199 osd.124 up 1.00000 1.00000
129 nvme 5.82199 osd.129 up 1.00000 1.00000
134 nvme 5.82199 osd.134 up 1.00000 1.00000
140 nvme 5.82199 osd.140 up 1.00000 1.00000
146 nvme 5.82199 osd.146 up 1.00000 1.00000
152 nvme 5.82199 osd.152 up 1.00000 1.00000
158 nvme 5.82199 osd.158 up 1.00000 1.00000
164 nvme 5.82199 osd.164 up 1.00000 1.00000
170 nvme 5.82199 osd.170 up 1.00000 1.00000
176 nvme 5.82199 osd.176 up 1.00000 1.00000
182 nvme 5.82199 osd.182 up 1.00000 1.00000
-85 69.86279 host srv8970
0 nvme 5.82190 osd.0 up 1.00000 1.00000
1 nvme 5.82190 osd.1 up 1.00000 1.00000
2 nvme 5.82190 osd.2 up 1.00000 1.00000
3 nvme 5.82190 osd.3 up 1.00000 1.00000
4 nvme 5.82190 osd.4 up 1.00000 1.00000
5 nvme 5.82190 osd.5 up 1.00000 1.00000
6 nvme 5.82190 osd.6 up 1.00000 1.00000
7 nvme 5.82190 osd.7 up 1.00000 1.00000
8 nvme 5.82190 osd.8 up 1.00000 1.00000
9 nvme 5.82190 osd.9 up 1.00000 1.00000
10 nvme 5.82190 osd.10 up 1.00000 1.00000
11 nvme 5.82190 osd.11 up 1.00000 1.00000

[ {
"rule_id": 0,
"rule_name": "nvme_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [ {
"op": "take",
"item": -72,
"item_name": "default~nvme"
}, {
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
}, {
"op": "emit"
}
]
}, {
"rule_id": 1,
"rule_name": "hdd_rule",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [ {
"op": "take",
"item": -30,
"item_name": "default~hdd"
}, {
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
}, {
"op": "emit"
}
]
}, {
"rule_id": 5,
"rule_name": "mixed_rule",
"ruleset": 5,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [ {
"op": "take",
"item": -72,
"item_name": "default~nvme"
}, {
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
}, {
"op": "emit"
}, {
"op": "take",
"item": -30,
"item_name": "default~hdd"
}, {
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
}, {
"op": "emit"
}
]
}
]

ID CLASS WEIGHT TYPE NAME
-1 419.18036 root default
-4 419.18036 datacenter EU
-97 209.59074 rack U72
-73 69.86388 host srv8967
112 nvme 5.82199 osd.112
122 nvme 5.82199 osd.122
128 nvme 5.82199 osd.128
136 nvme 5.82198 osd.136
141 nvme 5.82199 osd.141
147 nvme 5.82199 osd.147
153 nvme 5.82199 osd.153
159 nvme 5.82199 osd.159
165 nvme 5.82199 osd.165
171 nvme 5.82199 osd.171
177 nvme 5.82199 osd.177
184 nvme 5.82199 osd.184
-93 69.86388 host srv8968
113 nvme 5.82199 osd.113
123 nvme 5.82199 osd.123
131 nvme 5.82199 osd.131
138 nvme 5.82199 osd.138
143 nvme 5.82198 osd.143
149 nvme 5.82199 osd.149
154 nvme 5.82199 osd.154
160 nvme 5.82199 osd.160
166 nvme 5.82199 osd.166
172 nvme 5.82199 osd.172
178 nvme 5.82199 osd.178
186 nvme 5.82199 osd.186
-77 69.86299 host srv8969
118 nvme 5.82199 osd.118
126 nvme 5.82199 osd.126
132 nvme 5.82199 osd.132
137 nvme 5.82199 osd.137
144 nvme 5.82199 osd.144
150 nvme 5.82199 osd.150
156 nvme 5.82199 osd.156
161 nvme 5.82199 osd.161
167 nvme 5.82199 osd.167
173 nvme 5.82199 osd.173
179 nvme 5.82199 osd.179
187 nvme 5.82199 osd.187
-32 209.58963 rack U74
-89 69.86385 host srv8965
120 nvme 5.82199 osd.120
125 nvme 5.82199 osd.125
130 nvme 5.82199 osd.130
135 nvme 5.82199 osd.135
142 nvme 5.82199 osd.142
148 nvme 5.82199 osd.148
155 nvme 5.82199 osd.155
162 nvme 5.82198 osd.162
168 nvme 5.82199 osd.168
175 nvme 5.82198 osd.175
181 nvme 5.82199 osd.181
183 nvme 5.82198 osd.183
-81 69.86299 host srv8966
119 nvme 5.82199 osd.119
124 nvme 5.82199 osd.124
129 nvme 5.82199 osd.129
134 nvme 5.82199 osd.134
140 nvme 5.82199 osd.140
146 nvme 5.82199 osd.146
152 nvme 5.82199 osd.152
158 nvme 5.82199 osd.158
164 nvme 5.82199 osd.164
170 nvme 5.82199 osd.170
176 nvme 5.82199 osd.176
182 nvme 5.82199 osd.182
-85 69.86279 host srv8970
0 nvme 5.82190 osd.0
1 nvme 5.82190 osd.1
2 nvme 5.82190 osd.2
3 nvme 5.82190 osd.3
4 nvme 5.82190 osd.4
5 nvme 5.82190 osd.5
6 nvme 5.82190 osd.6
7 nvme 5.82190 osd.7
8 nvme 5.82190 osd.8
9 nvme 5.82190 osd.9
10 nvme 5.82190 osd.10
11 nvme 5.82190 osd.11

What has been done right after the failure:

ceph osd pool set vms_cache pg_num 32
ceph osd pool set vms_cache pgp_num 32
ceph osd tier cache-mode vms_cache writeback (has been revert to read-proxy afterward)

--- POOLS ---
POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
images 1 1024 35 TiB 35 TiB 954 KiB 4.67M 104 TiB 104 TiB 954 KiB 50.56 34 TiB N/A N/A N/A 0 B 0 B
volumes 2 32 41 GiB 41 GiB 1.7 KiB 10.76k 124 GiB 124 GiB 1.7 KiB 0.12 34 TiB N/A N/A N/A 0 B 0 B
vms 3 1024 57 TiB 57 TiB 7.2 MiB 9.41M 170 TiB 170 TiB 7.2 MiB 62.63 34 TiB N/A N/A N/A 0 B 0 B
images_cache 11 32 5.3 MiB 5.1 MiB 196 KiB 5.76k 68 MiB 68 MiB 196 KiB 0 34 TiB N/A N/A N/A 0 B 0 B
vms_cache 12 256 486 GiB 486 GiB 863 KiB 542.07k 1.4 TiB 1.4 TiB 863 KiB 1.39 34 TiB N/A N/A 294.57k 0 B 0 B
volumes_cache 13 32 284 KiB 284 KiB 228 B 1.54k 18 MiB 18 MiB 228 B 0 34 TiB N/A N/A 9 0 B 0 B
backups 14 8 0 B 0 B 0 B 0 0 B 0 B 0 B 0 34 TiB N/A N/A N/A 0 B 0 B
device_health_metrics 15 1 0 B 0 B 0 B 72 0 B 0 B 0 B 0 34 TiB N/A N/A N/A 0 B 0 B

pool 1 'images' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 1564546 lfor 953426/953426/1546859 flags hashpspool,selfmanaged_snaps stripe_width 0 expected_num_objects 1 application rbd
pool 2 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 1560659 lfor 28784/950229/950227 flags hashpspool,selfmanaged_snaps tiers 13 read_tier 13 write_tier 13 stripe_width 0 expected_num_objects 1 application rbd
pool 3 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 1560660 lfor 28785/931603/1546859 flags hashpspool,selfmanaged_snaps tiers 12 read_tier 12 write_tier 12 stripe_width 0 expected_num_objects 1 application rbd
pool 11 'images_cache' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 1560661 lfor 953426/953426/953426 flags hashpspool,incomplete_clones,selfmanaged_snaps stripe_width 0 application rbd
pool 12 'vms_cache' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode off last_change 1565798 lfor 28785/1562901/1564526 flags hashpspool,incomplete_clones,selfmanaged_snaps tier_of 3 cache_mode readproxy target_bytes 1000000000000 target_objects 600000 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 7200s x12 decay_rate 0 search_last_n 0 stripe_width 0 application rbd
pool 13 'volumes_cache' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 1560663 lfor 28784/952190/952188 flags hashpspool,incomplete_clones,selfmanaged_snaps tier_of 2 cache_mode proxy hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 7200s x12 decay_rate 0 search_last_n 0 stripe_width 0 application rbd
pool 14 'backups' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode off last_change 1560664 flags hashpspool stripe_width 0
pool 15 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 1565137 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth

Then we discovered that scrubs are stuck on all PG on pool 12 vms_cache, which blocked the tier agent:

pg xx.x has invalid (post-split) stats; must scrub before tier agent can activate

In OSD logs, scrubs are starting in a loop without succeeding for all
pg of this pool.

What is also broken:
- rados and rbd command stay stuck
- seems we have issue with the manager: ceph progress and ceph balancer status stay stuck

What we already tried without luck so far:

- shutdown / restart OSD
- rebalance pg between OSD
- raise the memory on OSD
- repeer PG
- set / unset noscrub and nodeep-scrub
- fsck / rechard all bluestore (with the good documentation https://github.com/ceph/ceph/pull/54474)
- set hit_set_count to 0
- ceph config get osd osd_scrub_invalid_stats is True

Looking at the code it seems the split-mode message is triggered when
the PG as ""stats_invalid": true,", here is the result of a query:

"stats_invalid": true,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,

The code [2] is that such an
action actually caused the invalid_state in the first place, the
function void PrimaryLogPG::mark_all_unfound_lost handles two cases:

case pg_log_entry_t::LOST_REVERT:
...
     case pg_log_entry_t::LOST_DELETE:

And after those cases updates the stats and marks them invalid:

recovery_state.update_stats(
     [](auto &history, auto &stats) {
       stats.stats_invalid = true;
       return false;
     });

But according to the scrubbing code [3] it would update the invalid
stats when finishing the scrub:

if (info.stats.stats_invalid) {
     m_pl_pg->recovery_state.update_stats([=](auto& history, auto& stats) {
       stats.stats = m_scrub_cstat;
       stats.stats_invalid = false;
       return false;
     });

The PG should be "scrubbable", I don't
really understand why it doesn't.

[2] https://github.com/ceph/ceph/blob/v16.2.13/src/osd/PrimaryLogPG.cc#L12407
[3] https://github.com/ceph/ceph/blob/v16.2.13/src/osd/PrimaryLogScrub.cc#L54

We are thinking about the use of "ceph pg_mark_unfound_lost revert"
action, but we wonder if there is a risk of data loss.

Any idea what is causing this? any help will be greatly appreciated

Thanks

Actions

Copy link

Updated by Cedric Lemarchand 2 months ago

Among all actions that has been done, the last try which fix the issue was to move caches pool to dedicated OSD (which is the context the cache tier feature was made for). As a note, the reason pool and caches pool were collocated was the move the NVMe and the fact that write cache eviction was not possible without shutting down all VM, which wasn't an option for us, thus preventing the cache tier removal.

It seems right after the upgrade the cluster suffer at some point multiples slow OPS that cripple all OSD, making a king of dead lock between pool and cache pool were no cache promotions and/or evictions possible, locking down the whole cluster.

This bug can be closed.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #64544

Scrub stuck and 'pg has invalid (post-split) stat'

Updated by Cedric Lemarchand 2 months ago