Bug #45292
openpg autoscaler merging issue
0%
Description
Encountering an issue where placement groups (pgs) go into status stuck inactive and hang in that status. This appears to happen when the pg_autoscaler is merging/scaling down placement groups on the main Ceph node while the Jenkins pipeline expansion phase where ceph is being scaled out/deployed to other nodes is occurring. This was manifesting itself for example in the pool default.rgw.buckets.data pool id=13 (below is a portion of that pool's pgs that were stuck inactive):
kubectl -n ceph exec -it ceph-mon-4pw5s – ceph health detail
HEALTH_WARN Reduced data availability: 210 pgs inactive
PG_AVAILABILITY Reduced data availability: 210 pgs inactive
pg 13.416 is stuck inactive for 1652.311455, current state activating, last acting [131,194,287]
pg 13.422 is stuck inactive for 1652.302794, current state activating, last acting [131,194,287]
pg 13.432 is stuck inactive for 1652.297366, current state activating, last acting [131,194,287]
pg 13.442 is stuck inactive for 1652.286949, current state activating, last acting [131,194,287]
pg 13.452 is stuck inactive for 1652.310763, current state activating, last acting [131,194,287]
pg 13.462 is stuck inactive for 1652.296110, current state activating, last acting [131,194,287]
pg 13.472 is stuck inactive for 1652.294760, current state activating, last acting [131,194,287]
pg 13.482 is stuck inactive for 1652.297505, current state activating, last acting [131,194,287]
As a temporary workaround on the main ceph node, we had to delete the pool default.rgw.buckets.data and recreate it, delete the osdj pod that was experiencing issues, delete the 1 osd that was marked down and out, and then reweight the newly created osd.
Updated by Neha Ojha almost 4 years ago
- Status changed from New to Need More Info
Can you provide pg query output for one of those PGs? Also, osd logs with debug_osd=20 will be helpful.
Updated by Brian Wickersham almost 4 years ago
Sorry for the delay. We are working to get a reservation on one of our internal labs so we can recreate the issue and provide more detailed logs and the information requested.