Project

General

Profile

Actions

Bug #45292

open

pg autoscaler merging issue

Added by Brian Wickersham about 4 years ago. Updated almost 4 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Encountering an issue where placement groups (pgs) go into status stuck inactive and hang in that status. This appears to happen when the pg_autoscaler is merging/scaling down placement groups on the main Ceph node while the Jenkins pipeline expansion phase where ceph is being scaled out/deployed to other nodes is occurring. This was manifesting itself for example in the pool default.rgw.buckets.data pool id=13 (below is a portion of that pool's pgs that were stuck inactive):

kubectl -n ceph exec -it ceph-mon-4pw5s – ceph health detail
HEALTH_WARN Reduced data availability: 210 pgs inactive
PG_AVAILABILITY Reduced data availability: 210 pgs inactive
pg 13.416 is stuck inactive for 1652.311455, current state activating, last acting [131,194,287]
pg 13.422 is stuck inactive for 1652.302794, current state activating, last acting [131,194,287]
pg 13.432 is stuck inactive for 1652.297366, current state activating, last acting [131,194,287]
pg 13.442 is stuck inactive for 1652.286949, current state activating, last acting [131,194,287]
pg 13.452 is stuck inactive for 1652.310763, current state activating, last acting [131,194,287]
pg 13.462 is stuck inactive for 1652.296110, current state activating, last acting [131,194,287]
pg 13.472 is stuck inactive for 1652.294760, current state activating, last acting [131,194,287]
pg 13.482 is stuck inactive for 1652.297505, current state activating, last acting [131,194,287]

As a temporary workaround on the main ceph node, we had to delete the pool default.rgw.buckets.data and recreate it, delete the osdj pod that was experiencing issues, delete the 1 osd that was marked down and out, and then reweight the newly created osd.

Actions #1

Updated by Greg Farnum almost 4 years ago

  • Project changed from Ceph to RADOS
Actions #2

Updated by Neha Ojha almost 4 years ago

  • Status changed from New to Need More Info

Can you provide pg query output for one of those PGs? Also, osd logs with debug_osd=20 will be helpful.

Actions #3

Updated by Brian Wickersham almost 4 years ago

Sorry for the delay. We are working to get a reservation on one of our internal labs so we can recreate the issue and provide more detailed logs and the information requested.

Actions

Also available in: Atom PDF