Bug #45292: pg autoscaler merging issue - RADOS - Ceph

Actions

Copy link

Bug #45292

open

pg autoscaler merging issue

Added by Brian Wickersham about 4 years ago. Updated almost 4 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v14.2.8

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Encountering an issue where placement groups (pgs) go into status stuck inactive and hang in that status. This appears to happen when the pg_autoscaler is merging/scaling down placement groups on the main Ceph node while the Jenkins pipeline expansion phase where ceph is being scaled out/deployed to other nodes is occurring. This was manifesting itself for example in the pool default.rgw.buckets.data pool id=13 (below is a portion of that pool's pgs that were stuck inactive):

kubectl -n ceph exec -it ceph-mon-4pw5s – ceph health detail
HEALTH_WARN Reduced data availability: 210 pgs inactive
PG_AVAILABILITY Reduced data availability: 210 pgs inactive
pg 13.416 is stuck inactive for 1652.311455, current state activating, last acting [131,194,287]
pg 13.422 is stuck inactive for 1652.302794, current state activating, last acting [131,194,287]
pg 13.432 is stuck inactive for 1652.297366, current state activating, last acting [131,194,287]
pg 13.442 is stuck inactive for 1652.286949, current state activating, last acting [131,194,287]
pg 13.452 is stuck inactive for 1652.310763, current state activating, last acting [131,194,287]
pg 13.462 is stuck inactive for 1652.296110, current state activating, last acting [131,194,287]
pg 13.472 is stuck inactive for 1652.294760, current state activating, last acting [131,194,287]
pg 13.482 is stuck inactive for 1652.297505, current state activating, last acting [131,194,287]

As a temporary workaround on the main ceph node, we had to delete the pool default.rgw.buckets.data and recreate it, delete the osdj pod that was experiencing issues, delete the 1 osd that was marked down and out, and then reweight the newly created osd.

Actions

Copy link

Updated by Greg Farnum almost 4 years ago

Project changed from Ceph to RADOS

Actions

Copy link

Updated by Neha Ojha almost 4 years ago

Status changed from New to Need More Info

Can you provide pg query output for one of those PGs? Also, osd logs with debug_osd=20 will be helpful.

Actions

Copy link

Updated by Brian Wickersham almost 4 years ago

Sorry for the delay. We are working to get a reservation on one of our internal labs so we can recreate the issue and provide more detailed logs and the information requested.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #45292

pg autoscaler merging issue

Updated by Greg Farnum almost 4 years ago

Updated by Neha Ojha almost 4 years ago

Updated by Brian Wickersham almost 4 years ago