Project

General

Profile

Bug #40622

PG stuck in active+clean+remapped

Added by Mike Almateia 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Backfill/Recovery
Target version:
Start date:
07/02/2019
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature:

Description

A cluster have 6 servers, in 3 racks, 2 servers per a rack.
A replication rule distributes replicas to the 3 racks: one replica per a rack

We start removing one server in each rack: all replicas must move to the remaining server in each rack.

In the second rack, the second server has two DSOs less than the one being removed from the cluster.

When moving data from server in a second rack, 1 PG stuck at active+clean+remapped status: apparently can not find the desired OSD for moving inside second rack.

I'm try use:
  • ceph osd out 21
  • ceph crush reweight osd.21 0

but the same PG (id 5.783) stuck in active+clean+remapped status.

I have mon_max_pg_per_osd=400 set up, it's can be a barrier.

ceph_osd_df_tree (7.5 KB) Mike Almateia, 07/02/2019 02:27 PM

replicated_rule (417 Bytes) Mike Almateia, 07/02/2019 02:27 PM

crush_dump (35.1 KB) Mike Almateia, 07/02/2019 02:27 PM

History

#1 Updated by Sage Weil 5 months ago

  • Status changed from New to Resolved

This looks like crush is just failing to find a good replica because 50% of the osds in a rack are down. Try using the optimal tunables.. if you already are using those, try increasing choose_tries to something larger than 50 (70?) and the pgs will probably go active.

Also available in: Atom PDF