Project

General

Profile

Actions

Bug #40622

closed

PG stuck in active+clean+remapped

Added by Mike Almateia almost 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Backfill/Recovery
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

A cluster have 6 servers, in 3 racks, 2 servers per a rack.
A replication rule distributes replicas to the 3 racks: one replica per a rack

We start removing one server in each rack: all replicas must move to the remaining server in each rack.

In the second rack, the second server has two DSOs less than the one being removed from the cluster.

When moving data from server in a second rack, 1 PG stuck at active+clean+remapped status: apparently can not find the desired OSD for moving inside second rack.

I'm try use:
  • ceph osd out 21
  • ceph crush reweight osd.21 0

but the same PG (id 5.783) stuck in active+clean+remapped status.

I have mon_max_pg_per_osd=400 set up, it's can be a barrier.


Files

ceph_osd_df_tree (7.5 KB) ceph_osd_df_tree Mike Almateia, 07/02/2019 02:27 PM
replicated_rule (417 Bytes) replicated_rule Mike Almateia, 07/02/2019 02:27 PM
crush_dump (35.1 KB) crush_dump Mike Almateia, 07/02/2019 02:27 PM
Actions #1

Updated by Sage Weil almost 5 years ago

  • Status changed from New to Resolved

This looks like crush is just failing to find a good replica because 50% of the osds in a rack are down. Try using the optimal tunables.. if you already are using those, try increasing choose_tries to something larger than 50 (70?) and the pgs will probably go active.

Actions

Also available in: Atom PDF