Project

General

Profile

Actions

Bug #12716

closed

Cluster health_warn stuck on active+remapped

Added by Steve Dainard over 8 years ago. Updated about 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
ceph cli
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I ran a ceph osd reweight-by-utilization last week and partway through had a network interruption. After the network was restored the cluster continued to rebalance but eventually idled with active+remapped and degraded PG's. I added 2 OSD's to the cluster and ran another reweight-by-utilization and now the cluster is idle with 3 PG's active-remapped.

  1. ceph -s
    cluster af859ff1-c394-4c9a-95e2-0e0e4c87445c
    health HEALTH_WARN
    3 pgs stuck unclean
    recovery 24379/66089446 objects misplaced (0.037%)
    monmap e24: 3 mons at {mon1=10.0.231.53:6789/0,mon2=10.0.231.54:6789/0,mon3=10.0.231.55:6789/0}
    election epoch 268, quorum 0,1,2 mon1,mon2,mon3
    osdmap e186553: 102 osds: 102 up, 102 in; 3 remapped pgs
    pgmap v3178336: 4144 pgs, 7 pools, 125 TB data, 32270 kobjects
    251 TB used, 118 TB / 370 TB avail
    24379/66089446 objects misplaced (0.037%)
    4141 active+clean
    3 active+remapped
  1. ceph health detail
    HEALTH_WARN 3 pgs stuck unclean; recovery 24379/66089446 objects misplaced (0.037%)
    pg 2.e7f is stuck unclean for 517058.124297, current state active+remapped, last acting [58,5]
    pg 2.b16 is stuck unclean for 434261.024579, current state active+remapped, last acting [40,90]
    pg 2.782 is stuck unclean for 307997.053475, current state active+remapped, last acting [76,101]
    recovery 24379/66089446 objects misplaced (0.037%)
  1. ceph pg 2.e7f query|head {
    "state": "active+remapped",
    "snap_trimq": "[]",
    "epoch": 186553,
    "up": [
    58
    ],
    "acting": [
    58,
    5
    [root@ceph1 media]# ceph pg 2.b16 query|head {
    "state": "active+remapped",
    "snap_trimq": "[]",
    "epoch": 186553,
    "up": [
    40
    ],
    "acting": [
    40,
    90
    [root@ceph1 media]# ceph pg 2.782 query|head {
    "state": "active+remapped",
    "snap_trimq": "[]",
    "epoch": 186553,
    "up": [
    76
    ],
    "acting": [
    76,
    101

Full pg queries and crushmap, and above osd debug logs attached.


Files

pg_query (28.2 KB) pg_query Steve Dainard, 08/17/2015 09:30 PM
decompiled-crushmap (5.25 KB) decompiled-crushmap Steve Dainard, 08/17/2015 09:30 PM
osd-logs.tar.gz (355 KB) osd-logs.tar.gz Steve Dainard, 08/17/2015 09:30 PM
Actions #1

Updated by Sage Weil over 8 years ago

  • Status changed from New to Need More Info

My guess is you need to set the vary_r tunable? Or, can you attach the osdmap so we can see why those pgs are only getting 2 replicas.

Actions #2

Updated by Sage Weil about 7 years ago

  • Status changed from Need More Info to Can't reproduce
Actions

Also available in: Atom PDF