Project

General

Profile

Actions

Bug #20059

closed

miscounting degraded objects

Added by Sage Weil almost 7 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
David Zafman
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor, pgmap
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

on bigbang,

    cluster f502e0e8-63e1-42c8-b38b-5b4f8daba3f8
     health HEALTH_WARN
            129288 pgs degraded
            1287 pgs recovering
            128001 pgs recovery_wait
            129288 pgs stuck degraded
            129288 pgs stuck unclean
            recovery 489383396/601270785 objects degraded (81.392%)
     monmap e2: 3 mons, quorum p06636710a37514,p06636710a59202,p06636710a82299
        mgr e502: active: p06636710a59202, standbys: p06636710a82299, p06636710a
37514
     osdmap e37010: 6528 osds, 6526 up, 6526 in
      pgmap 262208 pgs, 2 pools, 192 TB data, 191M objects
            948 TB used, 34670 TB / 35618 TB avail
            489383396/601270785 objects degraded (81.392%)
              132920 active+clean
              128001 active+recovery_wait+degraded
                1287 active+recovering+degraded
recovery io 4712 MB/s, 4691 objects/s
  client io 3533 MB/s wr, 0 op/s rd, 7063 op/s wr

each pg is 3x. note that almost exactly 1/2 of them are degraded (i did a big pg split that updated pg_num and hten pgp_num from 131072 to 262144). so the degraded pgs probably have all 3 replicas in the wrong location, and the active+clean ones are obviously fine.

This should mean that no more than 50% of object (instances/copies) are degraded... right? Not sure where the 80% arithmetic is coming from.


Related issues 3 (0 open3 closed)

Related to Ceph - Bug #21803: objects degraded higher than 100%ResolvedDavid Zafman10/13/2017

Actions
Related to RADOS - Bug #22145: PG stuck in recovery_unfoundResolvedSage Weil11/16/2017

Actions
Copied to RADOS - Backport #22724: luminous: miscounting degraded objectsResolvedDavid ZafmanActions
Actions #1

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category set to Administration/Usability
  • Component(RADOS) Monitor, pgmap added

Maybe we count each instance of an object when it's degraded (i.e., 3x for replicated pools), but the non-degraded ones only once? This would be a mismatch but you can see how it happens since if there is a missing copy of the object you want to say it's degraded; if there are two missing copies you want it to be twice as bad...

Actions #2

Updated by Sage Weil almost 7 years ago

  • Priority changed from Immediate to Urgent
Actions #3

Updated by Greg Farnum almost 7 years ago

  • Priority changed from Urgent to High
Actions #4

Updated by Joao Eduardo Luis almost 7 years ago

  • Assignee set to Joao Eduardo Luis
Actions #5

Updated by Sage Weil over 6 years ago

  • Status changed from New to In Progress
  • Assignee changed from Joao Eduardo Luis to David Zafman

David fixed most of this, but there is still one piece left (accurate accounting for missing_loc)

Actions #6

Updated by David Zafman over 6 years ago

I don't know what Sage is referring to regarding accounting for missing_loc. Should num_objects_missing be set for the PG?

The issue as described here may be fixed by https://github.com/ceph/ceph/pull/18554

Should I use this tracker for pull #18554 and add the "Fixes:" to the merge comment?

Actions #7

Updated by David Zafman over 6 years ago

  • Subject changed from mon: miscounting degraded objects to miscounting degraded objects
  • Backport set to luminous
Actions #8

Updated by David Zafman over 6 years ago

  • Status changed from In Progress to Fix Under Review
Actions #9

Updated by David Zafman over 6 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #10

Updated by David Zafman over 6 years ago

Actions #11

Updated by David Zafman over 6 years ago

  • Related to Bug #21803: objects degraded higher than 100% added
Actions #13

Updated by David Zafman over 6 years ago

  • Related to Bug #22145: PG stuck in recovery_unfound added
Actions #14

Updated by David Zafman over 6 years ago

  • Status changed from Pending Backport to Resolved
Actions #15

Updated by Florian Haas over 5 years ago

Just adding another reference to #21803 here — this fix was meant to fix that issue as well, which it apparently did not.

Actions

Also available in: Atom PDF