Project

General

Profile

Actions

Bug #5226

closed

Some PG stay in "incomplete" state

Added by Olivier Bonvalet almost 11 years ago. Updated over 10 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

With bobtail I first loose the OSD.25 : the OSD process was crashing, and when its data are ballanced on other OSD (because of reweigth or because of "out" state on OSD.25) they also crash. So, I choose to set the cluster in "noout", waiting for a fix (yes, I should open a bug for that...). I also try to mark it as "lost", and format (mkfs.xfs) without any success. And since cuttlefish I can't start this OSD anymore. It's down, out, and lost.

Then I loose the OSD.19 : the hard disk is dead, unable to read or write any data on it. So, I mark it as "lost" and replace the disk, the OSD.19 is now running.

But : I have 2 pool which use only 2 replica, so I loose the PG which were common to OSD.19 and OSD.25.
In fact I have 15 PG which stay in "incomplete" state, and I don't know how to recover for that. And 14 of thoses 15 PG are related to OSD.19.


Files

osd.19.extract.log.gz (19.4 MB) osd.19.extract.log.gz OSD.19 logs Olivier Bonvalet, 05/31/2013 04:21 PM
pg-query.txt (453 KB) pg-query.txt result of "ceph pg 4.0 query" Olivier Bonvalet, 06/03/2013 02:55 PM
Actions #1

Updated by Olivier Bonvalet almost 11 years ago

After replacing OSD.25, near all incompletes PG are [19, 25] or [25, 19] :

$ ceph health detail
HEALTH_WARN 15 pgs incomplete; 15 pgs stuck inactive; 15 pgs stuck unclean
pg 4.5c is stuck inactive since forever, current state incomplete, last acting [19,25]
pg 8.71d is stuck inactive since forever, current state incomplete, last acting [25,19]
pg 8.3fa is stuck inactive since forever, current state incomplete, last acting [19,25]
pg 8.3e0 is stuck inactive since forever, current state incomplete, last acting [25,19]
pg 8.56c is stuck inactive since forever, current state incomplete, last acting [19,25]
pg 8.19f is stuck inactive since forever, current state incomplete, last acting [25,19]
pg 8.792 is stuck inactive since forever, current state incomplete, last acting [19,25]
pg 4.0 is stuck inactive since forever, current state incomplete, last acting [25,19]
pg 8.78a is stuck inactive since forever, current state incomplete, last acting [25,19]
pg 8.23e is stuck inactive since forever, current state incomplete, last acting [32,13]
pg 8.2ff is stuck inactive since forever, current state incomplete, last acting [25,19]
pg 8.5e2 is stuck inactive since forever, current state incomplete, last acting [25,19]
pg 8.528 is stuck inactive since forever, current state incomplete, last acting [25,19]
pg 8.20f is stuck inactive since forever, current state incomplete, last acting [25,19]
pg 8.372 is stuck inactive since forever, current state incomplete, last acting [19,25]
pg 4.5c is stuck unclean since forever, current state incomplete, last acting [19,25]
pg 8.71d is stuck unclean since forever, current state incomplete, last acting [25,19]
pg 8.3fa is stuck unclean since forever, current state incomplete, last acting [19,25]
pg 8.3e0 is stuck unclean since forever, current state incomplete, last acting [25,19]
pg 8.56c is stuck unclean since forever, current state incomplete, last acting [19,25]
pg 8.19f is stuck unclean since forever, current state incomplete, last acting [25,19]
pg 8.792 is stuck unclean since forever, current state incomplete, last acting [19,25]
pg 4.0 is stuck unclean since forever, current state incomplete, last acting [25,19]
pg 8.78a is stuck unclean since forever, current state incomplete, last acting [25,19]
pg 8.23e is stuck unclean since forever, current state incomplete, last acting [32,13]
pg 8.2ff is stuck unclean since forever, current state incomplete, last acting [25,19]
pg 8.5e2 is stuck unclean since forever, current state incomplete, last acting [25,19]
pg 8.528 is stuck unclean since forever, current state incomplete, last acting [25,19]
pg 8.20f is stuck unclean since forever, current state incomplete, last acting [25,19]
pg 8.372 is stuck unclean since forever, current state incomplete, last acting [19,25]
pg 8.792 is incomplete, acting [19,25]
pg 8.78a is incomplete, acting [25,19]
pg 8.71d is incomplete, acting [25,19]
pg 8.5e2 is incomplete, acting [25,19]
pg 8.56c is incomplete, acting [19,25]
pg 8.528 is incomplete, acting [25,19]
pg 8.3fa is incomplete, acting [19,25]
pg 8.3e0 is incomplete, acting [25,19]
pg 8.372 is incomplete, acting [19,25]
pg 8.2ff is incomplete, acting [25,19]
pg 8.23e is incomplete, acting [32,13]
pg 8.20f is incomplete, acting [25,19]
pg 8.19f is incomplete, acting [25,19]
pg 4.5c is incomplete, acting [19,25]
pg 4.0 is incomplete, acting [25,19]

Actions #2

Updated by Sage Weil almost 11 years ago

  • Status changed from New to Need More Info

it sounds as though osd.19 was also missing hte data prior to osd.25 going away. can you look for the pg subdirectories on osd.19 and see if they are there?

you can also do 'ceph pg 4.0 query' (for example) to see what the intenral rados state is.

Actions #3

Updated by Olivier Bonvalet almost 11 years ago

Well, if I look /var/lib/ceph/osd/ceph-19/current/4.5c_head or /var/lib/ceph/osd/ceph-19/current/4.0_head for example, they are both empty.
And since OSD.19 was replaced by a fresh new disk while OSD.25 was down, there is data loss yes.

Actions #4

Updated by Sage Weil almost 11 years ago

  • Status changed from Need More Info to Won't Fix

nothing much to be done here if 2 disk were replaced/failed

Actions #5

Updated by Olivier Bonvalet almost 11 years ago

Well, I have pools on that clusters which are fines (thanks to 3 copies) ; so how can I recover a HEALTH_OK status, since RBD is not able to remove damaged images ?

Actions #6

Updated by Olivier Bonvalet over 10 years ago

For the record : today I replaced all OSD by new models, and it fix the problem : it seems that incomplete data were not balanced on newer OSD.

My cluster is now in HEALTH_OK state.

Actions

Also available in: Atom PDF