Project

General

Profile

Actions

Bug #4645

closed

osd: Adding osd causes long stall without restart

Added by Sam Lang about 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

From the mailing list: http://thread.gmane.org/gmane.comp.file-systems.ceph.user/571/focus=572

Erdem Agaoglu wrote:

We are currently in process of enlarging our bobtail cluster size by adding OSDs. We have 12 disks per machine and we are creating one OSD per disk, adding them one by one as recommended. Only thing we don't do is starting with a small weight and increasing it slowly. Weights are all 1.

In this scenario both rbd and radosgw are unable to respond only in the first two minutes of adding a new OSD. After that small hiccup, we have some pgs like active+remapped+wait_backfill, active+remapped+backfilling, active+recovery_wait+remapped, active+degraded+remapped+backfilling and everything works OK. After a few hours of backfilling and recovery all pgs come active+clean and we add another OSD.

But sometimes, that small hiccup takes longer than a few minutes. In that times status shows some pgs are stuck in active and some are stuck in peering. When we look at the pg dump we see all those active or peering pgs are on the same 2 OSDs and are unable to move forward. At this stage rbd works poorly and radosgw is completely stalled. Only after restarting one of those 2 OSDs, pg's start to backfill and clients continue with their operations.

Since this is a live cluster we don't want to wait too long and usually go restart the OSD in a hurry. That's why i cannot currently provide status or pg query outputs. We have some logs but i don't know what to look for or if they are verbose enough.

Can this be any kind of a known issue? If not, where should i look to get any ideas about what's happening when it occurs?

In addition, i was able to extract some logs from the last time active/peering problem happened.
http://pastebin.com/BakFREFP

It ends with me restarting it.

Actions #1

Updated by Sage Weil almost 11 years ago

  • Status changed from New to Resolved

this should be fixed...

Actions

Also available in: Atom PDF