Project

General

Profile

Actions

Bug #7576

closed

osd: large skew in pg epochs (dumpling)

Added by Sage Weil about 10 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Observed a cluster with pgs at very different pg epochs (~17000 and ~24000). This isn't supposed to happen on start because we flush the peering wq.

Maybe it can still happen while teh osd is active, though? At some point we should mark ourselves down (maybe?) if there are pgs that are so far behind.

Actions #1

Updated by Greg Farnum about 10 years ago

Hmm, this is partly deliberate — we allow PGs to move forward "at their own pace", so if they aren't getting any activity they can fall behind a bit so as not to preempt actual work. I don't recall what mechanisms exist to keep them from being completely out of date, though.

Actions #2

Updated by Sage Weil about 10 years ago

  • Severity changed from 3 - minor to 2 - major
Actions #3

Updated by Ian Colle about 10 years ago

  • Assignee set to Sage Weil
Actions #4

Updated by Sage Weil about 10 years ago

How about this: in OSDService, add

Mutex pg_epoch_lock;
Cond pg_epoch_cond;
multiset<epoch_t> pg_epochs;
map<pg_t,epoch_t> pg_epoch;

and

void pg_update_epoch(pg_t pgid, epoch_t epoch);

that updates the pg_epochs map and multiset. And then a

void pg_update_get_lower_bound()

that returns the oldest epoch. And a wait(). Then in hadnle_osd_map, if the LB is more than X epochs behind, we block. Or, mark ourselves down until we can catch up.

Actions #5

Updated by Greg Farnum about 10 years ago

That doesn't seem like it's addressing the issue the right way. We've deliberately set it so that PGs which don't get activity won't wake up and process new maps as frequently; the solution needs to adjust waking those PGs up, not simply blocking if nothing's done so. So perhaps if we're processing a map and we have PGs that are 100 epochs behind, we issue them a null event (to bring them up to date); but we don't want to block (or mark ourselves down) until they're much farther behind (eg 1000 epochs).
Honestly I thought we had some mechanism like that already, but maybe not (or maybe it's not functioning properly)?

Actions #6

Updated by Greg Farnum about 10 years ago

We looked at this in standup today. There is a queue_null on every PG in OSD::consume_map(), so they should be getting woken up. (ie, I was mistaken about the current state of affairs.) I'm not sure what else is going on around here.

Actions #7

Updated by Sage Weil about 10 years ago

  • Status changed from 12 to In Progress

wip-7576

Actions #8

Updated by Sage Weil about 10 years ago

  • Status changed from In Progress to Fix Under Review
Actions #9

Updated by Sage Weil almost 10 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #10

Updated by Sage Weil almost 10 years ago

  • Priority changed from Urgent to High
Actions #11

Updated by Sage Weil almost 10 years ago

  • Status changed from Pending Backport to Resolved
Actions #12

Updated by Sage Weil over 9 years ago

  • Status changed from Resolved to Pending Backport

still want to backport this to firefly ...

Actions #13

Updated by Sage Weil over 9 years ago

..and when we do, include a52a855f6c92b03dd84cd0cc1759084f070a98c2 !!

Actions #14

Updated by Sage Weil over 9 years ago

  • Priority changed from High to Normal
Actions #15

Updated by Sage Weil over 9 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF