Bug #7576: osd: large skew in pg epochs (dumpling) - Ceph - Ceph

Actions

Copy link

Bug #7576

closed

osd: large skew in pg epochs (dumpling)

Added by Sage Weil about 10 years ago. Updated over 9 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Sage Weil

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Observed a cluster with pgs at very different pg epochs (~17000 and ~24000). This isn't supposed to happen on start because we flush the peering wq.

Maybe it can still happen while teh osd is active, though? At some point we should mark ourselves down (maybe?) if there are pgs that are so far behind.

Actions

Copy link

Updated by Greg Farnum about 10 years ago

Hmm, this is partly deliberate — we allow PGs to move forward "at their own pace", so if they aren't getting any activity they can fall behind a bit so as not to preempt actual work. I don't recall what mechanisms exist to keep them from being completely out of date, though.

Actions

Copy link

Updated by Sage Weil about 10 years ago

Severity changed from 3 - minor to 2 - major

Actions

Copy link

Updated by Ian Colle about 10 years ago

Assignee set to Sage Weil

Actions

Copy link

Updated by Sage Weil about 10 years ago

How about this: in OSDService, add

Mutex pg_epoch_lock;
 Cond pg_epoch_cond;
 multiset&lt;epoch_t&gt; pg_epochs;
 map&lt;pg_t,epoch_t&gt; pg_epoch;

and

void pg_update_epoch(pg_t pgid, epoch_t epoch);

that updates the pg_epochs map and multiset. And then a

void pg_update_get_lower_bound()

that returns the oldest epoch. And a wait(). Then in hadnle_osd_map, if the LB is more than X epochs behind, we block. Or, mark ourselves down until we can catch up.

Actions

Copy link

Updated by Greg Farnum about 10 years ago

That doesn't seem like it's addressing the issue the right way. We've deliberately set it so that PGs which don't get activity won't wake up and process new maps as frequently; the solution needs to adjust waking those PGs up, not simply blocking if nothing's done so. So perhaps if we're processing a map and we have PGs that are 100 epochs behind, we issue them a null event (to bring them up to date); but we don't want to block (or mark ourselves down) until they're much farther behind (eg 1000 epochs).
Honestly I thought we had some mechanism like that already, but maybe not (or maybe it's not functioning properly)?

Actions

Copy link

Updated by Greg Farnum about 10 years ago

We looked at this in standup today. There is a queue_null on every PG in OSD::consume_map(), so they should be getting woken up. (ie, I was mistaken about the current state of affairs.) I'm not sure what else is going on around here.

Actions

Copy link