Project

General

Profile

Bug #8595

osd: client op blocks until backfill starts (dumpling)

Added by Sage Weil almost 10 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

observed on congress. logs at cephstore7098:~sage

History

#1 Updated by Sage Weil almost 10 years ago

we are hitting this:

  if (head == backfill_pos) {
    wait_for_backfill_pos(op);
    return;
  }

in do_op()

#2 Updated by Sage Weil almost 10 years ago

6f975e35a1e29a01347e4a6709b54a0422e063dd fixed it in emperor, but not intentionally.

#3 Updated by David Zafman almost 10 years ago

Both
6f975e35a1e29a01347e4a6709b54a0422e063dd
and
3d0d69fed09675fc466f6c5b2736ba674923823d
look like they need to be backported together.

I created wip-8595-dz on dumpling branch for further testing.

#4 Updated by Sage Weil over 9 years ago

  • Priority changed from Normal to Urgent

#5 Updated by Sage Weil over 9 years ago

  • Status changed from New to In Progress
  • Assignee set to Sage Weil

#6 Updated by Sage Weil over 9 years ago

  • Status changed from In Progress to 7

#7 Updated by Sage Weil over 9 years ago

  • Status changed from 7 to In Progress

with this patch, i see filestore tripping over ENOENT on clone:

ubuntu@teuthology:/a/teuthology-2014-08-11_19:00:01-rados-dumpling-testing-basic-multi/417411
ubuntu@teuthology:/a/teuthology-2014-08-11_19:00:01-rados-dumpling-testing-basic-multi/417465

which were

description: rados/thrash/{clusters/fixed-2.yaml fs/ext4.yaml msgr-failures/few.yaml thrashers/morepggrow.yaml workloads/snaps-many-objects.yaml}
description: rados/thrash/{clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/few.yaml thrashers/pggrow.yaml workloads/snaps-many-objects.yaml}

#8 Updated by Sage Weil over 9 years ago

  • Assignee deleted (Sage Weil)

#9 Updated by Sage Weil over 9 years ago

  • Status changed from In Progress to 12

The simple fixes here seem insufficient (fail in qa). Haven't seen anybody else hitting this, which surprises me a bit.

#10 Updated by Samuel Just over 9 years ago

  • Status changed from 12 to 7
  • Assignee set to Samuel Just

I think the least distasteful solution is to actually backport the last_backfill_started modifications. I'll start testing a branch.

#11 Updated by Samuel Just over 9 years ago

It seems that we need to backport the update_range/scan_range changes (intended to avoid backfill related flushes) from just after dumpling as well.
0cb797b3e5064ec6b092dc28f23a85fcdbf96526
1f9e137b51d580bce6505773217bcca008225a47

wip-sam-dumpling-testing

#12 Updated by Sage Weil over 9 years ago

  • Status changed from 7 to Resolved

Also available in: Atom PDF