Project

General

Profile

Bug #2002

osd: racy push/pull for clones

Added by Sage Weil almost 8 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

There is currently a race where:
- an adjacent clone is missing
- we (calculate some clone overlap? and) start pulling
- we get adjacent clone
- we get push, calc a different overlap, and then get confused.

Also, we don't work efficiently when pulling clones in parallel. We should probably serialize on each object_t so that we don't waste disk space. Recovery will probably still be faster.


Related issues

Duplicated by Ceph - Feature #2055: osd: fix up push cloning Duplicate
Duplicated by Ceph - Bug #1943: osd: bad clone transaction on journal replay Duplicate 01/14/2012

Associated revisions

Revision 2116f012 (diff)
Added by Sage Weil almost 8 years ago

osd: disable clone overlap for push/pull

There is a bug in the push/pull code. Disable the recovery smarts by
default until we fix #2002.

There is currently a race (in the callers) where:
- an adjacent clone is missing
- we (calculate some clone overlap? and) start pulling
- we get adjacent clone
- we get push, calc a different overlap, and then get confused.

Signed-off-by: Sage Weil <>

Revision 1775301b (diff)
Added by Sage Weil over 7 years ago

osd: reenable clone on recovery

This hasn't turned up problems in QA.

Fixes: #2002
Signed-off-by: Sage Weil <>

History

#1 Updated by Sage Weil almost 8 years ago

:osd.log.badpushpull

shows the (or similar) badness. workload was


kernel:
  branch: master
interactive-on-error: true

roles:
- - mon.a
  - mds.a
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mon.c
  - client.0
  - osd.3
  - osd.4
  - osd.5
tasks:
- ceph:
    btrfs: 1
    log-whitelist:
    - wrongly marked me down or wrong addr
    conf:
      osd:
        debug ms: 1
        debug osd: 20
      mon:
        debug ms: 10
        debug mon: 20
- thrashosds:
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 50
      read: 100
      write: 100
      snap_create: 50
      snap_remove: 50
      snap_rollback: 50
    ops: 4000

on 2116f012eddfe3278fcdfeb5a2ddc877491d210d

#2 Updated by Sage Weil almost 8 years ago

  • Target version set to v0.43

#3 Updated by Sage Weil almost 8 years ago

  • Status changed from New to 7
  • Source set to Development

reenabling this in my thrashing tests. if all goes well, i'll reenable in master under the assumption that sam's cleanups addressed the problem.

#4 Updated by Sage Weil almost 8 years ago

  • Target version changed from v0.43 to v0.44

#5 Updated by Sage Weil over 7 years ago

  • Status changed from 7 to Resolved

haven't seen this in forever; looks fixed.

#6 Updated by Sage Weil over 7 years ago

  • Status changed from Resolved to 7
  • Target version changed from v0.44 to v0.45

i take that back; this wasn't enabled in qa. adding to the teuthology ceph.conf file.

#7 Updated by Sage Weil over 7 years ago

  • Status changed from 7 to Resolved

Also available in: Atom PDF