Project

General

Profile

Actions

Bug #6585

closed

osd: backfill vs copy-from delay badness (was osd: ENOENT on clone)

Added by Sage Weil over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

with logs!

2013-10-17 16:28:01.222009 7fb15369e780  0 filestore(/var/lib/ceph/osd/ceph-1)  error (2) No such file or directory not handled on operation 17 (7273.1.0, or op 0, counting from 0)
2013-10-17 16:28:01.222038 7fb15369e780  0 filestore(/var/lib/ceph/osd/ceph-1) ENOENT on clone suggests osd bug

ubuntu@teuthology:/a/sage-bug-6582-a/57528
job is hung; log on plana18 and copy in the archive dir


Related issues 2 (0 open2 closed)

Related to Ceph - Bug #6593: osd: copy-from object blocking clashes with recoveryDuplicate10/18/2013

Actions
Has duplicate Ceph - Bug #6602: osd/PG.cc: 2465: FAILED assert(r == 0) (SnapMapper::update_snaps() returns error)DuplicateSamuel Just10/21/2013

Actions
Actions #1

Updated by Sage Weil over 10 years ago

dump of the jouranl is in the dir now. i see a transaction taht deletes the object and starts writing to a temp object:

    { "offset": 11792384,
      "seq": 7271,
      "transactions": [
            { "trans_num": 0,
              "ops": [
                    { "op_num": 0,
                      "op_name": "mkcoll",
                      "collection": "3.8_TEMP"},
                    { "op_num": 1,
                      "op_name": "remove",
                      "collection": "3.8_head",
                      "oid": "ebe36168\/plana186588-18\/fa\/\/3"},
                    { "op_num": 2,
                      "op_name": "remove",
                      "collection": "3.8_head",
                      "oid": "ebe36168\/plana186588-18\/fa\/\/3"},
                    { "op_num": 3,
                      "op_name": "remove",
                      "collection": "3.8_TEMP",
                      "oid": "ebe36168\/plana186588-18\/fa\/\/3"},
                    { "op_num": 4,
                      "op_name": "touch",
                      "collection": "3.8_TEMP",
                      "oid": "ebe36168\/plana186588-18\/fa\/\/3"},
                    { "op_num": 5,
                      "op_name": "omap_setheader",
                      "collection": "3.8_TEMP",
                      "oid": "ebe36168\/plana186588-18\/fa\/\/3",
                      "header_length": "0"},
...

and then the next mention of the object is
            { "trans_num": 1,
              "ops": [
                    { "op_num": 0,
                      "op_name": "clone",
                      "collection": "3.8_head",
                      "src_oid": "ebe36168\/plana186588-18\/head\/\/3",
                      "dst_oid": "ebe36168\/plana186588-18\/111\/\/3"},
                    { "op_num": 1,
                      "op_name": "setattr",
                      "collection": "3.8_head",
                      "oid": "ebe36168\/plana186588-18\/111\/\/3",
                      "name": "_",
                      "length": 222},
                    { "op_num": 2,
                      "op_name": "rmattr",
                      "collection": "3.8_head",
                      "oid": "ebe36168\/plana186588-18\/111\/\/3",
                      "name": "snapset"},
                    { "op_num": 3,
                      "op_name": "remove",
                      "collection": "3.8_head",
                      "oid": "ebe36168\/plana186588-18\/head\/\/3"},
                    { "op_num": 4,

Actions #2

Updated by Sage Weil over 10 years ago

  • Assignee set to Samuel Just

around 16:27:52.761566 in osd.5 log there isa modify that happens right in the middle of the same object being backfilled.

Actions #3

Updated by Sage Weil over 10 years ago

  • Status changed from New to 12
Actions #4

Updated by Samuel Just over 10 years ago

Once we are ready to actually move the object into place we need to consider the possibility that backfill is now blocking writes on that object.

Actions #5

Updated by Sage Weil over 10 years ago

  • Subject changed from osd: ENOENT on clone to osd: backfill vs copy-from delay badness (was osd: ENOENT on clone)
Actions #6

Updated by Sage Weil over 10 years ago

ubuntu@teuthology:/a/teuthology-2013-10-20_23:00:15-rados-master-testing-basic-plana/61776

Actions #7

Updated by Samuel Just over 10 years ago

  • Assignee changed from Samuel Just to Greg Farnum
Actions #8

Updated by Greg Farnum over 10 years ago

  • Status changed from 12 to In Progress

Proposed solution (thanks for walking me through this, Sam!):
In ReplicatedPG::recover_backfill, grab a RWTracker read lock on every object we are going to backfill. If we can't get one on an object, stop the loop early.
(Update the return values and state appropriately for having short-circuited the push start)
Add a callback to the RWTracker that starts up the backfill machinery again when we get our read lock.
Release the read lock at an appropriate time in the push code.

Actions #9

Updated by Greg Farnum over 10 years ago

  • Status changed from In Progress to Fix Under Review

Pushed to wip-6585. Haven't tested it yet; need it to build so I can start the teuthology thrashing copy-from tests.

Actions #10

Updated by Greg Farnum over 10 years ago

  • Status changed from Fix Under Review to 7

Sam liked it; have squashed and am scheduling a suite run now.

Actions #11

Updated by Samuel Just over 10 years ago

65817 FAIL scheduled_teuthology@teuthology rados/thrash/{clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/osd-delay.yaml thrashers/default.yaml workloads/snaps-few-objects.yaml} 1550s
65889 FAIL scheduled_teuthology@teuthology rados/thrash/{clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/few.yaml thrashers/default.yaml workloads/snaps-few-objects.yaml} 1774s

Actions #12

Updated by Greg Farnum over 10 years ago

  • Assignee changed from Greg Farnum to Samuel Just

We think we might have fixed this, or at least most of it — but testing is shaking out a lot of long-standing bugs in other places that Sam has been working on, so he gets the ticket.

Actions #13

Updated by Samuel Just over 10 years ago

I've merged wip-6585, but it's not quite fixed yet.

Actions #14

Updated by Sage Weil over 10 years ago

  • Status changed from 7 to Resolved
Actions

Also available in: Atom PDF