Bug #6585
closedosd: backfill vs copy-from delay badness (was osd: ENOENT on clone)
Added by Sage Weil over 10 years ago. Updated over 10 years ago.
0%
Description
with logs!
2013-10-17 16:28:01.222009 7fb15369e780 0 filestore(/var/lib/ceph/osd/ceph-1) error (2) No such file or directory not handled on operation 17 (7273.1.0, or op 0, counting from 0) 2013-10-17 16:28:01.222038 7fb15369e780 0 filestore(/var/lib/ceph/osd/ceph-1) ENOENT on clone suggests osd bug
ubuntu@teuthology:/a/sage-bug-6582-a/57528
job is hung; log on plana18 and copy in the archive dir
Updated by Sage Weil over 10 years ago
dump of the jouranl is in the dir now. i see a transaction taht deletes the object and starts writing to a temp object:
{ "offset": 11792384, "seq": 7271, "transactions": [ { "trans_num": 0, "ops": [ { "op_num": 0, "op_name": "mkcoll", "collection": "3.8_TEMP"}, { "op_num": 1, "op_name": "remove", "collection": "3.8_head", "oid": "ebe36168\/plana186588-18\/fa\/\/3"}, { "op_num": 2, "op_name": "remove", "collection": "3.8_head", "oid": "ebe36168\/plana186588-18\/fa\/\/3"}, { "op_num": 3, "op_name": "remove", "collection": "3.8_TEMP", "oid": "ebe36168\/plana186588-18\/fa\/\/3"}, { "op_num": 4, "op_name": "touch", "collection": "3.8_TEMP", "oid": "ebe36168\/plana186588-18\/fa\/\/3"}, { "op_num": 5, "op_name": "omap_setheader", "collection": "3.8_TEMP", "oid": "ebe36168\/plana186588-18\/fa\/\/3", "header_length": "0"}, ...
and then the next mention of the object is
{ "trans_num": 1, "ops": [ { "op_num": 0, "op_name": "clone", "collection": "3.8_head", "src_oid": "ebe36168\/plana186588-18\/head\/\/3", "dst_oid": "ebe36168\/plana186588-18\/111\/\/3"}, { "op_num": 1, "op_name": "setattr", "collection": "3.8_head", "oid": "ebe36168\/plana186588-18\/111\/\/3", "name": "_", "length": 222}, { "op_num": 2, "op_name": "rmattr", "collection": "3.8_head", "oid": "ebe36168\/plana186588-18\/111\/\/3", "name": "snapset"}, { "op_num": 3, "op_name": "remove", "collection": "3.8_head", "oid": "ebe36168\/plana186588-18\/head\/\/3"}, { "op_num": 4,
Updated by Sage Weil over 10 years ago
- Assignee set to Samuel Just
around 16:27:52.761566 in osd.5 log there isa modify that happens right in the middle of the same object being backfilled.
Updated by Samuel Just over 10 years ago
Once we are ready to actually move the object into place we need to consider the possibility that backfill is now blocking writes on that object.
Updated by Sage Weil over 10 years ago
- Subject changed from osd: ENOENT on clone to osd: backfill vs copy-from delay badness (was osd: ENOENT on clone)
Updated by Sage Weil over 10 years ago
ubuntu@teuthology:/a/teuthology-2013-10-20_23:00:15-rados-master-testing-basic-plana/61776
Updated by Samuel Just over 10 years ago
- Assignee changed from Samuel Just to Greg Farnum
Updated by Greg Farnum over 10 years ago
- Status changed from 12 to In Progress
Proposed solution (thanks for walking me through this, Sam!):
In ReplicatedPG::recover_backfill, grab a RWTracker read lock on every object we are going to backfill. If we can't get one on an object, stop the loop early.
(Update the return values and state appropriately for having short-circuited the push start)
Add a callback to the RWTracker that starts up the backfill machinery again when we get our read lock.
Release the read lock at an appropriate time in the push code.
Updated by Greg Farnum over 10 years ago
- Status changed from In Progress to Fix Under Review
Pushed to wip-6585. Haven't tested it yet; need it to build so I can start the teuthology thrashing copy-from tests.
Updated by Greg Farnum over 10 years ago
- Status changed from Fix Under Review to 7
Sam liked it; have squashed and am scheduling a suite run now.
Updated by Samuel Just over 10 years ago
65817 FAIL scheduled_teuthology@teuthology rados/thrash/{clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/osd-delay.yaml thrashers/default.yaml workloads/snaps-few-objects.yaml} 1550s
65889 FAIL scheduled_teuthology@teuthology rados/thrash/{clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/few.yaml thrashers/default.yaml workloads/snaps-few-objects.yaml} 1774s
Updated by Greg Farnum over 10 years ago
- Assignee changed from Greg Farnum to Samuel Just
We think we might have fixed this, or at least most of it — but testing is shaking out a lot of long-standing bugs in other places that Sam has been working on, so he gets the ticket.
Updated by Samuel Just over 10 years ago
I've merged wip-6585, but it's not quite fixed yet.