Bug #20380
closedStale bucket index entries are left over after object deletions
0%
Description
On many of our clusters running Jewel (10.2.5+), am running into a strange problem of having stale bucket index entries left over for (some of the) objects deleted.
The symptoms are that the actual delete operation of an object is reported successful in the RGW logs, but a bucket list on the container would still show the deleted object. An attempt to download/stat of the object appropriately results in a failure. No failures are seen in the respective OSDs where the bucket index object is located. And rebuilding the bucket index by running ‘radosgw-admin bucket check –fix’ would fix the issue.
Though I could simulate the problem by instrumenting the code, to not to have invoked `complete_del` on the bucket index op https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.cc#L8793, but that call is always seem to be made unless there is a cascading error from the actual delete operation of the object, which doesn’t seem to be the case here.
A simple script like below would easily reproduce the issue:
$# cat delete_create_object #!/bin/sh i=1 objects=100 prefix=$1 while [ $i -lt $objects ] do object=$prefix.`date +%Y-%m-%d:%H:%M:%S` `touch $object` swift -A http://localhost:8000/auth -U test:tester -K testing upload test $object swift -A http://localhost:8000/auth -U test:tester -K testing delete test $object swift -A http://localhost:8000/auth -U test:tester -K testing list test i=`expr $i + 1` done $# cat script ./delete_create_object one& ./delete_create_object two& ./delete_create_object three& ./delete_create_object four& ./delete_create_object five& ./delete_create_object six& ./delete_create_object seven& ./delete_create_object eight& ./delete_create_object nine& ./delete_create_object ten&
In one such run of the above script, I ended up with the below on my test single node (vstart) cluster running 1 RGW, 3 OSDs. A bucket-check with --fix resolves the issue.
$# swift -A http://localhost:8000/auth -U test:tester -K testing list test eight.2017-06-22:15:25:59 eight.2017-06-22:15:26:31 eight.2017-06-22:15:26:39 eight.2017-06-22:15:27:22 eight.2017-06-22:15:27:56 eight.2017-06-22:15:29:20 eight.2017-06-22:15:30:10 eight.2017-06-22:15:31:24 eight.2017-06-22:15:31:31 eight.2017-06-22:15:32:17 eight.2017-06-22:15:33:26 eight.2017-06-22:15:33:53 eight.2017-06-22:15:34:20 five.2017-06-22:15:24:41 five.2017-06-22:15:25:27 five.2017-06-22:15:25:53 five.2017-06-22:15:27:07 five.2017-06-22:15:28:14 five.2017-06-22:15:28:18 five.2017-06-22:15:29:20 five.2017-06-22:15:31:04 five.2017-06-22:15:31:18 five.2017-06-22:15:31:31 five.2017-06-22:15:32:17 five.2017-06-22:15:33:40 five.2017-06-22:15:34:27 five.2017-06-22:15:35:05 five.2017-06-22:15:35:21 four.2017-06-22:15:24:48 four.2017-06-22:15:25:33 four.2017-06-22:15:26:06 four.2017-06-22:15:26:18 four.2017-06-22:15:26:25 four.2017-06-22:15:27:43 four.2017-06-22:15:28:18 four.2017-06-22:15:28:35 four.2017-06-22:15:32:31 four.2017-06-22:15:33:40 four.2017-06-22:15:33:53 four.2017-06-22:15:34:26 four.2017-06-22:15:35:05 four.2017-06-22:15:35:40 nine.2017-06-22:15:30:50 nine.2017-06-22:15:31:37 nine.2017-06-22:15:31:51 nine.2017-06-22:15:31:57 nine.2017-06-22:15:32:17 nine.2017-06-22:15:33:26 nine.2017-06-22:15:34:27 nine.2017-06-22:15:35:21 nine.2017-06-22:15:35:40 one.2017-06-22:15:25:53 one.2017-06-22:15:26:25 one.2017-06-22:15:26:39 one.2017-06-22:15:27:00 one.2017-06-22:15:27:07 one.2017-06-22:15:31:40 one.2017-06-22:15:31:51 one.2017-06-22:15:31:57 one.2017-06-22:15:32:37 one.2017-06-22:15:32:45 one.2017-06-22:15:34:00 one.2017-06-22:15:34:52 one.2017-06-22:15:34:58 seven.2017-06-22:15:25:20 seven.2017-06-22:15:25:27 seven.2017-06-22:15:25:33 seven.2017-06-22:15:26:49 seven.2017-06-22:15:27:00 seven.2017-06-22:15:27:13 seven.2017-06-22:15:27:22 seven.2017-06-22:15:28:18 seven.2017-06-22:15:28:35 seven.2017-06-22:15:31:04 seven.2017-06-22:15:31:11 seven.2017-06-22:15:31:24 seven.2017-06-22:15:31:44 seven.2017-06-22:15:34:20 seven.2017-06-22:15:34:27 six.2017-06-22:15:25:20 six.2017-06-22:15:26:06 six.2017-06-22:15:26:12 six.2017-06-22:15:27:13 six.2017-06-22:15:28:35 six.2017-06-22:15:28:42 six.2017-06-22:15:28:59 six.2017-06-22:15:29:35 six.2017-06-22:15:30:30 ten.2017-06-22:15:26:06 ten.2017-06-22:15:27:56 ten.2017-06-22:15:28:27 ten.2017-06-22:15:28:42 ten.2017-06-22:15:29:57 ten.2017-06-22:15:30:17 ten.2017-06-22:15:31:18 ten.2017-06-22:15:31:24 ten.2017-06-22:15:31:44 ten.2017-06-22:15:32:41 ten.2017-06-22:15:33:26 ten.2017-06-22:15:34:32 three.2017-06-22:15:24:41 three.2017-06-22:15:25:01 three.2017-06-22:15:25:53 three.2017-06-22:15:26:25 three.2017-06-22:15:27:43 three.2017-06-22:15:28:13 three.2017-06-22:15:28:59 three.2017-06-22:15:29:27 three.2017-06-22:15:30:04 three.2017-06-22:15:30:44 three.2017-06-22:15:30:57 three.2017-06-22:15:31:18 three.2017-06-22:15:32:31 three.2017-06-22:15:32:37 three.2017-06-22:15:32:52 three.2017-06-22:15:33:06 two.2017-06-22:15:25:27 two.2017-06-22:15:26:12 two.2017-06-22:15:26:18 two.2017-06-22:15:28:21 two.2017-06-22:15:31:44 two.2017-06-22:15:34:27 two.2017-06-22:15:34:52 two.2017-06-22:15:34:58 $# ./bin/radosgw-admin object stat --object three.2017-06-22:15:30:57 --bucket test ERROR: failed to stat object, returned error: (2) No such file or directory $# ./bin/radosgw-admin object stat --object ten.2017-06-22:15:30:17 --bucket test ERROR: failed to stat object, returned error: (2) No such file or directory $# ./bin/radosgw-admin bucket check --bucket test --check-objects [] $# ./bin/radosgw-admin bucket check --bucket test --check-objects --fix [] {} { "existing_header": { "usage": { "rgw.main": { "size": 0, "size_actual": 0, "size_utilized": 0, "size_kb": 0, "size_kb_actual": 0, "size_kb_utilized": 0, "num_objects": 0 } } }, "calculated_header": { "usage": {} } } $# swift -A http://localhost:8000/auth -U test:tester -K testing list test
Updated by Orit Wasserman almost 7 years ago
- Priority changed from Normal to High
Updated by Pavan Rallabhandi almost 7 years ago
Orit Wasserman wrote:
Pavan, are you using versioning?
No Orit, we are not using object versioning. One interesting observation from the repro steps is that, am not able to repro if the list operation is not present (out of create, delete, list). Only with those steps mentioned in the script am able to 100% repro.
Updated by Matt Benjamin almost 7 years ago
- Status changed from New to In Progress
Updated by Matt Benjamin almost 7 years ago
- Backport set to jewel kraken
Hi Pavan,
I haven't proved this resolves, but I did prove (in a non-reproducer scenario) that the tag_timeout values being produced did not compare equal with ones scaled as std::chrono::seconds. Assuming that's the race window, I'm not sure how large it is, maybe millis.
Updated by Pavan Rallabhandi almost 7 years ago
- Assignee changed from Pavan Rallabhandi to Matt Benjamin
Yes Matt, the fix seems to take effect. Thanks for the quick turnaround.
I was going through the commit where the resolution time had been changed, https://github.com/ceph/ceph/commit/416234b38be77ebb2decabc91a74055387a4b10a. Would this also mean we need to look at other places where `timespan` has been introduced and may possibly lead to other races?
Updated by Matt Benjamin almost 7 years ago
Yes, I think follows that there could be other ceph::timespan problems. I have so far grepped through the Ceph source and looked at all the instances.
From what I can tell, all the interesting instances use the make_timespan(const double d) inline function in common/ceph_time.h. Naively, it looks like this inline might induce the same problem, but for some reason, it appears as if that's not the case. I've proved that on my x86_64 system running F23, the following code does not assert:
auto t1 = make_timespan(20);
auto t2 = ceph::timespan(std::chrono::seconds(20));
assert(t1 == t2);
I intend to run this by Adam Emerson to see exactly how the correct unit is derived.
Matt
Updated by Matt Benjamin almost 7 years ago
- Status changed from In Progress to Pending Backport
Updated by Matt Benjamin almost 7 years ago
- Backport changed from jewel kraken to jewel luminous kraken
Updated by Nathan Cutler almost 7 years ago
- Copied to Backport #20513: jewel: Stale bucket index entries are left over after object deletions added
Updated by Nathan Cutler almost 7 years ago
- Backport changed from jewel luminous kraken to kraken, jewel
Updated by Nathan Cutler almost 7 years ago
- Copied to Backport #20519: kraken: Stale bucket index entries are left over after object deletions added
Updated by Nathan Cutler over 6 years ago
- Status changed from Pending Backport to Resolved