Project

General

Profile

Bug #20380

Stale bucket index entries are left over after object deletions

Added by Pavan Rallabhandi 4 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
Start date:
06/22/2017
Due date:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
kraken, jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

On many of our clusters running Jewel (10.2.5+), am running into a strange problem of having stale bucket index entries left over for (some of the) objects deleted.

The symptoms are that the actual delete operation of an object is reported successful in the RGW logs, but a bucket list on the container would still show the deleted object. An attempt to download/stat of the object appropriately results in a failure. No failures are seen in the respective OSDs where the bucket index object is located. And rebuilding the bucket index by running ‘radosgw-admin bucket check –fix’ would fix the issue.

Though I could simulate the problem by instrumenting the code, to not to have invoked `complete_del` on the bucket index op https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.cc#L8793, but that call is always seem to be made unless there is a cascading error from the actual delete operation of the object, which doesn’t seem to be the case here.

A simple script like below would easily reproduce the issue:


$# cat delete_create_object 

#!/bin/sh
i=1
objects=100
prefix=$1

while [ $i -lt $objects ]
do
    object=$prefix.`date +%Y-%m-%d:%H:%M:%S`
    `touch $object`
    swift -A http://localhost:8000/auth -U test:tester -K testing upload test $object
    swift -A http://localhost:8000/auth -U test:tester -K testing delete test $object
    swift -A http://localhost:8000/auth -U test:tester -K testing list test

i=`expr $i + 1`
done

$# cat script 
./delete_create_object one&
./delete_create_object two&
./delete_create_object three&
./delete_create_object four&
./delete_create_object five&
./delete_create_object six&
./delete_create_object seven&
./delete_create_object eight&
./delete_create_object nine&
./delete_create_object ten&

In one such run of the above script, I ended up with the below on my test single node (vstart) cluster running 1 RGW, 3 OSDs. A bucket-check with --fix resolves the issue.


$# swift -A http://localhost:8000/auth -U test:tester -K testing list test
eight.2017-06-22:15:25:59
eight.2017-06-22:15:26:31
eight.2017-06-22:15:26:39
eight.2017-06-22:15:27:22
eight.2017-06-22:15:27:56
eight.2017-06-22:15:29:20
eight.2017-06-22:15:30:10
eight.2017-06-22:15:31:24
eight.2017-06-22:15:31:31
eight.2017-06-22:15:32:17
eight.2017-06-22:15:33:26
eight.2017-06-22:15:33:53
eight.2017-06-22:15:34:20
five.2017-06-22:15:24:41
five.2017-06-22:15:25:27
five.2017-06-22:15:25:53
five.2017-06-22:15:27:07
five.2017-06-22:15:28:14
five.2017-06-22:15:28:18
five.2017-06-22:15:29:20
five.2017-06-22:15:31:04
five.2017-06-22:15:31:18
five.2017-06-22:15:31:31
five.2017-06-22:15:32:17
five.2017-06-22:15:33:40
five.2017-06-22:15:34:27
five.2017-06-22:15:35:05
five.2017-06-22:15:35:21
four.2017-06-22:15:24:48
four.2017-06-22:15:25:33
four.2017-06-22:15:26:06
four.2017-06-22:15:26:18
four.2017-06-22:15:26:25
four.2017-06-22:15:27:43
four.2017-06-22:15:28:18
four.2017-06-22:15:28:35
four.2017-06-22:15:32:31
four.2017-06-22:15:33:40
four.2017-06-22:15:33:53
four.2017-06-22:15:34:26
four.2017-06-22:15:35:05
four.2017-06-22:15:35:40
nine.2017-06-22:15:30:50
nine.2017-06-22:15:31:37
nine.2017-06-22:15:31:51
nine.2017-06-22:15:31:57
nine.2017-06-22:15:32:17
nine.2017-06-22:15:33:26
nine.2017-06-22:15:34:27
nine.2017-06-22:15:35:21
nine.2017-06-22:15:35:40
one.2017-06-22:15:25:53
one.2017-06-22:15:26:25
one.2017-06-22:15:26:39
one.2017-06-22:15:27:00
one.2017-06-22:15:27:07
one.2017-06-22:15:31:40
one.2017-06-22:15:31:51
one.2017-06-22:15:31:57
one.2017-06-22:15:32:37
one.2017-06-22:15:32:45
one.2017-06-22:15:34:00
one.2017-06-22:15:34:52
one.2017-06-22:15:34:58
seven.2017-06-22:15:25:20
seven.2017-06-22:15:25:27
seven.2017-06-22:15:25:33
seven.2017-06-22:15:26:49
seven.2017-06-22:15:27:00
seven.2017-06-22:15:27:13
seven.2017-06-22:15:27:22
seven.2017-06-22:15:28:18
seven.2017-06-22:15:28:35
seven.2017-06-22:15:31:04
seven.2017-06-22:15:31:11
seven.2017-06-22:15:31:24
seven.2017-06-22:15:31:44
seven.2017-06-22:15:34:20
seven.2017-06-22:15:34:27
six.2017-06-22:15:25:20
six.2017-06-22:15:26:06
six.2017-06-22:15:26:12
six.2017-06-22:15:27:13
six.2017-06-22:15:28:35
six.2017-06-22:15:28:42
six.2017-06-22:15:28:59
six.2017-06-22:15:29:35
six.2017-06-22:15:30:30
ten.2017-06-22:15:26:06
ten.2017-06-22:15:27:56
ten.2017-06-22:15:28:27
ten.2017-06-22:15:28:42
ten.2017-06-22:15:29:57
ten.2017-06-22:15:30:17
ten.2017-06-22:15:31:18
ten.2017-06-22:15:31:24
ten.2017-06-22:15:31:44
ten.2017-06-22:15:32:41
ten.2017-06-22:15:33:26
ten.2017-06-22:15:34:32
three.2017-06-22:15:24:41
three.2017-06-22:15:25:01
three.2017-06-22:15:25:53
three.2017-06-22:15:26:25
three.2017-06-22:15:27:43
three.2017-06-22:15:28:13
three.2017-06-22:15:28:59
three.2017-06-22:15:29:27
three.2017-06-22:15:30:04
three.2017-06-22:15:30:44
three.2017-06-22:15:30:57
three.2017-06-22:15:31:18
three.2017-06-22:15:32:31
three.2017-06-22:15:32:37
three.2017-06-22:15:32:52
three.2017-06-22:15:33:06
two.2017-06-22:15:25:27
two.2017-06-22:15:26:12
two.2017-06-22:15:26:18
two.2017-06-22:15:28:21
two.2017-06-22:15:31:44
two.2017-06-22:15:34:27
two.2017-06-22:15:34:52
two.2017-06-22:15:34:58

$# ./bin/radosgw-admin object stat --object three.2017-06-22:15:30:57 --bucket test
ERROR: failed to stat object, returned error: (2) No such file or directory

$# ./bin/radosgw-admin object stat --object ten.2017-06-22:15:30:17 --bucket test
ERROR: failed to stat object, returned error: (2) No such file or directory

$# ./bin/radosgw-admin bucket check --bucket test --check-objects
[]

$# ./bin/radosgw-admin bucket check --bucket test --check-objects --fix
[]
{}
{
    "existing_header": {
        "usage": {
            "rgw.main": {
                "size": 0,
                "size_actual": 0,
                "size_utilized": 0,
                "size_kb": 0,
                "size_kb_actual": 0,
                "size_kb_utilized": 0,
                "num_objects": 0
            }
        }
    },
    "calculated_header": {
        "usage": {}
    }
}

$# swift  -A http://localhost:8000/auth -U test:tester -K testing list test 


Related issues

Copied to rgw - Backport #20513: jewel: Stale bucket index entries are left over after object deletions Resolved
Copied to rgw - Backport #20519: kraken: Stale bucket index entries are left over after object deletions Rejected

History

#1 Updated by Pavan Rallabhandi 4 months ago

  • Source set to Community (dev)

#2 Updated by Pavan Rallabhandi 4 months ago

  • Description updated (diff)

#3 Updated by Pavan Rallabhandi 4 months ago

  • Description updated (diff)

#4 Updated by Orit Wasserman 4 months ago

  • Priority changed from Normal to High

#5 Updated by Orit Wasserman 4 months ago

Pavan, are you using versioning?

#6 Updated by Pavan Rallabhandi 4 months ago

Orit Wasserman wrote:

Pavan, are you using versioning?

No Orit, we are not using object versioning. One interesting observation from the repro steps is that, am not able to repro if the list operation is not present (out of create, delete, list). Only with those steps mentioned in the script am able to 100% repro.

#7 Updated by Matt Benjamin 4 months ago

  • Status changed from New to In Progress

#8 Updated by Matt Benjamin 4 months ago

  • Backport set to jewel kraken

Hi Pavan,

I haven't proved this resolves, but I did prove (in a non-reproducer scenario) that the tag_timeout values being produced did not compare equal with ones scaled as std::chrono::seconds. Assuming that's the race window, I'm not sure how large it is, maybe millis.

https://github.com/ceph/ceph/pull/16026

#9 Updated by Pavan Rallabhandi 4 months ago

  • Assignee changed from Pavan Rallabhandi to Matt Benjamin

Yes Matt, the fix seems to take effect. Thanks for the quick turnaround.

I was going through the commit where the resolution time had been changed, https://github.com/ceph/ceph/commit/416234b38be77ebb2decabc91a74055387a4b10a. Would this also mean we need to look at other places where `timespan` has been introduced and may possibly lead to other races?

#10 Updated by Matt Benjamin 4 months ago

Yes, I think follows that there could be other ceph::timespan problems. I have so far grepped through the Ceph source and looked at all the instances.

From what I can tell, all the interesting instances use the make_timespan(const double d) inline function in common/ceph_time.h. Naively, it looks like this inline might induce the same problem, but for some reason, it appears as if that's not the case. I've proved that on my x86_64 system running F23, the following code does not assert:

auto t1 = make_timespan(20);
auto t2 = ceph::timespan(std::chrono::seconds(20));
assert(t1 == t2);

I intend to run this by Adam Emerson to see exactly how the correct unit is derived.

Matt

#11 Updated by Matt Benjamin 4 months ago

  • Status changed from In Progress to Pending Backport

#12 Updated by Matt Benjamin 4 months ago

  • Backport changed from jewel kraken to jewel luminous kraken

#13 Updated by Nathan Cutler 4 months ago

  • Copied to Backport #20513: jewel: Stale bucket index entries are left over after object deletions added

#14 Updated by Nathan Cutler 4 months ago

  • Backport changed from jewel luminous kraken to kraken, jewel

#15 Updated by Nathan Cutler 4 months ago

  • Copied to Backport #20519: kraken: Stale bucket index entries are left over after object deletions added

#16 Updated by Nathan Cutler about 2 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF