Project

General

Profile

Bug #20107

rgw: Truncated objects

Added by Jens Harbott 7 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
-
Target version:
-
Start date:
05/30/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
jewel, kraken, hammer
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
hammer
Needs Doc:
No

Description

On a large cluster, we are seeing some objects being truncated at 512kb size. The behaviour when trying to access these objects looks like this:

$ s3cmd get s3://bucket/object
download: 's3://bucket/object' -> './object'  [1 of 1]
download: 's3://bucket/object' -> './object'  [1 of 1]
    4096 of 1204514     0% in    0s    64.67 kB/s
<-- 30 second delay happening here -->
  525194 of 1204514    43% in   30s    17.04 kB/s  failed
WARNING: Retrying failed request: /object (EOF from S3!)
WARNING: Waiting 3 sec...
download: 's3://bucket/object' -> './object'  [1 of 1]
ERROR: S3 error: 404 (Not Found)

If I interrupt the download, I can see the object having exactly 524288 bytes, corresponding to the head_size in the manifest. So it seems like the head object has been written correctly and can be read, but the next tail object has gone missing. If I check the rgw log, the PUT for that object has been finished successfully though, returning a 200 response.

The cluster is running version 0.94.10-1trusty on Ubuntu Trusty, has 388 osds, 494 TB data and about 1300 Mobjects. It is unclear how many objects are affected, we can see errors for about 100 per day.


Related issues

Copied to rgw - Backport #20717: hammer: rgw: Truncated objects Rejected
Copied to rgw - Backport #20718: kraken: rgw: Truncated objects Rejected
Copied to rgw - Backport #20719: jewel: rgw: Truncated objects Resolved

History

#1 Updated by Jens Harbott 6 months ago

It seems this is happening when an object is copied, the copy has the same prefix for shadow objects as the original object. So when the copied object is deleted, garbage collection will remove the shadow object and render the original object truncated.

#2 Updated by Jens Harbott 6 months ago

I found an example that is recent enough so that there are still some relevant logs available. Here an object got copied 4 times:

2017-06-05 16:56:33.053460 7f18e3e5f700  0 dest_obj=bucket1703:obj97f29802408d355959150000199d0c9a src_obj=bucket1703:obj3fc65bc21a9342f59be220000199d0c9a copy_itself=0
2017-06-05 17:04:25.524945 7f4a31d03700  0 dest_obj=bucket1703:obj25e9ef900198f355981070000199d0c9a src_obj=bucket1703:obj3fc65bc21a9342f59be220000199d0c9a copy_itself=0
2017-06-05 17:43:00.436986 7f628a674700  0 dest_obj=bucket1703:objd79ecc002498355981070000199d0c9a src_obj=bucket1703:obj3fc65bc21a9342f59be220000199d0c9a copy_itself=0
2017-06-05 17:44:57.549374 7f634a7f4700  0 dest_obj=bucket1703:obj5601b5039998355981070000199d0c9a src_obj=bucket1703:obj3fc65bc21a9342f59be220000199d0c9a copy_itself=0

The copied objects get deleted again soon afterwards:

2017-06-05 16:57:05.751995 7f18bbe0f700  1 civetweb: 0x7f1aa8099a80: 10.51.6.65 - - [05/Jun/2017:16:57:05 +0000] "DELETE /obj97f29802408d355959150000199d0c9a HTTP/1.1" 204 0 -
2017-06-05 17:04:39.745946 7f4b8a7b4700  1 civetweb: 0x7f4c000059e0: 10.51.6.65 - - [05/Jun/2017:17:04:39 +0000] "DELETE /obj25e9ef900198f355981070000199d0c9a HTTP/1.1" 204 0 -
2017-06-05 17:43:15.532089 7f61f1d43700  1 civetweb: 0x7f638c030e60: 10.51.6.65 - - [05/Jun/2017:17:43:15 +0000] "DELETE /objd79ecc002498355981070000199d0c9a HTTP/1.1" 204 0 -
2017-06-05 17:45:13.993036 7f61d750e700  1 civetweb: 0x7f6358038d00: 10.51.6.65 - - [05/Jun/2017:17:45:13 +0000] "DELETE /obj5601b5039998355981070000199d0c9a HTTP/1.1" 204 0 -

However there are 5 events for the garbage collection of the corresponding shadow object, resulting in it being removed and the original object getting truncated:

2017-06-05 19:16:57.205288 7fdf40ff9700  0 gc::process: removing .de-hamburg-1.rgw.buckets:de-hamburg-1.41044432.1507__shadow_.2s5bX_K7DQvgJdiTX2tgJKaQE3IVoFN_1
2017-06-05 19:22:25.249571 7fb7bcff9700  0 gc::process: removing .de-hamburg-1.rgw.buckets:de-hamburg-1.41044432.1507__shadow_.2s5bX_K7DQvgJdiTX2tgJKaQE3IVoFN_1
2017-06-05 20:32:45.609886 7f4c5cff9700  0 gc::process: removing .de-hamburg-1.rgw.buckets:de-hamburg-1.41044432.1507__shadow_.2s5bX_K7DQvgJdiTX2tgJKaQE3IVoFN_1
2017-06-05 20:53:01.954708 7f4c5cff9700  0 gc::process: removing .de-hamburg-1.rgw.buckets:de-hamburg-1.41044432.1507__shadow_.2s5bX_K7DQvgJdiTX2tgJKaQE3IVoFN_1
2017-06-05 20:53:01.996126 7f4c5cff9700  0 gc::process: removing .de-hamburg-1.rgw.buckets:de-hamburg-1.41044432.1507__shadow_.2s5bX_K7DQvgJdiTX2tgJKaQE3IVoFN_1

#3 Updated by Yehuda Sadeh 6 months ago

A copy operation should increase the tail's refcount. The garbage collection will try to remove it by dropping the refcount. However, it's not going to remove a tail that is shared by multiple entries. There was a bug that was fixed that dealt with objects that were copied across buckets, maybe that's the issue that you're seeing?

#4 Updated by Yehuda Sadeh 6 months ago

The bug I mentioned was dealing with copying objects that resided on different pools. Do you have a scenario where this reproduces easily? If so, can you turn set 'debug rgw = 20', and 'debug ms = 1', and provide logs?

#5 Updated by Jens Harbott 6 months ago

Do you have a reference to that bug? I haven't been able to find anything that looked like it would match. Also, the copies that I am seeing are taking place within a single bucket

I did verify multiple occurences now and in each of them I can see N copies being made, the N objects being deleted again, but N+1 matching gc:process removing entries, each time one pair of them with timestamps only a couple of milliseconds apart.

Note that this affects only very few objects, like maybe 100 per day, with 100M being processed in total. Most of them are more than a week old, though, so its difficult to gather the appropriate information from the logs.

Looking at the code, I don't see anything obvious that could explain this, it could either be the object getting placed twice onto the gc queue or the gc job processing a single entry twice.

#6 Updated by Jens Harbott 6 months ago

Ah, no, no different pools involved.

And no, with the volume of requests we are dealing with, we are already at about 50GB logs per day, no way to increase the logging level to more than the default. We could consider adding some specific logging code, though, if you have some idea for that.

Also, I haven't been able to reproduce this on a non-production cluster yet, sadly.

#7 Updated by Jens Harbott 6 months ago

So I can reproduce this issue now, after finding and working around #20234:

1. Run the python script below, which uploads 4000 objects, copies them and deletes the copy again.
2. Wait for all gc timers to expire.
3. Run radosgw-admin gc process to handle all 4000 removals in one batch.

As a result, I'm seeing about 400 of the removals being processed twice as seen in the log in comment #2, resulting in the corresponding objects being truncated.

Please confirm if my understanding of how refcounting works is reasonable:

1. PUT obj1 => shadow object is created with no refcount xattr.
2. COPY obj1 obj2 => shadow object gets refcount xattr with tag_for_obj2 plus a wildcard entry.
3. DEL obj2 => when GC runs, the tag_for_obj2 is removed from the refcount, but since there is still the wildcard entry, shadow object is kept in place.
4. GC processes the same entry a second time, wildcard is removed and now refcount is empty => shadow object is deleted.

import boto
import boto.s3.connection
access_key = 'key'
secret_key = 'secret'

conn = boto.connect_s3(
        aws_access_key_id = access_key,
        aws_secret_access_key = secret_key,
        host = 'localhost', port=8000,
        is_secure=False,               # uncomment if you are not using ssl
        calling_format = boto.s3.connection.OrdinaryCallingFormat(),
        )

bucket = conn.get_bucket('bla')

for i in range(1000,5000):
  keyname1 = 'testa%d' % i
  keyname2 = 'copyb%d' % i
  key = bucket.new_key(keyname1)
  key.set_contents_from_string('Hello World!\n'*50000)

  key2 = bucket.copy_key(keyname2, 'bla', keyname1)
  key2.delete()
  print 'done with %s' % keyname1

#8 Updated by Jens Harbott 6 months ago

After noticing that there is always a gc_list operation between the duplicated gc actions, I think I found the bug:

gc fetches up to 100 tags for processing. Deleting these tags is done in batches of 16 (or 17?) though. So when fetching the next batch of tags, there are some tags that have been processed, but not deleted from the list yet, resulting in duplicates. Fix should be pretty easy:

diff --git a/src/rgw/rgw_gc.cc b/src/rgw/rgw_gc.cc
index 6b23d9a..0c51f51 100644
--- a/src/rgw/rgw_gc.cc
+++ b/src/rgw/rgw_gc.cc
@@ -223,6 +223,10 @@ int RGWGC::process(int index, int max_secs)
         }
       }
     }
+    if (!remove_tags.empty()) {
+      RGWGC::remove(index, remove_tags);
+      remove_tags.clear();
+    }
   } while (truncated);

 done:

#9 Updated by Yehuda Sadeh 6 months ago

Yeah, your analysis seems right. I think the big issue here is that we don't set a tag on the first write, but rather rely on a wildcard entry. iirc, the wildcard entry was used for backward compatibility when the refcounting was introduced, but we shouldn't rely on it on newer objects (though, it has been a while, I might be wrong).

#10 Updated by Yehuda Sadeh 6 months ago

actually, I take this backward compatibility claim back a bit. Looking at the code again, I think this is how it was designed to work. The problem here as you pointed out is that now the removal is not idempotent. One way to fix it would be to drop the use of wildcard, which I think will be tricky in terms of backward compatibility. Another way would be to keep behind a list of the references that were removed, and to not check that list every time.
We need to also fix the gc, but that will not solve all cases as it's not atomic, and will probably just make it harder to happen.

#11 Updated by Yehuda Sadeh 6 months ago

#12 Updated by Orit Wasserman 6 months ago

  • Backport set to jewel, kraken

#13 Updated by Orit Wasserman 6 months ago

  • Backport changed from jewel, kraken to jewel, kraken, hammer

#14 Updated by Orit Wasserman 6 months ago

  • Priority changed from Normal to High

#15 Updated by Orit Wasserman 6 months ago

  • Status changed from New to Need Review

#16 Updated by Jens Harbott 6 months ago

Added a PR for my patch, too, assuming both are needed for a complete solution.

https://github.com/ceph/ceph/pull/15912

#17 Updated by Aaron Bassett 5 months ago

Would this issue affect a situation where a multi-part upload was partially completed, cleaned, and then successfully completed before the first, cleaned upload was gc'ed?

#19 Updated by Orit Wasserman 5 months ago

  • Status changed from Need Review to Pending Backport

#20 Updated by Nathan Cutler 5 months ago

#21 Updated by Nathan Cutler 5 months ago

#22 Updated by Nathan Cutler 5 months ago

#23 Updated by Nathan Cutler 3 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF