Project

General

Profile

Actions

Bug #16309

closed

rgw: bucket listing hangs on versioned bucket

Added by Osamu KIMURA almost 8 years ago. Updated almost 8 years ago.

Status:
Duplicate
Priority:
High
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

GET BUCKET (List Objects) S3 API hangs on a versioned bucket.
The API was terminated with 500 server error (FastCGI timeout), but RGW's internal listing process continued until the RGW was restarted.

RGW repeatedly tries to get information for a specific object that has null version and another version as bellow.

<?xml version="1.0" encoding="UTF-8"?>
<ListVersionsResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
 <Name>hojo-bucket-02</Name>
 <Prefix>testfile1896.txt</Prefix>
 <KeyMarker></KeyMarker>
 <MaxKeys>1000</MaxKeys>
 <IsTruncated>false</IsTruncated>
 <Version>
  <Key>testfile1896.txt</Key>
  <VersionId>2MSeTpZoDSqt2nTk1oYHZNmF1e84C.1</VersionId>
  <IsLatest>true</IsLatest>
  <LastModified>2016-05-30T08:39:52.000Z</LastModified>
  <ETag>&quot;1d24c7924b9798bb9064dcb043b3d989&quot;</ETag>
  <Size>3152</Size>
  <StorageClass>STANDARD</StorageClass>
  <Owner>
   <ID>XXXXXXXX</ID>
   <DisplayName>XXXXXXXX</DisplayName>
  </Owner>
 </Version>
 <Version>
  <Key>testfile1896.txt</Key>
  <VersionId>null</VersionId>
  <IsLatest>false</IsLatest>
  <LastModified>2016-05-30T02:43:22.000Z</LastModified>
  <ETag>&quot;20a4fc4c12598089a8937496a5eba67e&quot;</ETag>
  <Size>3052</Size>
  <StorageClass>STANDARD</StorageClass>
  <Owner>
   <ID>XXXXXXXX</ID>
   <DisplayName>XXXXXXXX</DisplayName>
  </Owner>
 </Version>
</ListVersionsResult>
<?xml version="1.0" encoding="UTF-8"?>
<ListVersionsResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>hojo-bucket-02</Name>
<Prefix>testfile1897.txt</Prefix>
<KeyMarker></KeyMarker>
<MaxKeys>1000</MaxKeys>
<IsTruncated>false</IsTruncated>
<Version>
 <Key>testfile1897.txt</Key>
 <VersionId>bPMbvEae4KIxSJZn9folT4sQtt7h0w3</VersionId>
 <IsLatest>true</IsLatest>
 <LastModified>2016-05-30T08:39:52.000Z</LastModified>
 <ETag>&quot;71d8c0b2fc4e320f7f82ce88b737f2dd&quot;</ETag>
 <Size>3152</Size>
 <StorageClass>STANDARD</StorageClass>
 <Owner>
  <ID>XXXXXXXX</ID>
  <DisplayName>XXXXXXXX</DisplayName>
 </Owner>
</Version>
<Version>
 <Key>testfile1897.txt</Key>
 <VersionId>null</VersionId>
 <IsLatest>false</IsLatest>
 <LastModified>2016-05-30T02:43:22.000Z</LastModified>
 <ETag>&quot;27fe674eeca8f3fcff844ad7e91816c9&quot;</ETag>
 <Size>3052</Size>
 <StorageClass>STANDARD</StorageClass>
 <Owner>
  <ID>XXXXXXXX</ID>
  <DisplayName>XXXXXXXX</DisplayName>
 </Owner>
</Version>
</ListVersionsResult>

Files

radosgw-20160530.log.gz (580 KB) radosgw-20160530.log.gz Partial radosgw.log on 2016.05.30 Osamu KIMURA, 06/15/2016 05:15 AM
radosgw-20160601.log.gz (53.7 KB) radosgw-20160601.log.gz Partial radosgw.log on 2016.06.01 (reproduced) Osamu KIMURA, 06/15/2016 05:15 AM
radosgw-admin_bi_list-testfile1896.txt (5.69 KB) radosgw-admin_bi_list-testfile1896.txt Output of radosgw-admin bi list --bucket=hojo-bucket-02 --object=testfile1896.txt Osamu KIMURA, 07/01/2016 09:28 AM
Actions #1

Updated by Orit Wasserman almost 8 years ago

  • Assignee set to Orit Wasserman
Actions #2

Updated by Orit Wasserman almost 8 years ago

I have tried to reproduce in on 0.94.6 without any luck, can you give more details?

It is very easy to reproduce in on 0.94.5, can you confirm your version?

Actions #3

Updated by Osamu KIMURA almost 8 years ago

As you can find the last line in the radosgw-20160530.log.gz, which was output when the RGW was restarted, the version is 0.94.6.

2016-05-30 20:09:55.595131 7f490ea78820  0 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403), process radosgw, pid 1042

Unfortunately, I don't know the detailed procedure. I guess the customer doesn't remember details.
The system is configured with 3 RGWs and a load balancer. Some operations might be executed by other RGWs than this log.

I interpret the operation from radosgw-20160530.log.gz as bellow:
  1. Create a bucket
  2. PUT Objects (testfile0.txt ... testfile2499.txt)
  3. GET Bucket (List objects) - no problem without versioning
  4. PUT Bucket versioning (Enable versioning)
  5. PUT Bucket versioning (??? I don't know why twice)
  6. GET Bucket versioning
  7. PUT Objects (testfile0.txt ... testfile2499.txt)
  8. GET Bucket (List objects) - no problem with no versioned objects
  9. PUT Objects (testfile0.txt ... testfile2499.txt)
  10. GET Bucket (List objects) - infinite loop on testfile1897.txt
  11. PUT Objects (test/testfile0.txt ... test/testfile2499.txt?)
  12. GET Bucket (List objects) - infinite loop on testfile1896.txt
  13. PUT Objects (test/testfile2000.txt ... test/testfile2499.txt?)
  14. GET Bucket (List objects) - infinite loop on testfile1896.txt
  15. GET Bucket (List objects) - infinite loop on testfile1896.txt
  16. GET Bucket (List objects) - infinite loop on testfile1896.txt
  17. GET Bucket (List objects) - infinite loop on test/testfile1898.txt
  18. GET Bucket (List objects) - infinite loop on test/testfile1898.txt
  19. GET Bucket (List objects) - infinite loop on testfile1896.txt
  20. GET Bucket (List objects) - infinite loop on testfile1896.txt
  21. HEAD Bucket
  22. GET Bucket (List objects) - infinite loop on testfile1896.txt
  23. Restart RGW

As I mentioned before, some other APIs might be executed on other RGWs.

I noted following outputs in the log.

2016-05-30 17:43:55.642866 7fcb13f9f700 10 cls_bucket_list hojo-bucket-02(@{i=.rgw.buckets.index,e=.rgw.buckets.extra}.rgw.buckets[default.1456852.44]) start testfile1298.txt^@v913^@i7A6PTnyp3iM4CM3PR6n4hPeIz7VAXnI[] num_entries 667
...
2016-05-30 17:43:55.701421 7fcb13f9f700 10 cls_bucket_list hojo-bucket-02(@{i=.rgw.buckets.index,e=.rgw.buckets.extra}.rgw.buckets[default.1456852.44]) start testfile1897.txt^@v913^@ibPMbvEae4KIxSJZn9folT4sQtt7h0w3[] num_entries 2

Looped entries always have "num_entries" with 2 or 1. Non-looped entries have larger number.

Actions #4

Updated by Osamu KIMURA almost 8 years ago

I mistook. please eliminate 7th step.

Actions #5

Updated by Orit Wasserman almost 8 years ago

  • Description updated (diff)
Actions #6

Updated by Orit Wasserman almost 8 years ago

Still cannot reproduce in on 0.94.6
Can you provide more details about the costumer environment? what are the ceph packages installed?

Actions #7

Updated by Osamu KIMURA almost 8 years ago

I apologize for the wait.
I got the environment.

  • CentOS release 6.7 (Final)
  • kernel 2.6.32-573.18.1.el6.x86_64 #1 SMP Tue Feb 9 22:46:17 UTC 2016 x86_64 x86_64
  • ceph-radosgw-0.94.6-0.el6.x86_64
  • python-cephfs-0.94.6-0.el6.x86_64
  • ceph-0.94.6-0.el6.x86_64
  • ceph-common-0.94.6-0.el6.x86_64
  • libcephfs1-0.94.6-0.el6.x86_64
  • httpd-2.2.15-47.el6.centos.3.x86_64
  • httpd-tools-2.2.15-47.el6.centos.3.x86_64
  • httpd-devel-2.2.15-47.el6.centos.3.x86_64

They are using 3 RGW nodes under a load balancer. All the RGW nodes are same configuration.

Is it enough?

Actions #8

Updated by Orit Wasserman almost 8 years ago

can you run:
radosgw-admin bi list --bucket=hojo-bucket-02 --object=testfile1896.txt

Also can you increase the osd classobj debug level and provide the logs:
ceph tell osd.\* injectargs --debug-objclass 20

Actions #9

Updated by Osamu KIMURA almost 8 years ago

Here is your requested information:
radosgw-admin bi list --bucket=hojo-bucket-02 --object=testfile1896.txt

The bucket is only for test purpose, but the system has been generally in service.
It is difficult to set high debug level. In addition, it is difficult to re-try listing of the bucket, because the listing operation would continue until the RGW would be restarted. It affects operations on other buckets.

Actions #10

Updated by Yehuda Sadeh almost 8 years ago

What version are the osds running? Have osds been restarted since upgrade? E.g., please run:

$ ceph tell osd.\* version
Actions #11

Updated by Osamu KIMURA almost 8 years ago

OSDs are running on 0.94.3.3. RGWs are running on 0.94.6.
OSDs are built on our appliance. RGWs are built on the customer's server. Different versions may co-exist.

Actions #12

Updated by Orit Wasserman almost 8 years ago

Sadly the fix is in the OSD not in the gateway, this is why the user is encountering this issue.

Actions #13

Updated by Osamu KIMURA almost 8 years ago

Does it mean the fix of issue #13536 has to be applied to OSD?

Actions #14

Updated by Orit Wasserman almost 8 years ago

yes, it is radosgw code that runs in the OSD (object class)

Actions #15

Updated by Orit Wasserman almost 8 years ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF