Project

General

Profile

Actions

Bug #10445

closed

Radosgw hang, corrupt manifest?

Added by Aaron Bassett over 9 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm load testing my radosgw setup to get an idea what kind of throughput I can get. I have it set up writing to a cache tier of size 2 in front of an EC storage pool at 10+2 (this data is non-critical). I'm testing it by attempting to upload 2000 files alternating between several kB and 100 GB. I'm uploading 25 files at once using 10 threads for each with a part size of 512MB.

I can get this to run and upload at nearly line speed for an hour or two, but eventually the gateway always melts. It keeps running, but stop processing any requests and I'm not sure what its doing. The cluster stays HEALTH OK for the duration. Once I restart the gateway, it will sit and garbage collect happily, and accept new uploads, but attempting to delete or re-upload any of the existing uploads results in a 500 error from the gateway. The only way I've been able to clear them out is to delete all the gateways pool and start from scratch on the gateway.

While I've been unable to find the point where the gateway dies in the logs, I have captured a log of attempting to cancel an existing upload with debug ms = 1 and debug rgw =20. It's rather verbose, and I'm not sure what to look for in it, but when I was running it without verbosity turned up, I kept seeing a message like

"2015-01-02 14:16:55.409181 7f4d07fb7700 0 RGWObjManifest::operator++(): result: ofs=134217728 stripe_ofs=134217728 part_ofs=0 rule->part_size=0"

when I try to cancel an upload. I thoink it may be in a loop. As soon as I try to cancel, it just starts spitting those out at fast as it can and the cancel request eventually 500s. At this point I have to kill -9 the gw process.


Files

radosgw.log (2.37 MB) radosgw.log Aaron Bassett, 01/02/2015 06:23 AM
civet_cancel.log (34.8 MB) civet_cancel.log Aaron Bassett, 01/08/2015 07:13 AM
Actions #1

Updated by Yehuda Sadeh over 9 years ago

Not sure what triggers your original issue, althgouh I assume it's similar to what I see in the log. Basically what I can see here is a removal of multipart object with all the corresponding parts that just take too long, which ends up being timed out (by apache I assume) and then retried. Everything goes very slow probably due to the EC backend. You may want to try adjusting your apache 30 seconds timeout, see if it helps.

Actions #2

Updated by Aaron Bassett over 9 years ago

Well I tried the same setup again with civetweb instead of apache. It ran 10-15% slower, which was fine, it made it far longer than the apache run, but still stalled out in the same way. I'm not sure if civetweb has the same timeout or not, or how to control it for that matter.

The initial run that causes it to stall is all uploads, the test script doesn't do any deleting or canceling of uploads. I suppose it could be an upload timing out that causes it. The problem I'm having now is that once it happens, I have no way of cleaning up. I can seemingly delete the objects, but I can't cancel the stalled multipart uploads.

At any rate, I'm going to try another run with the EC pool out of the mix.

Thanks Aaron

Actions #3

Updated by Aaron Bassett over 9 years ago

I did some more testing, this time I tried taking the cache tier out of play and set it up with:

a) civetweb, using a replicated pool - I was able to complete a run and upload 50TB without a failure and a nice fast clip

b) civet web, using an EC pool - It started writing at a little less than half the speed of the replicated pool and quickly ran into the same problem, spitting out the RGWObjManifest::operator++(): result: ofs=134217728 stripe_ofs=134217728 part_ofs=0 rule->part_size=0 line.

So it appears that there's a problem with the multipart upload manifest and EC pools. My cluster is sweating pretty hard, a lot of the cpu's are maxed out during the EC test. I think there's clearly something wrong if radosgw is breaking in such a way that I can't recover cleanly when the cluster get's overloaded. I would expect it to slow down, and I'm fine with it 500'ing as I can make my clients deal with that and/or throttle the traffic, but it makes me very nervous that I'm so easily able to get it into a state that currently seems impossible to recover from.

Actions #4

Updated by Aaron Bassett over 9 years ago

https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.cc#L589

Reading this section, it looks to me that if part_size = 0 its going to loop indefinitely, so I guess the question is what's happening that causes part_size to be 0?

Actions #5

Updated by Aaron Bassett over 9 years ago

Aaron Bassett wrote:

https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.cc#L589

Reading this section, it looks to me that if part_size = 0 its going to loop indefinitely, so I guess the question is what's happening that causes part_size to be 0?

Er disregard that, it only gets hit if part_size > 0

Actions #6

Updated by Aaron Bassett over 9 years ago

Ok I still don't know how it gets into this state to begin with but right now I have objects or partial uploads where I can't cancel the multipart and I can't reload the object because the radosgw goes into some kind of infinite loop. Attached is a radosgw log with debugging on. I made one request using boto to cancel a multipart upload. The radosgw will now continue printing these manifest lines until it dies.

Actions #7

Updated by Yehuda Sadeh over 9 years ago

I don't see the gateway stuck in infinite loop. It just goes too slow, so the client retries (while the first operation is still pending). This leads to a snowball of retries that I assume at a point consumes all resources. We need to think of a better way to do the multipart abort (maybe do it through the garbage collection).

Actions #8

Updated by Aaron Bassett over 9 years ago

Ah I see what you mean. But I'm not sure what is causing these specific uploads to be so slow to remove. If I start a new upload and upload a couple parts and then cancel it, it goes fine. It seems to me like something during the heavily loaded run is causing these to get into some state where they are super slow to cancel? Is there any guidance you can give me towards inspecting the manifest to see If I can figure out what's different about these particular uploads?

Actions #9

Updated by Yehuda Sadeh over 9 years ago

you can run:

$ radosgw-admin object stat --bucket=<bucket> --object=<obj>

Actions #10

Updated by Yehuda Sadeh over 9 years ago

I pushed a change to wip-10445. Now, instead of waiting for the completion, it'll just send it to the garbage collector, where it'll be picked up later.

Actions #11

Updated by Aaron Bassett over 9 years ago

Thanks for that, I'll try to test it if I get a chance.

I believe I've narrowed the original problem that gets me in to this state down to a buggy uploading client. It was canceling multipart uploads when one part failed while other parts were still uploading. I've tried to reproduce it a little and was unable to, it seems to depend on the one failed part timing out which I'm not sure how to reliably recreate. At any rate, I've fixed the uploader to wait until the other threads have stopped before canceling and now I'm able to overload the gw fairly heavily without any of these problems.

Actions #12

Updated by Samuel Just over 9 years ago

  • Project changed from Ceph to rgw
Actions #13

Updated by Aaron Bassett over 9 years ago

wip-10455 worked in that it allowed me to clean up the unwanted uploads. Unfortunately, I think it was only cosmetic. As soon as I cancel one of these uploads, I start getting "RGWObjManifest::operator++(): result: ofs=92274688 stripe_ofs=92274688 part_ofs=0 rule->part_size=0" lines as fast as it can write to the log. In an attempt to see if it was making any processes through the parts, I did a sort|uniq -c on the ofs and stripe_ofs chunks. The results looked like this:

9832 result: ofs=281018368 stripe_ofs=281018368
9832 result: ofs=285212672 stripe_ofs=285212672
9832 result: ofs=289406976 stripe_ofs=289406976
9832 result: ofs=293601280 stripe_ofs=293601280
9835 result: ofs=29360128 stripe_ofs=29360128
9832 result: ofs=297795584 stripe_ofs=297795584
1 result: ofs=297812909 stripe_ofs=297812909
9831 result: ofs=301989888 stripe_ofs=301989888
9831 result: ofs=306184192 stripe_ofs=306184192
9831 result: ofs=310378496 stripe_ofs=310378496

It seems to me like it's caught in some sort of loop. I've let it run overnight and it doesn't seem to make any progress.

Actions #14

Updated by Yehuda Sadeh over 9 years ago

How many objects are getting processed? Does it keep processing the same object, or is it processing different object every time?

Actions #15

Updated by Aaron Bassett about 9 years ago

I'm not sure. Last time it happened, I had ~15 multipart uploads I couldn't cancel. Once this starts happening, it takes over the whole radosgw until it can't process any more requests.

Actions #16

Updated by Sage Weil about 9 years ago

  • Status changed from New to Fix Under Review
Actions #17

Updated by Casey Bodley about 4 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF