Project

General

Profile

Actions

Bug #47722

open

rgw sometimes hang multilple minutes during GET query

Added by Arnaud Lefebvre over 3 years ago. Updated about 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

I'm having an issue recently which results in the Rados Gateway process to be stuck for some time. I've seen a couple of times today (~20 times). After it froze, it happened that the process got unstuck and resumed serving requests as expected. Unfortunately, I've been unable to reproduce it yet with higher logs. We have a 1/5 logs configuration. I'll try again in the next couple of days.

I've also found sometimes a few hangs (2/3 minutes) where the log `0 iterate_obj() failed with -104` appears after. Not sure if this is related as this one is printed from time to time:


2020-10-01 16:30:56.567 7f4daddff700 1 ====== req done req=0x7f4d9ca64170 op status=0 http_status=200 latency=0.002s ======
2020-10-01 16:30:56.571 7f4daddff700 1 ====== starting new request req=0x7f4da52aad70 =====
2020-10-01 16:31:30.785 7f4dab5fa700 0 iterate_obj() failed with -104
2020-10-01 16:31:30.786 7f4dab5fa700 1 ====== req done req=0x7f4dd6147f70 op status=-104 http_status=206 latency=242.986s ======
2020-10-01 16:31:30.786 7f4dab5fa700 1 ====== starting new request req=0x7f4dd6147ef0 =====
2020-10-01 16:31:30.788 7f4dab5fa700 1 ====== req done req=0x7f4dd6147ef0 op status=0 http_status=200 latency=0.002s ======

Here you can see a gap of ~1 minute 36 without requests. We have at least one request per second as a monitoring check, so this isn't normal.

Please find in the attachments the logs 20/20 of a request that hanged. Unfortunately I had to restart the node after it started timing-out (I should have waited instead). The logs are anonymized since the file belongs to one of our customers.
The request in the logs is a GET for a video file with a size of 414177017 bytes. The download is done using a signed URL. I can't say if it was that file that was responsible for the other hangs.

Let me know if there is anything I can help you with here. I'm also not sure as to where I should look because I don't really know at which ceph step the request failed.

Thank you


Files

ceph-issue.txt (13.4 KB) ceph-issue.txt Arnaud Lefebvre, 10/01/2020 05:53 PM
Actions

Also available in: Atom PDF