Bug #47722: rgw sometimes hang multilple minutes during GET query - rgw - Ceph

Actions

Copy link

Bug #47722

open

rgw sometimes hang multilple minutes during GET query

Added by Arnaud Lefebvre over 3 years ago. Updated about 3 years ago.

Status:

New

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.10

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hello,

I'm having an issue recently which results in the Rados Gateway process to be stuck for some time. I've seen a couple of times today (~20 times). After it froze, it happened that the process got unstuck and resumed serving requests as expected. Unfortunately, I've been unable to reproduce it yet with higher logs. We have a 1/5 logs configuration. I'll try again in the next couple of days.

I've also found sometimes a few hangs (2/3 minutes) where the log `0 iterate_obj() failed with -104` appears after. Not sure if this is related as this one is printed from time to time:

2020-10-01 16:30:56.567 7f4daddff700 1 ====== req done req=0x7f4d9ca64170 op status=0 http_status=200 latency=0.002s ====== 2020-10-01 16:30:56.571 7f4daddff700 1 ====== starting new request req=0x7f4da52aad70 ===== 2020-10-01 16:31:30.785 7f4dab5fa700 0 iterate_obj() failed with -104 2020-10-01 16:31:30.786 7f4dab5fa700 1 ====== req done req=0x7f4dd6147f70 op status=-104 http_status=206 latency=242.986s ====== 2020-10-01 16:31:30.786 7f4dab5fa700 1 ====== starting new request req=0x7f4dd6147ef0 ===== 2020-10-01 16:31:30.788 7f4dab5fa700 1 ====== req done req=0x7f4dd6147ef0 op status=0 http_status=200 latency=0.002s ======

Here you can see a gap of ~1 minute 36 without requests. We have at least one request per second as a monitoring check, so this isn't normal.

Please find in the attachments the logs 20/20 of a request that hanged. Unfortunately I had to restart the node after it started timing-out (I should have waited instead). The logs are anonymized since the file belongs to one of our customers.
The request in the logs is a GET for a video file with a size of 414177017 bytes. The download is done using a signed URL. I can't say if it was that file that was responsible for the other hangs.

Let me know if there is anything I can help you with here. I'm also not sure as to where I should look because I don't really know at which ceph step the request failed.

Thank you

Files

ceph-issue.txt (13.4 KB) ceph-issue.txt

Arnaud Lefebvre, 10/01/2020 05:53 PM

Actions

Copy link

Updated by Arnaud Lefebvre over 3 years ago

I'm sorry, it seems like I did everything I could to report this issue in the wrong place. I can't seem to be able to edit / delete the issue so I'll keep it open, just:
- This concerns the rgw project.
- This happened on version 14.2.10, not 14.2.11 (which we will upgrade to soon and see if it continues)

As side note, this issue did not occurred much again when I was expecting it to. It occurred a few hours ago where the 4 radosgw process were timing out at merely the same time. Because they were all down, I did not have the time to get additional information, I had to put the service back online.

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Project changed from RADOS to rgw

Actions

Copy link

Updated by Nathan Cutler over 3 years ago

Affected Versions v14.2.10 added
Affected Versions deleted (~~v14.2.11~~)

Actions

Copy link

Updated by Or Friedmann over 3 years ago

Hi,

Can you please add output of osd utilization during those errors?
Did you have slow requests in the cluster during those errors?

Thank you

Actions

Copy link

Updated by Arnaud Lefebvre over 3 years ago

Hello,

Thanks for your reply.

The OSD utilization right now (it's not exactly the same as two weeks ago, we have 1 OSD less, sorry):

avg 106.971
stddev 31.3209 (expected baseline 10.1939)
min osd.52 with 19 pgs (0.177618 * mean)
max osd.71 with 128 pgs (1.19658 * mean)

Let me know if this is the right information you were asking. We had a total of 36 OSD, none of them was in high CPU or network condition, nor high disk usage. The cluster was basically running as usual.

We did not have any slow requests reported by the mon, cluster state was reported as healthy and it seemed that way. Restarting the rgw services fixed the issue every time.

In case this happens again, would you mind telling me what should I check right away to get more information? After it happened, I tried to look at the logs of the OSD which were hosting the PG where the file would be stored and nothing unusual came up.

Actions

Copy link