Bug #64527: Radosgw 504 timeouts & Garbage collection is frozen - rgw - Ceph

Actions

Copy link

Bug #64527

open

Radosgw 504 timeouts & Garbage collection is frozen

Added by Michael Collins 2 months ago. Updated 2 months ago.

Status:

New

Priority:

High

Assignee:

Target version:

Ceph - v17.2.8

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v17.2.6

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Ceph version: 17.2.6-1
OS: Ubuntu 20.04
Deployed without Cephadm using SystemD.

Cluster specs:

mcollins1@ceph-p-mon-01:~$ ceph -s
  cluster:
    id:     e428af4c-c109-497e-af6f-54ebe1dda867
    health: HEALTH_WARN
            4725 pgs not deep-scrubbed in time
            15 pgs not scrubbed in time

  services:
    mon: 5 daemons, quorum ceph-p-mon-01,ceph-p-mon-02,ceph-p-mon-03,ceph-p-mon-04,ceph-p-mon-05 (age 2d)
    mgr: ceph-p-mgr-01(active, since 2d), standbys: ceph-p-mgr-02, ceph-p-mgr-05, ceph-p-mgr-03, ceph-p-mgr-04
    osd: 2748 osds: 2665 up (since 4d), 2650 in (since 5d); 37 remapped pgs
    rgw: 12 daemons active (12 hosts, 1 zones)

  data:
    pools:   10 pools, 18857 pgs
    objects: 5.18G objects, 19 PiB
    usage:   26 PiB used, 11 PiB / 37 PiB avail

Symptoms

There were 2 main symptoms of this issue when we encountered it:

1) We would frequently get 504 timeout errors when attempting to list buckets/objects using tools like rclone/awscli.

mcollins1@maas-rackd-01:~$ rclone ls hotsauce1:
2024/02/12 09:23:44 Failed to ls: SerializationError: failed to unmarshal error message
        status code: 504, request id: , host id:
caused by: UnmarshalError: failed to unmarshal error message
        00000000  3c 68 74 6d 6c 3e 3c 62  6f 64 79 3e 3c 68 31 3e  |<html><body><h1>|
00000010  35 30 34 20 47 61 74 65  77 61 79 20 54 69 6d 65  |504 Gateway Time|
00000020  2d 6f 75 74 3c 2f 68 31  3e 0a 54 68 65 20 73 65  |-out</h1>.The se|
00000030  72 76 65 72 20 64 69 64  6e 27 74 20 72 65 73 70  |rver didn't resp|
00000040  6f 6e 64 20 69 6e 20 74  69 6d 65 2e 0a 3c 2f 62  |ond in time..</b|
00000050  6f 64 79 3e 3c 2f 68 74  6d 6c 3e 0a              |ody></html>.|
caused by: expected element type <Error> but have <html>

mcollins1@maas-rackd-01:~/scripts$ rclone lsd hotsauce1:
2024/02/13 13:48:07 ERROR : : error listing: SerializationError: failed to unmarshal error message
        status code: 504, request id: , host id:
caused by: UnmarshalError: failed to unmarshal error message
        00000000  3c 68 74 6d 6c 3e 3c 62  6f 64 79 3e 3c 68 31 3e  |<html><body><h1>|
00000010  35 30 34 20 47 61 74 65  77 61 79 20 54 69 6d 65  |504 Gateway Time|
00000020  2d 6f 75 74 3c 2f 68 31  3e 0a 54 68 65 20 73 65  |-out</h1>.The se|
00000030  72 76 65 72 20 64 69 64  6e 27 74 20 72 65 73 70  |rver didn't resp|
00000040  6f 6e 64 20 69 6e 20 74  69 6d 65 2e 0a 3c 2f 62  |ond in time..</b|
00000050  6f 64 79 3e 3c 2f 68 74  6d 6c 3e 0a              |ody></html>.|
caused by: expected element type <Error> but have <html>

It would infrequently (~1%) produce 504 errors when writing objects, and occasionally (~0.1%) produce 504 errors when reading objects.

PUT completed successfully for ./100MBfile-test_000000098.
PUT completed successfully for ./100MBfile-test_000000099.
PUT failed for ./100MBfile-test_000000100. See error details below:
2024/02/06 12:05:28 INFO  :
Transferred:             0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:       1m0.0s2024/02/06 12:05:28 ERROR : Attempt 1/3 failed with 1 errors and: GatewayTimeout: Gateway Timeout
        status code: 504, request id: , host id:

2) The garbage collection was completely frozen, with the queue just growing over time.

radosgw-admin gc list | grep -c oid
4859555

The RGWs would become unresponsive, there the logs showed them recieving new requests but not completing them. After enough time these RGW nodes would just freeze entirely and stop producing any log messages. We would then restart them and they would work again but the above symptoms remained.

Other Error Messages

Some other errors we were encountering at the time which might be relevant:

mcollins1@ceph-p-mon-01:~$ radosgw-admin bucket rm --purge-objects --bucket=hotsauce-bench000000000093
...
2024-02-13T13:28:45.878+0800 7f90d9bdae40  1 lifecycle: int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):426 invalidation attempt 9 failed: (110) Connection timed out
2024-02-13T13:28:45.878+0800 7f90d9bdae40  0 lifecycle: ERROR: failed to distribute cache for default.rgw.meta:root:.bucket.meta.hotsauce-bench000000000093:c0a34c69-bce9-4053-9ead-8ee81faae4c1.64636333.10
...
2024-02-13T13:34:15.944+0800 7f90d9bdae40  1 int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):426 invalidation attempt 9 failed: (110) Connection timed out
2024-02-13T13:34:15.944+0800 7f90d9bdae40  0 ERROR: remove(): failed to distribute cache: r=-110

We also found these but they are apparently harmless:

2024-02-07T07:39:49.444+0800 7f5683fef700  0 INFO: RGWReshardLock::lock found lock on reshard.0000000000 to be held by another RGW process; skipping for now
2024-02-07T07:39:49.444+0800 7f5683fef700  0 INFO: RGWReshardLock::lock found lock on reshard.0000000001 to be held by another RGW process; skipping for now
2024-02-07T07:39:49.448+0800 7f5683fef700  0 INFO: RGWReshardLock::lock found lock on reshard.0000000003 to be held by another RGW process; skipping for now

Initially the RGWs were OOM'ing every 1-10 minutes, this later stopped when the cluster became healthier but the 504 errors and frozen garbage collection persisted:

mcollins1@maas-rackd-01:~$ for i in 01 02 03 04 05 06 07 08 09; do echo ceph-p-rgw-vm-$i; ssh ceph-p-rgw-vm-$i sudo c
at /var/log/syslog | grep 'Out of memory' | tail -n 1; done
ceph-p-rgw-vm-01
2024-02-01T02:59:32.850823+08:00 ceph-p-rgw-vm-01 kernel: [95794.667160] Out of memory: Killed process 968268 (radosgw) total-vm:32321156kB, anon-rss:6960428kB, file-rss:0kB, shmem-rss:0kB, UID:64045 pgtables:28892kB oom_score_adj:0
ceph-p-rgw-vm-02
2024-02-01T07:03:05.002802+08:00 ceph-p-rgw-vm-02 kernel: [110312.565025] Out of memory: Killed process 1261996 (radosgw) total-vm:32145712kB, anon-rss:6973744kB, file-rss:0kB, shmem-rss:0kB, UID:64045 pgtables:28872kB oom_score_adj:0 
...

The RGW nodes didn't give any relevant error messages when we were triggering these 504 errors. Even with the verbosity of debug_rgw and debug_md set to 20/20.

We also noticed that RGW nodes weren't shutting down cleanly and would have to be SIGKILL'ed each time:

mcollins1@ceph-p-rgw-vm-01:~$ sudo journalctl -u ceph-radosgw@rgw.ceph-p-rgw-vm-01.service | tail -n 10
Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Killing process 201711 (n/a) with signal SIGKILL.
Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Killing process 201712 (n/a) with signal SIGKILL.
Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Killing process 201713 (n/a) with signal SIGKILL.
Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Killing process 201714 (n/a) with signal SIGKILL.
Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Killing process 201715 (n/a) with signal SIGKILL.
Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Failed to kill control group /system.slice/system-ceph\x2dradosgw.slice/ceph-radosgw@rgw.ceph-p-rgw-vm-01.service, ignoring: Input/output error
Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Main process exited, code=killed, status=9/KILL
Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Failed with result 'timeout'.
Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: Stopped Ceph rados gateway.
Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: Started Ceph rados gateway.

When tracing the RGW process we would always see this output:

mcollins1@ceph-p-rgw-vm-09:~$ sudo strace -p 1031
strace: Process 1031 attached
read(103,

With the logs set to maximum verbosity we would also see these messenger errors frequently:

2024-02-07T14:43:36.619+0800 7f7c270d3700  1 -- 10.235.7.221:0/845349610 >> [v2:10.235.8.28:6854/7005,v1:10.235.8.28:6873/7005] conn(0x7f7c0cd369d0 msgr2=0x7f7c0cd50c30 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 1458
2024-02-07T14:43:36.619+0800 7f7c270d3700  1 -- 10.235.7.221:0/845349610 >> [v2:10.235.8.28:6854/7005,v1:10.235.8.28:6873/7005] conn(0x7f7c0cd369d0 msgr2=0x7f7c0cd50c30 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
...
2024-02-07T14:43:36.619+0800 7f7c270d3700  1 --2- 10.235.7.221:0/845349610 >> [v2:10.235.8.28:6854/7005,v1:10.235.8.28:6873/7005] conn(0x7f7c0cd369d0 0x7f7c0cd50c30 crc :-1 s=READY pgs=11466 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)
...
2024-02-07T14:43:36.619+0800 7f7c270d3700  2 --2- 10.235.7.221:0/845349610 >> [v2:10.235.8.28:6854/7005,v1:10.235.8.28:6873/7005] conn(0x7f7c0cd369d0 0x7f7c0cd50c30 crc :-1 s=READY pgs=11466 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._fault on lossy channel, failing

Fixes We Tried

We tried a wide array of fixes to try bring the cluster into a healthier state and resolve these errors:
- Restarting RGW nodes.
- Redeploying RGW nodes.
- Just running 1 RGW node.
- Giving the RGW nodes more RAM.
- Slowed recovery by remapping the upmap: https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py Which allowed the monitor DB trimming to occur.
- Turning down target_max_misplaced_ratio to slow recovery.
- Just waiting for the cluster to recovery.

Nothing seemed to help though...

The Solution

We weren't able to find log messages anywhere that explained why this was occuring, it seems like a silent bug.

Strangely it all boiled down to 1 OSD that was misbehaving. Which we located by:

1) First listing every PG in the cluster with `rados ls`:

user@ceph-p-mon-03:~$ ps -ef|grep rados
user   1173205 1173203  0 07:11 pts/1    00:00:01 rados --pgid 3.120 ls
user   1227544 1227543  0 07:28 pts/1    00:00:01 rados --pgid 3.aa1 ls
user   1257073 1257072  0 07:37 pts/1    00:00:02 rados --pgid 3.fde ls
user   1258837 1258832  0 07:37 pts/1    00:00:01 rados --pgid 3.102e ls
user   1263857 1263856  0 07:39 pts/1    00:00:01 rados --pgid 3.1111 ls
user   1294344 1294343  0 07:48 pts/1    00:00:01 rados --pgid 3.165e ls
user   1323898 1323897  0 07:57 pts/1    00:00:02 rados --pgid 3.1b9b ls
user   1349663 1349662  0 08:05 pts/1    00:00:02 rados --pgid 3.202e ls
user   1354707 1354706  0 08:06 pts/1    00:00:02 rados --pgid 3.2111 ls
user   1385196 1385195  0 08:16 pts/1    00:00:02 rados --pgid 3.265e ls
user   1438760 1438759  0 08:33 pts/1    00:00:00 rados --pgid 3.2fde ls
user   1454365 1454364  0 08:38 pts/1    00:00:01 rados --pgid 3.32a1 ls
user   1505945 1505944  0 08:55 pts/1    00:00:00 rados --pgid 3.3b9b ls
user   1520186 1520185  0 09:00 pts/1    00:00:01 rados --pgid 3.3e20 ls
user   1563605  826342  0 09:51 pts/2    00:00:00 grep --color=auto rados

2) Then observed which commands were frozen.

user@ceph-p-mon-03:~/rados.out$ ls -tlr
total 298476
drwxrwxr-x 2 user user   131072 Feb 15 07:01 old
-rw-rw-r-- 1 user user        0 Feb 15 07:11 rados.3.120.out
-rw-rw-r-- 1 user user        0 Feb 15 07:28 rados.3.aa1.out
-rw-rw-r-- 1 user user 43762290 Feb 15 07:37 rados.3.fde.out
-rw-rw-r-- 1 user user        0 Feb 15 07:37 rados.3.102e.out
-rw-rw-r-- 1 user user        0 Feb 15 07:39 rados.3.1111.out
-rw-rw-r-- 1 user user        0 Feb 15 07:48 rados.3.165e.out
-rw-rw-r-- 1 user user 43781444 Feb 15 07:57 rados.3.1b9b.out
-rw-rw-r-- 1 user user 43616684 Feb 15 08:05 rados.3.202e.out
-rw-rw-r-- 1 user user 43623850 Feb 15 08:06 rados.3.2111.out
-rw-rw-r-- 1 user user 43490750 Feb 15 08:16 rados.3.265e.out
-rw-rw-r-- 1 user user        0 Feb 15 08:33 rados.3.2fde.out
-rw-rw-r-- 1 user user 43615991 Feb 15 08:38 rados.3.32a1.out
-rw-rw-r-- 1 user user        0 Feb 15 08:55 rados.3.3b9b.out
-rw-rw-r-- 1 user user 43601664 Feb 15 09:00 rados.3.3e20.out

3) Then deduced the one common OSD they all contained.

user@ceph-p-mon-03:~/rados.out$ echo "3.120 3.aa1 3.102e 3.1111 3.165e 3.2fde 3.3b9b"|xargs -n 1 ceph pg map 
osdmap e1488193 pg 3.120 (3.120) -> up [164,885,1995,683,618,2281,2116,2308,2438,12,1067] acting [164,885,1995,683,618,2281,2116,2308,2438,12,1067]
osdmap e1488193 pg 3.aa1 (3.aa1) -> up [164,1148,2551,2655,1975,1129,489,1515,16,2056,1014] acting [164,1148,2551,2655,1975,1129,489,1515,16,2056,1014]
osdmap e1488193 pg 3.102e (3.102e) -> up [164,1660,858,1531,476,874,615,2199,1935,1941,127] acting [164,1660,858,1531,476,874,615,2199,1935,1941,127]
osdmap e1488193 pg 3.1111 (3.1111) -> up [164,1468,2005,1112,2301,931,406,1591,1785,584,2561] acting [164,1468,2005,1112,2301,931,406,1591,1785,584,2561]
osdmap e1488193 pg 3.165e (3.165e) -> up [164,1729,1302,1758,2152,2221,980,323,1033,761,513] acting [164,1729,1302,1758,2152,2221,980,323,1033,761,513]
osdmap e1488193 pg 3.2fde (3.2fde) -> up [164,531,911,1668,2422,1924,2187,263,1720,2673,2197] acting [164,531,911,1668,2422,1924,2187,263,1720,2673,2197]
osdmap e1488193 pg 3.3b9b (3.3b9b) -> up [164,2114,2210,2584,1486,2025,2607,1396,2681,1888,32] acting [164,2114,2210,2584,1486,2025,2607,1396,2681,1888,32]

4) Here we see OSD 164 is the common OSD across all of these PGs. We then restarted that OSD:

root@storage-03-08046:~# sudo systemctl restart ceph-osd@164

Suddenly everything started working again! It became impossible to produce the 504 errors and the garbage collection started processing.

The OSD in question had no relevant errors or warnings in it's logs...

We do not know how to reproduce this bug unfortunately. For 3 weeks we were struggling to resolve this. It's concerning that one OSD misbehaving was able to so much dysfunction to the cluster. As I understand with Ceph, it is supposed to tolerate OSDs breaking or misbehaving, but with this bug we clearly saw just 1 bad OSD bring a large production cluster to its knees.

If you have any further questions about this issue I'm happy to provide more information.

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by Michael Collins 2 months ago

Ouch syntax errors... Sorry I expected markdown to work here.

Actions

Copy link

Updated by Neha Ojha 2 months ago

Project changed from Ceph to rgw
Description updated (diff)
Category deleted (~~OSD~~)

Actions

Copy link

Updated by Casey Bodley 2 months ago

Related to Feature #56956: osdc: Add objecter fastfail added

Actions

Copy link

Updated by Casey Bodley 2 months ago

Priority changed from Normal to High

thanks Michael

this is unfortunately common for librados clients like radosgw when pgs are down or unresponsive. the default behavior for a librados request is to wait, potentially forever, for the pg/osd to come back up and process to the request

so radosgw does not degrade gracefully when the underlying cluster isn't healthy. eventually all of radosgw's threads/requests will end up hanging on librados responses, and we won't be able to accept new requests at all

this issue has come up several times in the past. most recently in https://tracker.ceph.com/issues/56956, where we tried to fail requests to pgs that were down/inactive. but it doesn't look like that would have helped in your case, because the pgs were still up

there exists a config variable rados_osd_op_timeout that will time out librados requests and return an error, but that itself has caused consistency issues in rgw. see https://github.com/ceph/ceph/pull/47667 for example, where we made the wrong decision based on a ETIMEDOUT error for a request that eventually succeeded

this is worth further discussion with the greater ceph team, because i don't think rgw alone has a way to solve this. moving this to high prio in the meantime

Actions

Copy link

Updated by Casey Bodley 2 months ago

Related to Bug #64418: RGW garbage collection stuck and growing added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #64527

Radosgw 504 timeouts & Garbage collection is frozen

Updated by Michael Collins 2 months ago

Updated by Neha Ojha 2 months ago

Updated by Casey Bodley 2 months ago

Updated by Casey Bodley 2 months ago

Updated by Casey Bodley 2 months ago