Bug #64527
openRadosgw 504 timeouts & Garbage collection is frozen
0%
Description
Ceph version: 17.2.6-1
OS: Ubuntu 20.04
Deployed without Cephadm using SystemD.
Cluster specs:
mcollins1@ceph-p-mon-01:~$ ceph -s cluster: id: e428af4c-c109-497e-af6f-54ebe1dda867 health: HEALTH_WARN 4725 pgs not deep-scrubbed in time 15 pgs not scrubbed in time services: mon: 5 daemons, quorum ceph-p-mon-01,ceph-p-mon-02,ceph-p-mon-03,ceph-p-mon-04,ceph-p-mon-05 (age 2d) mgr: ceph-p-mgr-01(active, since 2d), standbys: ceph-p-mgr-02, ceph-p-mgr-05, ceph-p-mgr-03, ceph-p-mgr-04 osd: 2748 osds: 2665 up (since 4d), 2650 in (since 5d); 37 remapped pgs rgw: 12 daemons active (12 hosts, 1 zones) data: pools: 10 pools, 18857 pgs objects: 5.18G objects, 19 PiB usage: 26 PiB used, 11 PiB / 37 PiB avail
- Symptoms
There were 2 main symptoms of this issue when we encountered it:
1) We would frequently get 504 timeout errors when attempting to list buckets/objects using tools like rclone/awscli.
mcollins1@maas-rackd-01:~$ rclone ls hotsauce1: 2024/02/12 09:23:44 Failed to ls: SerializationError: failed to unmarshal error message status code: 504, request id: , host id: caused by: UnmarshalError: failed to unmarshal error message 00000000 3c 68 74 6d 6c 3e 3c 62 6f 64 79 3e 3c 68 31 3e |<html><body><h1>| 00000010 35 30 34 20 47 61 74 65 77 61 79 20 54 69 6d 65 |504 Gateway Time| 00000020 2d 6f 75 74 3c 2f 68 31 3e 0a 54 68 65 20 73 65 |-out</h1>.The se| 00000030 72 76 65 72 20 64 69 64 6e 27 74 20 72 65 73 70 |rver didn't resp| 00000040 6f 6e 64 20 69 6e 20 74 69 6d 65 2e 0a 3c 2f 62 |ond in time..</b| 00000050 6f 64 79 3e 3c 2f 68 74 6d 6c 3e 0a |ody></html>.| caused by: expected element type <Error> but have <html> mcollins1@maas-rackd-01:~/scripts$ rclone lsd hotsauce1: 2024/02/13 13:48:07 ERROR : : error listing: SerializationError: failed to unmarshal error message status code: 504, request id: , host id: caused by: UnmarshalError: failed to unmarshal error message 00000000 3c 68 74 6d 6c 3e 3c 62 6f 64 79 3e 3c 68 31 3e |<html><body><h1>| 00000010 35 30 34 20 47 61 74 65 77 61 79 20 54 69 6d 65 |504 Gateway Time| 00000020 2d 6f 75 74 3c 2f 68 31 3e 0a 54 68 65 20 73 65 |-out</h1>.The se| 00000030 72 76 65 72 20 64 69 64 6e 27 74 20 72 65 73 70 |rver didn't resp| 00000040 6f 6e 64 20 69 6e 20 74 69 6d 65 2e 0a 3c 2f 62 |ond in time..</b| 00000050 6f 64 79 3e 3c 2f 68 74 6d 6c 3e 0a |ody></html>.| caused by: expected element type <Error> but have <html>
It would infrequently (~1%) produce 504 errors when writing objects, and occasionally (~0.1%) produce 504 errors when reading objects.
PUT completed successfully for ./100MBfile-test_000000098. PUT completed successfully for ./100MBfile-test_000000099. PUT failed for ./100MBfile-test_000000100. See error details below: 2024/02/06 12:05:28 INFO : Transferred: 0 / 0 Bytes, -, 0 Bytes/s, ETA - Elapsed time: 1m0.0s2024/02/06 12:05:28 ERROR : Attempt 1/3 failed with 1 errors and: GatewayTimeout: Gateway Timeout status code: 504, request id: , host id:2) The garbage collection was completely frozen, with the queue just growing over time.
- radosgw-admin gc list | grep -c oid
4859555
The RGWs would become unresponsive, there the logs showed them recieving new requests but not completing them. After enough time these RGW nodes would just freeze entirely and stop producing any log messages. We would then restart them and they would work again but the above symptoms remained.
- Other Error Messages
Some other errors we were encountering at the time which might be relevant:
mcollins1@ceph-p-mon-01:~$ radosgw-admin bucket rm --purge-objects --bucket=hotsauce-bench000000000093 ... 2024-02-13T13:28:45.878+0800 7f90d9bdae40 1 lifecycle: int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):426 invalidation attempt 9 failed: (110) Connection timed out 2024-02-13T13:28:45.878+0800 7f90d9bdae40 0 lifecycle: ERROR: failed to distribute cache for default.rgw.meta:root:.bucket.meta.hotsauce-bench000000000093:c0a34c69-bce9-4053-9ead-8ee81faae4c1.64636333.10 ... 2024-02-13T13:34:15.944+0800 7f90d9bdae40 1 int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):426 invalidation attempt 9 failed: (110) Connection timed out 2024-02-13T13:34:15.944+0800 7f90d9bdae40 0 ERROR: remove(): failed to distribute cache: r=-110
We also found these but they are apparently harmless:
2024-02-07T07:39:49.444+0800 7f5683fef700 0 INFO: RGWReshardLock::lock found lock on reshard.0000000000 to be held by another RGW process; skipping for now 2024-02-07T07:39:49.444+0800 7f5683fef700 0 INFO: RGWReshardLock::lock found lock on reshard.0000000001 to be held by another RGW process; skipping for now 2024-02-07T07:39:49.448+0800 7f5683fef700 0 INFO: RGWReshardLock::lock found lock on reshard.0000000003 to be held by another RGW process; skipping for now
Initially the RGWs were OOM'ing every 1-10 minutes, this later stopped when the cluster became healthier but the 504 errors and frozen garbage collection persisted:
mcollins1@maas-rackd-01:~$ for i in 01 02 03 04 05 06 07 08 09; do echo ceph-p-rgw-vm-$i; ssh ceph-p-rgw-vm-$i sudo c at /var/log/syslog | grep 'Out of memory' | tail -n 1; done ceph-p-rgw-vm-01 2024-02-01T02:59:32.850823+08:00 ceph-p-rgw-vm-01 kernel: [95794.667160] Out of memory: Killed process 968268 (radosgw) total-vm:32321156kB, anon-rss:6960428kB, file-rss:0kB, shmem-rss:0kB, UID:64045 pgtables:28892kB oom_score_adj:0 ceph-p-rgw-vm-02 2024-02-01T07:03:05.002802+08:00 ceph-p-rgw-vm-02 kernel: [110312.565025] Out of memory: Killed process 1261996 (radosgw) total-vm:32145712kB, anon-rss:6973744kB, file-rss:0kB, shmem-rss:0kB, UID:64045 pgtables:28872kB oom_score_adj:0 ...
The RGW nodes didn't give any relevant error messages when we were triggering these 504 errors. Even with the verbosity of debug_rgw and debug_md set to 20/20.
We also noticed that RGW nodes weren't shutting down cleanly and would have to be SIGKILL'ed each time:
mcollins1@ceph-p-rgw-vm-01:~$ sudo journalctl -u ceph-radosgw@rgw.ceph-p-rgw-vm-01.service | tail -n 10 Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Killing process 201711 (n/a) with signal SIGKILL. Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Killing process 201712 (n/a) with signal SIGKILL. Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Killing process 201713 (n/a) with signal SIGKILL. Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Killing process 201714 (n/a) with signal SIGKILL. Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Killing process 201715 (n/a) with signal SIGKILL. Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Failed to kill control group /system.slice/system-ceph\x2dradosgw.slice/ceph-radosgw@rgw.ceph-p-rgw-vm-01.service, ignoring: Input/output error Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Main process exited, code=killed, status=9/KILL Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Failed with result 'timeout'. Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: Stopped Ceph rados gateway. Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: Started Ceph rados gateway.
When tracing the RGW process we would always see this output:
mcollins1@ceph-p-rgw-vm-09:~$ sudo strace -p 1031 strace: Process 1031 attached read(103,
With the logs set to maximum verbosity we would also see these messenger errors frequently:
2024-02-07T14:43:36.619+0800 7f7c270d3700 1 -- 10.235.7.221:0/845349610 >> [v2:10.235.8.28:6854/7005,v1:10.235.8.28:6873/7005] conn(0x7f7c0cd369d0 msgr2=0x7f7c0cd50c30 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 1458 2024-02-07T14:43:36.619+0800 7f7c270d3700 1 -- 10.235.7.221:0/845349610 >> [v2:10.235.8.28:6854/7005,v1:10.235.8.28:6873/7005] conn(0x7f7c0cd369d0 msgr2=0x7f7c0cd50c30 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed ... 2024-02-07T14:43:36.619+0800 7f7c270d3700 1 --2- 10.235.7.221:0/845349610 >> [v2:10.235.8.28:6854/7005,v1:10.235.8.28:6873/7005] conn(0x7f7c0cd369d0 0x7f7c0cd50c30 crc :-1 s=READY pgs=11466 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted) ... 2024-02-07T14:43:36.619+0800 7f7c270d3700 2 --2- 10.235.7.221:0/845349610 >> [v2:10.235.8.28:6854/7005,v1:10.235.8.28:6873/7005] conn(0x7f7c0cd369d0 0x7f7c0cd50c30 crc :-1 s=READY pgs=11466 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._fault on lossy channel, failing
- Fixes We Tried
We tried a wide array of fixes to try bring the cluster into a healthier state and resolve these errors:
- Restarting RGW nodes.
- Redeploying RGW nodes.
- Just running 1 RGW node.
- Giving the RGW nodes more RAM.
- Slowed recovery by remapping the upmap: https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py Which allowed the monitor DB trimming to occur.
- Turning down target_max_misplaced_ratio to slow recovery.
- Just waiting for the cluster to recovery.
Nothing seemed to help though...
- The Solution
We weren't able to find log messages anywhere that explained why this was occuring, it seems like a silent bug.
Strangely it all boiled down to 1 OSD that was misbehaving. Which we located by:
1) First listing every PG in the cluster with `rados ls`:
user@ceph-p-mon-03:~$ ps -ef|grep rados user 1173205 1173203 0 07:11 pts/1 00:00:01 rados --pgid 3.120 ls user 1227544 1227543 0 07:28 pts/1 00:00:01 rados --pgid 3.aa1 ls user 1257073 1257072 0 07:37 pts/1 00:00:02 rados --pgid 3.fde ls user 1258837 1258832 0 07:37 pts/1 00:00:01 rados --pgid 3.102e ls user 1263857 1263856 0 07:39 pts/1 00:00:01 rados --pgid 3.1111 ls user 1294344 1294343 0 07:48 pts/1 00:00:01 rados --pgid 3.165e ls user 1323898 1323897 0 07:57 pts/1 00:00:02 rados --pgid 3.1b9b ls user 1349663 1349662 0 08:05 pts/1 00:00:02 rados --pgid 3.202e ls user 1354707 1354706 0 08:06 pts/1 00:00:02 rados --pgid 3.2111 ls user 1385196 1385195 0 08:16 pts/1 00:00:02 rados --pgid 3.265e ls user 1438760 1438759 0 08:33 pts/1 00:00:00 rados --pgid 3.2fde ls user 1454365 1454364 0 08:38 pts/1 00:00:01 rados --pgid 3.32a1 ls user 1505945 1505944 0 08:55 pts/1 00:00:00 rados --pgid 3.3b9b ls user 1520186 1520185 0 09:00 pts/1 00:00:01 rados --pgid 3.3e20 ls user 1563605 826342 0 09:51 pts/2 00:00:00 grep --color=auto rados
2) Then observed which commands were frozen.
user@ceph-p-mon-03:~/rados.out$ ls -tlr total 298476 drwxrwxr-x 2 user user 131072 Feb 15 07:01 old -rw-rw-r-- 1 user user 0 Feb 15 07:11 rados.3.120.out -rw-rw-r-- 1 user user 0 Feb 15 07:28 rados.3.aa1.out -rw-rw-r-- 1 user user 43762290 Feb 15 07:37 rados.3.fde.out -rw-rw-r-- 1 user user 0 Feb 15 07:37 rados.3.102e.out -rw-rw-r-- 1 user user 0 Feb 15 07:39 rados.3.1111.out -rw-rw-r-- 1 user user 0 Feb 15 07:48 rados.3.165e.out -rw-rw-r-- 1 user user 43781444 Feb 15 07:57 rados.3.1b9b.out -rw-rw-r-- 1 user user 43616684 Feb 15 08:05 rados.3.202e.out -rw-rw-r-- 1 user user 43623850 Feb 15 08:06 rados.3.2111.out -rw-rw-r-- 1 user user 43490750 Feb 15 08:16 rados.3.265e.out -rw-rw-r-- 1 user user 0 Feb 15 08:33 rados.3.2fde.out -rw-rw-r-- 1 user user 43615991 Feb 15 08:38 rados.3.32a1.out -rw-rw-r-- 1 user user 0 Feb 15 08:55 rados.3.3b9b.out -rw-rw-r-- 1 user user 43601664 Feb 15 09:00 rados.3.3e20.out
3) Then deduced the one common OSD they all contained.
user@ceph-p-mon-03:~/rados.out$ echo "3.120 3.aa1 3.102e 3.1111 3.165e 3.2fde 3.3b9b"|xargs -n 1 ceph pg map osdmap e1488193 pg 3.120 (3.120) -> up [164,885,1995,683,618,2281,2116,2308,2438,12,1067] acting [164,885,1995,683,618,2281,2116,2308,2438,12,1067] osdmap e1488193 pg 3.aa1 (3.aa1) -> up [164,1148,2551,2655,1975,1129,489,1515,16,2056,1014] acting [164,1148,2551,2655,1975,1129,489,1515,16,2056,1014] osdmap e1488193 pg 3.102e (3.102e) -> up [164,1660,858,1531,476,874,615,2199,1935,1941,127] acting [164,1660,858,1531,476,874,615,2199,1935,1941,127] osdmap e1488193 pg 3.1111 (3.1111) -> up [164,1468,2005,1112,2301,931,406,1591,1785,584,2561] acting [164,1468,2005,1112,2301,931,406,1591,1785,584,2561] osdmap e1488193 pg 3.165e (3.165e) -> up [164,1729,1302,1758,2152,2221,980,323,1033,761,513] acting [164,1729,1302,1758,2152,2221,980,323,1033,761,513] osdmap e1488193 pg 3.2fde (3.2fde) -> up [164,531,911,1668,2422,1924,2187,263,1720,2673,2197] acting [164,531,911,1668,2422,1924,2187,263,1720,2673,2197] osdmap e1488193 pg 3.3b9b (3.3b9b) -> up [164,2114,2210,2584,1486,2025,2607,1396,2681,1888,32] acting [164,2114,2210,2584,1486,2025,2607,1396,2681,1888,32]
4) Here we see OSD 164 is the common OSD across all of these PGs. We then restarted that OSD:
root@storage-03-08046:~# sudo systemctl restart ceph-osd@164
Suddenly everything started working again! It became impossible to produce the 504 errors and the garbage collection started processing.
The OSD in question had no relevant errors or warnings in it's logs...
We do not know how to reproduce this bug unfortunately. For 3 weeks we were struggling to resolve this. It's concerning that one OSD misbehaving was able to so much dysfunction to the cluster. As I understand with Ceph, it is supposed to tolerate OSDs breaking or misbehaving, but with this bug we clearly saw just 1 bad OSD bring a large production cluster to its knees.
If you have any further questions about this issue I'm happy to provide more information.
Updated by Michael Collins 2 months ago
Ouch syntax errors... Sorry I expected markdown to work here.
Updated by Casey Bodley 2 months ago
- Related to Feature #56956: osdc: Add objecter fastfail added
Updated by Casey Bodley 2 months ago
- Priority changed from Normal to High
thanks Michael
this is unfortunately common for librados clients like radosgw when pgs are down or unresponsive. the default behavior for a librados request is to wait, potentially forever, for the pg/osd to come back up and process to the request
so radosgw does not degrade gracefully when the underlying cluster isn't healthy. eventually all of radosgw's threads/requests will end up hanging on librados responses, and we won't be able to accept new requests at all
this issue has come up several times in the past. most recently in https://tracker.ceph.com/issues/56956, where we tried to fail requests to pgs that were down/inactive. but it doesn't look like that would have helped in your case, because the pgs were still up
there exists a config variable rados_osd_op_timeout
that will time out librados requests and return an error, but that itself has caused consistency issues in rgw. see https://github.com/ceph/ceph/pull/47667 for example, where we made the wrong decision based on a ETIMEDOUT
error for a request that eventually succeeded
this is worth further discussion with the greater ceph team, because i don't think rgw alone has a way to solve this. moving this to high prio in the meantime
Updated by Casey Bodley 2 months ago
- Related to Bug #64418: RGW garbage collection stuck and growing added