Project

General

Profile

Bug #64527

Updated by Neha Ojha 3 months ago

 
 Ceph version: 17.2.6-1 
 OS: Ubuntu 20.04 
 Deployed without Cephadm using SystemD. 

 Cluster specs: 
 <pre> ``` 
 mcollins1@ceph-p-mon-01:~$ ceph -s 
   cluster: 
     id:       e428af4c-c109-497e-af6f-54ebe1dda867 
     health: HEALTH_WARN 
             4725 pgs not deep-scrubbed in time 
             15 pgs not scrubbed in time 

   services: 
     mon: 5 daemons, quorum ceph-p-mon-01,ceph-p-mon-02,ceph-p-mon-03,ceph-p-mon-04,ceph-p-mon-05 (age 2d) 
     mgr: ceph-p-mgr-01(active, since 2d), standbys: ceph-p-mgr-02, ceph-p-mgr-05, ceph-p-mgr-03, ceph-p-mgr-04 
     osd: 2748 osds: 2665 up (since 4d), 2650 in (since 5d); 37 remapped pgs 
     rgw: 12 daemons active (12 hosts, 1 zones) 

   data: 
     pools:     10 pools, 18857 pgs 
     objects: 5.18G objects, 19 PiB 
     usage:     26 PiB used, 11 PiB / 37 PiB avail 
 </pre> ``` 

 ## Symptoms 

 There were 2 main symptoms of this issue when we encountered it: 

 1) We would frequently get 504 timeout errors when attempting to list buckets/objects using tools like rclone/awscli. 
 <pre> ``` 
 mcollins1@maas-rackd-01:~$ rclone ls hotsauce1: 
 2024/02/12 09:23:44 Failed to ls: SerializationError: failed to unmarshal error message 
         status code: 504, request id: , host id: 
 caused by: UnmarshalError: failed to unmarshal error message 
         00000000    3c 68 74 6d 6c 3e 3c 62    6f 64 79 3e 3c 68 31 3e    |<html><body><h1>| 
 00000010    35 30 34 20 47 61 74 65    77 61 79 20 54 69 6d 65    |504 Gateway Time| 
 00000020    2d 6f 75 74 3c 2f 68 31    3e 0a 54 68 65 20 73 65    |-out</h1>.The se| 
 00000030    72 76 65 72 20 64 69 64    6e 27 74 20 72 65 73 70    |rver didn't resp| 
 00000040    6f 6e 64 20 69 6e 20 74    69 6d 65 2e 0a 3c 2f 62    |ond in time..</b| 
 00000050    6f 64 79 3e 3c 2f 68 74    6d 6c 3e 0a                |ody></html>.| 
 caused by: expected element type <Error> but have <html> 

 mcollins1@maas-rackd-01:~/scripts$ rclone lsd hotsauce1: 
 2024/02/13 13:48:07 ERROR : : error listing: SerializationError: failed to unmarshal error message 
         status code: 504, request id: , host id: 
 caused by: UnmarshalError: failed to unmarshal error message 
         00000000    3c 68 74 6d 6c 3e 3c 62    6f 64 79 3e 3c 68 31 3e    |<html><body><h1>| 
 00000010    35 30 34 20 47 61 74 65    77 61 79 20 54 69 6d 65    |504 Gateway Time| 
 00000020    2d 6f 75 74 3c 2f 68 31    3e 0a 54 68 65 20 73 65    |-out</h1>.The se| 
 00000030    72 76 65 72 20 64 69 64    6e 27 74 20 72 65 73 70    |rver didn't resp| 
 00000040    6f 6e 64 20 69 6e 20 74    69 6d 65 2e 0a 3c 2f 62    |ond in time..</b| 
 00000050    6f 64 79 3e 3c 2f 68 74    6d 6c 3e 0a                |ody></html>.| 
 caused by: expected element type <Error> but have <html> 
 </pre> ``` 

 It would infrequently (~1%) produce 504 errors when writing objects, and occasionally (~0.1%) produce 504 errors when reading objects. 
 <pre> ``` 
 PUT completed successfully for ./100MBfile-test_000000098. 
 PUT completed successfully for ./100MBfile-test_000000099. 
 PUT failed for ./100MBfile-test_000000100. See error details below: 
 2024/02/06 12:05:28 INFO    : 
 Transferred:               0 / 0 Bytes, -, 0 Bytes/s, ETA - 
 Elapsed time:         1m0.0s2024/02/06 12:05:28 ERROR : Attempt 1/3 failed with 1 errors and: GatewayTimeout: Gateway Timeout 
         status code: 504, request id: , host id: 
 </pre> ``` 


 2) The garbage collection was completely frozen, with the queue just growing over time. 
 </pre> ``` 
 # radosgw-admin gc list | grep -c oid 
 4859555 
 </pre> ``` 

 The RGWs would become unresponsive, there the logs showed them recieving new requests but not completing them. After enough time these RGW nodes would just freeze entirely and stop producing any log messages. We would then restart them and they would work again but the above symptoms remained. 


 ## Other Error Messages 

 Some other errors we were encountering at the time which might be relevant: 
 <pre> ``` 
 mcollins1@ceph-p-mon-01:~$ radosgw-admin bucket rm --purge-objects --bucket=hotsauce-bench000000000093 
 ... 
 2024-02-13T13:28:45.878+0800 7f90d9bdae40  1 lifecycle: int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):426 invalidation attempt 9 failed: (110) Connection timed out 
 2024-02-13T13:28:45.878+0800 7f90d9bdae40  0 lifecycle: ERROR: failed to distribute cache for default.rgw.meta:root:.bucket.meta.hotsauce-bench000000000093:c0a34c69-bce9-4053-9ead-8ee81faae4c1.64636333.10 
 ... 
 2024-02-13T13:34:15.944+0800 7f90d9bdae40  1 int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):426 invalidation attempt 9 failed: (110) Connection timed out 
 2024-02-13T13:34:15.944+0800 7f90d9bdae40  0 ERROR: remove(): failed to distribute cache: r=-110 
 </pre> ``` 

 We also found these but they are apparently harmless: 
 <pre> ``` 
 2024-02-07T07:39:49.444+0800 7f5683fef700  0 INFO: RGWReshardLock::lock found lock on reshard.0000000000 to be held by another RGW process; skipping for now 
 2024-02-07T07:39:49.444+0800 7f5683fef700  0 INFO: RGWReshardLock::lock found lock on reshard.0000000001 to be held by another RGW process; skipping for now 
 2024-02-07T07:39:49.448+0800 7f5683fef700  0 INFO: RGWReshardLock::lock found lock on reshard.0000000003 to be held by another RGW process; skipping for now 
 </pre> ``` 

 Initially the RGWs were OOM'ing every 1-10 minutes, this later stopped when the cluster became healthier but the 504 errors and frozen garbage collection persisted: 
 <pre> ``` 
 mcollins1@maas-rackd-01:~$ for i in 01 02 03 04 05 06 07 08 09; do echo ceph-p-rgw-vm-$i; ssh ceph-p-rgw-vm-$i sudo c 
 at /var/log/syslog | grep 'Out of memory' | tail -n 1; done 
 ceph-p-rgw-vm-01 
 2024-02-01T02:59:32.850823+08:00 ceph-p-rgw-vm-01 kernel: [95794.667160] Out of memory: Killed process 968268 (radosgw) total-vm:32321156kB, anon-rss:6960428kB, file-rss:0kB, shmem-rss:0kB, UID:64045 pgtables:28892kB oom_score_adj:0 
 ceph-p-rgw-vm-02 
 2024-02-01T07:03:05.002802+08:00 ceph-p-rgw-vm-02 kernel: [110312.565025] Out of memory: Killed process 1261996 (radosgw) total-vm:32145712kB, anon-rss:6973744kB, file-rss:0kB, shmem-rss:0kB, UID:64045 pgtables:28872kB oom_score_adj:0  
 ... 
 </pre> ``` 

 The RGW nodes didn't give any relevant error messages when we were triggering these 504 errors. Even with the verbosity of debug_rgw and debug_md set to 20/20. 

 We also noticed that RGW nodes weren't shutting down cleanly and would have to be SIGKILL'ed each time: 
 <pre> ``` 
 mcollins1@ceph-p-rgw-vm-01:~$ sudo journalctl -u ceph-radosgw@rgw.ceph-p-rgw-vm-01.service | tail -n 10 
 Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Killing process 201711 (n/a) with signal SIGKILL. 
 Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Killing process 201712 (n/a) with signal SIGKILL. 
 Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Killing process 201713 (n/a) with signal SIGKILL. 
 Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Killing process 201714 (n/a) with signal SIGKILL. 
 Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Killing process 201715 (n/a) with signal SIGKILL. 
 Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Failed to kill control group /system.slice/system-ceph\x2dradosgw.slice/ceph-radosgw@rgw.ceph-p-rgw-vm-01.service, ignoring: Input/output error 
 Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Main process exited, code=killed, status=9/KILL 
 Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: ceph-radosgw@rgw.ceph-p-rgw-vm-01.service: Failed with result 'timeout'. 
 Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: Stopped Ceph rados gateway. 
 Jan 29 14:31:43 ceph-p-rgw-vm-01 systemd[1]: Started Ceph rados gateway. 
 </pre> ``` 

 When tracing the RGW process we would always see this output: 
 <pre> ``` 
 mcollins1@ceph-p-rgw-vm-09:~$ sudo strace -p 1031 
 strace: Process 1031 attached 
 read(103, 
 </pre> ``` 

 With the logs set to maximum verbosity we would also see these messenger errors frequently: 
 <pre> ``` 
 2024-02-07T14:43:36.619+0800 7f7c270d3700    1 -- 10.235.7.221:0/845349610 >> [v2:10.235.8.28:6854/7005,v1:10.235.8.28:6873/7005] conn(0x7f7c0cd369d0 msgr2=0x7f7c0cd50c30 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 1458 
 2024-02-07T14:43:36.619+0800 7f7c270d3700    1 -- 10.235.7.221:0/845349610 >> [v2:10.235.8.28:6854/7005,v1:10.235.8.28:6873/7005] conn(0x7f7c0cd369d0 msgr2=0x7f7c0cd50c30 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed 
 ... 
 2024-02-07T14:43:36.619+0800 7f7c270d3700    1 --2- 10.235.7.221:0/845349610 >> [v2:10.235.8.28:6854/7005,v1:10.235.8.28:6873/7005] conn(0x7f7c0cd369d0 0x7f7c0cd50c30 crc :-1 s=READY pgs=11466 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted) 
 ... 
 2024-02-07T14:43:36.619+0800 7f7c270d3700    2 --2- 10.235.7.221:0/845349610 >> [v2:10.235.8.28:6854/7005,v1:10.235.8.28:6873/7005] conn(0x7f7c0cd369d0 0x7f7c0cd50c30 crc :-1 s=READY pgs=11466 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._fault on lossy channel, failing 
 </pre> ``` 


 ## Fixes We Tried 

 We tried a wide array of fixes to try bring the cluster into a healthier state and resolve these errors: 
 - Restarting RGW nodes. 
 - Redeploying RGW nodes. 
 - Just running 1 RGW node. 
 - Giving the RGW nodes more RAM. 
 - Slowed recovery by remapping the upmap: https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py Which allowed the monitor DB trimming to occur. 
 - Turning down target_max_misplaced_ratio to slow recovery. 
 - Just waiting for the cluster to recovery. 

 Nothing seemed to help though... 


 ## The Solution 

 We weren't able to find log messages anywhere that explained why this was occuring, it seems like a silent bug. 

 Strangely it all boiled down to 1 OSD that was misbehaving. Which we located by: 

 1) First listing every PG in the cluster with `rados ls`: 
 <pre> ``` 
 user@ceph-p-mon-03:~$ ps -ef|grep rados 
 user     1173205 1173203    0 07:11 pts/1      00:00:01 rados --pgid 3.120 ls 
 user     1227544 1227543    0 07:28 pts/1      00:00:01 rados --pgid 3.aa1 ls 
 user     1257073 1257072    0 07:37 pts/1      00:00:02 rados --pgid 3.fde ls 
 user     1258837 1258832    0 07:37 pts/1      00:00:01 rados --pgid 3.102e ls 
 user     1263857 1263856    0 07:39 pts/1      00:00:01 rados --pgid 3.1111 ls 
 user     1294344 1294343    0 07:48 pts/1      00:00:01 rados --pgid 3.165e ls 
 user     1323898 1323897    0 07:57 pts/1      00:00:02 rados --pgid 3.1b9b ls 
 user     1349663 1349662    0 08:05 pts/1      00:00:02 rados --pgid 3.202e ls 
 user     1354707 1354706    0 08:06 pts/1      00:00:02 rados --pgid 3.2111 ls 
 user     1385196 1385195    0 08:16 pts/1      00:00:02 rados --pgid 3.265e ls 
 user     1438760 1438759    0 08:33 pts/1      00:00:00 rados --pgid 3.2fde ls 
 user     1454365 1454364    0 08:38 pts/1      00:00:01 rados --pgid 3.32a1 ls 
 user     1505945 1505944    0 08:55 pts/1      00:00:00 rados --pgid 3.3b9b ls 
 user     1520186 1520185    0 09:00 pts/1      00:00:01 rados --pgid 3.3e20 ls 
 user     1563605    826342    0 09:51 pts/2      00:00:00 grep --color=auto rados 
 </pre> ``` 

 2) Then observed which commands were frozen. 
 <pre> ``` 
 user@ceph-p-mon-03:~/rados.out$ ls -tlr 
 total 298476 
 drwxrwxr-x 2 user user     131072 Feb 15 07:01 old 
 -rw-rw-r-- 1 user user          0 Feb 15 07:11 rados.3.120.out 
 -rw-rw-r-- 1 user user          0 Feb 15 07:28 rados.3.aa1.out 
 -rw-rw-r-- 1 user user 43762290 Feb 15 07:37 rados.3.fde.out 
 -rw-rw-r-- 1 user user          0 Feb 15 07:37 rados.3.102e.out 
 -rw-rw-r-- 1 user user          0 Feb 15 07:39 rados.3.1111.out 
 -rw-rw-r-- 1 user user          0 Feb 15 07:48 rados.3.165e.out 
 -rw-rw-r-- 1 user user 43781444 Feb 15 07:57 rados.3.1b9b.out 
 -rw-rw-r-- 1 user user 43616684 Feb 15 08:05 rados.3.202e.out 
 -rw-rw-r-- 1 user user 43623850 Feb 15 08:06 rados.3.2111.out 
 -rw-rw-r-- 1 user user 43490750 Feb 15 08:16 rados.3.265e.out 
 -rw-rw-r-- 1 user user          0 Feb 15 08:33 rados.3.2fde.out 
 -rw-rw-r-- 1 user user 43615991 Feb 15 08:38 rados.3.32a1.out 
 -rw-rw-r-- 1 user user          0 Feb 15 08:55 rados.3.3b9b.out 
 -rw-rw-r-- 1 user user 43601664 Feb 15 09:00 rados.3.3e20.out 
 </pre> ``` 

 3) Then deduced the one common OSD they all contained. 
 <pre> ``` 
 user@ceph-p-mon-03:~/rados.out$ echo "3.120 3.aa1 3.102e 3.1111 3.165e 3.2fde 3.3b9b"|xargs -n 1 ceph pg map  
 osdmap e1488193 pg 3.120 (3.120) -> up [164,885,1995,683,618,2281,2116,2308,2438,12,1067] acting [164,885,1995,683,618,2281,2116,2308,2438,12,1067] 
 osdmap e1488193 pg 3.aa1 (3.aa1) -> up [164,1148,2551,2655,1975,1129,489,1515,16,2056,1014] acting [164,1148,2551,2655,1975,1129,489,1515,16,2056,1014] 
 osdmap e1488193 pg 3.102e (3.102e) -> up [164,1660,858,1531,476,874,615,2199,1935,1941,127] acting [164,1660,858,1531,476,874,615,2199,1935,1941,127] 
 osdmap e1488193 pg 3.1111 (3.1111) -> up [164,1468,2005,1112,2301,931,406,1591,1785,584,2561] acting [164,1468,2005,1112,2301,931,406,1591,1785,584,2561] 
 osdmap e1488193 pg 3.165e (3.165e) -> up [164,1729,1302,1758,2152,2221,980,323,1033,761,513] acting [164,1729,1302,1758,2152,2221,980,323,1033,761,513] 
 osdmap e1488193 pg 3.2fde (3.2fde) -> up [164,531,911,1668,2422,1924,2187,263,1720,2673,2197] acting [164,531,911,1668,2422,1924,2187,263,1720,2673,2197] 
 osdmap e1488193 pg 3.3b9b (3.3b9b) -> up [164,2114,2210,2584,1486,2025,2607,1396,2681,1888,32] acting [164,2114,2210,2584,1486,2025,2607,1396,2681,1888,32] 
 </pre> ``` 

 4) Here we see OSD 164 is the common OSD across all of these PGs. We then restarted that OSD: 
 <pre> ``` 
 root@storage-03-08046:~# sudo systemctl restart ceph-osd@164 
 </pre> ``` 

 Suddenly everything started working again! It became impossible to produce the 504 errors and the garbage collection started processing. 

 The OSD in question had no relevant errors or warnings in it's logs... 

 We do not know how to reproduce this bug unfortunately. For 3 weeks we were struggling to resolve this. It's concerning that one OSD misbehaving was able to so much dysfunction to the cluster. As I understand with Ceph, it is supposed to tolerate OSDs breaking or misbehaving, but with this bug we clearly saw just 1 bad OSD bring a large production cluster to its knees. 

 If you have any further questions about this issue I'm happy to provide more information. 

Back