Project

General

Profile

Bug #53025

OSD timing out causes all the RGW to stop providing service

Added by Alex Kershaw over 2 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Seen on a containerised cluster running across 3 VMs with 3x MON, 3x OSD, 3x MDS, 6x RGW (3 regular, 3 pubsub).

ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)

During performance testing we noticed an OSD hitting timeouts causing all 3 regular RGWs to stop providing service temporarily.

Expected outcome ================
The OSD gets caught up hitting timeouts and eventually is restarted. There is no interruption to service during this.

Actual outcome ==============
The OSD gets caught up hitting timeouts and eventually is restarted. All S3 access through the RGWs went down during this time, eventually recovering after the OSD was restarted.

I've attached both OSD and RGW container logs around one occurance. The OSD died several times over the day:

Oct 21 05:10:54 condor_sc1 container_name/ceph-osd-1[1120]: teardown: Bye Bye, container will die with return code 0
Oct 21 09:58:59 condor_sc1 container_name/ceph-osd-1[1120]: teardown: Bye Bye, container will die with return code 0
Oct 21 11:56:39 condor_sc1 container_name/ceph-osd-1[1120]: teardown: Bye Bye, container will die with return code 0
Oct 21 14:57:05 condor_sc1 container_name/ceph-osd-1[1120]: teardown: Bye Bye, container will die with return code 0
Oct 21 16:41:02 condor_sc1 container_name/ceph-osd-1[1120]: teardown: Bye Bye, container will die with return code 0
Oct 21 18:34:14 condor_sc1 container_name/ceph-osd-1[1120]: teardown: Bye Bye, container will die with return code 0

Something odd I spotted in RGW logs around each occurrence was a several minute gap in the logs:

Oct 21 09:50:43 condor_sc1 container_name/ceph-rgw-condor_sc1-rgw0[1120]: 2021-10-21T09:50:43.078+0000 7f0207d6c700 0 RGW-SYNC:data:sync:shard[51]: ERROR: failed to read remote data log info: ret=-5
Oct 21 09:50:43 condor_sc1 container_name/ceph-rgw-condor_sc1-rgw0[1120]: 2021-10-21T09:50:43.078+0000 7f0207d6c700 0 RGW-SYNC:data:sync:shard[119]: ERROR: failed to read remote data log info: ret=-5
Oct 21 09:54:33 condor_sc1 container_name/ceph-rgw-condor_sc1-rgw0[1120]: 2021-10-21T09:54:33.745+0000 7f022059d700 -1 RGWWatcher::handle_error cookie 94409032206592 err (107) Transport endpoint is not
connected
Oct 21 09:54:33 condor_sc1 container_name/ceph-rgw-condor_sc1-rgw0[1120]: 2021-10-21T09:54:33.745+0000 7f022059d700 -1 RGWWatcher::handle_error cookie 94409006694400 err (107) Transport endpoint is not
connected
Oct 21 09:54:34 condor_sc1 container_name/ceph-rgw-condor_sc1-rgw0[1120]: 2021-10-21T09:54:34.105+0000 7f01cecfa700 0 ERROR: client_io->complete_request() returned Connection reset by peer

We some large omap alarms on the cluster that we suspect may be the cause of the OSD timeouts. However despite that we're still expecting service not to be interrupted should an OSD get hung on up on this.

cluster:
id: 8f9a7ae9-968d-42ae-86e9-7f49d1e61624
health: HEALTH_WARN
5 large omap objects
services:
mon: 3 daemons, quorum condor_sc0,condor_sc1,condor_sc2 (age 7d)
mgr: condor_sc0(active, since 7d), standbys: condor_sc2, condor_sc1
mds: cephfs:1 {0=condor_sc0=up:active} 2 up:standby
osd: 3 osds: 3 up (since 3d), 3 in (since 3d)
rgw: 6 daemons active (condor_sc0.pubsub, condor_sc0.rgw0, condor_sc1.pubsub, condor_sc1.rgw0, condor_sc2.pubsub, condor_sc2.rgw0)
task status:
data:
pools: 14 pools, 137 pgs
objects: 3.78M objects, 98 GiB
usage: 543 GiB used, 507 GiB / 1.0 TiB avail
pgs: 137 active+clean
io:
client: 26 MiB/s rd, 3.2 MiB/s wr, 1.44k op/s rd, 294 op/s wr

At the time this was running we had bumped two config timeouts on all the OSDs:
- osd_op_thread_timeout from 15 to 200
- osd_op_thread_suicide_timeout from 150 to 500

However I don't see how this change could explain this behaviour.

I'm particularly interested in the "Transport endpoint is not connected" errors. That sounds fairly worrying and may indicate that we've got some setup issue with our deployment. Didn't get anywhere with debugging this, any pointers on this is much appreciated.

While we had no evidence from Ceph that the RGWs actually stopped providing service, we have irrefutable evidence from our clients that all the RGW endpoints were not contactable during these OSD restarts. Notably pubsub RGW endpoints did not go down.

osd.txt View - osd container log (971 KB) Alex Kershaw, 10/25/2021 10:39 AM

rgw.txt View - rgw container log (971 KB) Alex Kershaw, 10/25/2021 10:39 AM

History

#2 Updated by Alex Kershaw over 2 years ago

We've also got large omap alarms on our cluster.

#3 Updated by Igor Fedotov over 2 years ago

You might want to try to compact OSDs' database using ceph-kvstore-tool's compact command. We've seen plenty of cases where suicide timeout was caused by RocksDB "fragmentation" and subsequent performance drop after bulk data removal. I think you're facing the same issue at your OSDs, compaction would help (at lest temporarily) then.
Other factors that might negatively impact DB performance (which I believe is [one of] the primary trigger for your issues) are:
1) Using bluefs-buffered-io parameter set to false. Which isn't default in .13 though...
2) The significant amount of DB data residing at slow drives, e.g spinning ones.

#4 Updated by Alex Kershaw over 2 years ago

Hi Igor, thanks for your response, it was massively helpful.

1) I checked the bluefs-buffered-io parameter, it was set to true.
2) We've previously hit our performance targets with these spinning disks, so I'm reluctant blame this for the problems alone.

Compacting the OSDs worked wonders. Both the OMAP and META storage drastically decreased. Things seem to be running much smoother now, and all my large omap alarms have cleared.

Before compaction:

[qs-admin@condor_sc0 ~]$ ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.34180 1.00000 350 GiB 227 GiB 123 GiB 96 GiB 8.1 GiB 123 GiB 64.95 1.13 137 up
1 hdd 0.34180 1.00000 350 GiB 182 GiB 123 GiB 50 GiB 9.1 GiB 168 GiB 52.13 0.90 137 up
2 hdd 0.34180 1.00000 350 GiB 196 GiB 123 GiB 66 GiB 7.7 GiB 154 GiB 56.05 0.97 137 up
TOTAL 1.0 TiB 606 GiB 369 GiB 212 GiB 25 GiB 444 GiB 57.71
MIN/MAX VAR: 0.90/1.13 STDDEV: 5.36

After compaction:

[qs-admin@condor_sc0 ~]$ ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.34180 1.00000 350 GiB 131 GiB 123 GiB 3.3 GiB 4.4 GiB 219 GiB 37.33 1.00 137 up
1 hdd 0.34180 1.00000 350 GiB 131 GiB 123 GiB 3.3 GiB 4.4 GiB 219 GiB 37.35 1.00 137 up
2 hdd 0.34180 1.00000 350 GiB 131 GiB 123 GiB 3.3 GiB 4.4 GiB 219 GiB 37.35 1.00 137 up
TOTAL 1.0 TiB 392 GiB 369 GiB 9.8 GiB 13 GiB 658 GiB 37.34
MIN/MAX VAR: 1.00/1.00 STDDEV: 0.01

We use ceph-ansible and run within containers, as a result I couldn't get the "ceph-kvstore-tool" to work and I think I did a 'live' compaction (i.e. the OSD was active, during, I ran this with "ceph tell osd.N compact").

We did do a bulk data deletion on this cluster (we had a bug that caused our pubsub queue to grow so large that it was fasted to delete than deal with it), so perhaps that was the cause. What was the giveaway for you so we can know what to look out for in the future?

Thanks again!

Also available in: Atom PDF