Project

General

Profile

Bug #48021

severe rdb performance degradation with long running write-heavy work load

Added by Shridhar S over 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are running fio benchmark tests concurrently on large number of rbd volumes (over 8000 volumes) each of 6GB in size. Before running the actual read/write benchmark, FIO tries to create 6G files on each volume for a total write of 48TB. These volumes are configured with replication factor of 3, so the total write to the cluster is about 144TB.

We are noticing that the write throughput from the cluster starts at about 40-45GB/s and stays there for a few mins before dropping drastically down to 1-5GB/s! On Nautilus release, this is even worse - the throughput drops to ~500MB/s. After this drop the writes continue at this pace for several hours until all writes are complete. When FIO gets back to the actual benchmark, the throughput bounces back to >30GB/s read and write!

During the time when the throughput drops down, there is no CPU or memory pressure on all OSDs or mon/mgr processes. Most of the processes/threads seem to be sleeping on mutexes with very few threads in running state. CPU utilization is just under 5%.

We have tried several config changes but noticed the problem at least once in all configurations. Below are the details of the cluster and things we have tried.

The cluster has 10 data nodes and 5 monitor nodes. Each data node has 10 Samsung PM1725B NVME drives. We are running 2 OSDs per device that were configured using ceph-volume tool. All nodes have 2x100G links in bonding config. We are using 208 client machines each of which have a 25G link. We are running 40 rbd volumes for each client to simulate a workload of 8320 concurrent volume I/O.

Some of the things we tried without much success.
1. Reduced number of mons to 3
2. Reduced number of OSDs to 1 per NVME - This gave a better result, but we still hit the problem 2 times out of 6
3. Tried increasing OSD memory
4. On Nautilus release the problem is more severe. We saw the problem with only 2000 concurrent fio operations, but with mimic we saw the problem 1 in 4 times with 8000 volumes.
5. When we hit the problem, new volume creations work normally, but any attempt to map, format or write to the volume hang for a very long time.

=== Output from Ceph.log that shows the read/write performance as described above ====
2020-10-12 11:04:26.658522 mgr.volume-mon-05 (mgr.1240949) 117575 : cluster [DBG] pgmap v118418: 8192 pgs: 8192 active+clean; 15 TiB data, 44 TiB used, 246 TiB / 291 TiB avail; 9.8 KiB/s rd, 40 GiB/s wr, 10.19k op/s
2020-10-12 11:04:28.672348 mgr.volume-mon-05 (mgr.1240949) 117576 : cluster [DBG] pgmap v118419: 8192 pgs: 8192 active+clean; 15 TiB data, 45 TiB used, 246 TiB / 291 TiB avail; 10 KiB/s rd, 40 GiB/s wr, 10.44k op/s
2020-10-12 11:04:30.682339 mgr.volume-mon-05 (mgr.1240949) 117577 : cluster [DBG] pgmap v118420: 8192 pgs: 8192 active+clean; 15 TiB data, 45 TiB used, 246 TiB / 291 TiB avail; 10 KiB/s rd, 38 GiB/s wr, 9.84k op/s
2020-10-12 11:04:32.694943 mgr.volume-mon-05 (mgr.1240949) 117578 : cluster [DBG] pgmap v118421: 8192 pgs: 8192 active+clean; 15 TiB data, 45 TiB used, 246 TiB / 291 TiB avail; 9.6 KiB/s rd, 39 GiB/s wr, 9.98k op/s
2020-10-12 11:04:34.706667 mgr.volume-mon-05 (mgr.1240949) 117579 : cluster [DBG] pgmap v118422: 8192 pgs: 8192 active+clean; 15 TiB data, 45 TiB used, 245 TiB / 291 TiB avail; 10 KiB/s rd, 38 GiB/s wr, 9.77k op/s
2020-10-12 11:04:36.720519 mgr.volume-mon-05 (mgr.1240949) 117580 : cluster [DBG] pgmap v118423: 8192 pgs: 8192 active+clean; 15 TiB data, 45 TiB used, 245 TiB / 291 TiB avail; 9.9 KiB/s rd, 37 GiB/s wr, 9.68k op/s
2020-10-12 11:04:38.732976 mgr.volume-mon-05 (mgr.1240949) 117581 : cluster [DBG] pgmap v118424: 8192 pgs: 8192 active+clean; 15 TiB data, 46 TiB used, 245 TiB / 291 TiB avail; 1.6 MiB/s rd, 37 GiB/s wr, 9.65k op/s
2020-10-12 11:04:40.741539 mgr.volume-mon-05 (mgr.1240949) 117582 : cluster [DBG] pgmap v118425: 8192 pgs: 8192 active+clean; 15 TiB data, 46 TiB used, 245 TiB / 291 TiB avail; 3.9 MiB/s rd, 34 GiB/s wr, 8.88k op/s
2020-10-12 11:04:42.753725 mgr.volume-mon-05 (mgr.1240949) 117583 : cluster [DBG] pgmap v118426: 8192 pgs: 8192 active+clean; 15 TiB data, 46 TiB used, 245 TiB / 291 TiB avail; 9.3 MiB/s rd, 34 GiB/s wr, 8.79k op/s
2020-10-12 11:04:44.763885 mgr.volume-mon-05 (mgr.1240949) 117584 : cluster [DBG] pgmap v118427: 8192 pgs: 8192 active+clean; 15 TiB data, 46 TiB used, 245 TiB / 291 TiB avail; 12 MiB/s rd, 31 GiB/s wr, 8.06k op/s
2020-10-12 11:04:46.776046 mgr.volume-mon-05 (mgr.1240949) 117585 : cluster [DBG] pgmap v118428: 8192 pgs: 8192 active+clean; 15 TiB data, 46 TiB used, 244 TiB / 291 TiB avail; 20 MiB/s rd, 27 GiB/s wr, 7.09k op/s
2020-10-12 11:04:48.788287 mgr.volume-mon-05 (mgr.1240949) 117586 : cluster [DBG] pgmap v118429: 8192 pgs: 8192 active+clean; 15 TiB data, 46 TiB used, 244 TiB / 291 TiB avail; 30 MiB/s rd, 23 GiB/s wr, 6.04k op/s

2020-10-12 11:04:55.177054 mgr.volume-mon-05 (mgr.1240949) 117589 : cluster [DBG] pgmap v118432: 8192 pgs: 8192 active+clean; 15 TiB data, 46 TiB used, 244 TiB / 291 TiB avail; 22 MiB/s rd, 8.7 GiB/s wr, 2.34k op/s
2020-10-12 11:04:57.191067 mgr.volume-mon-05 (mgr.1240949) 117590 : cluster [DBG] pgmap v118433: 8192 pgs: 8192 active+clean; 15 TiB data, 46 TiB used, 244 TiB / 291 TiB avail; 20 MiB/s rd, 5.9 GiB/s wr, 1.61k op/s
2020-10-12 11:04:59.203656 mgr.volume-mon-05 (mgr.1240949) 117591 : cluster [DBG] pgmap v118434: 8192 pgs: 8192 active+clean; 15 TiB data, 46 TiB used, 244 TiB / 291 TiB avail; 14 MiB/s rd, 3.8 GiB/s wr, 1.04k op/s
2020-10-12 11:05:01.216284 mgr.volume-mon-05 (mgr.1240949) 117592 : cluster [DBG] pgmap v118435: 8192 pgs: 8192 active+clean; 15 TiB data, 46 TiB used, 244 TiB / 291 TiB avail; 4.1 MiB/s rd, 2.2 GiB/s wr, 589 op/s
2020-10-12 11:05:03.231017 mgr.volume-mon-05 (mgr.1240949) 117593 : cluster [DBG] pgmap v118436: 8192 pgs: 8192 active+clean; 15 TiB data, 46 TiB used, 244 TiB / 291 TiB avail; 2.4 MiB/s rd, 1.9 GiB/s wr, 503 op/s
2020-10-12 11:05:05.241061 mgr.volume-mon-05 (mgr.1240949) 117594 : cluster [DBG] pgmap v118437: 8192 pgs: 8192 active+clean; 15 TiB data, 46 TiB used, 244 TiB / 291 TiB avail; 2.4 MiB/s rd, 1.5 GiB/s wr, 402 op/s
2020-10-12 11:05:07.254532 mgr.volume-mon-05 (mgr.1240949) 117595 : cluster [DBG] pgmap v118438: 8192 pgs: 8192 active+clean; 15 TiB data, 46 TiB used, 244 TiB / 291 TiB avail; 2.4 MiB/s rd, 1.5 GiB/s wr, 395 op/s
2020-10-12 11:05:09.266649 mgr.volume-mon-05 (mgr.1240949) 117596 : cluster [DBG] pgmap v118439: 8192 pgs: 8192 active+clean; 15 TiB data, 46 TiB used, 244 TiB / 291 TiB avail; 2.2 MiB/s rd, 1.4 GiB/s wr, 382 op/s
2020-10-12 11:05:11.279722 mgr.volume-mon-05 (mgr.1240949) 117597 : cluster [DBG] pgmap v118440: 8192 pgs: 8192 active+clean; 16 TiB data, 46 TiB used, 244 TiB / 291 TiB avail; 2.5 MiB/s rd, 1.4 GiB/s wr, 372 op/s
2020-10-12 11:05:13.291635 mgr.volume-mon-05 (mgr.1240949) 117598 : cluster [DBG] pgmap v118441: 8192 pgs: 8192 active+clean; 16 TiB data, 46 TiB used, 244 TiB / 291 TiB avail; 5.3 MiB/s rd, 1.7 GiB/s wr, 463 op/s
2020-10-12 11:05:15.301944 mgr.volume-mon-05 (mgr.1240949) 117599 : cluster [DBG] pgmap v118442: 8192 pgs: 8192 active+clean; 16 TiB data, 46 TiB used, 244 TiB / 291 TiB avail; 7.4 MiB/s rd, 1.8 GiB/s wr, 499 op/s

2020-10-12 11:08:08.748316 mgr.volume-mon-05 (mgr.1240949) 117685 : cluster [DBG] pgmap v118528: 8192 pgs: 8192 active+clean; 16 TiB data, 47 TiB used, 244 TiB / 291 TiB avail; 678 B/s rd, 527 MiB/s wr, 133 op/s
2020-10-12 11:08:10.758094 mgr.volume-mon-05 (mgr.1240949) 117686 : cluster [DBG] pgmap v118529: 8192 pgs: 8192 active+clean; 16 TiB data, 47 TiB used, 244 TiB / 291 TiB avail; 1017 B/s rd, 490 MiB/s wr, 124 op/s
2020-10-12 11:08:12.773242 mgr.volume-mon-05 (mgr.1240949) 117687 : cluster [DBG] pgmap v118530: 8192 pgs: 8192 active+clean; 16 TiB data, 47 TiB used, 244 TiB / 291 TiB avail; 1.3 KiB/s rd, 532 MiB/s wr, 135 op/s
2020-10-12 11:08:14.784545 mgr.volume-mon-05 (mgr.1240949) 117688 : cluster [DBG] pgmap v118531: 8192 pgs: 8192 active+clean; 16 TiB data, 47 TiB used, 244 TiB / 291 TiB avail; 1.3 KiB/s rd, 488 MiB/s wr, 124 op/s
2020-10-12 11:08:16.796246 mgr.volume-mon-05 (mgr.1240949) 117689 : cluster [DBG] pgmap v118532: 8192 pgs: 8192 active+clean; 16 TiB data, 47 TiB used, 244 TiB / 291 TiB avail; 1.7 KiB/s rd, 466 MiB/s wr, 119 op/s
2020-10-12 11:08:18.809808 mgr.volume-mon-05 (mgr.1240949) 117690 : cluster [DBG] pgmap v118533: 8192 pgs: 8192 active+clean; 16 TiB data, 47 TiB used, 244 TiB / 291 TiB avail; 1.7 KiB/s rd, 471 MiB/s wr, 121 op/s
2020-10-12 11:08:21.154185 mgr.volume-mon-05 (mgr.1240949) 117691 : cluster [DBG] pgmap v118534: 8192 pgs: 8192 active+clean; 16 TiB data, 47 TiB used, 244 TiB / 291 TiB avail; 1.6 KiB/s rd, 434 MiB/s wr, 112 op/s

2020-10-12 11:23:55.557684 mgr.volume-mon-05 (mgr.1240949) 118154 : cluster [DBG] pgmap v118997: 8192 pgs: 8192 active+clean; 16 TiB data, 48 TiB used, 242 TiB / 291 TiB avail; 32 GiB/s rd, 32 GiB/s wr, 130.62k op/s
2020-10-12 11:23:57.571413 mgr.volume-mon-05 (mgr.1240949) 118155 : cluster [DBG] pgmap v118998: 8192 pgs: 8192 active+clean; 16 TiB data, 48 TiB used, 242 TiB / 291 TiB avail; 34 GiB/s rd, 35 GiB/s wr, 142.26k op/s
2020-10-12 11:23:59.582741 mgr.volume-mon-05 (mgr.1240949) 118156 : cluster [DBG] pgmap v118999: 8192 pgs: 8192 active+clean; 16 TiB data, 48 TiB used, 242 TiB / 291 TiB avail; 33 GiB/s rd, 34 GiB/s wr, 137.60k op/s
2020-10-12 11:24:01.594909 mgr.volume-mon-05 (mgr.1240949) 118157 : cluster [DBG] pgmap v119000: 8192 pgs: 8192 active+clean; 16 TiB data, 48 TiB used, 242 TiB / 291 TiB avail; 32 GiB/s rd, 33 GiB/s wr, 133.42k op/s
2020-10-12 11:24:03.607921 mgr.volume-mon-05 (mgr.1240949) 118158 : cluster [DBG] pgmap v119001: 8192 pgs: 8192 active+clean; 16 TiB data, 48 TiB used, 242 TiB / 291 TiB avail; 34 GiB/s rd, 35 GiB/s wr, 142.32k op/s

image.png View (230 KB) Shridhar S, 10/27/2020 09:23 PM

Also available in: Atom PDF