Project

General

Profile

Actions

Bug #21375

closed

Average latency output in rados bench is arithmetically flawed

Added by Florian Haas over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Mohamad Gebai
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
4 - irritation
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Seen on v12.2.0-296-g706f78da1a on a virtualized test system:

sudo rados -p bench bench 30 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 30 seconds or 0 objects
Object prefix: benchmark_data_deploy_22718
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        58        42   167.914       168    0.734898    0.300907
    2      16       105        89   177.941       188    0.358432    0.322523
    3      16       154       138    183.95       196    0.351548    0.322912
    4      16       206       190   189.954       208    0.141235    0.319715
    5      16       254       238   190.358       192    0.156576     0.31991
    6      16       304       288   191.961       200    0.231245    0.323936
    7      16       348       332   189.678       176    0.120068    0.323501
    8      16       400       384   191.963       208    0.763719    0.323801
    9      16       452       436   193.742       208    0.336659    0.324155
   10      16       502       486   194.366       200    0.412668    0.323478
   11      16       550       534   194.135       192    0.318614    0.323487
   12      16       601       585   194.954       204    0.330679    0.323183
   13      16       650       634   195.032       196    0.546288    0.322944
   14      16       695       679   193.958       180    0.139749    0.322562
   15      16       779       763   203.424       336    0.638265 9.00648e+07
   16      16       828       812   202.958       196    0.492445 8.46299e+07
   17      16       877       861   202.548       196    0.164353 7.98136e+07
   18      16       923       907   201.516       184    0.254678 7.57657e+07
   19      16       976       960   202.066       212    0.237107 7.15828e+07
2017-09-12 16:20:08.253560 min lat: 0.0693709 max lat: 4.29497e+09 avg lat: 6.79718e+07
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   20      16      1027      1011   202.162       204    0.219402 6.79718e+07
   21      16      1074      1058   201.485       188    0.194049 6.49522e+07
   22      16      1124      1108   201.416       200    0.254869 6.20212e+07
   23      16      1175      1159   201.527       204    0.530085  5.9292e+07
   24      16      1213      1197   199.462       152      0.5128 5.74098e+07
   25      16      1270      1254   200.602       228    0.212382 5.48002e+07
   26      16      1317      1301   200.116       188    0.161623 5.28205e+07
   27      16      1366      1350   199.963       196    0.183112 5.09033e+07
   28      16      1414      1398   199.677       192    0.281947 4.91556e+07
   29      16      1453      1437   198.171       156    0.362389 4.78215e+07
   30      16      1499      1483   197.698       184    0.186634 4.63382e+07
Total time run:         30.448048
Total writes made:      1500
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     197.057
Stddev Bandwidth:       30.4936
Max bandwidth (MB/sec): 336
Min bandwidth (MB/sec): 152
Average IOPS:           49
Stddev IOPS:            7
Max IOPS:               84
Min IOPS:               38
Average Latency(s):     4.5813e+07
Stddev Latency(s):      4.41357e+08
Max latency(s):         4.29497e+09
Min latency(s):         0.0693709

None of the numbers here would be realistic if running on real hardware, but the latency from second 15 onward looks just implausible. :)

Actions #1

Updated by Florian Haas over 6 years ago

  • Subject changed from Latency output in rados bench is arithmetically flawed to Average latency output in rados bench is arithmetically flawed

Dropped an important word here: it's obviously the average latency from second 15 onward that looks off.

Actions #2

Updated by Florian Haas over 6 years ago

  • Affected Versions v12.2.1 added

Just checking: is anyone looking into this? This still exists on 12.2.1, and unfortunately makes rados bench results rather unsuitable for consumption...

Below is reference output from 12.2.1:

# rados bench -p bench 30 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 30 seconds or 0 objects
Object prefix: benchmark_data_daisy_35973
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        32        16   63.9748        64    0.524363    0.380571
    2      16        59        43   85.9721       108    0.491057    0.596051
    3      16       109        93   123.964       200    0.502876  1.8473e+08
    4      16       129       113   112.971        80    0.748528 1.52034e+08
    5      16       141       125   99.9765        48    0.102788 1.37439e+08
    6      16       170       154   102.644       116    0.175975 1.11558e+08
    7      16       206       190   108.548       144    0.526126 9.04204e+07
    8      16       239       223   111.477       132    0.634028 7.70398e+07
    9      16       279       263   116.866       160   0.0811387 6.53227e+07
   10      16       291       275   109.979        48    0.396706 6.24723e+07
   11      16       324       308   111.979       132   0.0802478 5.57788e+07
   12      16       337       321    106.98        52    0.138103 5.35198e+07
   13      16       355       339   104.289        72    0.226504 5.06781e+07
   14      16       393       377   107.695       152     0.27853 4.55699e+07
   15      16       461       445   118.646       272    0.081326 3.86064e+07
   16      16       507       491   122.728       184    0.330384 3.49896e+07
   17      16       578       562   132.212       284    0.538753 3.05692e+07
   18      16       639       623    138.42       244      0.5365  2.7576e+07
   19      16       689       673   141.483       200    0.492813 2.55273e+07
2017-10-12 09:38:20.735721 min lat: 0.0502576 max lat: 4.29497e+09 avg lat: 2.3216e+07
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   20      16       756       740   147.799       268    0.350281  2.3216e+07
   21      16       820       804   152.943       256    0.161733  2.1368e+07
   22      16       878       862    156.53       232    0.239297 1.99302e+07
   23      16       937       921    159.98       236   0.0916112 1.86535e+07
   24      16       974       958    159.48       148     0.24256 1.79331e+07
   25      16      1008       992   158.424       136    0.319072 1.73184e+07
   26      16      1053      1037   159.252       180    0.142151 1.65669e+07
   27      16      1101      1085   160.462       192     0.62604  1.5834e+07
   28      16      1132      1116   159.161       124    0.124207 1.53941e+07
   29      16      1168      1152   158.638       144    0.149339 1.49131e+07
   30      16      1170      1154   153.563         8     1.33382 1.48872e+07
Total time run:         30.220721
Total writes made:      1170
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     154.861
Stddev Bandwidth:       74.9491
Max bandwidth (MB/sec): 284
Min bandwidth (MB/sec): 8
Average IOPS:           38
Stddev IOPS:            18
Max IOPS:               71
Min IOPS:               2
Average Latency(s):     1.46836e+07
Stddev Latency(s):      2.50807e+08
Max latency(s):         4.29497e+09
Min latency(s):         0.0502576
Cleaning up (deleting benchmark objects)
Removed 1170 objects
Clean up completed and total clean up time :0.472365
Actions #3

Updated by Shinobu Kinjo over 6 years ago

  • Assignee set to Shinobu Kinjo
Actions #4

Updated by Shinobu Kinjo over 6 years ago

I don't see same result on physical machine with:

$ ceph -v
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)

How did you build a virtualized test system?

Actions #5

Updated by Florian Haas over 6 years ago

Shinobu Kinjo wrote:

How did you build a virtualized test system?

It's three nodes in an OpenStack public cloud, but how exactly would that make a difference? Is the latency being calculated based on system clock time? Would an adjustment in the system clock cause this issue?

Actions #6

Updated by Shinobu Kinjo over 6 years ago

Have you you ever tried Kraken, Jewel? If not, can you try? And what's your exact kernel version?

Actions #7

Updated by Florian Haas over 6 years ago

Shinobu Kinjo wrote:

Have you you ever tried Kraken, Jewel? If not, can you try? And what's your exact kernel version?

I've been using Ceph since before Argonaut. :) I've never seen rados
bench act up in this particular way before.

This issue definitely did not exist in Jewel. I can't say for sure
whether it existed in Kraken, because I don't have Kraken "rados
bench" records to go back to.

Actions #8

Updated by Mohamad Gebai over 6 years ago

Looking at the code, it seems that clock_gettime() with the REALTIME clock is used to calculate the latency. The reference for this clock is the epoch, which means that any changes made to the system's clock will affect the REATLTIME clock. Is it possible that ntp kicked in during the rados bench? Alternatnively, maybe the paravirtualization clock (kvm clock) made changes to the system's clock during the benchmark? Dmesg would tell you that.

Actions #9

Updated by Mohamad Gebai over 6 years ago

But again, since you don't have any issue with the 'last lat', it might not be that...

Actions #10

Updated by Florian Haas over 6 years ago

Mohamad Gebai wrote:

But again, since you don't have any issue with the 'last lat', it might not be that...

Right. In addition, look at the following line:

2017-10-12 09:38:20.735721 min lat: 0.0502576 max lat: 4.29497e+09 avg lat: 2.3216e+07

The excessive value for max lat is never actually found in any prior value of last lat.

Actions #11

Updated by Mohamad Gebai over 6 years ago

Right, interesting.. After trying a bunch of things, I was able to reproduce this:

  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        16         0         0         0           -           0
    2      16        16         0         0         0           -           0
    3      16        19         3   3.99923         4     2.91914      2.6029
    4      16        19         3   2.99939         0           -      2.6029
    5      16        33        17   13.5974        28     2.32565     3.89453
    6      16        34        18   11.9978         4     5.88878     4.00532
    7      16        35        19   10.8552         4     3.97716     4.00384
    8      16        45        29      12.7        40 4.29492e+09 1.48101e+09
    9      16        51        35   13.8149        24     47428.7 1.22712e+09
   10      16        53        37   13.2924         8     47429.3 1.16079e+09
   11      16        59        43   14.1746        24      2.3183 9.98825e+08
   12      16        62        46   14.0089        12     3.26318 9.33685e+08
   13      16        68        52   14.7156        24     2.67895 8.25955e+08
   14      16        75        59   15.5932        28     3.74444 7.27961e+08
   15      16        79        63   15.6182        16     2.65534 6.81741e+08
   16      16        82        66   15.4069        12       2.549 6.50753e+08
   17      16        83        67   14.7777         4       3.879  6.4104e+08
   18      16        90        74   15.4686        28     2.82302 5.80401e+08
   19      16        96        80   15.8921        24     2.88408 5.36871e+08
2017-10-16 21:26:40.002227 min lat: 1.73544 max lat: 4.29492e+09 avg lat: 5.17466e+08
[...]

I also don't have the 'min lat' anywhere. The only way I was able to reproduce this was to start rados bench, go back in time (which makes rados bench quiet because its waiting for the right time to output), then immediately go forward to the correct time. In other words, the slightest time travel backwards followed by a readjustment causes this. The readjustment is necessary as 'rados bench' doesn't return otherwise (it's waiting on the wall clock to complete).

Anyway, this might be a different problem than what you're experiencing. How easily can you reproduce this issue? And is there anything on your system that might be playing with the clock? Or maybe something that might cause the clock to fall behind for a moment? Are you using the kvm clock? Does your CPU have constant_tsc? Are you using a tickless kernel?

Actions #12

Updated by Mohamad Gebai over 6 years ago

PS: I know that 'last lat' was also wrong in the example I just pasted, but that's not always the case:

  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
[...]
   11      16        51        35   12.7248        32     3.00379     4.40251
   12      16        55        39   12.9975        16     3.91942     4.32376
   13      16        60        44   13.5357        20     2.54609 9.76118e+07
   14      16        63        47   13.4259        12     3.11633 9.13813e+07
[...]
Actions #13

Updated by Florian Haas over 6 years ago

Mohamad Gebai wrote:

Anyway, this might be a different problem than what you're experiencing. How easily can you reproduce this issue? And is there anything on your system that might be playing with the clock? Or maybe something that might cause the clock to fall behind for a moment? Are you using the kvm clock? Does your CPU have constant_tsc? Are you using a tickless kernel?

Yes, this is a VM using the KVM clock source. I'll try to find some time in the next few days to reproduce this (currently traveling), but assuming this is indeed linked to the clock jumping around (which sounds entirely plausible!), maybe the best option is to just add a warning to the output, saying something to the effect of "clock adjusted, results will be unreliable"? Or would you expect other Ceph components, apart from `rados bench`, to also have a problem with this?

Actions #14

Updated by Mohamad Gebai over 6 years ago

  • Assignee changed from Shinobu Kinjo to Mohamad Gebai
Actions #15

Updated by Mohamad Gebai over 6 years ago

Actions #16

Updated by Florian Haas over 6 years ago

Mohamad Gebai wrote:

PR submitted: https://github.com/ceph/ceph/pull/18588

Nice, thanks Mohamad!

Actions #17

Updated by Mohamad Gebai over 6 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF