Project

General

Profile

Actions

Bug #7765

closed

Bogus bandwidth statistics during pg creation

Added by John Spray about 10 years ago. Updated about 9 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Noticed this while growing the number of PGs in an existing pool on a cluster running firefly branch.

ceph version 0.77-904-g76beb5c (76beb5ce82d6e7c6bd6dd3b7f844851ad67db18d)
root@gravel1:~# ceph -w
    cluster 1de82a3e-7ac7-4ca2-a123-95c21f525bfb
     health HEALTH_WARN 96 pgs stuck inactive; 96 pgs stuck unclean; pool data pg_num 736 > pgp_num 64; clock skew detected on mon.gravel2, mon.gravel3
     monmap e1: 3 mons at {gravel1=192.168.18.1:6789/0,gravel2=192.168.18.2:6789/0,gravel3=192.168.18.3:6789/0}, election epoch 76, quorum 0,1,2 gravel1,gravel2,gravel3
     mdsmap e37: 1/1/1 up {0=gravel1=up:active}
     osdmap e146: 3 osds: 3 up, 3 in
      pgmap v1600: 864 pgs, 3 pools, 320 GB data, 83652 objects
            97840 MB used, 2682 GB / 2778 GB avail
                  96 creating
                 768 active+clean

2014-03-18 15:05:29.852257 mon.0 [INF] osdmap e146: 3 osds: 3 up, 3 in
2014-03-18 15:05:29.943042 mon.0 [INF] pgmap v1600: 864 pgs: 96 creating, 768 active+clean; 320 GB data, 97840 MB used, 2682 GB / 2778 GB avail
2014-03-18 15:05:58.695455 mon.0 [INF] pgmap v1601: 864 pgs: 96 creating, 34 peering, 734 active+clean; 320 GB data, 97835 MB used, 2682 GB / 2778 GB avail
2014-03-18 15:05:59.861263 mon.0 [INF] pgmap v1602: 864 pgs: 96 creating, 54 peering, 714 active+clean; 320 GB data, 97831 MB used, 2682 GB / 2778 GB avail
2014-03-18 15:06:03.895804 mon.0 [INF] pgmap v1603: 864 pgs: 95 creating, 55 peering, 714 active+clean; 320 GB data, 97832 MB used, 2682 GB / 2778 GB avail; 100417 kB/s wr, 27 op/s
2014-03-18 15:06:05.087930 mon.0 [INF] pgmap v1604: 864 pgs: 94 creating, 98 peering, 672 active+clean; 321 GB data, 97832 MB used, 2682 GB / 2778 GB avail; 199 MB/s wr, 56 op/s
2014-03-18 15:06:07.243323 mon.0 [INF] pgmap v1605: 864 pgs: 89 creating, 103 peering, 672 active+clean; 323 GB data, 97833 MB used, 2682 GB / 2778 GB avail; 970 MB/s wr, 276 op/s
2014-03-18 15:06:09.402235 mon.0 [INF] pgmap v1606: 864 pgs: 88 creating, 104 peering, 672 active+clean; 323 GB data, 97833 MB used, 2682 GB / 2778 GB avail; 755 MB/s wr, 216 op/s
2014-03-18 15:06:10.576723 mon.0 [INF] pgmap v1607: 864 pgs: 84 creating, 108 peering, 672 active+clean; 326 GB data, 97833 MB used, 2682 GB / 2778 GB avail; 908 MB/s wr, 254 op/s

This is on 3 1TB SATA OSDs, so the actual ~1GByte/s of writes reported is impossible. The disks are actually seeing ~150-200kB of writes per second during this period, and there is no significant traffic on frontend or backend networks.

I'm guessing this is an accounting issue, where the data being reassigned from one PG to another during splitting is being erroneously counted in the overall I/O figures. This is the first time I've noticed this, but I haven't compared the firefly code with anything older for this specific case.

As well as being misleading in itself, this output has a nasty side effect: it causes problems for anyone plotting charts of I/O, because once an invalid max value has appeared, it stretches the chart y axis out so that the real I/O rates later are much harder to see (e.g. it pushes my graph to a 0-1GB/s axis, while the valid data are only in the range 0-50MB/s so appear as a squiggle across the bottom - see attached).


Files

Actions

Also available in: Atom PDF