Bug #6736: Bugs in per pool IOPs/recovery statistics - Ceph - Ceph

Actions

Copy link

Bug #6736

closed

Bugs in per pool IOPs/recovery statistics

Added by John Spray over 10 years ago. Updated about 10 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Loïc Dachary

Category:

Target version:

v0.77

% Done:

100%

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

So I'm playing with the new 'ceph osd pool stats' in Emperor.

Initially I had a healthy cluster (3 OSDs, 3 mons, 1 MDS) with some load driven by a simple "rados bench -p pbench 100 write". Then I did a 'ceph osd out 2' to get some recovery stuff going.

There are some issues with the format of the output:

Typo in "degrated_ratio"
The various sections (recovery, recovery_rate, client_io_rate) are only populated if they are nonzero. This is inconvenient for monitoring tools consuming the statistics, because we have to remember the names of the fields, and synthesize zeros for them when they are absent. It is much better to output a consistent set of fields, and when they're zero just print the zero rather than hiding the field.
The 'degrated ratio' field is a string instead of a numeric type. Perhaps this was done to accommodate infinities? There shouldn't be any

There are also outright errors in the output:

# ceph --version
ceph version 0.72 (5832e2603c7db5d40b433d0953408993a9b7c217)
# ceph -s ; ceph pg stat ; ceph -f json-pretty osd pool stats
    cluster 9c085505-c637-4fec-bcbb-2e05b124ba39
     health HEALTH_WARN 218 pgs backfill; 124 pgs peering; 7 pgs recovering; 170 pgs recovery_wait; 109 pgs stuck inactive; 486 pgs stuck unclean; 22 requests are blocked > 32 sec; recovery 10324/32442 objects degraded (31.823%)
     monmap e1: 3 mons at {gravel1=192.168.18.1:6789/0,gravel2=192.168.18.2:6789/0,gravel3=192.168.18.3:6789/0}, election epoch 466, quorum 0,1,2 gravel1,gravel2,gravel3
     mdsmap e16: 1/1/1 up {0=gravel1=up:active}
     osdmap e710: 3 osds: 3 up, 3 in
      pgmap v61332: 1192 pgs, 4 pools, 16833 MB data, 13667 objects
            36581 MB used, 2754 GB / 2790 GB avail
            10324/32442 objects degraded (31.823%)
                 112 active
                 435 active+clean
                 218 active+remapped+wait_backfill
                 168 active+recovery_wait
                 124 peering
                 126 active+remapped
                   2 active+recovery_wait+remapped
                   7 active+recovering
recovery io 168 MB/s, 279 objects/s
  client io 10318 kB/s rd, 14462 MB/s wr, 9828 op/s

v61332: 1192 pgs: 112 active, 435 active+clean, 218 active+remapped+wait_backfill, 168 active+recovery_wait, 124 peering, 126 active+remapped, 2 active+recovery_wait+remapped, 7 active+recovering; 16833 MB data, 36581 MB used, 2754 GB / 2790 GB avail; 10318 kB/s rd, 14462 MB/s wr, 9828 op/s; 10324/32442 objects degraded (31.823%); 168 MB/s, 279 objects/s recovering

[
    { "pool_name": "data",
      "pool_id": 0,
      "recovery": {},
      "recovery_rate": {},
      "client_io_rate": {}},
    { "pool_name": "metadata",
      "pool_id": 1,
      "recovery": {},
      "recovery_rate": {},
      "client_io_rate": {}},
    { "pool_name": "rbd",
      "pool_id": 2,
      "recovery": {},
      "recovery_rate": {},
      "client_io_rate": {}},
    { "pool_name": "pbench",
      "pool_id": 3,
      "recovery": { "degraded_objects": 18446744073709551562,
          "degraded_total": 412,
          "degrated_ratio": "-13.107"},
      "recovery_rate": { "recovering_objects_per_sec": 279,
          "recovering_bytes_per_sec": 176401059,
          "recovering_keys_per_sec": 0},
      "client_io_rate": { "read_bytes_sec": 10566067,
          "write_bytes_sec": 15165220376,
          "op_per_sec": 9828}}]

In the above output:

recovery.degraded_objects is bogus (too big to be true)
degrated ratio is negative
client_io_rate.write_bytes_sec is bogus (too big to be true, this is a 3-drive cluster).
client_io_rate.read_bytes_sec is nonzero, although there is no client read activity (the only reads will be due to recovery). I haven't looked at where this statistic comes from, but given the name I would not expect it to include the reads from recovery.