Actions
Bug #6736
closedBugs in per pool IOPs/recovery statistics
% Done:
100%
Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
So I'm playing with the new 'ceph osd pool stats' in Emperor.
Initially I had a healthy cluster (3 OSDs, 3 mons, 1 MDS) with some load driven by a simple "rados bench -p pbench 100 write". Then I did a 'ceph osd out 2' to get some recovery stuff going.
There are some issues with the format of the output:
- Typo in "degrated_ratio"
- The various sections (recovery, recovery_rate, client_io_rate) are only populated if they are nonzero. This is inconvenient for monitoring tools consuming the statistics, because we have to remember the names of the fields, and synthesize zeros for them when they are absent. It is much better to output a consistent set of fields, and when they're zero just print the zero rather than hiding the field.
- The 'degrated ratio' field is a string instead of a numeric type. Perhaps this was done to accommodate infinities? There shouldn't be any
There are also outright errors in the output:
# ceph --version ceph version 0.72 (5832e2603c7db5d40b433d0953408993a9b7c217) # ceph -s ; ceph pg stat ; ceph -f json-pretty osd pool stats cluster 9c085505-c637-4fec-bcbb-2e05b124ba39 health HEALTH_WARN 218 pgs backfill; 124 pgs peering; 7 pgs recovering; 170 pgs recovery_wait; 109 pgs stuck inactive; 486 pgs stuck unclean; 22 requests are blocked > 32 sec; recovery 10324/32442 objects degraded (31.823%) monmap e1: 3 mons at {gravel1=192.168.18.1:6789/0,gravel2=192.168.18.2:6789/0,gravel3=192.168.18.3:6789/0}, election epoch 466, quorum 0,1,2 gravel1,gravel2,gravel3 mdsmap e16: 1/1/1 up {0=gravel1=up:active} osdmap e710: 3 osds: 3 up, 3 in pgmap v61332: 1192 pgs, 4 pools, 16833 MB data, 13667 objects 36581 MB used, 2754 GB / 2790 GB avail 10324/32442 objects degraded (31.823%) 112 active 435 active+clean 218 active+remapped+wait_backfill 168 active+recovery_wait 124 peering 126 active+remapped 2 active+recovery_wait+remapped 7 active+recovering recovery io 168 MB/s, 279 objects/s client io 10318 kB/s rd, 14462 MB/s wr, 9828 op/s v61332: 1192 pgs: 112 active, 435 active+clean, 218 active+remapped+wait_backfill, 168 active+recovery_wait, 124 peering, 126 active+remapped, 2 active+recovery_wait+remapped, 7 active+recovering; 16833 MB data, 36581 MB used, 2754 GB / 2790 GB avail; 10318 kB/s rd, 14462 MB/s wr, 9828 op/s; 10324/32442 objects degraded (31.823%); 168 MB/s, 279 objects/s recovering [ { "pool_name": "data", "pool_id": 0, "recovery": {}, "recovery_rate": {}, "client_io_rate": {}}, { "pool_name": "metadata", "pool_id": 1, "recovery": {}, "recovery_rate": {}, "client_io_rate": {}}, { "pool_name": "rbd", "pool_id": 2, "recovery": {}, "recovery_rate": {}, "client_io_rate": {}}, { "pool_name": "pbench", "pool_id": 3, "recovery": { "degraded_objects": 18446744073709551562, "degraded_total": 412, "degrated_ratio": "-13.107"}, "recovery_rate": { "recovering_objects_per_sec": 279, "recovering_bytes_per_sec": 176401059, "recovering_keys_per_sec": 0}, "client_io_rate": { "read_bytes_sec": 10566067, "write_bytes_sec": 15165220376, "op_per_sec": 9828}}]
In the above output:
- recovery.degraded_objects is bogus (too big to be true)
- degrated ratio is negative
- client_io_rate.write_bytes_sec is bogus (too big to be true, this is a 3-drive cluster).
- client_io_rate.read_bytes_sec is nonzero, although there is no client read activity (the only reads will be due to recovery). I haven't looked at where this statistic comes from, but given the name I would not expect it to include the reads from recovery.
Updated by Loïc Dachary over 10 years ago
- Status changed from New to 12
- Assignee set to Loïc Dachary
Updated by Loïc Dachary over 10 years ago
- Status changed from 12 to Fix Under Review
Updated by Loïc Dachary about 10 years ago
- Status changed from Fix Under Review to Resolved
- Target version set to v0.77
- % Done changed from 0 to 100
Actions