Project

General

Profile

2H - Ceph stats and monitoring tools » History » Version 1

Jessica Mack, 06/22/2015 05:06 AM

1 1 Jessica Mack
h1. 2H - Ceph stats and monitoring tools
2
3
h3. Live Pad
4
5
The live pad can be found here: "[pad]":http://pad.ceph.com/p/ceph_stats_and_monitoring_tools
6
7
h3. Summit Snapshot
8
9
Tasks
10
11
# upstream collectd plugin, updating as needed
12
# upstream nagios plugin
13
# ganglia integration?
14
# statsd?
15
# alternatives?
16
17
Collectd plugin for ceph generates dynamic types based on counters that can be dynamically added to the admin socket dump
18
Where is the collectd Code? https://github.com/ceph/collectd-4.10.1
19
20
Caveat: collectd network protocol does not send the dynamically detected types data to the collectd instance running on the
21
server side (collectd/network with a carbon plugin or into bucky)
22
How many data types are there? can you just pre-generate them?
23
I'm not sure the count off hand, you can pre-generate them and pass them to the server side but I recall having an issue if
24
the collectd plugin suddenly gets a new metric and the types.db isn't updated on the server side.
25
26
Diamond (admin socket collector agent):
27
28
https://github.com/BrightcoveOS/Diamond
29
https://github.com/BrightcoveOS/Diamond/blob/master/src/collectors/ceph/ceph.py
30
31
Bucky (Collects stats from multiple protocols and writes to carbon, a component of Graphite)
32
33
https://github.com/cloudant/bucky
34
35
Bucky supports recieving statsd, ganglia, collectd and metricsd
36
37
http://graphite.wikidot.com/
38
39
Statsd is not appropriate for large volumes of metrics because each metric is sent as a UDP packet. If you have lots of
40
metrics, like ceph does, your end up with a UDP storm to your metric collection point. Statsd is fantastic for sending
41
success fail (0/1) counters as the result of CI type tests (s3-tests running in jenkins). You can use draw Y as infinite
42
in graphite to draw vertical bars on all your other graphs where tests failed, allowing you to correlate CI failures with
43
anomalies in system/ceph metrics.
44
45
Log collection:
46
47
Logstash: http://logstash.net/
48
49
Elastic search backend, push ceph logs with either syslog, logstash configured as a "shipper", or lumberjack (light weight
50
log gathering tool for sending to logstash).
51
52
1. Get information like 2xx,3xx,4xx,5xx errors from radosgw and any form of load balancer you have in front of it.
53
2. Get ceph.log from monitors (search for slow requests, osd up/down, remapping, etc.)
54
3. OSD logs could have lots of potentially interesting data but are very verbose
55
56
http://logstash.net/docs/1.0.17/filters/grok
57
58
Grok filters
59
60
https://github.com/Dieterbe/anthracite
61
62
Logstash can ship metrics to Graphite so you can log velocity of certain message types and setup alerting
63
64
Sensu: https://github.com/sensu/sensu
65
66
Like it overall but I'm not a big fan of rabbitmq, there is a zeromq fork but I'm unsure how well tested and if it's
67
still under development.
68
69
1. Daemon/agent status checks
70
2. 
71
72
Anyone tried reimann? http://riemann.io/
73
74
Logstash trigger events in anthracite?
75
76
https://github.com/Dieterbe/anthracite
77
78
Documentation tasks
79
80
# nagios
81
# collectd
82
# graphite
83
# ganglia
84
# saltstack
85
86
Suggestions:
87
88
p(. Use Chef/Puppet to configure the relevant monitoring tool
89
    Aggregate all the monitoring tool info on a single docs page/wiki