2H - Ceph stats and monitoring tools » History » Version 1
Jessica Mack, 06/22/2015 05:06 AM
1 | 1 | Jessica Mack | h1. 2H - Ceph stats and monitoring tools |
---|---|---|---|
2 | |||
3 | h3. Live Pad |
||
4 | |||
5 | The live pad can be found here: "[pad]":http://pad.ceph.com/p/ceph_stats_and_monitoring_tools |
||
6 | |||
7 | h3. Summit Snapshot |
||
8 | |||
9 | Tasks |
||
10 | |||
11 | # upstream collectd plugin, updating as needed |
||
12 | # upstream nagios plugin |
||
13 | # ganglia integration? |
||
14 | # statsd? |
||
15 | # alternatives? |
||
16 | |||
17 | Collectd plugin for ceph generates dynamic types based on counters that can be dynamically added to the admin socket dump |
||
18 | Where is the collectd Code? https://github.com/ceph/collectd-4.10.1 |
||
19 | |||
20 | Caveat: collectd network protocol does not send the dynamically detected types data to the collectd instance running on the |
||
21 | server side (collectd/network with a carbon plugin or into bucky) |
||
22 | How many data types are there? can you just pre-generate them? |
||
23 | I'm not sure the count off hand, you can pre-generate them and pass them to the server side but I recall having an issue if |
||
24 | the collectd plugin suddenly gets a new metric and the types.db isn't updated on the server side. |
||
25 | |||
26 | Diamond (admin socket collector agent): |
||
27 | |||
28 | https://github.com/BrightcoveOS/Diamond |
||
29 | https://github.com/BrightcoveOS/Diamond/blob/master/src/collectors/ceph/ceph.py |
||
30 | |||
31 | Bucky (Collects stats from multiple protocols and writes to carbon, a component of Graphite) |
||
32 | |||
33 | https://github.com/cloudant/bucky |
||
34 | |||
35 | Bucky supports recieving statsd, ganglia, collectd and metricsd |
||
36 | |||
37 | http://graphite.wikidot.com/ |
||
38 | |||
39 | Statsd is not appropriate for large volumes of metrics because each metric is sent as a UDP packet. If you have lots of |
||
40 | metrics, like ceph does, your end up with a UDP storm to your metric collection point. Statsd is fantastic for sending |
||
41 | success fail (0/1) counters as the result of CI type tests (s3-tests running in jenkins). You can use draw Y as infinite |
||
42 | in graphite to draw vertical bars on all your other graphs where tests failed, allowing you to correlate CI failures with |
||
43 | anomalies in system/ceph metrics. |
||
44 | |||
45 | Log collection: |
||
46 | |||
47 | Logstash: http://logstash.net/ |
||
48 | |||
49 | Elastic search backend, push ceph logs with either syslog, logstash configured as a "shipper", or lumberjack (light weight |
||
50 | log gathering tool for sending to logstash). |
||
51 | |||
52 | 1. Get information like 2xx,3xx,4xx,5xx errors from radosgw and any form of load balancer you have in front of it. |
||
53 | 2. Get ceph.log from monitors (search for slow requests, osd up/down, remapping, etc.) |
||
54 | 3. OSD logs could have lots of potentially interesting data but are very verbose |
||
55 | |||
56 | http://logstash.net/docs/1.0.17/filters/grok |
||
57 | |||
58 | Grok filters |
||
59 | |||
60 | https://github.com/Dieterbe/anthracite |
||
61 | |||
62 | Logstash can ship metrics to Graphite so you can log velocity of certain message types and setup alerting |
||
63 | |||
64 | Sensu: https://github.com/sensu/sensu |
||
65 | |||
66 | Like it overall but I'm not a big fan of rabbitmq, there is a zeromq fork but I'm unsure how well tested and if it's |
||
67 | still under development. |
||
68 | |||
69 | 1. Daemon/agent status checks |
||
70 | 2. |
||
71 | |||
72 | Anyone tried reimann? http://riemann.io/ |
||
73 | |||
74 | Logstash trigger events in anthracite? |
||
75 | |||
76 | https://github.com/Dieterbe/anthracite |
||
77 | |||
78 | Documentation tasks |
||
79 | |||
80 | # nagios |
||
81 | # collectd |
||
82 | # graphite |
||
83 | # ganglia |
||
84 | # saltstack |
||
85 | |||
86 | Suggestions: |
||
87 | |||
88 | p(. Use Chef/Puppet to configure the relevant monitoring tool |
||
89 | Aggregate all the monitoring tool info on a single docs page/wiki |