Project

General

Profile

2H - Ceph stats and monitoring tools

Live Pad

The live pad can be found here: [pad]

Summit Snapshot

Tasks

  1. upstream collectd plugin, updating as needed
  2. upstream nagios plugin
  3. ganglia integration?
  4. statsd?
  5. alternatives?

Collectd plugin for ceph generates dynamic types based on counters that can be dynamically added to the admin socket dump
Where is the collectd Code? https://github.com/ceph/collectd-4.10.1

Caveat: collectd network protocol does not send the dynamically detected types data to the collectd instance running on the
server side (collectd/network with a carbon plugin or into bucky)
How many data types are there? can you just pre-generate them?
I'm not sure the count off hand, you can pre-generate them and pass them to the server side but I recall having an issue if
the collectd plugin suddenly gets a new metric and the types.db isn't updated on the server side.

Diamond (admin socket collector agent):

https://github.com/BrightcoveOS/Diamond
https://github.com/BrightcoveOS/Diamond/blob/master/src/collectors/ceph/ceph.py

Bucky (Collects stats from multiple protocols and writes to carbon, a component of Graphite)

https://github.com/cloudant/bucky

Bucky supports recieving statsd, ganglia, collectd and metricsd

http://graphite.wikidot.com/

Statsd is not appropriate for large volumes of metrics because each metric is sent as a UDP packet. If you have lots of
metrics, like ceph does, your end up with a UDP storm to your metric collection point. Statsd is fantastic for sending
success fail (0/1) counters as the result of CI type tests (s3-tests running in jenkins). You can use draw Y as infinite
in graphite to draw vertical bars on all your other graphs where tests failed, allowing you to correlate CI failures with
anomalies in system/ceph metrics.

Log collection:

Logstash: http://logstash.net/

Elastic search backend, push ceph logs with either syslog, logstash configured as a "shipper", or lumberjack (light weight
log gathering tool for sending to logstash).

1. Get information like 2xx,3xx,4xx,5xx errors from radosgw and any form of load balancer you have in front of it.
2. Get ceph.log from monitors (search for slow requests, osd up/down, remapping, etc.)
3. OSD logs could have lots of potentially interesting data but are very verbose

http://logstash.net/docs/1.0.17/filters/grok

Grok filters

https://github.com/Dieterbe/anthracite

Logstash can ship metrics to Graphite so you can log velocity of certain message types and setup alerting

Sensu: https://github.com/sensu/sensu

Like it overall but I'm not a big fan of rabbitmq, there is a zeromq fork but I'm unsure how well tested and if it's
still under development.

1. Daemon/agent status checks
2.

Anyone tried reimann? http://riemann.io/

Logstash trigger events in anthracite?

https://github.com/Dieterbe/anthracite

Documentation tasks

  1. nagios
  2. collectd
  3. graphite
  4. ganglia
  5. saltstack

Suggestions:

Use Chef/Puppet to configure the relevant monitoring tool
Aggregate all the monitoring tool info on a single docs page/wiki