Version 1 - History - 2H - Ceph stats and monitoring tools - Ceph - Ceph

1

Jessica Mack

h1. 2H - Ceph stats and monitoring tools

2

3

h3. Live Pad

4

5

The live pad can be found here: "[pad]":http://pad.ceph.com/p/ceph_stats_and_monitoring_tools

6

7

h3. Summit Snapshot

8

9

Tasks

10

11

# upstream collectd plugin, updating as needed

12

# upstream nagios plugin

13

# ganglia integration?

14

# statsd?

15

# alternatives?

16

17

Collectd plugin for ceph generates dynamic types based on counters that can be dynamically added to the admin socket dump

18

Where is the collectd Code? https://github.com/ceph/collectd-4.10.1

19

20

Caveat: collectd network protocol does not send the dynamically detected types data to the collectd instance running on the

21

server side (collectd/network with a carbon plugin or into bucky)

22

How many data types are there? can you just pre-generate them?

23

I'm not sure the count off hand, you can pre-generate them and pass them to the server side but I recall having an issue if

24

the collectd plugin suddenly gets a new metric and the types.db isn't updated on the server side.

25

26

Diamond (admin socket collector agent):

27

28

https://github.com/BrightcoveOS/Diamond

29

https://github.com/BrightcoveOS/Diamond/blob/master/src/collectors/ceph/ceph.py

30

31

Bucky (Collects stats from multiple protocols and writes to carbon, a component of Graphite)

32

33

https://github.com/cloudant/bucky

34

35

Bucky supports recieving statsd, ganglia, collectd and metricsd

36

37

http://graphite.wikidot.com/

38

39

Statsd is not appropriate for large volumes of metrics because each metric is sent as a UDP packet. If you have lots of

40

metrics, like ceph does, your end up with a UDP storm to your metric collection point. Statsd is fantastic for sending

41

success fail (0/1) counters as the result of CI type tests (s3-tests running in jenkins). You can use draw Y as infinite

42

in graphite to draw vertical bars on all your other graphs where tests failed, allowing you to correlate CI failures with

43

anomalies in system/ceph metrics.

44

45

Log collection:

46

47

Logstash: http://logstash.net/

48

49

Elastic search backend, push ceph logs with either syslog, logstash configured as a "shipper", or lumberjack (light weight

50

log gathering tool for sending to logstash).

51

52

1. Get information like 2xx,3xx,4xx,5xx errors from radosgw and any form of load balancer you have in front of it.

53

2. Get ceph.log from monitors (search for slow requests, osd up/down, remapping, etc.)

54

3. OSD logs could have lots of potentially interesting data but are very verbose

55

56

http://logstash.net/docs/1.0.17/filters/grok

57

58

Grok filters

59

60

https://github.com/Dieterbe/anthracite

61

62

Logstash can ship metrics to Graphite so you can log velocity of certain message types and setup alerting

63

64

Sensu: https://github.com/sensu/sensu

65

66

Like it overall but I'm not a big fan of rabbitmq, there is a zeromq fork but I'm unsure how well tested and if it's

67

still under development.

68

69

1. Daemon/agent status checks

70

2.

71

72

Anyone tried reimann? http://riemann.io/

73

74

Logstash trigger events in anthracite?

75

76

https://github.com/Dieterbe/anthracite

77

78

Documentation tasks

79

80

# nagios

81

# collectd

82

# graphite

83

# ganglia

84

# saltstack

85

86

Suggestions:

87

88

p(. Use Chef/Puppet to configure the relevant monitoring tool

89

    Aggregate all the monitoring tool info on a single docs page/wiki

Project

General

Profile

Ceph

2H - Ceph stats and monitoring tools » History » Version 1